Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (2023)

You can now continuously stream data into your data lake using the new Onehouse PostgreSQL Change Data Capture (CDC) feature. I'll dive into the CDC architecture and pipeline using Onehouse products built on open source systems such asApache Kafka,exfoliationandApache Woody.

To enable effective data-driven decision-making, organizations recognize the need to leverage large volumes of data hosted in various OLTP databases. A popular solution is to replicate this data in near real time to a data lake house, a powerful foundation for many business-criticalExampleacross industries.

Build a reliable and efficient data replication pipelineMoving from an RDBMS to a data lake house requires a thorough understanding of OLTP systems and analytics needs. It requires expertise in data integration, database technology, data governance, and scalable architecture design. By successfully implementing such solutions, organizations can unlock the power of their high-value data, facilitate data-driven decision-making, and unlock a host of critical use cases to drive business success.

Onehouse architecture overview

By automating data management tasks and providing a fully managed cloud-native platform, Onehouse allows customers to build data lakes quickly and easily, saving significant time and costCompared to the DIY method.With Onehouse, clients can focus on deriving insights and value from their data. This can be especially valuable for organizations that don't have the resources or expertise to design, build, and manage their own data infrastructure.

Onehouse also offers a range of enterprise security features such as encryption, access control and monitoring to ensure data is always protected. This is critical for organizations that need to comply with industry regulations or data privacy laws. By using Onehouse, customers can have peace of mind knowing that their data is safe and compliant.

Onehouse's modern architecture provides customers with full control over their own data and the computing resources used by Onehouse's data plane, because the data plane runs in customers' cloud accounts and VPCs, so customers' data stays in their accounts and never will leave. This is in contrast to some other SaaS cloud architectures where data is processed and stored outside of customer accounts, often leading to questions and concerns about data privacy and security. By using such an architecture, Onehouse is able to offer clients the benefits of a cloud-based solution while allowing them to maintain security and control of their data.

Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (1)

Pitfalls of Using PostgreSQL as a Data Lake

Relational database management systems (RDBMS) like PostgreSQL play a key role in storing most of the world's valuable data. These databases are primarily used in Online Transaction Processing (OLTP) systems, which are critical to day-to-day business operations. However, running complex analytics directly on these OLTP databases (such as PostgreSQL) maychallengingBecause it can disrupt the high volume of transactional activity that is critical to the smooth running of the business.

To address this challenge, it is critical to develop a change data capture (CDC) and replication framework that coordinates the movement of data from the OLTP database to the data lake. The framework should be designed to handle complex data structures, maintain data consistency, minimize latency, and achieve near real-time synchronization. It must also account for the inherent challenges of replicating data from OLTP systems, such as concurrent updates, schema changes, and transactional dependencies.

(Video) Lakehouse or Warehouse: Where Should You Live | Onehouse

In addition, the architecture supporting the data lake house must be scalable, fault-tolerant, and optimized for analytical workloads. It should take advantage of distributed computing techniques and parallel processing capabilities to efficiently handle large data sets and complex analytical tasks. This enables organizations to gain valuable insights from replicated data without compromising the operational efficiency of the OLTP database.

CDC process, architecture and components

There are several components involved in setting up change data capture (CDC) streaming and replication to a data lake using PostgreSQL as a source. Let's break down the common components in a typical architecture:

Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (2)

  1. PostgreSQL is the source database where the data resides.
  2. Debezium is a popular open source CDC tool. It connects to a PostgreSQL database and captures changes as they occur, including inserts, updates, and deletes, in real time. Debezium turns these changes into event streams.
  3. Apache Kafka is a distributed streaming platform. Debezium publishes captured database changes as events into Kafka topics. Kafka ensures a reliable and scalable flow of events, acting as a buffer between data sources and downstream processing.
  4. Apache SparkIt is a powerful distributed data processing framework. In this use case, Spark reads change events from Kafka and performs further processing and transformation on the data. Spark can handle large-scale data processing and is commonly used for analytics, ETL (extract, transform, load) and machine learning tasks.
  5. Apache WoodyIt is an open source data storage technology. It focuses on efficiently managing incremental updates and providing fast data ingestion and query capabilities on large datasets. In this case, Hudi updates the changes to a table stored in a cloud storage system such as Amazon S3.
  6. Apache Airflowis an open-source platform for orchestrating and scheduling workflows. It enables you to define and manage dependencies between different tasks or components in a data pipeline. In this use case, Airflow coordinates the execution of Spark and Hudi, ensuring that these steps are properly sequenced and coordinated.
  7. AWS Glue Data Catalogis a fully managed metadata catalog service from Amazon Web Services (AWS). It enables you to store, organize, and manage metadata about data assets. In this use case, the Glue Data Catalog is used to update the state of the tables and make them discoverable by the query engine, allowing efficient query and analysis of the data stored in the data lake.

Implementing a data pipeline using the above components is a complex task that requires a solid understanding of data engineering principles and experience integrating and configuring different tools. Data engineers need a deep understanding of each tool's capabilities, limitations, and integration points to ensure smooth interoperability and efficient processing. They also need to be familiar with best practices in data governance, security, and performance optimization to ensure data pipelines are reliable, scalable, and cost-effective.

Additionally, they need to be proficient in scripting and coding to automate tasks and reduce manual errors. Building a data pipeline using the above components requires a combination of skills, knowledge and experience in data engineering and related fields, which is difficult for customers to possess.

As an example, let's look at a common use case of CDC ingestion from a PostgreSQL database with 100+ tables to replicate data into a data lake.

based onour experience, typically takes three data engineers about three to six months to build a pipeline of this nature. However, if you need to use the same data pipeline for four different business units, you will need to perform the same laborious process four times, potentially costing more valuable engineering time. By using Onehouse's Managed Lakehouse solution for the same use case, a single data engineer could produce all four deployments in a few weeks, saving 10x or more. Additionally, in addition to cost savings, the solution offers other intangible benefits, including simplified maintenance and faster business insight.

Onehouse built-in CDC function

Onehouse now has an exciting new capability to perform CDC ingestion from PostgreSQL into analytics-ready data lake houses. several clients such asapneaBeen running this pipeline in production. Behind the scenes, Onehouse will automate the data infrastructure in your account to leverage Debezium, Kafka, Spark and Hudi without exposing any of this complexity or maintenance to you. If you're currently designing your CDC ingestion pipeline to your data lake, or if you're tired of maintenance and on-call monitoring, Onehouse can provide you with a seamless experience.

Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (3)

In the following, I'll walk through an example of continuously ingesting data from an AWS RDS PostgreSQL database into an Amazon S3 data lake using the Onehouse product.

(Video) Building and Managing the Modern Datastore: The Data Lakehouse

Enable CDC on the source PostgreSQL database

To configure change data capture (CDC) on the source PostgreSQL database, you need to turn on logical replication with write-ahead logging (WAL). For more information, seeHow to replicate tables in RDS for PostgreSQL using logical replicationandSet up logical replication for an Aurora PostgreSQL DB cluster.

To verify the status of logical replication, log in to the source PostgreSQL database and run the following query:

cdc_db=>select_name, set_name('wal_level', 'rds.logical_replication') from pg_settings;

Name | Environment

------------------------------+--------

rds.logical_replication |在

wal_level | logical

(2 lines)

Create a Schema Registry in AWS Glue

In this example, we will use theAWS Glue Schema RegistryRecord and verify real-time streaming data patterns, similar to the functionality provided byConfluent Schema Registry.

  1. Once in the AWS Management Console, search for and click on GLUE.
  2. On the left tab, click Schema Registry.
  3. Click Add Registry.
  4. Enter a name, description (if desired), and any tags, then click Add Registry.
  5. Make a note of the name of the registry, as the PostgreSQL CDC datasource needs to be created in the Onhouse Admin Console.

Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (4)

(Video) TDSS 99: State of the Data Lakehouse with Vinoth Chandar of Apache Hudi and Onehouse

Create a new CDC source in Onehouse

To create a new PostgreSQL CDC source, log into the Onehouse management console; go to Connections -> Sources and click "Add Data Source". Select "Postgres CDC".

Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (5)

Before adding a new source, Onehouse will perform some network connectivity verification tests.

Set up CDC stream capture in Onehouse

On the left side of the Onehouse Admin Console, go to Services -> Capture and select Add Stream. Give the new stream a name and select the data source as the PostgreSQL CDC source created earlier. Since we are working on a CDC use case, select the Write Mode as "Mutable (Updates/deletes)".

Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (6)

For the test source table, specify the target table name, then click Configure and provide the following information:

Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (7)

In the Transformation section, select Transform data from CDC format and click Add Transformation; for CDC format, select Postgres (Debezium) as shown below:

Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (8)

For this simple example, you can leave the other settings at their defaults. In the final step, provide the target data lake, database, and data catalog information:

(Video) Learn How to use DMS to move data from Source RDS To target S3 for Beginners

Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (9)

now it's right! You can now click "Start Capture" and go back to Services -> Capture to confirm that a new stream capture is starting. Initial setup can take around 30-40 minutes, as Onehouse needs to provision an Amazon MSK cluster and configure Debezium behind the scenes. All incremental CDC streams will then be processed, converted to Hudi format and loaded into the data lake without any further delay.

Built-in table optimization and interoperability

Onehouse does more than that. Onehouse provides comprehensive data management and built-in support for Apache Hudi's advanced table optimization services, such asclustering,to compact, andclean.These services are critical to optimizing the performance and cost-effectiveness of Hudi tables, especially for these OLTP CDC and replication use cases, as table data changes very quickly.

Hudi's industry leadership and innovationasynchronous compressionCapabilities are a game changer in the world of cloud data analytics. Compression involves merging small files (Avro format) into larger base files (Parquet format) to reduce the number of files and improve query performance without blocking writes.

Since data resides in a data lake running on Apache Hudi, customers have the flexibility to choose their preferred query engine, such as Amazon Athena, Apache Presto, Trino, Databricks, or Snowflake, based on their specific requirements and cost/performance trade-offs. by usinga tablefeature to easily generate interoperable metadata for Delta Lake and Apache Iceberg without touching the underlying data.

Instantly unlock CDC PostgreSQL data on lakehouse with Onehouse (10)

in conclusion

In this blog post, we show how easy it is to set up an end-to-end data pipeline using the Onehouse PostgreSQL change data capture (CDC) feature. By leveraging this capability, organizations can replicate data from their PostgreSQL databases to data lake databases with near real-time latency in just a few clicks! This enables them to keep up with the rapidly changing demands of the modern business world, where data-driven decision-making is critical to success.

If you want to learn more about Onehouse and want to try it out, visitDetached houseon AWS Marketplace or contactgtm@onehouse.ai.

I would like to thank my Onehouse team members Vinish Reddy Pannala and Daniel Lee for their contributions to the Onehouse PostgreSQL CDC feature.

(Video) Compare Data Lakehouse Table Formats: Iceberg, Hudi and Delta Lake - Subsurface Meetup

Videos

1. Apache Iceberg 101 Course #3 | Data Lakehouse & Iceberg Explained
(Dremio)
2. Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache Hudi - PulsarSummitSF22
(StreamNative)
3. Lakes and Houses Put into perspective
(Trino)
4. Build Real Time Streaming Pipeline with Apache Hudi Kinesis and Flink | Hands on Lab
(Soumil Shah)
5. Data Engineering Made Easy: Build Datalake on S3 with Apache Hudi & Glue Hands-on Labs for Beginners
(Soumil Shah)
6. The Powerhouse in your Data Stack with Onehouse team
(The Ravit Show)
Top Articles
Latest Posts
Article information

Author: Maia Crooks Jr

Last Updated: 05/30/2023

Views: 6560

Rating: 4.2 / 5 (63 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Maia Crooks Jr

Birthday: 1997-09-21

Address: 93119 Joseph Street, Peggyfurt, NC 11582

Phone: +2983088926881

Job: Principal Design Liaison

Hobby: Web surfing, Skiing, role-playing games, Sketching, Polo, Sewing, Genealogy

Introduction: My name is Maia Crooks Jr, I am a homely, joyous, shiny, successful, hilarious, thoughtful, joyous person who loves writing and wants to share my knowledge and understanding with you.