Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (2023)

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (2)

Jonathan Reiss

·

(Video) Lightning Talk: Implementing an Event-Driven Lake House Architecture in AWS (Big Data Series 1 of 4)

follow

--

(Video) What is a Data Lakehouse? A Simple Explanation for Anyone

Numerous applications in today's world accumulate vast amounts of data to build insight and knowledge. Adding value and improving functionality is essential, but at what cost? A key factor in the arrival of the "big data" era is the drastic reduction in data storage costs while increasing the computing power available in the cloud.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (3)

Since Apache Hadoop was first released in 2008, the extensibility of the framework has allowed it to continuously evolve and improve its ability to collect, store, analyze, and manage multiple sets of data. Several "big data" projects have been added to the data architect's toolbox to create more powerful, smarter solutions. One of these newer data structures is the so-calleddata lake house.

Data Lakehouse is a new form of data architecture that combines the positive aspects of Data Lakes and Data Warehouses. The advantages of a data lake include flexibility, low cost, and scale. When combined with strong data management andACID transactionalData Warehouse, which allows business intelligence, analytics and machine learning on all data in a fast and agile manner.

But it doesn't end there. A robust Data Lakehouse will connect the Data Lake, Data Warehouse, and purpose-built databases and services into a single fabric. This also includes a unified governance approach, tools and policies to move data inside-out and outside-in with minimal effort and cost.

Data Lakehouse is enabled by a new open system design that implements similar data structures and data management capabilities as in data warehouses directly onto low-cost storage for data lakes. Consolidating them into one system means data teams can move faster while working with data without accessing multiple systems. Data Lakehouses also ensure teams have the most complete and up-to-date data for data science, machine learning, and business analytics projects.

In this series of articles, we explore how to embrace thisbig data architectureand implemented in AWS.

The implementation of Data Lakehouse Architecture in AWS centers on a simple storage system or S3, providing object storage to build a Data Lake. Around the data lake, you can add other components such as:

  • Use Elastic Map Reduce to process "big data";
  • relational database with relational database service, in special Amazon Aurora;
  • non-relational databases such as Amazon DynamoDB;
  • Machine Learning for Business Analytics with Amazon SageMaker;
  • Data warehouse using Amazon Redshift;
  • Log analysis powered by OpenSearch.

As you can probably guess, connecting all of the above components into a Data Lakehouse requires setting up data stores, databases, and services and allowing for quick and easy transfer of data from the services used to the Data Lake (data moves from outside to inside), Extract data from data lakes to dedicated databases and services (data inside-out movement), and move data from one dedicated data store to another (perimeter data movement).

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (4)

To get the most out of their data lakes and these purpose-built stores, customers need to easily move data between these systems. For example, clickstream data from a web application can be collected directly in a data lake. A portion of this data can be moved out to the data warehouse for daily reporting. We think of this concept as data movement from the inside out.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (5)

Likewise, clients move data in the other direction: from the outside to the inside. For example, copy query results for product sales in a given region from their data warehouse into their data lake to run a product recommendation algorithm using ML against a more important data set.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (6)

Finally, there are other situations where customers want to move data from one purpose-built data store to another: the perimeter. For example, they might replicate product catalog data stored in a database into their search service to make it easier to view their product catalog and offload search queries from the database.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (7)

As data in data lakes and other data stores increases,data gravity also grows, and it becomes more challenging to move data in any direction. The key service to make all data move agile while data gravity grows isamazon movement.In short, Amazon Kinesis allows you to process and analyze data as it is presented, responding quickly and in real time with no latency.

Let's take a deeper look at how this process works. Kinesis has three different services: Kinesis Data Stream, Kinesis Data Firehose, and Kinesis Data Analytics.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (8)

Kinesis is a serverless service that is elastic, durable, reliable, and available by default. With it, you can develop applications to ingest streams of data and react to them in real time, enabling you to build value from streaming data very quickly with minimal operational overhead.

Kinesis Data Stream is best for fast, continuous ingestion of data, which can include social media, application logs, market data feeds, or IT infrastructure log data. Each stream consists of a set of shards, which are units of read and write capabilities in Kinesis. A shard provides a stream that reads up to 1 MB per second and writes at a rate of 2 MB per second. When creating a Kinesis stream, you must specify the number of shards provisioned for the stream, as well as some other parameters. If the amount of data written to the stream increases, you can add more shards to scale the stream without downtime.

To use Kinesis Data Streams, a typical scenario is to have producers push data directly to a software system that writes Kinesis stream data. Producers can be EC2 instances, mobile applications, on-premises servers, or IoT devices. On the other hand, there are consumers that retrieve records from the data stream and process them accordingly. Consumers could be applications running on EC2 instances or AWS Lambda functions. If the consumer is on an Amazon EC2 instance, you can run it in an Auto-Scaling group to scale up. In this case, you only pay for the shards you allocate per hour.

(Video) Amazon Redshift Lake House Architecture

There can be more than one application type processing the same data. An easy way to develop consumer applications is to use AWS Lambda, which scales up or down automatically because it allows you to run your consumer application's code without provisioning or managing servers.

Kinesis Data Streams allows preserving the client-side order of data and guarantees that consumers receive data in the order that producers sent it. Data can also be processed in parallel, so you can also consume the same data stream in parallel to multiple applications at the same time. For example, you can ingest application log data when a first consumer looks for a specific keyword, and have a second consumer use the same data to generate performance insights. This feature allows decoupling the collection and processing of data while providing it with persistent buffers. Also, it's up to you to process data at your own desired rate, so there's no need to worry about some processes being directly dependent on others.

Additionally, Kinesis Data Streams makes it easy to ingest and store streaming data while ensuring your data is durable and available, typically in less than 1 second after the data is written to the stream. For security purposes, server-side encryption can be used to protect your data to meet any compliance requirements.

Some use cases for big data streams are:

  • E-commerce platform fraud detection (real-time monitoring logs);
  • Analyze mobile or web applications with millions of users by ingesting clickstreams;
  • Tracking and responding in real-time to IoT devices emitting massive amounts of sensor data;
  • Perform sentiment analysis on social media posts in real time.

All of the above use cases have one thing in common, which is the need to process many small messages generated from various sources in near real time. Messages need to be ingested, stored and provided to different applications. Furthermore, it needs to be done in a reliable, secure and scalable manner. However, you don't want to keep multiple copies of this data, as one copy per application is sufficient.

In general, Kinesis stores streaming data for a limited amount of time. By default, it retains any sequence of events for 24 hours, but the retention period can be extended by up to 7 days. You need to write the processed results to persistent storage, such as Amazon S3, DynamoDB, or Redshift.

In this article, we want to illustrate how easy it is to start feeding data lakes built on top of Amazon S3 from streaming data. Applications may generate events, which we may wish to store to build insight and knowledge, while creating value from derived data.

A first initial idea for implementing a data lake based on real-time streaming data might be to implement a Kinesis consumer to store the consumed events into S3 using the AWS API or SDK. However, there is a more elegant option.

We will use aKinesis Data Firehose Delivery StreamFeed data to S3 seamlessly.

Among the several ways to send message streams to Kinesis, we will use a producer developed in Python with a libraryPorto 3Establish communication and messaging. Message producers will use the libraryfraud, widely used to create random fake data, and a parameter to define the number of messages created and sent to Kinesis. We will use Terraform to automatically create the necessary resources and ensure version control.

Below we list each resource used and what it does in that context:

  • Data Producer - a Python program that generates data;
  • Kinesis Data Streams - will receive the generated messages;
  • Kinesis Data Firehose - passing received messages, converted into parquet files;
  • Amazon S3 - buckets for storing generated data;
  • Glue Data Catalog - we will use the Glue Data Catalog to have a unified metastore;
  • Amazon Athena - Used to query data stored in S3 buckets.

The following figure illustrates the architectural design of the proposed solution.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (9)

AWS service creation

To start the deployment, we will verify the infrastructure code developed with Terraform. If you do not have Terraform installed, here we will see two methods, installing from the repository and downloading the standalone version.

Now, we need to initializelandformby runningTerrain initialization. Terraform will generate a file named.terraformand download atmain programdocument.

Following best practices, always run the commandterraform plan -out=kinesis-stack-planView output before starting, creating, or changing existing resources.

Once the plan has been validated, changes can be safely applied by runningterraform application "kinesis-stack-plan". Terraform will perform a final validation step and prompt for confirmation before applying.

In the video embedded below, you'll see the entire process of creating the environment.

data producer

The code snippet below shows how to randomly create a data dictionary and then convert it to JSON. We need to convert the dictionary to JSON in order to put the data into the Kinesis data stream.

Now is the time to start generating data, setting up a python virtual environment, and installing dependencies. The isolated environment will prevent breaking any other python libraries that may be in use.

Using the data generator script, we'll start sending 10 records to our Kinesis service.

(Video) Building a Lakehouse Architecture with Azure Databricks with Christopher Chalcraft

An important thing to note is that it takes a few seconds to see the generated data reach Kinesis.

Check out the Kinesis Data Stream chart below to get an idea of ​​the rate at which data is being received and how it is being processed.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (10)

The same goes for the Kinesis Data Firehose statistics in the image below.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (11)

Can navigate within our S3 bucket to see how Kinesis Data Firehose delivers and organizes Parquet files and creates "subfolders" that will be recognized as partitions (Year/Month/Day/Hour - 2022/02/04/21) our table.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (12)

used byAWS Glue Service, we were able to see patterns automatically identified from messages processed by Kinesis Data Firehose.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (13)

Now that our data is available and its schema defined, we will be able to execute some SQL AdHoc queries using Amazon Athena as shown below.

Implementing a Data Lakehouse Architecture in AWS - Part 1 of 4 (14)

Finally, we need to compromise our infrastructure by running the commandterrain destructionto avoid additional costs. Running the destroy command first asks for confirmation, and proceeds to delete the infrastructure after receiving an affirmative answer, as you can see in the short video below.

A new paradigm of Data Lakehouse architecture is coming, providing more opportunities for enterprises planning to start their businessData Driven Journey, the technologies, frameworks and cost ranges associated with cloud platforms are now more attractive than ever.

In the first blog post, we introduced a technology usage scenario where streaming data is ingested using Kinesis Data Streams, processed in real time using Kinesis Data Firehose, and delivered to object storage for further use by data analysts, ML Engineer, using Amazon Athena to run SQL AdHoc queries.

If you liked this article, here's our Lightning Talk on the topic.

In this series on Data Lakehouses, we'll discuss this architecture further. stay tuned!

Does your business require an efficient and robust data architecture to succeed?keep in touch, we can help you make it happen!

(Video) Back to Basics: Building an Efficient Data Lake

FAQs

How do you implement a data lakehouse? ›

5 Steps to a Successful Data Lakehouse
  1. Start with the data lake that already manages most of the enterprise data.
  2. Bring quality and governance to your data lake.
  3. Optimize data for fast query performance.
  4. Provide native support for machine learning.
  5. Prevent lock-in by using open data formats and APIs.

How do you implement a data lake in AWS? ›

Now, set up your data lake with Lake Formation.
  1. Step 1: Create a data lake administrator. ...
  2. Step 2: Register an Amazon S3 path. ...
  3. Step 3: Create a database. ...
  4. Step 4: Grant permissions. ...
  5. Step 5: Crawl the data with AWS Glue to create the metadata and table. ...
  6. Step 6: Grant access to the table data. ...
  7. Step 7: Query the data with Athena.
Aug 12, 2019

What is a data lakehouse architecture? ›

A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data.

What is the difference between Databricks Delta Lake and data Lakehouse? ›

Databricks Lakehouse is built on Open standards and has an Open-source foundation (Spark). Delta Lake too is open-source, and it can be used to enable a data warehousing compute layer atop a traditional data lake. Databricks has SQL support and is supported by talented engineers who work on Apache Spark.

What is an example of a data lakehouse? ›

Some examples of data lakehouses include Amazon Redshift Spectrum or Delta Lake.

What is the difference between data hub and data lakehouse? ›

In summary, a data hub is about sharing and exchanging curated and managed data between systems, services, or parties. A data lake is about creating a vast pool of data in many different formats which can feed analytics, AI or data science services to create value.

Is S3 a data lake or data warehouse? ›

Central storage: Amazon S3 as the data lake storage platform. A data lake built on AWS uses Amazon S3 as its primary storage platform. Amazon S3 provides an optimal foundation for a data lake because of its virtually unlimited scalability and high durability.

What is the best service to build a data lake in AWS? ›

Amazon S3 is the best place to build data lakes because of its unmatched durability, availability, scalability, security, compliance, and audit capabilities.

How to build a data lake architecture? ›

He went on to explain that there are five typical steps in building a data lake:
  1. Set up storage.
  2. Move data.
  3. Cleanse, prep, and catalog data.
  4. Configure and enforce security and compliance policies.
  5. Make data available for analytics.
Aug 19, 2020

What are the components of lakehouse architecture? ›

Introduction to Data Lakehouse Architecture

The platform consists of multiple layers, including the object store, data layer, processing layer, semantic layer, communication layer, and client layer. Each layer provides open-source options that help maintain data portability, flexibility, and economic efficiency.

What is the difference between Lakehouse and data lake? ›

The data lakehouse has a layer design, with a warehouse layer on top of a data lake. This architecture, which enables combining structured and unstructured data, makes it efficient for business intelligence and business analysis. A data lakehouse system usually consists of the following layers: Ingestion.

What are the 3 data lake architecture design principles? ›

First, it eliminates the need for expensive data warehouses and ETL (extract, transform, load) tools. Second, it allows you to store all your data in one place, so you can access it quickly and easily. Third, it allows you to enrich your data with additional data from external sources.

Why Delta Lake instead of data lake? ›

Delta Lake is an open-source storage layer built atop a data lake that confers reliability and ACID (Atomicity, Consistency, Isolation, and Durability) transactions. It enables a continuous and simplified data architecture for organizations.

Is Delta Lake and data Lakehouse same? ›

Architecture: Data Lakehouse is a hybrid architecture that combines the best of data lake and data warehouse capabilities. Delta Lake, on the other hand, is a data management system running on Apache Spark. Reliability: Although data lakes are highly scalable and flexible, they are not known for their reliability.

What is the advantage of data Lakehouse? ›

A data lakehouse offers more robust data governance than a traditional data warehouse. This is because it enforces strict controls on who can access and modify data. This helps to ensure that only authorized users can access sensitive information.

What are the key features of data lakehouse? ›

Data Lake Features
  • Separation of storage and compute.
  • Unlimited scale data repository.
  • Mixed data types: structured, semi-structured and unstructured.
  • Choice of languages for processing (but not always SQL)
  • No need to inventory or ingest data.
  • Direct access to source data.

Does a data lakehouse replace a data warehouse? ›

A data lake is not a direct replacement for a data warehouse; they are supplemental technologies that serve different use cases with some overlap. Most organizations that have a data lake will also have a data warehouse.

What are the applications of data Lakehouse? ›

With a data lakehouse, an array of different tools and engines can access the raw data in an object store directly via the DataFrames application programming interface (API). This enables instant data optimization and presentation based on the needs of a certain workload, say machine learning.

What is the benefit of a data lakehouse that is unavailable in a traditional data warehouse? ›

The benefits of a data lake

Because data lakes can store both structured and unstructured data, they offer several benefits, such as: Data consolidation: Data lakes can store both structured and unstructured data to eliminate the need to store both data formats in different environments.

Is Databricks a data Lakehouse? ›

The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance and performance of data warehouses with the openness, flexibility and machine learning support of data lakes.

Is data lake an ETL tool? ›

Data Lakes and ETL

In more simple terms, data warehouses are systems which contain current and historical data that has been processed and standardized. The warehouse is the central location from which all data is retrieved. In contrast, data lakes are repositories of data in a more fluid sense (pun inevitable).

Is ETL a data lake? ›

ETL was developed when there were no data lakes; the staging area for the data that was being transformed acted as a virtual data lake. Now that storage and compute is relatively cheap, we can have an actual data lake and a virtual data warehouse built on top of it.

Does data lake have ETL? ›

There are many different options for data lake ETL – from open-source frameworks such as Apache Spark, to managed solutions offered by companies like Databricks and StreamSets, and purpose-built data lake pipeline tools such as Upsolver.

Can S3 be used as a data warehouse? ›

Data lakes often coexist with data warehouses, where data warehouses are often built on top of data lakes. In terms of AWS, the most common implementation of this is using S3 as the data lake and Redshift as the data warehouse.

Which two services are used for persistence of the data in Lakehouse? ›

OCI Object Storage and OCI Data Catalog, services are used for persistence of the data in Lakehouse.

What is the best format to store data in data lake? ›

STORAGE FORMATS: A PRACTICAL PERSPECTIVE

Text Files – Information will often come into the data lake in the form of delimited text, JSON, or other similar formats. As discussed above, text formats are seldom the best choice for analysis, so you should generally convert to a compressed format like ORC or Parquet.

How to build a data lake in s3? ›

How to Build a Data Lake:
  1. Map out your structured and unstructured data sources.
  2. Build ingestion pipelines into object storage.
  3. Incorporate a data catalog to identify schema.
  4. Create ETL and ELT pipelines to make data useful for analytics.
  5. Ensure security and access control are managed correctly.
Jun 9, 2021

Does a data lake need a schema? ›

A data lake, in contrast, has no predefined schema, which allows it to store data in its native format. So in a data warehouse most of the data preparation usually happens before processing. In a data lake, it happens later, when the data is actually being used.

How is a data lake implemented? ›

But the strategy for a data lake implementation is to ingest and analyze data from virtually any system that generates information. Data warehouses use predefined schemas to ingest data. In a data lake, analysts apply schemas after the ingestion process is complete. Data lakes store data in its raw form.

How do you implement Azure data lake? ›

Create a Data Lake Analytics account
  1. Sign on to the Azure portal.
  2. Select Create a resource, and in the search at the top of the page enter Data Lake Analytics.
  3. Select values for the following items: ...
  4. Optionally, select a pricing tier for your Data Lake Analytics account.
  5. Select Create.
Jan 13, 2023

How do data lakehouses work? ›

More specifically, a data lakehouse takes the flexible storage of unstructured data from a data lake and the management features and tools from data warehouses, then strategically implements them together as a larger system. This integration of two unique tools brings the best of both worlds to users.

How do you put data into a data lake? ›

To get data into your Data Lake you will first need to Extract the data from the source through SQL or some API, and then Load it into the lake. This process is called Extract and Load - or “EL” for short.

What is the difference between ADLs and blob storage? ›

Key Differences

ADLS: Hierarchical namespaces (much like a File System). Blob: General purpose object store for a wide variety of storage scenarios, including big data analytics. ADLS: Optimized storage for big data analytics workloads. Blob: Good storage retrieval performance.

What is the difference between ADLS Gen2 and blob storage? ›

In summary, Azure Data Lake Storage Gen2 is ideal for big data analytics workloads, while Blob storage is ideal for storing and accessing unstructured data. Both solutions offer strong security features and are cost-effective compared to traditional data storage solutions.

What is the difference between Azure SQL data warehouse and Azure Data Lake analytics? ›

While a data lake holds data of all structure types, including raw and unprocessed data, a data warehouse stores data that has been treated and transformed with a specific purpose in mind, which can then be used to source analytic or operational reporting.

Is S3 a data lake? ›

Amazon S3 provides an optimal foundation for a data lake because of its virtually unlimited scalability and high durability. You can seamlessly and non-disruptively increase storage from gigabytes to petabytes of content, paying only for what you use.

What is the Azure Data Lake equivalent in AWS? ›

Azure Data Lake is the competitor to AWS Cloud Formation. As with AWS, Azure Data Lake is centered around its storage capacity, with Azure blob storage being the equivalent to Amazon S3 storage.

Videos

1. Data Lakehouse Symposium | Part 1
(Databricks)
2. What is Lake House and why it matters | AWS Events
(AWS Events)
3. Natural Intelligence : Architecture walk through to AWS Data Lake - Part 1
(Natural Intelligence)
4. What is Lakehouse Architecture? Databricks Lakehouse architecture. #databricks #lakehouse #pyspark
(TechLake)
5. AWS Summit ANZ 2021-Lakehouse architecture: Simplify infrastructure and accelerate innovation
(AWS Events)
6. Lakehouse Architecture on Amazon Redshift For a Global Pharma Giant
(Agilisium Consulting)
Top Articles
Latest Posts
Article information

Author: Neely Ledner

Last Updated: 06/11/2023

Views: 6558

Rating: 4.1 / 5 (62 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Neely Ledner

Birthday: 1998-06-09

Address: 443 Barrows Terrace, New Jodyberg, CO 57462-5329

Phone: +2433516856029

Job: Central Legal Facilitator

Hobby: Backpacking, Jogging, Magic, Driving, Macrame, Embroidery, Foraging

Introduction: My name is Neely Ledner, I am a bright, determined, beautiful, adventurous, adventurous, spotless, calm person who loves writing and wants to share my knowledge and understanding with you.