What is a data lake? (2023)

TL;DR: Some controversy initially revolved around the purpose of the data lake. Ultimately most agree that it is an on-premises and/or in-the-cloud repository for an organization's own or third-party raw data. It accepts all data types, basically from any data source, and stores until the data is ready to be consumed/used. Hence, this storage is referred to as "object" or "block" storage, since any data type stored (eg: CSV, TXT, MP4, Parquet, Avro) is usually classified as an object. It is often compared to a data warehouse because of the belief that early people conflated a general purpose raw data storage location for any type of data with the Hadoop file system (HDFS) and its sole purpose of analyzing big data or replacing a traditional data warehouse Could not scale to support Hadoop Proved it could Handling ever-growing datasets. Data Lake now often has the suffix "storage" to distinguish Data Lake Storage from other purported uses. there is a goodAn article on why Data Lake is short for "Data Lake Storage".

(Video) What is a Data Lake?

If you think of a data mart as a store for bottled water—cleaned, packaged, and structured for easy drinking—a data lake is a large body of water in a more natural state. The content of the data lake is poured from the source to fill the lake, and various users of the lake can come and inspect, dive into, or sample.

james dixon

Why have a data lake?

Soon the concept of big data will just mean "data" because "big" becomes the new normal for data, so it's just data, it's everywhere, mostly pedestrians, until someone decides how to use it to get some kind of value . The flow of data isn't going to stop anytime soon. Every transaction has data. People can even buy data to help achieve goals an organization may have, and data is generated from objects that were once considered static and useful. What every organization has yet to start with is the way to capture, store, organize, secure, share and generate value from data. These are actually 6 key principles of data lakes. Here are some key reasons to have a data lake:

(Video) Database vs Data Warehouse vs Data Lake | What is the Difference?

  • Data is actually an asset, so start collecting and storing it now
  • Your organization is in need of data, so provide a way for approved users to get it so they can get it in the best possible way
  • Storage is cheap compared to other time periods and it can be an excellent way to offload data from legacy systems to reduce costs elsewhere
  • It leads to other value-added initiatives, such as moving to cloud or hybrid cloud architecture, or it starts an updated conversation about cybersecurity in the organization, etc.
  • It plants the seeds for system modernization, such as: updating legacy data warehouses or building new data warehouses, reorganizing legacy data pipelines or legacy ETL toETL as code

Some organizations already have the general concept of a dump where data sits for a long time before it is called a data lake. Some will argue that if the 6 key principles are not met, then there is just a data swamp - an unorganized, almost dysfunctional data repository.

Remember that a data lake is really just the storage aspect of a place where any type of data is stored. Different vendors that offer Data Lake Storage capabilities enable different object storage management capabilities. Some add detailed search, object security, REST API functionality, etc., while others add data wrangling capabilities. This adds to the confusion of where data lakes start and end. DataLakeHouse attempts to bridge this gap by educating the end-to-end lifecycle (including Data Lake to business value), providing a recognized separation of duties that still confuses many.

Why confuse data lake and data warehouse?

Not every organization is as cutting-edge as some of the household names those of us in tech hear from reading the latest TechCrunch or Forbes article. Companies like Uber, AirBnB, Amazon, etc. have to move fast and stay ahead in order to stay and outperform their competitors. These companies are built on principles in the technology field or become leaders in the technology field. Although the business problems they solve through technology can be solved quickly with unique solutions, other companies that are not household names are also trying to solve many of the same problems. Although possible at lower data volumes.

(Video) What is Data Lake (2023) | Data Lake vs Data Warehouse (English Subtitles)

So for many companies focused on running their business (not covered by TechCrunch et al) they may or may not have a data warehouse, let alone a Hadoop instance or data lake. Since the buzzword big data dominated tech conference headlines in the 2010s, the most relevant connection to improving data analytics in organizations over the past 30 years has been data warehousing. Naturally, most people make inferences based on current understanding, so unfortunately, many conversations and articles about data lakes are compared to the historical concept of data warehouses.

Another common comparison is storing structured data in conversation with unstructured data. Coming only from the realm of relational databases supporting operating systems and potential data marts/warehouses, unknown definitions of unstructured data can lead to confusion. For systems that will consume data from the data lake (rather than the data lake itself), this is similar to the comparable argument for read versus write patterns. These comparisons of the potential behind a data lake and what a data lake actually is are again simply using the data warehouse as a point of reference. The value of the data warehouse remains intact.

For anyone who has worked with object storage solutions and ingested and output data from object storage "buckets" etc., it is easy to understand the difference between a data lake and a data warehouse. However, the comparison seems to be proliferated further by expressing the writings of "data scientists" and their need for data for machine learning, prediction, etc., which is not possible or possible in a data warehouse. Again, compare, but do not define the purpose differences of the two types of systems.

(Video) What is Azure Data Lake and When to Use It

Finally, newer vendor systems labeled Cloud Data Warehouse work amazingly as data warehouse solutions and offer the best scalable computing power. Now, some cloud data warehouse vendors have enabled combined data warehouse, machine learning and direct object store to data warehouse data loading (pseudo-ETL) in their solutions. Comparing or even mixing the data warehouse concept with the historical data lake concept adds to the confusion. The DataLakeHouse project aims to educate and clarify these common misconceptions so that every organization can realize business value through the Big Data ecosystem. The latter example is one where the DataLakeHouse project is split into Front Lake and Back Lake concepts so that organizations can properly align their skill sets and realize the maximum potential of their data management investments.

How much maintenance does a data lake require?

Similar to any other infrastructure initiative in an organization, the data lake as an asset continues to receive attention. Realizing immediate business value and ROI from the infrastructure setup and associated inputs and outputs of the data lake is a subjective measurement task. Maintaining a data lake will require individuals familiar with data management to understand the vendors used to support object storage, as well as any associated networking and security controls, in order to provide the access that makes the data lake valuable to individuals in the organization who may potentially use the Data is transformed into information to generate business value.

How long does it take to build a data lake?

Similar to building a data warehouse, setting up a new CRM system, or any other project for an organization, building a data lake takes time, depending on the business case requirements. Like most technology projects, the solution can scale and grow. Starting from a small part of a larger use case, the foundational elements of a data lake can be built. The initial business use case can then generate a minimum viable product (MVP) to deliver the concept in a pilot of similar implementations. From the proven results of meeting some basic requirements, the size of the data lake can grow almost infinitely.

(Video) What is Data Lake | Understand the Data Lake Architecture | Data Lake using Apache Spark

The biggest gap or disconnect between the time it takes to build a data lake and reality is through a well-defined business case. Since many business cases are similar, regardless of industry, the DataLakeHouse program seeks to enable all organizations with proven business cases and pre-built data lake-to-business value pipelines to accelerate the education and implementation of these big data solutions.

FAQs

What is a data lake? ›

A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits.

What is data lake in simple terms? ›

A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits.

What is a data lake vs data warehouse? ›

A data warehouse and a data lake are two related but fundamentally different technologies. While data warehouses store structured data, a lake is a centralized repository that allows you to store any data at any scale.

What is a data lake vs database? ›

What is the difference between a database and a data lake? A database stores the current data required to power an application. A data lake stores current and historical data for one or more systems in its raw form for the purpose of analyzing the data.

What are examples of data lakes? ›

A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

Is Snowflake a data lake? ›

Snowflake Has Always Been a Hybrid of Data Warehouse and Data Lake.

Is data lake a data dump? ›

The Data Lake is a breath of fresh air to many, especially those within an organization that regularly work with data and need the ability to run analysis on-demand, without waiting for IT. However, given the free spirited nature of what can be stored in the lake's architecture, they can quickly turn into data dumps.

Is ETL a data lake? ›

ETL was developed when there were no data lakes; the staging area for the data that was being transformed acted as a virtual data lake. Now that storage and compute is relatively cheap, we can have an actual data lake and a virtual data warehouse built on top of it.

Does data lake have ETL? ›

There are many different options for data lake ETL – from open-source frameworks such as Apache Spark, to managed solutions offered by companies like Databricks and StreamSets, and purpose-built data lake pipeline tools such as Upsolver.

What is the difference between data lake and ETL? ›

Data Lake defines the schema after data is stored, whereas Data Warehouse defines the schema before data is stored. Data Lake uses the ELT(Extract Load Transform) process, while the Data Warehouse uses ETL(Extract Transform Load) process.

Is SQL a data lake? ›

SQL is the solution. A data lake is a centralized repository that allows for the storage of structured and unstructured data at any scale. SQL (Structured Query Language) is a programming language used to communicate with and manipulate databases.

What is Snowflake vs data lake? ›

The biggest difference between Snowflake and a data lakehouse platform is that Snowflake's hybrid model has better capabilities for the security and governance of sensitive data, as well as more automation, better economics, and better performance.

Why would you use a data lake? ›

Data Lakes allow you to store relational data like operational databases and data from line of business applications, and non-relational data like mobile apps, IoT devices, and social media. They also give you the ability to understand what data is in the lake through crawling, cataloging, and indexing of data.

Is Azure a data lake? ›

Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages.

Who builds data lakes? ›

Data lake management is often the domain of data engineers, who help design, build and maintain the data pipelines that bring data into data lakes. With data lakehouses, there can often be multiple stakeholders for management in addition to data engineers, including data scientists.

Is Google a data lake? ›

Google is just one piece of the data lake puzzle. Our key partners can help you unlock new capabilities that seamlessly integrate with the rest of your IT investments.

Is data lake in AWS? ›

To support our customers as they build data lakes, AWS offers Data Lake on AWS, which deploys a highly available, cost-effective data lake architecture on the AWS Cloud along with a user-friendly console for searching and requesting datasets.

Is Hadoop a data lake? ›

Hadoop is an important element of the architecture that is used to build data lakes. A Hadoop data lake is one which has been built on a platform made up of Hadoop clusters. Hadoop is particularly popular in data lake architecture as it is open source (as part of the Apache Software Foundation project).

Is S3 a data lake? ›

Amazon S3 provides an optimal foundation for a data lake because of its virtually unlimited scalability and high durability. You can seamlessly and non-disruptively increase storage from gigabytes to petabytes of content, paying only for what you use.

Is Kafka part of the data lake? ›

A modern data lake solution that uses Apache Kafka, or a fully managed Apache Kafka service like Confluent Cloud, allows organizations to use the wealth of existing data in their on-premises data lake while moving that data to the cloud.

What is the opposite of data lake? ›

The opposite of a data lake, a data warehouse is a hierarchical, structured data repository of integrated data from multiple sources, organized for creating analytical reports.

What is data lake vs pool? ›

A data pool is an independent, isolated micro-data lake. A data lake includes at least one, but ideally many data pools that belong to the same organization, and are managed independently (they can even run on different cloud vendors!).

What is another word for data lake? ›

A synonym for a data warehouse.

Which statement best describes a data lake? ›

A storage repository holding raw data in its native format.

Why is it called data lake? ›

Data Lake. Pentaho CTO James Dixon has generally been credited with coining the term “data lake”. He describes a data mart (a subset of a data warehouse) as akin to a bottle of water…”cleansed, packaged and structured for easy consumption” while a data lake is more like a body of water in its natural state.

What is the reason for data lake? ›

The primary purpose of a data lake is to make organizational data from different sources accessible to various end-users like business analysts, data engineers, data scientists, product managers, executives, etc., to enable these personas to leverage insights in a cost-effective manner for improved business performance ...

Videos

1. What is a Data Lake?
(intricity101)
2. KNOW the difference between Data Base // Data Warehouse // Data Lake (Easy Explanation👌)
(Chandoo)
3. What is a Data Lake
(Oracle Cloud Infrastructure)
4. What is a Data Lake
(The Career Force)
5. What is Azure Data Lake?
(Tutorialspoint)
6. Data Lakehouses Explained
(IBM Technology)
Top Articles
Latest Posts
Article information

Author: Zonia Mosciski DO

Last Updated: 04/13/2023

Views: 6576

Rating: 4 / 5 (71 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Zonia Mosciski DO

Birthday: 1996-05-16

Address: Suite 228 919 Deana Ford, Lake Meridithberg, NE 60017-4257

Phone: +2613987384138

Job: Chief Retail Officer

Hobby: Tai chi, Dowsing, Poi, Letterboxing, Watching movies, Video gaming, Singing

Introduction: My name is Zonia Mosciski DO, I am a enchanting, joyous, lovely, successful, hilarious, tender, outstanding person who loves writing and wants to share my knowledge and understanding with you.