This year's themeData + Artificial Intelligence SummitYes we are using lakehouse to build a modern data stack. A fundamental requirement of Data Lakehouse is the need to bring reliability to your data - an open, simple, production-ready and platform-agnostic data such asdelta lake.With this, we are happy to announce that with Delta Lake 2.0, we will open source all of Delta Lake!
What Makes Delta Lakes Different
Delta Lake enables organizations to build Data Lakehouse, enabling data warehousing and machine learning directly on the data lake. But Delta Lake doesn't stop there. Today, it is the most comprehensive Lakehouse format used by more than 7,000 organizations, processing exabytes of data every day. Besides the core functionality of being able to seamlessly ingest and consume streaming and batch data in a reliable and performant manner, one of the most important features of Delta Lake is Delta Sharing, which enables different companies to share datasets in a secure manner . Delta Lake also comes with standalone readers/writers, allowing any Python, Ruby, or Rust client to write data directly to Delta Lake without any big data engines like Apache Spark™. Finally, Delta Lake has been optimized over time and significantly outperforms all other lakehouse formats. Delta Lake comes with a rich set of open source connectors, including Apache Flink, Presto, and Trino. Today, we’re excited to announce our commitment to open source Delta Lake by open sourcing all of Delta Lake, including features that were only available in Databricks until now. We hope this democratizes the use and adoption of Data Lake House. But before we do that, we want to tell you a bit about the history of Delta.
Origin of Delta Lake
The genesis of this project started in a casual conversation between Spark Summit 2018Dominic Brezinski, a Distinguished Engineer at Apple, and our own Michael Armbrust (who originally created Delta Lake, Spark SQL, and Structured Streaming). Dominique, who works on Intrusion Monitoring and Threat Response, approached Michael with ideas on how to address the processing demands created by massive concurrent batch and streaming workloads (petabytes of log and telemetry data per day). They were unable to use data warehouses for this use case because (i) they were cost-prohibitive for the massive amount of event data they had, (ii) they did not support the real-time streaming use case essential for intrusion detection, and (iii ) lack support for the advanced machine learning needed to detect zero-day attacks and other suspicious patterns. So building it on a data lake was the only viable option at the time, but they were struggling with pipeline failures due to the high number of concurrent streaming and batch jobs, and couldn't ensure transactional consistency and data accessibility for all business data .
So the two of them got together to discuss the need for a unified data warehouse and AI, planting the seeds of what we now know as Delta Lake blossoming. Over the next few months, Michael and his team worked closely with Dominique's team to build this ingestion architecture designed to solve data problems at scale—allowing their team to easily and reliably handle low-latency stream processing and Interactive queries without job failure or reliability troubleshoot the underlying cloud object storage system, while enabling Apple's data scientists to process large volumes of data to detect anomalous patterns. We quickly realized that this issue was not unique to Apple, as many of our customers had experienced the same issue. Fast forward, and we're starting to see Databricks customers building reliable data lakes at scale effortlessly using Delta Lake. We started calling this approach to building a reliable data lake the data lake house pattern because it provides the reliability and performance of a data warehouse with the openness, data science, and real-time nature of a massive data lake.
Delta Lake becomes a Linux Foundation project
As more and more organizations start building lakehouses with Delta Lake, we hear that they want the data format on the data lake to be open source, avoiding vendor lock-in entirely. As a result, in2019 Spark+AI Summit, together withLinux Foundation, we announced the open sourcing of the Delta Lake format so that the larger community of data practitioners can make better use of their existing data lakes without sacrificing data quality. Since open sourcing Delta Lake (under the permissive Apache License v2, the same license we use for Apache Spark), we have seen massive adoption and growth of the Delta Lake developer community, as well as data sharing between practitioners and companies. Brigade's paradigm shift by unifying their data with machine learning and artificial intelligence use cases. That's why we're seeing such massive adoption and success.
Delta Lakes Community Grows
Today, the Delta Lake project is thriving with more than 190 contributors from more than 70 organizations, nearly two-thirds of which are external Databricks contributors from leading companies such as Apple, IBM, Microsoft, Disney, Amazon, and eBay, just to name Few cases are rare. In fact, we've seen a 633% increaseContributor strength (as defined by The Linux Foundation)in the past three years. This level of support is the heart and strength of this open source project.
Source: The Linux Foundation Contributor Strength:Growth in the total number of unique contributors analyzed over the past three years. A contributor is anyone who is associated with a project through any code activity (commits/PRs/changesets) or helps find and fix bugs.
Delta Lake: The fastest and most advanced multi-engine storage format
Delta Lake was built not just for one tech company's special use case, but for a wide range of use cases representing our customers and communities, from finance, healthcare, manufacturing, operations to the public sector. Delta Lake has been deployed and battle-tested in tens of thousands of the largest tables measured in exabytes. Therefore, Delta Lake is far ahead of other formats in real customer testing and third-party benchmarking time and time again1About performance and ease of use.
With Delta Sharing, anyone can easily share data and read data shared from other Delta tables. We released Delta Sharing in 2021 to give the data community an option to escape vendor lock-in. As data sharing becomes more popular, most of you have expressed frustration with more data silos (now even outside your organization) due to proprietary data formats and proprietary computations required to read the data. Delta Sharing introduces an open protocol for secure real-time exchange of large data sets, enabling secure data sharing across products for the first time. Data users can now connect directly to shared data through Pandas, Tableau, Presto, Trino, or dozens of other systems that implement open protocols, without using any proprietary systems—including Databricks.
Delta Lake also has the richest ecosystem of direct connectors such asa lot,Presto, andTrinity, enabling you to read and write to Delta Lake directly from the most popular engines without Apache Spark. Thanks to contributors from Delta Lakegraffitiandbackground market, you can also usetriangle rust- The underlying Delta Lake library in Rust, enabling Python, Rust, and Ruby developers to read and write Delta without any big data frameworks. Today, Delta Lake is the most widely used storage tier in the world with over 7 million monthly downloads; a 10x increase in monthly downloads in just one year.
Announcing Delta 2.0: Open Source Everything
Delta Lake 2.0 is the latest version of Delta Lake and will further enable our large community to benefit from all Delta Lake innovations, all Delta Lake APIs are open source - especially the performance optimizations and features brought by Delta Engine, such asZ order,change data feed,dynamic partition coverageand discarded columns. Thanks to these new capabilities, Delta Lake continues to deliver unmatched out-of-the-box price/performance for all Lakehouse workloads, from streaming to batch—up to 4.3x faster than other storage tiers. Over the past six months, we have put in a huge effort to implement and contribute all of these performance enhancements to Delta Lake.Therefore, we will open source all of Delta Lake, and we are committed to ensuring that all functions of Delta Lake will be open source.
We're excited to see Delta Lake going from strength to strength. We look forward to working with you to continue the rapid innovation and adoption of Delta Lake in the years to come.
Interested in joining the open source Delta Lake community?
accessdelta lakeLearn more; you can join the Delta Lake community in the following waysrelaxationandgoogle group.
1 https://databeans-blogs.medium.com/delta-vs-iceberg-performance-as-a-decisive-criteria-add7bcdde03d
FAQs
Announcing Delta 2.0: Open Source Everything? ›
As a result of these new features, Delta Lake continues to provide unrivaled, out-of-the-box price-performance for all lakehouse workloads from streaming to batch processing — up to 4.3x faster compared to other storage layers.
Is Delta Lake really open source? ›Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.
Is Databricks Lakehouse open source? ›It's built on open source and open standards to maximize flexibility. And, its common approach to data management, security and governance helps you operate more efficiently and innovate faster.
What is the difference between data Lakehouse and Delta Lake? ›Data Lakehouse offers a hybrid architecture that combines the best of data lake and data warehouse capabilities. Delta Lake is a data management system that ensures the reliability, consistency and scalability of a data lake.
What is the difference between Databricks and Delta Lake? ›Delta Lake is the default storage format for all operations on Azure Databricks. Unless otherwise specified, all tables on Azure Databricks are Delta tables. Databricks originally developed the Delta Lake protocol and continues to actively contribute to the open source project.
What is the controversy with Databricks? ›Databricks insists its Delta Lake database technology is open source, but critics say it's not open source in spirit, and that could cost businesses time and money. This could be all part of the Databricks playbook as it prepares to go public. Had the dispute erupted in a bar, it might have led to a sloppy brawl.
Does Snowflake use Delta Lake? ›Whenever Delta Lake generates updated manifests, it atomically overwrites existing manifest files. Therefore, Snowflake will always see a consistent view of the data files; it will see all of the old version files or all of the new version files.
Is Snowflake open-source? ›By building with open source, developers can innovate faster with powerful services. At Snowflake, we are grateful for the community's efforts, which propelled the software and data revolution.
What is the open-source equivalent of Databricks? ›- MongoDB Atlas.
- Oracle Database.
- Teradata Vantage.
- Amazon Redshift.
- Db2.
- DataStax Enterprise.
- Redis Enterprise Cloud.
- CDP Data Hub.
Azure Databricks is a fully managed Azure first-party service, sold and supported directly by Microsoft. It's simple to get started with a single click in the Azure portal, and Azure Databricks is natively integrated with related Azure services.
Is Delta sharing open source? ›
Delta Sharing is a Linux Foundation open source framework that uses an open protocol to secure the real-time exchange of large datasets and enables secure data sharing across products for the first time.
Is Databricks really open source? ›Our most popular open source projects
Apache Spark is a unified engine for executing data engineering, data science and ML workloads. What is Apache Spark? Delta Lake lets you build a lakehouse architecture on top of storage systems such as AWS S3, ADLS, GCS and HDFS.
A Delta Lake is an open-source storage layer designed to run on top of an existing data lake and improve its reliability, security, and performance. Delta Lakes support ACID transactions, scalable metadata, unified streaming, and batch data processing.
What format does Delta Lake use Delta? ›What format does Delta Lake use to store data? Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.