From data warehouses to big data to data lakes and now data lake houses, the journey may not be over yet. Data warehouses are built to report on business operations and to do some analysis for predictive modeling (forecasting) or discovery. Access is available to a limited group of experts. Big data is often stored in file-based or object-based repositories, in unsorted and raw form, waiting for intrepid analysts to mine it for insights. The data lake architecture evolved from the big data warehouse. Then there's Data Lake House.
Tomer Shiran, Dremio's co-founder and chief product officer, named his keynote at Subsurface 2023 "The Year of the Data Lake House." Tomer took the time to sit down with Cloud Data Insights (CDI) to explain why this is the case. He covers topics such as the importance of the semantic layer, performance optimization, and emerging capabilities in data lake houses.(See Shiran's bio below.)
CDI: Your keynote at this year's Subsurface conference was titled "The Year of the Data Lake House." What kind of inflection do you see in Lake House?
Tom:We've gone through many different phases or eras in the past decade, much faster in the past few decades. We had decades of enterprise data warehouses, and then we had the whole Hadoop boom and big data. As the VP of Product at MapR, I was involved in that phase. Then we had the rise of the public cloud, which came out of nowhere. This has led to solutions like Redshift and Snowflake, making cloud data warehouses popular. I think they solve some of the ease-of-use issues and are able to provide complete data warehouse use cases that data lakes cannot.
In just one year, Apache Iceberg as a common tabular format has changed, and the entire ecosystem has fallen behind. Now you can do things in the lake that you couldn't do before—basically, everything a data warehouse can do. So now, for the first time, all of these use cases can really be addressed with an open data architecture. That's why we call it the Year of the Lake House.
We at Dremio actually think lakehouse is a good class name to describe what we're doing. Yes, at our core we have a query engine. Yes, we can also connect to other data sources and federate them. A fundamental feature of Lakehouse is the ability to work with data in object storage, and also work across other sources, because in the real world, companies have data in various places, and they can't always centralize all of that data.
CDI: One of the main points you made in your keynote was the tension between governance and accessibility. We think of a data warehouse as a very controlled, inaccessible store of data to which only a few people have the "keys". How do you feel about this tension?
Tom:You cannot put physical or human barriers between teams and between departments and projects. Of course, not everyone can see every piece of data from a governance standpoint and a security standpoint. So you want to have the ability to control things. But this should be driven by things like business needs and compliance, not physical constraints like say data exists in one system and not in another. The semantic layer allows you to access data regardless of where it resides or how large it is.
See also: The Role of the Semantic Layer in Analytics and Data Integration
CDI: Tell us about your strategy for increasing access and how Dremio Arctic achieves this.
Tom:Arctic introduces the new idea of sandboxing data while you're doing intermediate work, like fixing up a bunch of data or ingesting some new sources that you haven't tested yet and want to make sure the data is correct. Arctic lets you do this independently, using the same concepts found in Git and GitHub. You create a branch, ingest data into the branch, and ingest data into the branch. These concepts were actually very valuable because that data was a work in progress at the time. It's not really data that you want to share with anyone, but once you're ready to share it, you run a command, and boom, you can share it within the branch. The whole idea of a branch office allows you to work part-time.
CDI: This is a way of dealing with transient data. Branches are also useful for rolling back in case of errors. It removes the risk of experimentation.
Tom:Yes. I think this data philosophy at GitHub might represent one of the biggest changes in the history of data management. Take our data warehouse today as an example. Yes, it's more scalable and easier to use than it was 20 years ago, but fundamentally, it's still the same thing. This is a bunch of tables on which I run SQL commands, right? The model is the exact same one that has been in use for decades. But in software development, everything has changed in the past decade. We have things like GitHub for collaboration and more agility, version control and source code governance. We have CI/CD and a lot of stuff is developing. And that doesn't happen in the data world. Arctic [based on open source projectloch ness monster] as a critical part of driving change forward.
See also: Python: Top Programming Language, But SQL Gets You Noticed
CDI: This is a good point about using the same rules and tools for working with data and data products as you use to create applications.
Tom:These development tools are already familiar. Data engineers are also our source code engineers, since they build at least the Python scripts, and they use tools like GitHub. We don't have to reinvent everything - we can apply them to a new space.
CDI: There has been some hesitation in adopting the semantic layer, mainly around its impact on performance. One workaround is to have the semantic layer cover only a portion of the organization's data assets, which defeats the purpose. How do you overcome this problem?
Tom:If things aren't fast enough, the way to gain performance is to optimize the data. I'll be summing by phone number rather than a table with a billion calls. I have statistics for each phone number, not all individual calls. Then when I query, it's faster because I have a smaller dataset. Another example is joining to tables that people are querying ahead of time so that the join doesn't happen on every query. Some people manually optimize data when pulling it from a data warehouse that is too slow and potentially overloaded. So they take the data and start caching pieces of the data on the BI side, inside applications like Tableau.
These approaches can give you more performance, but they create a lot of problems because you've now created all these disconnected copies of the data, which you then have to manage. You must also ensure that permissions on these datasets are always up to date, as they contain sensitive data, and permissions are not automatically transferred from the data source. In our opinion, the only real way to solve performance problems is to recognize that people need to transform data for logical business reasons. Maybe I need zip codes instead of addresses. Things are misspelled and need to be corrected. Maybe things need to be rationalized differently. There will always be a need for logical transformations in the world, but there's no reason to tie them to physical transformations. So we say create a semantic layer for logical transformations.
You still need to worry more about performance because these logical transformations happen at query time. Instead, we created what are called materializations, which are basically different aggregations of data or different types of data that are done behind the scenes without the user connecting to one of these materializations. In the traditional model, users connect to pre-joined or pre-aggregated tables, so whenever you want to make a change, you can't because the user has already built their dashboard on that particular table. In our world, users are connected to a logic layer, and we use these additional aggregations or data classifications automatically and transparently as part of the query planner. Your data team can internally rewrite users' queries without them knowing about these optimized versions of the data, giving them sub-second response times.
CDI: How do you pre-select what to aggregate, merge or join?
Tom:We have features in technology preview that automate these actions based on a company's previous workloads. We study query patterns and how data is typically aggregated. Based on this, we identify recommendations that data engineers can override. For example, we might not know that a certain dashboard is run by the CEO and is very, very important. The other is used by interns, which is not very important. You can't always see all the business differences behind the scenes, so we give data teams the control they need.
CDI: Sounds like artificial intelligence in terms of pattern detection and prediction. Does Dremio have AI embedded?
Tom:Yes. Over time, our understanding of workloads has gotten smarter. This is the key to preselecting optimizations in an efficient manner.
CDI: In addition to Dremio's commercial track record, it has also made significant contributions to open source software. You founded the Apache Arrow project with over 7 million downloads in 2022. How do you explain its popularity?
Tom:It is one of the most popular open source projects today. First, it's driven by the need for data, the number of data scientists, and the number of people using data science tools and Python. thisPyArrow libraryEvery data scientist is in basically every notebook they create and every application they build. Something like Arrow provides very fast data access - that's why our entire engine is based on it.
In addition to Apache Arrow, we are also huge contributors to Iceberg and make sure we do it natively. We built a team dedicated to evangelizing it and educating the market about its benefits, and now we see many companies adopting Iceberg in their products.
See also:Why Python is Best for AI, ML, and Deep Learning
CDI: Open source projects sometimes start with exploration and experimentation. Is this Dremio's approach, or are you really trying to solve a real problem needed to complete your product strategy?
Tom:This is more like what we need. We're building our query engine, and we know that to get world-class performance, we need to do it columnar in memory. Initially, this technology was only in our products. However, before we actually rolled out the first version of our product, we realized that other database companies and other data science tools would need something like this. Of course, everyone can build their own, but if we open source that part of Dremio, then maybe this single format will become the standard, which will also benefit us in many ways, because there is a large community of developers contributing to the project Contribute and make it happen faster and better.
CDI: Let's end our conversation with a forward-looking question: where do you think data lake houses will move?
Tom:Well, I'm really excited about the new paradigm of managing data as code and all the benefits and agility it will bring to companies. I think as a paradigm it will be a game changer for many companies because it will just make it easier to work with data, collaborate and manage it. It also makes it easier to build data grids and data products. There are many benefits to this approach.
[NoticeUnderground 2023After free registration, you can watch the conference on demand]
Introduction:Tomer Shiran, Dremio's co-founder and CPO, spent his first 4.5 years as Dremio's CEO, overseeing the development of the company's core technology and growing the team to 100 employees. Previously, he was the fourth employee and vice president of product at MapR, a pioneer in big data analytics. Tomer has held a number of product management and engineering positions at IBM Research and Microsoft. He is the founder of two websites that have served millions of users and over 100,000 paying customers. He holds an MS in Computer Engineering from Carnegie Mellon University and a BS in Computer Science from the Technion-Israel Institute of Technology, and is the author of numerous US patents.
Elisabeth Strenger isCDInsights.ai.