The architecture introduces platform topology, component overview, recommended best practices, and Terraform automation to deliver an open source data lake library in OCI.
Data Lakehouse can store and aggregate data from enterprise applications. Data can be sent to a data lake or data warehouse. The data present in the data lake can be processed and loaded into the data warehouse or it can be read directly from the data lake for advanced analytics.
The following diagram illustrates this reference architecture.
Illustration open-source-data-lakehouse.png
Open Source Data Lake House oracle.zip
In this data lake house architecture on OCIOracle MySQL heat wavefor data warehouses.Oracle MySQL heat waveis the only MySQL cloud service with a built-in, powerful in-memory query accelerator. For database administrators and application developers, it is the only service that allows them to run OLTP and OLAP workloads directly from a MySQL database. Because MySQL is optimized for OLTP, many MySQL implementations use a separate OLAP database for business analysis.
Oracle MySQL heat waveImprove MySQL performance for analytical and mixed workloads by orders of magnitude without requiring changes to existing applications.Oracle MySQL heat waveProvides a single, unified platform for transactional and analytical workloads. This eliminates the need for complex, time-consuming and costly ETL and integration with separate analytical databases. MySQL autopilotOracle MySQL heat waveAuto-configuration, data loading, query execution and error handling. This saves a lot of time for developers and DBAs.
Oracle Cloud Infrastructure Object StorageActs as a data lake in this architecture. With OCI object storage, organizations can store all data in a cost-effective, elastic environment while providing the processing, persistence, and analytics services necessary to discover new business insights. With a data lake in OCI Object Storage, you can store and maintain structured and unstructured data, and use methods to organize large volumes of disparate data from multiple sources.
The presented architecture consists of the following open source components:
- Apache Zeppelin
Apache Zeppelin is a web-based notebook that supports data-driven interactive data analysis and collaborative documentation using SQL, Scala, Python, R, and more.
Zeppelin is used for data science and data research in this architecture. In Zeppelin, you can create notebooks and use the Zeppelin interpreter concept, which can be used to integrate any language or backend data processing system. With Zeppelin connections to MySQL and object storage, you can run common queries and retrieve data from data warehouses and data lakes simultaneously in a true data lake house query experience.
- Grafana
Grafana is an open source platform in this architecture. Grafana is a popular web application written in TypeScript (front end) and Go (back end). It includes graphs and charts for supported data sources. One of them is MySQL. There are many plugins available online to help you extend Grafana.
(Video) Sesame Software opens up your data lakes on OCI
Zeppelin and Grafana use a Network File System (NFS) shared between two VM instances on a private subnet. These instances exist in two different fault domains within an availability domain. The file system exists on a dedicated private subnet with a Network Security Group (NSG) that allows access to the mount target from all instances. As part of this high-availability design, users can access Zeppelin and Grafana through the OCI load balancer.
The schema usesOracle Cloud Infrastructure Data IntegrationLoad files from object storage to MySQL. existMySQL, Data Integration, and Marine Life Data Science WorkshopDescribes how to set up and run a dataflow from Object Storage to MySQL.
The architecture consists of the following OCI components:
- client
A tenant is a securely isolated partition that Oracle sets up in Oracle Cloud when you sign upOracle Cloud Infrastructureregister. You can create, organize, and manage resources in your tenant's Oracle Cloud. Tenants correspond to companies or organizations. Typically, a company has only one tenant and mirrors its organizational structure within that tenant. A single tenant is usually associated with a single subscription, and a single subscription usually has only one tenant.
- car
Compartments is a logical partition across regionsOracle Cloud Infrastructure-client. organize your resourcesOracle CloudWith compartments, control access to resources and set usage quotas. To control access to resources in specific compartments, you define policies to specify who can access resources and what actions they can take.
- policy
oneOracle Cloud Infrastructure Identity and Access Management-Policy Specifies who can access which resources and how. Access is granted at the group and compartment level. Therefore, you can write a policy that provides a group with a specific type of access within a specific compartment or tenant.
(Video) Accenture uses a data lakehouse on OCI to foster innovation - area
oneOracle Cloud Infrastructure-Region is a localized geographic area that contains one or more datacenters called availability domains. Regions are not dependent on other regions, and great distances can separate them (across countries or even continents).
- Virtual Cloud Network (VCN) and Subnets
A VCN is a customizable software-defined network that you canOracle Cloud Infrastructure- Set the area. Like traditional data center networking, VCNs allow complete control of your networking environment. A VCN can have multiple non-overlapping CIDR blocks, which you can change after the VCN is created. You can segment a VCN into subnets, which can apply to regions or availability domains. Each subnet consists of subsequent address ranges that do not overlap with other subnets in the VCN. You can resize a subnet after you create it. Subnets can be public or private.
- availability domain
An availability domain is a self-contained, independent data center within a region. Physical resources in each availability domain are isolated from resources in other availability domains, providing fault tolerance. Availability domains do not share infrastructure such as power or cooling or the internal availability domain network. Therefore, a failure in one availability domain is less likely to affect other availability domains in the region.
- fault domain
A fault domain is a group of hardware and infrastructure within an availability domain. Each availability domain consists of three failure domains with independent power supplies and hardware. Distributing resources across multiple fault domains allows your application to tolerate failures, system maintenance, and power outages within a fault domain.
- object storage
OCI Object Storage is a high-performance Internet-based storage platform that provides reliable and cost-effective data persistence. Object storage can store an unlimited amount of unstructured data of any content type. Include analytics data. You can securely store data or retrieve it directly from the Internet or cloud platforms. With multiple management interfaces, you can easily start small and scale your service seamlessly without sacrificing performance or reliability.
Use object storage as a cold storage layer for your data warehouse, store infrequently used data, and then seamlessly connect to the latest data with Apache Zeppelin. Use archive storage for files that require long-term preservation and are accessed infrequently.
(Video) OCI GoldenGate for OCI Data Lake House (Object Storage) - MySQL heat wave
Oracle MySQL Database Serviceis a fully managed database service that enables developers to rapidly develop and deploy secure cloud-native applications using the world's most popular open source database.Oracle MySQL heat waveis a new, integrated, high-performance, memory-resident query accelerator forOracle MySQL Database Service, which accelerates MySQL's analytical and transactional query performance.
- Data Integration
Oracle Cloud Infrastructure Data Integrationis a fully managed, serverless, cloud-native service that integrates data from a variety of sources intoOracle Cloud Infrastructure- Extract, load, convert, clean and retrain target services. ETL (Extract Transform Load) leverages fully managed scale-out processing on Spark. Users design data integration processes using an intuitive, code-free user interface that simplifies the integration process to generate the most efficient engines and orchestrations. Execution environments are allocated and scaled automatically. OCIData IntegrationProvides interactive exploration and data preparation, and protects data engineers from schema drift by defining rules for handling schema changes.
- load balancer
thisOracle Cloud Infrastructure Load Balancing- The service supports automatic traffic distribution from a single entry point to multiple servers in the backend.
- Calculation instance
along withOracle Cloud Infrastructure ComputeServices allow you to deploy and manage computing hosts in the cloud. You can use compute instances withManifestationsMeet your resource needs for CPU, memory, network bandwidth, and storage. After you create a compute instance, you can safely access it, restart it, map and unmap volumes, and terminate it when you no longer need it.
- file storage
thisOracle Cloud Infrastructure File StorageThe service provides a durable, scalable, and secure enterprise-grade network file system. You can connect to the File Storage service file system from any bare metal, VM, or container instance in the VCN. You can also use outside the VCNOracle Cloud Infrastructure FastConnectand IPSec VPN to access the file system.
- internet gateway
An internet gateway allows traffic between the public subnets in the VCN and the public internet.
(Video) Highlights: The Future of the Data Lakehouse—Oracle Live - Network Address Translation-(NAT-)Gateway
A NAT gateway allows private resources in the VCN to access hosts on the Internet without exposing those resources to incoming Internet connections.
- Network Security Group (NSG)
NSG acts as a virtual firewall for cloud resources. Use a trustless security modelOracle Cloud InfrastructureAll traffic is denied and you can control network traffic within the VCN. An NSG consists of a set of ingress and egress security rules that apply only to a specified set of VNICs within a single VCN.
FAQs
Which lakehouse service should you use for serverless spark processing in oci? ›
Oracle Cloud Infrastructure Data Flow provides a serverless Spark environment to process data at scale with a pay-per-use, extremely elastic model.
Which two services are used for persistence of the data in Lakehouse OCI? ›OCI Object Storage and OCI Data Catalog, services are used for persistence of the data in Lakehouse.
Is data Lakehouse open source? ›An open data lakehouse is a powerful and cost-effective solution for managing and analyzing open data. Leveraging open-source technologies to provide flexibility and transparency enables efficient data analysis, eliminates data silos, and enables data-driven decisions.
How to implement data lakehouse? ›- make sense of the data available in the data lake for a business case.
- ingest/acquire data that's not available.
- organize, integrate and enrich them to make data accessible and user-friendly for the business.
Oracle Cloud Infrastructure (OCI) Functions is a serverless compute service that lets developers create, run, and scale applications without managing any infrastructure.
Which three services integrate with OCI key management? ›The Key Management service is integrated with many OCI services, including Block Volumes, File Storage, Oracle Container Engine for Kubernetes, and Object Storage.