To ingest data from an API into a Databricks environment (especially using the Lakehouse architecture), you typically follow a structured pipeline where you manage the data using various layers (e.g., Bronze, Silver, and Gold). The Silver layer is typically where you cleanse and enrich the data before it's fully prepared for downstream analytics, business intelligence, or machine learning tasks.
1. Parquet File
Parquet is a columnar file format that is commonly used for storing large datasets in a compressed and optimized manner. It is the default storage format for many data processing systems like Apache Spark, Hive, and BigQuery.
Features of Parquet Files:
-
Columnar storage: Parquet stores data in columns rather than rows, making it highly efficient for analytical queries (e.g., aggregate functions).
-
Efficient compression: It offers high compression rates, which reduces storage costs.
-
Schema support: Parquet files include schema information (e.g., data types, column names), so they are self-describing.
-
Interoperability: Parquet is an open-source format, and it can be used by multiple data engines (Spark, Hive, AWS Redshift, etc.).
Parquet as Raw Data:
In the context of the Lakehouse architecture or data pipelines, Parquet files are often used to store raw or semi-processed data in the Bronze layer of a Lakehouse, or even in the Silver layer once cleaned and transformed.
2. Delta Lake Table
Delta Lake is an open-source storage layer built on top of existing data lakes. It enhances the Parquet file format by adding ACID transactions, schema enforcement, and time travel features.
Delta Lake is essentially a transactional layer on top of Parquet files. It combines the benefits of Apache Parquet with ACID transaction support and other advanced features.
Key Features of Delta Lake:
-
ACID Transactions: Delta Lake brings atomicity, consistency, isolation, and durability (ACID) properties to data lakes. This allows you to perform updates, inserts, and deletes on data without worrying about corrupting the data.
-
Schema Evolution: Delta Lake supports schema evolution, meaning you can change the schema of your table (add/remove columns) as your data model evolves. It can also enforce schema validation.
-
Time Travel: Delta Lake supports time travel, allowing you to query previous versions of the data. This is useful for debugging, audit trails, or rolling back changes.
-
Optimized Reads and Writes: Delta Lake leverages data skipping and optimized file formats for faster reads and writes, especially for large datasets.
-
Concurrency: Delta Lake allows multiple concurrent reads and writes, enabling collaborative data processing.
Databricks Functionalities:
What Types of transformations are performed in the silver layer?
Please never say very basic transformations like concatenating columns or data type conversions. Always starts with some complex transformations and use any of them given below (Some transformations we will be doing in next coming chapters , But I consolidated here for complete reference
Standardising Date time values from different source systems
Integrating(Joining) data from multiple source tables (or) coming from multiple source systems. We are learning this in the Section 10.
Populating intermediate tables for loading reporting(covered in section 10) and datalike tables(covered in section 15)
Deriving business mapping values for the source values (e.g. converting different currency values into one single currency)
Converting complex nested JSON data into flattened tables (Covered in section 14)
Advantages Of Using Delta Lake Tables?
Delta Lake tables support ACID-compliant
ACID compliance enables Data Integrity and Consistency
Delta tables supports time versioning and each commit on the table creates new version of table data
Allows to query the different versions of the data
Comments
Post a Comment