Spark SQL

1.Converting the dataframe(spark dataframe API ) into the tempview to use (spark SQl API )and then into converting back into dataframe (Spark Dataframe )

if we want to work on different language with the azure notebook just we need to click on the python o whatever language is shown in the notebook cell and we can use SQL if we want to use SQl directly withour import pyspark.sql.functions package

As usual we create the connection between the storage account to databricks then we read the file and converted into dataframe

Now we are just creating the tempervory view of the dataframe which is changed into the tempory table by giving df.CreateorReplaceTempView("tempview name). this can be used only in this notebook.

then we can use that temp view table to do any SQL functions . if we want to gove any SQL normal queries nside the notebook just chaged the python into SQl and start quering

or else we just need to use spark.sql (any query)

if we want to change the Tempview back into the dataframe we just need to give spark.table ("temp view table name")

if we want to use the tempview outside of the notebook also we can give df.CreateOrViewGlobolTempView("temp view table name ")

Globaltemp tables wont be available if we shut down the clusters .

2. Creating the Managed tables

just changing the language back into python to sql beacause most of the operations are going to be in SQl

and opening the already created Global temp View to use in SQl using select*from globol_temp.tablename

CREATE TABLE AS SELECT (CTAS) operation in a Spark SQL environment. This is a common SQL construct that allows you to create a new table based on the results of a SELECT query.

creating managed tables with CTAS (CREATE TABLE AS SELECT) in Spark! It's a common and efficient way to build your data warehouse. Let's walk through how you can do this.

When you use CTAS to create a managed table in Spark, the following happens:

Spark manages the lifecycle of the table's data and metadata.
The data for the table is stored in the location specified by Spark's warehouse directory (typically under /user/hive/warehouse/ in HDFS, but this can be configured).
When you drop a managed table, both the metadata and the underlying data are deleted.

Here's the basic syntax for creating a managed table using CTAS in Spark SQL:

SQL
CREATE TABLE managed_table_name
AS SELECT column1, column2, ...
FROM source_table;

Let's break down the components and explore some common options:

1. Basic Syntax:

CREATE TABLE managed_table_name: This specifies that you want to create a new table named managed_table_name. Since we're not specifying EXTERNAL, Spark will create a managed table.
AS SELECT column1, column2, ... FROM source_table: This is the crucial part. The SELECT statement defines the data that will populate the new table. You can select specific columns, apply transformations, and join multiple tables here.

2. Specifying Data Types:

You can explicitly define the data types of the columns in your new managed table if you don't want to rely on Spark's type inference from the SELECT statement:

SQL
CREATE TABLE managed_table_name (
    column1 STRING,
    column2 INT,
    column3 TIMESTAMP
)
AS SELECT col_a, col_b, col_c
FROM source_table;

3. Partitioning:

For large datasets, partitioning can significantly improve query performance. You can specify partitioning columns during table creation:

SQL
CREATE TABLE partitioned_managed_table (
    column1 STRING,
    column2 INT
)
PARTITIONED BY (partition_column DATE)
AS SELECT col_a, col_b, date_col AS partition_column
FROM source_table;

4. Bucketing:

Bucketing is another optimization technique that can be useful for joins and aggregations on specific columns:

SQL
CREATE TABLE bucketed_managed_table (
    column1 STRING,
    column2 INT
)
CLUSTERED BY (column1) INTO 4 BUCKETS
AS SELECT col_a, col_b
FROM source_table;

5. Table Properties:

You can set various table properties using the TBLPROPERTIES clause. These properties can control storage formats, compression, and other table-specific settings:

SQL
CREATE TABLE managed_table_with_props
AS SELECT column1, column2
FROM source_table
TBLPROPERTIES ('storage.location.template'='/path/to/data/${year}/${month}/', 'parquet.compression'='SNAPPY');

6. Using Different File Formats:

By default, managed tables in Spark often use the configured default file format (usually Parquet). However, you can explicitly specify a different format using STORED AS:

SQL
CREATE TABLE managed_table_orc
STORED AS ORC
AS SELECT column1, column2
FROM source_table;

CREATE TABLE managed_table_csv
STORED AS CSV
OPTIONS ('header'='true', 'delimiter'=',')
AS SELECT column1, column2
FROM source_table;

Example Scenario:

Let's say you have a source table named raw_sales_data with columns product_id, sale_date, and amount. You want to create a managed table called daily_sales_summary that aggregates the total sales amount for each day.

SQL
CREATE TABLE daily_sales_summary
PARTITIONED BY (sale_day DATE)
AS SELECT
    product_id,
    CAST(sale_date AS DATE) AS sale_day,
    SUM(amount) AS total_amount
FROM raw_sales_data
GROUP BY product_id, CAST(sale_date AS DATE);

In this example:

We create a managed table named daily_sales_summary.
We partition the table by the sale_day.
We select the product_id, cast the sale_date to a DATE for partitioning, and calculate the SUM of the amount.
The data is grouped by product_id and the extracted sale_day.

Key Considerations:

Permissions: Ensure that the Spark user has the necessary permissions to create tables in the designated warehouse directory.
Performance: Consider partitioning and bucketing strategies based on your query patterns and data volume.
Data Types: Be mindful of data type conversions during the SELECT operation. Explicitly casting when necessary can prevent unexpected issues.
Managed vs. External: Remember that managed tables are tightly coupled with Spark's metastore. Dropping the table will delete the data. If you need the data to persist independently, consider creating an external table instead.

In Databricks, here we just create the permanent table (Managed table ) using CTAS, which can be viewed under the Catalog section

The table was stored under the default Unity catalog. One Unity catalog was created automatically when you create the cluster under Compute

Now we are trying to insert the table into the managed table by using the normal insert function in SQL.

after inserting if we rerun the count() query its changes in the count; we are responsible for the contents that should be available in the table so we should always be careful about it

now we are altering the table by adding the column, the table was updated with the column, but it has no data since we haven't loaded the data into the table

Now we are updating the data into the create table by the usual update command

Like Hive, we can give Describe tablename to see what is there in the table order to know more, we can use Describe Extended tablename for a detailed description

if we drop the entire table its wipes out the entire table, including ts data and metadata from memory

now we are creating managed table back from the global temp view but its using python language by

df.write .SaveasTable ("table name ")

Table was created and saved under the Unity catalog.

In provider its shows as Delta, its is basic for the Delta Lake table.

the "provider" often shows as "Delta" when you're working with Delta Lake tables, especially in environments like Azure Databricks or Azure Synapse Analytics.

Yes, "Delta" as the provider is fundamental and essentially the identifier for a Delta Lake table.

Here's why:

Delta Lake's Core: Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to Apache Spark¹ and big data workloads. It enables features like schema evolution, time travel, and unified streaming and batch processing on data lakes

External Tables :

using dbutils fs.ls shows all the files available in the file system. Now, we are going to create the external table from that

Just by querying, we are viewing the data inside the file in table format by using the select function

Now we are creating the view from the query, and that is stored by default in the catalog

we can see te contains by giving command as Describe Extented view

No we are creating the external table directly from the file location by using CTAS

If you drop the table also just deletes the metadata of the table, not the file storage in the file system

Now we are creating the external table without CTAS

An external table was created.

Module Summary

Explain the scenarios where you have used external tables.
- When Data is stored in external storage and also managed by external systems
- Need to keep the underlying data files when dropping the table
- Multiple teams share the data stored in the common place
Difference Between External Tables and Managed(Delta) Tables?
- Managed Tables
  - Use when you want Databricks to handle the entire lifecycle(create/insert/update/delete) of the table and data
  - Ideal for the tables used inside the Databricks platform
  - Simplifies data load since Databricks handles storage and clean-up.
- External Tables
  - Use when you want to maintain control over the data processing and storage location
  - Ideal for data that is shared across multiple platforms or external systems
  - Useful for scenarios where data must persist even if the table is dropped
Different types of Spark SQL Functions(Transformations) used in the project?
- String Manipulation Functions
  - DISTINCT - Deduplicate records
  - AGGREGATION FUNCTION - SUM , COUNT
  - ANALYTICAL FUNCTIONS - ROW_NUMBER
  - CASE WHEN FUNCTION - Derive Complex Conditional Logic
- Date Manipulation Functions
  - current_timestamp() - Get the current date and time value
  - Date Extraction Functions - year , month , week
  - Date Formatting functions - date format()
  - Date Conversion Functions - to_date()

Keerthana Blogs

Search This Blog

Spark SQL

Explain the scenarios where you have used external tables.

Difference Between External Tables and Managed(Delta) Tables?

Different types of Spark SQL Functions(Transformations) used in the project?

Labels

Comments

Post a Comment

Popular posts from this blog

session 19 Git Repository

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

Transformation - section 6 - data flow