Skip to main content

Spark SQL

 1.Converting the dataframe(spark dataframe API ) into the tempview to use (spark SQl API )and then into converting back into dataframe (Spark Dataframe )


if we want to work on different language with the azure notebook just we need to click on the python o whatever language is shown in the notebook cell and we can use SQL if we want to use SQl  directly withour import pyspark.sql.functions package 



As usual we create the connection between the storage account to databricks then we read the file and converted into dataframe 

Now we are just creating the tempervory view of the dataframe which is changed into the tempory table by giving df.CreateorReplaceTempView("tempview name). this can be used only in this notebook.



then we can use that temp view table to do any SQL functions . if we want to gove any SQL normal queries nside the notebook just chaged the python into SQl and start quering 

or else we just need to use spark.sql (any query)



if we want to change the Tempview back into the dataframe we just need to give spark.table ("temp view table name")


if we want to use the tempview outside of the notebook also we can give df.CreateOrViewGlobolTempView("temp view table name ")

Globaltemp tables wont be available if we shut down the clusters .

2. Creating the Managed tables 


just changing the language back into python to sql beacause most of the operations are going to be in SQl 

and opening the already created Global temp View to use in SQl using select*from globol_temp.tablename 



CREATE TABLE AS SELECT (CTAS) operation in a Spark SQL environment. This is a common SQL construct that allows you to create a new table based on the results of a SELECT query.


 creating managed tables with CTAS (CREATE TABLE AS SELECT) in Spark! It's a common and efficient way to build your data warehouse. Let's walk through how you can do this.

When you use CTAS to create a managed table in Spark, the following happens:

  • Spark manages the lifecycle of the table's data and metadata.
  • The data for the table is stored in the location specified by Spark's warehouse directory (typically under /user/hive/warehouse/ in HDFS, but this can be configured).
  • When you drop a managed table, both the metadata and the underlying data are deleted.

Here's the basic syntax for creating a managed table using CTAS in Spark SQL:

SQL
CREATE TABLE managed_table_name
AS SELECT column1, column2, ...
FROM source_table;

Let's break down the components and explore some common options:

1. Basic Syntax:

  • CREATE TABLE managed_table_name: This specifies that you want to create a new table named managed_table_name. Since we're not specifying EXTERNAL, Spark will create a managed table.
  • AS SELECT column1, column2, ... FROM source_table: This is the crucial part. The SELECT statement defines the data that will populate the new table. You can select specific columns, apply transformations, and join multiple tables here.

2. Specifying Data Types:

You can explicitly define the data types of the columns in your new managed table if you don't want to rely on Spark's type inference from the SELECT statement:

SQL
CREATE TABLE managed_table_name (
    column1 STRING,
    column2 INT,
    column3 TIMESTAMP
)
AS SELECT col_a, col_b, col_c
FROM source_table;

3. Partitioning:

For large datasets, partitioning can significantly improve query performance. You can specify partitioning columns during table creation:

SQL
CREATE TABLE partitioned_managed_table (
    column1 STRING,
    column2 INT
)
PARTITIONED BY (partition_column DATE)
AS SELECT col_a, col_b, date_col AS partition_column
FROM source_table;

4. Bucketing:

Bucketing is another optimization technique that can be useful for joins and aggregations on specific columns:

SQL
CREATE TABLE bucketed_managed_table (
    column1 STRING,
    column2 INT
)
CLUSTERED BY (column1) INTO 4 BUCKETS
AS SELECT col_a, col_b
FROM source_table;

5. Table Properties:

You can set various table properties using the TBLPROPERTIES clause. These properties can control storage formats, compression, and other table-specific settings:

SQL
CREATE TABLE managed_table_with_props
AS SELECT column1, column2
FROM source_table
TBLPROPERTIES ('storage.location.template'='/path/to/data/${year}/${month}/', 'parquet.compression'='SNAPPY');

6. Using Different File Formats:

By default, managed tables in Spark often use the configured default file format (usually Parquet). However, you can explicitly specify a different format using STORED AS:

SQL
CREATE TABLE managed_table_orc
STORED AS ORC
AS SELECT column1, column2
FROM source_table;

CREATE TABLE managed_table_csv
STORED AS CSV
OPTIONS ('header'='true', 'delimiter'=',')
AS SELECT column1, column2
FROM source_table;

Example Scenario:

Let's say you have a source table named raw_sales_data with columns product_id, sale_date, and amount. You want to create a managed table called daily_sales_summary that aggregates the total sales amount for each day.

SQL
CREATE TABLE daily_sales_summary
PARTITIONED BY (sale_day DATE)
AS SELECT
    product_id,
    CAST(sale_date AS DATE) AS sale_day,
    SUM(amount) AS total_amount
FROM raw_sales_data
GROUP BY product_id, CAST(sale_date AS DATE);

In this example:

  • We create a managed table named daily_sales_summary.
  • We partition the table by the sale_day.
  • We select the product_id, cast the sale_date to a DATE for partitioning, and calculate the SUM of the amount.
  • The data is grouped by product_id and the extracted sale_day.

Key Considerations:

  • Permissions: Ensure that the Spark user has the necessary permissions to create tables in the designated warehouse directory.
  • Performance: Consider partitioning and bucketing strategies based on your query patterns and data volume.
  • Data Types: Be mindful of data type conversions during the SELECT operation. Explicitly casting when necessary can prevent unexpected issues.
  • Managed vs. External: Remember that managed tables are tightly coupled with Spark's metastore. Dropping the table will delete the data. If you need the data to persist independently, consider creating an external table instead.



In Databricks, here we just create the permanent table (Managed table ) using CTAS, which can be viewed under the Catalog section 


The table was stored under the default Unity catalog. One Unity catalog was created automatically when you create the cluster under Compute 
Now we are trying to insert the table into the managed table by using the normal insert function in SQL.



after inserting if we rerun the count() query its changes in the count; we are responsible for the contents that should be available in the table so we should always be careful about it 



now we are altering the table by adding the column, the table was updated with the column, but it has no data since we haven't loaded the data into the table 




Now we are updating the data into the create table by the usual update command 

Like Hive, we can give Describe tablename to see what is there in the table order to know more, we can use Describe Extended tablename for a detailed description 
if we drop the entire table its wipes out the entire table, including ts data and metadata from memory 


now we are creating managed table back from the global temp view but its using python language by 
df.write .SaveasTable ("table name ")

Table was created and saved under the Unity catalog.



In provider its shows as Delta, its is basic for the Delta Lake table.

the "provider" often shows as "Delta" when you're working with Delta Lake tables, especially in environments like Azure Databricks or Azure Synapse Analytics.

Yes, "Delta" as the provider is fundamental and essentially the identifier for a Delta Lake table.

Here's why:

  • Delta Lake's Core: Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to Apache Spark 1 and big data workloads. It enables features like schema evolution, time travel, and unified streaming and batch processing on data lakes
External Tables :


using dbutils fs.ls shows all the files available in the file system. Now, we are going to create the external table from that 

Just by querying, we are viewing the data inside the file in table format by using the select function 

Now we are creating the view from the query, and that is stored by default in the catalog 

we can see te contains by giving command as Describe Extented view 

No we are creating the external table directly from the file location by using CTAS 

If you drop the table also just deletes the metadata of the table, not the file storage in the file system 

Now we are creating the external table without CTAS 

An external table was created.

Module Summary
  1.   Explain the scenarios where you have used external tables.

    • When Data is stored in external storage and also managed by external systems

    • Need to keep the underlying data files when dropping the table

    • Multiple teams share the data stored in the common place

  2. Difference Between External Tables and Managed(Delta) Tables?

    • Managed Tables

      • Use when you want Databricks to handle the entire lifecycle(create/insert/update/delete) of the table and data

      • Ideal for the tables used inside the Databricks platform

      • Simplifies data load since Databricks handles storage and clean-up.

    • External Tables

      • Use when you want to maintain control over the data processing and storage location

      • Ideal for data that is shared across multiple platforms or external systems

      • Useful for scenarios where data must persist even if the table is dropped

  3. Different types of Spark SQL Functions(Transformations) used in the project?

    • String Manipulation Functions

      • DISTINCT -  Deduplicate records

      • AGGREGATION FUNCTION - SUM , COUNT

      • ANALYTICAL FUNCTIONS - ROW_NUMBER

      • CASE WHEN FUNCTION - Derive Complex Conditional Logic

    • Date Manipulation Functions

      • current_timestamp() - Get the current date and time value

      • Date Extraction Functions - year , month , week

      • Date Formatting functions - date format()

      • Date Conversion Functions - to_date()

Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...