1.Converting the dataframe(spark dataframe API ) into the tempview to use (spark SQl API )and then into converting back into dataframe (Spark Dataframe )
As usual we create the connection between the storage account to databricks then we read the file and converted into dataframe
then we can use that temp view table to do any SQL functions . if we want to gove any SQL normal queries nside the notebook just chaged the python into SQl and start quering
2. Creating the Managed tables
just changing the language back into python to sql beacause most of the operations are going to be in SQl
and opening the already created Global temp View to use in SQl using select*from globol_temp.tablename
CREATE TABLE AS SELECT (CTAS) operation in a Spark SQL environment. This is a common SQL construct that allows you to create a new table based on the results of a SELECT query.
creating managed tables with CTAS (CREATE TABLE AS SELECT) in Spark! It's a common and efficient way to build your data warehouse. Let's walk through how you can do this.
When you use CTAS to create a managed table in Spark, the following happens:
- Spark manages the lifecycle of the table's data and metadata.
- The data for the table is stored in the location specified by Spark's warehouse directory (typically under
/user/hive/warehouse/
in HDFS, but this can be configured). - When you drop a managed table, both the metadata and the underlying data are deleted.
Here's the basic syntax for creating a managed table using CTAS in Spark SQL:
CREATE TABLE managed_table_name
AS SELECT column1, column2, ...
FROM source_table;
Let's break down the components and explore some common options:
1. Basic Syntax:
CREATE TABLE managed_table_name
: This specifies that you want to create a new table namedmanaged_table_name
. Since we're not specifyingEXTERNAL
, Spark will create a managed table.AS SELECT column1, column2, ... FROM source_table
: This is the crucial part. TheSELECT
statement defines the data that will populate the new table. You can select specific columns, apply transformations, and join multiple tables here.
2. Specifying Data Types:
You can explicitly define the data types of the columns in your new managed table if you don't want to rely on Spark's type inference from the SELECT
statement:
CREATE TABLE managed_table_name (
column1 STRING,
column2 INT,
column3 TIMESTAMP
)
AS SELECT col_a, col_b, col_c
FROM source_table;
3. Partitioning:
For large datasets, partitioning can significantly improve query performance. You can specify partitioning columns during table creation:
CREATE TABLE partitioned_managed_table (
column1 STRING,
column2 INT
)
PARTITIONED BY (partition_column DATE)
AS SELECT col_a, col_b, date_col AS partition_column
FROM source_table;
4. Bucketing:
Bucketing is another optimization technique that can be useful for joins and aggregations on specific columns:
CREATE TABLE bucketed_managed_table (
column1 STRING,
column2 INT
)
CLUSTERED BY (column1) INTO 4 BUCKETS
AS SELECT col_a, col_b
FROM source_table;
5. Table Properties:
You can set various table properties using the TBLPROPERTIES
clause. These properties can control storage formats, compression, and other table-specific settings:
CREATE TABLE managed_table_with_props
AS SELECT column1, column2
FROM source_table
TBLPROPERTIES ('storage.location.template'='/path/to/data/${year}/${month}/', 'parquet.compression'='SNAPPY');
6. Using Different File Formats:
By default, managed tables in Spark often use the configured default file format (usually Parquet). However, you can explicitly specify a different format using STORED AS
:
CREATE TABLE managed_table_orc
STORED AS ORC
AS SELECT column1, column2
FROM source_table;
CREATE TABLE managed_table_csv
STORED AS CSV
OPTIONS ('header'='true', 'delimiter'=',')
AS SELECT column1, column2
FROM source_table;
Example Scenario:
Let's say you have a source table named raw_sales_data
with columns product_id
, sale_date
, and amount
. You want to create a managed table called daily_sales_summary
that aggregates the total sales amount for each day.
CREATE TABLE daily_sales_summary
PARTITIONED BY (sale_day DATE)
AS SELECT
product_id,
CAST(sale_date AS DATE) AS sale_day,
SUM(amount) AS total_amount
FROM raw_sales_data
GROUP BY product_id, CAST(sale_date AS DATE);
In this example:
- We create a managed table named
daily_sales_summary
. - We partition the table by the
sale_day
. - We select the
product_id
, cast thesale_date
to aDATE
for partitioning, and calculate theSUM
of theamount
. - The data is grouped by
product_id
and the extractedsale_day
.
Key Considerations:
- Permissions: Ensure that the Spark user has the necessary permissions to create tables in the designated warehouse directory.
- Performance: Consider partitioning and bucketing strategies based on your query patterns and data volume.
- Data Types: Be mindful of data type conversions during the
SELECT
operation. Explicitly casting when necessary can prevent unexpected issues. - Managed vs. External: Remember that managed tables are tightly coupled with Spark's metastore. Dropping the table will delete the data. If you need the data to persist independently, consider creating an external table instead.
The table was stored under the default Unity catalog. One Unity catalog was created automatically when you create the cluster under Compute
after inserting if we rerun the count() query its changes in the count; we are responsible for the contents that should be available in the table so we should always be careful about it
now we are altering the table by adding the column, the table was updated with the column, but it has no data since we haven't loaded the data into the table
Like Hive, we can give Describe tablename to see what is there in the table order to know more, we can use Describe Extended tablename for a detailed description
if we drop the entire table its wipes out the entire table, including ts data and metadata from memory
In provider its shows as Delta, its is basic for the Delta Lake table.
the "provider" often shows as "Delta" when you're working with Delta Lake tables, especially in environments like Azure Databricks or Azure Synapse Analytics.
Yes, "Delta" as the provider is fundamental and essentially the identifier for a Delta Lake table.
Here's why:
- Delta Lake's Core: Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to Apache Spark
1 and big data workloads. It enables features like schema evolution, time travel, and unified streaming and batch processing on data lakes

using dbutils fs.ls shows all the files available in the file system. Now, we are going to create the external table from that


we can see te contains by giving command as Describe Extented view

No we are creating the external table directly from the file location by using CTAS
If you drop the table also just deletes the metadata of the table, not the file storage in the file system
Explain the scenarios where you have used external tables.
When Data is stored in external storage and also managed by external systems
Need to keep the underlying data files when dropping the table
Multiple teams share the data stored in the common place
Difference Between External Tables and Managed(Delta) Tables?
Managed Tables
Use when you want Databricks to handle the entire lifecycle(create/insert/update/delete) of the table and data
Ideal for the tables used inside the Databricks platform
Simplifies data load since Databricks handles storage and clean-up.
External Tables
Use when you want to maintain control over the data processing and storage location
Ideal for data that is shared across multiple platforms or external systems
Useful for scenarios where data must persist even if the table is dropped
Different types of Spark SQL Functions(Transformations) used in the project?
String Manipulation Functions
DISTINCT - Deduplicate records
AGGREGATION FUNCTION - SUM , COUNT
ANALYTICAL FUNCTIONS - ROW_NUMBER
CASE WHEN FUNCTION - Derive Complex Conditional Logic
Date Manipulation Functions
current_timestamp() - Get the current date and time value
Date Extraction Functions - year , month , week
Date Formatting functions - date format()
Date Conversion Functions - to_date()
Comments
Post a Comment