Skip to main content

Session 10 - Reporting the dimensional and fact table

 


Spark SQL is an ideal choice when the data is structured format .go to SQl editor and save the file.


We are going to create the reporting tables as mentioned in the above column.




We create this to populate the first-dimensional table.



if we have to create multiple transformations, we can able to do it by chaining the transformations but if we want to do it for  different source tables, we have to do it by staging as stage 1



Now we are adding the surrogate table to the dimensional table.



86. Dimension table load - generate surrogate keys.


which is


Here we generate the surrogate keys in the column State ID for the unique value obtained by CTAS above. Using the Row number (), the window function creates the surrogate keys (identity column).


And add the timestamp, now the query is ready for reporting. inserting into the table Reporting_dim_state_gold ( reporting table ) using insert as select statement .



Checking the data has been loaded into the reporting table and check all the values are unique because dimension tables have to be unique and surrogate keys (identify column ) and then it must be linked to the fact table . Here the data got inserted twice that's why we got duplicate tables . so always review before loading into the table .


87. Dimension Table Load - Change Data Capture (CDC).

source table --> silver layer table 
reporting table --> Gold layer table.

now we are comparing the both table using the join query to check the all the data has been properly landed into the  reporting table .




If we rerun the above comparing table, we can identify which newly added values (CDC) which needs to be added into the reporting table .


Here, the state ID ( has to unique ) but there it is the values as 1 , 2, because for 1 , 2 there will be some other values that will be present so we need to know the maximum values given to the state id table now we are going to create one extra table for the existing one and naming as stage 2 table 


We are extracting the max values and cross-joining join to the existing stage 2 table so it automatically changes to 24 , 25 .

always remember during the first run there is no records be will available in the table because data is getting load  as the batch .

now are completely ready to load the data into the reporting table.

88. Dimension table load -Load Reporting_DIM _state_load ( reporting table ).
 
we are to going to create the stage 3 ( intermediate table ) for table  the join query with the existing stage 2 table and added with current and precious timestamp .



Remember: Always use the proper name for an alias.

Truncate the gold table (reporting table 0 which needs to be updated because the data that we will be available completely into the stage 3 table and insert the values using insert into select statement .


and check all the data got loaded into the reporting table.

Remember: Now we have created all the staging tables and loaded the data into the reporting table, which is the dimensional table. After this completed clearing the notebook by deleting the unnecessary commands or queries that we used to check the results and ran all to ensure that the data is flowing seamlessly .

89. Dimensional Table load - populate all the tables in the gold layer.

Now we are creating the new dimensional table likewise to the precious change table but remember to change the tables name and column properly.




After creating all the tables now we are writing the below query to know the data WHICH IS RUN BEFORE BE getting updated after one run .


90. Dimensional DATE table processing:

The most common or must table in the reporting table in a gold layer is the time table, sometimes its date or timestamp, somehow it's related to time.

Her,e the data is based on the DATE.



we have Pandas framework  to used the date fact table .

to go the Workspace and create the new notebook and save the notebook then connect and start  the cluster  .

to import the pandas , used pandas. date_range(start = ' ' and end = ' ') and assign it to the dataframe and convert pandas dataframe into spark dataframe  to diaplay the datafarme use display() function .




creating the column with the calander date values and convert into the temp view to convet tinto the table .

now the datatype to string to date ,


adding current timestamp and thus table is ready to get loaded into the gold layer for reporting table (dimentional table ).


91.Fact_table daily pricing table Gold layer

step1: we need to map the source table in silver layer  to create the fact table 
step 2 : we need to link the surrgate keys to the fact table .





These are the fact table column has the  matrics information .



Separating the surrogate key or identity column using the select and join query.


using the datalakehouse process_run to get the up-to-date data.



Likewise, just add the table name to the join query and separate the column from the dimensional table .


Now adding the all the metrics column from the source table and current timestamp in the same query  


Now we  got the all the columns required for the fact table and we can insert the fact table which we have created in the gold layer.



Giving this insert command for the incremental load .


92. Slowly changing Dimensions table load - SCD Type 1 and Type 2 :

We have created so far to load the data alone, and we have the existing data changes.




Mostly, the data won't get changed easily for dimensional tables because it's product descriptive. But over time, in that case, we need to use the SCD  concept.

93. SCD TYPE1 processing - type 1 

Very important interview questions :

  Cloning one table into another 


Create a table catalog. schema.table_name like catalog. schema.tablename 



Here, we just replace the existing product type with a new product type, which is onion . Then we need to update in the table name in all the staging table creation till reporting into the fact table .



Other interview questions: creating the table using the MERGE statement.

In SQL, the MERGE the statement is a powerful tool that allows you to perform INSERT, UPDATE, or DELETE operations on a target table based  on the results of a join with a source table. It essentially combines these three data manipulation language (DML) operations into a single statement.

MERGE INTO target_table [AS target_alias] USING source_table [AS source_alias] ON (merge_condition) WHEN MATCHED THEN merge_matched_action [WHEN MATCHED AND condition THEN merge_matched_action] WHEN NOT MATCHED [BY TARGET] THEN merge_not_matched_action [WHEN NOT MATCHED [BY TARGET] AND condition THEN merge_not_matched_action] [WHEN NOT MATCHED BY SOURCE THEN merge_not_matched_by_source_action [WHEN NOT MATCHED BY SOURCE AND condition THEN merge_not_matched_by_source_action]];


94. SCD TYPE 1 - part 2 

Run the query for the reporting table, now we can see the product type has been changed.


95. SCD TYPE 2 overview :

We must keep the changes and the original product, or the changes column.


Here we have both the change records as well as the previous record.

If we add like that, confusion will happen because the product ID is the identity column or seggrate keys that need to be linked to the fact table, so we are adding the start date and end date columns for clarification.



For more details, some of the projects might have a record flag set as Y .

To create the table, we can use SHOW CREATE TABLE. To create the table quickly, just copy and paste the output to create a table.

Always add the timestamp at the end of the create query.


CREATE THE TABLE FOR SCD TYPE 2 

96. SCD TYPE 2 processing - Take 1.




Keep their tables as it is so which were created when the SCD 1 implementation.



change the process name to type 1 to type 2.


Remove the existing dimensional table and add the new product id column to implement the SCD type 2 column.


CASE WHEN statement is important in interview questions 


99. Scheduling Reporting and Dimension and fact tables load.



Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...