Skip to main content

session 9 - Dimentional Data modelling

 Dimensional Data Modelling -

Dimensional modeling transforms data into a format that is optimized for reading and analysis, which is the primary goal of a reporting database ( data warehouse )


Dimensional modeling is a data modeling technique used primarily in data warehousing to organize and structure data in a way that is optimized for analysis and reporting. Instead of focusing on minimizing data redundancy like in transactional (OLTP) systems, dimensional modeling prioritizes making data easy to understand and query for business intelligence (BI) purposes.

The core concepts of dimensional modeling revolve around facts and dimensions:

  • Facts: These are the quantitative measurements or metrics of a business process. They are typically numeric and can be aggregated (e.g., sales revenue, units sold, profit). Fact tables usually contain foreign keys that link them to dimension tables.
  • Dimensions: These provide the context for the facts. They are descriptive attributes that answer the "who, what, when, where, and why" of a business event (e.g., product name, customer location, order date). Dimension tables contain attributes that are used for filtering, grouping, and labeling data in analyses.

These facts and dimensions are organized into structures called schemas, with the most common being the star schema.

Star Schema:

  • This schema consists of one central fact table surrounded by several dimension tables, resembling a star.
  • Each dimension table is directly related to the fact table through a foreign key.
  • Dimension tables are typically denormalized, meaning they may contain some redundant data to improve query performance by reducing the need for joins.
our  source data 



Step 1 :

Identify the dimensional and fact tables 

 "measure column" in the context of dimensional modeling it's referring to the column in a fact table that holds the numerical data you want to analyze and report. In the here measure column is the maximum price, the minimum price, and the model price 

These columns appear to contain descriptive attributes that provide context to the data. They seem to categorize or describe the business events.

  • DATE OF PRICING: (When did this pricing information apply?)
  • REP_ID: (Who is the representative associated with this data?)
  • STATE NAME: (Where did this occur?)
  • MARKET NAME: (Which specific market is this related to?)
  • PRODUCT GROUP NAME: (What broad category does the product belong to?)
  • PRODUCT NAME: (What is the specific product?)
  • VARIETY: (What is the specific variety of the product?)
  • ORIGIN: (Where did the product originate?)

Likely Fact Columns (indicated by the right green box):

These columns appear to contain numerical values that could represent measurements or metrics that you would want to analyze.

  • ARRIVAL IN TONNES: (How much arrived?)
  • MINIMUM PRICE: (What was the lowest price?)
  • MAXIMUM PRICE: (What was the highest price?)
  • MODAL PRICE: (What was the most frequent price?)

In summary, based on this snapshot:

  • The columns in the left green box are strong candidates for dimension columns. They provide the context for understanding the "when," "where," "what," and "who" related to the pricing information.
  • The columns in the right green box are strong candidates for fact columns (measures). They contain the numerical data about the arrival quantity and prices, which you would likely want to aggregate and analyze.

  • Step 2: Designing the Dimensional and Fact tables 
To create the dimensional columns, we have to group the dimensional columns which got 1 to one relationship.

Within the descriptive columns, if there is a many-to-one relationship within the columns just separate it. Here all five columns have a many-to-one relationship with each other so we are separating into five columns, so we need to create five five-dimensional tables.



Add sarragate keys to the dimensionally separated columns and create the dimensional tables.



To design the fact table .

1. We need to identify the measurable columns 
2. We need to map those measurable columns to the segrate columns we created as the part of the dimensional tables.



What we have done so far is a logical data model 

 We separated the logical the dimensional, and fact table. Now we are going to create the tables for that in the gold layer because we are going to create this especially for reporting .

We can create the table using the create table syntax in PySpark. ( use catalog then creating schema and then creating tables )






Now we have created all the dimensional and fact tables as we desired.

If we want to reuse the queries in the sequential environment, go to the SQL sections and go to queries, copy and paste, and save the queries.

  1.  What is Dimension Table (or) How to Identify Dimension Table?

       If you discuss about any one or two behaviours of dimension table that will be fine

    • Dimension tables stores Business Description column values like markets , products

    • Relationship between each column inside one dimension table is always 1 - 1 . If any columns got 1 - Many relationship then need to be moved to new dimension table

    • Dimension table columns data don't change frequently and slowly change over the time

    • Include Surrogate Keys on Dimension tables to keep history of changes in source

  2.  What is Fact Table (or) How to Identify Fact Table?

       If you discuss about any one or two behaviours of fact table that will be fine

    • Fact tables stores Business Measure column values like product-prices , product-quantities

    • Fact tables includes all Surrogate Keys from dimension table for joining and getting additional dimension values associated to the fact tables

    • Fact tables measures column data changed on regular interval and we need to load fact tables also regularly

    • Multiple fact tables required if measures captured are associated to different set of dimension tables

  3.  What is Logical and Physical Data Modelling?

      Logical Data Model : Created based on business requirement without considering physical implementation

      Physical Data Model : Created top of logical model to create it physically, normally in databases

  4.  What is Snowflake Schema?

     The relationship between each column inside a dimension table is always 1 - 1 . If any columns got a 1 - Many relationship but still associated only to the same dimension table and not directly linked to fact table then we create child dimension table to store 1-M columns and link to the main dimension table using surrogate key. This way of designing Star Schema is called Snowflake Schema

Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...