session 9 - Dimentional Data modelling

Dimensional Data Modelling -

Dimensional modeling transforms data into a format that is optimized for reading and analysis, which is the primary goal of a reporting database ( data warehouse )

Dimensional modeling is a data modeling technique used primarily in data warehousing to organize and structure data in a way that is optimized for analysis and reporting. Instead of focusing on minimizing data redundancy like in transactional (OLTP) systems, dimensional modeling prioritizes making data easy to understand and query for business intelligence (BI) purposes.

The core concepts of dimensional modeling revolve around facts and dimensions:

Facts: These are the quantitative measurements or metrics of a business process. They are typically numeric and can be aggregated (e.g., sales revenue, units sold, profit). Fact tables usually contain foreign keys that link them to dimension tables.
Dimensions: These provide the context for the facts. They are descriptive attributes that answer the "who, what, when, where, and why" of a business event (e.g., product name, customer location, order date). Dimension tables contain attributes that are used for filtering, grouping, and labeling data in analyses.

These facts and dimensions are organized into structures called schemas, with the most common being the star schema.

Star Schema:

This schema consists of one central fact table surrounded by several dimension tables, resembling a star.
Each dimension table is directly related to the fact table through a foreign key.
Dimension tables are typically denormalized, meaning they may contain some redundant data to improve query performance by reducing the need for joins.

our source data

Step 1 :

Identify the dimensional and fact tables

"measure column" in the context of dimensional modeling it's referring to the column in a fact table that holds the numerical data you want to analyze and report. In the here measure column is the maximum price, the minimum price, and the model price

These columns appear to contain descriptive attributes that provide context to the data. They seem to categorize or describe the business events.

DATE OF PRICING: (When did this pricing information apply?)
REP_ID: (Who is the representative associated with this data?)
STATE NAME: (Where did this occur?)
MARKET NAME: (Which specific market is this related to?)
PRODUCT GROUP NAME: (What broad category does the product belong to?)
PRODUCT NAME: (What is the specific product?)
VARIETY: (What is the specific variety of the product?)
ORIGIN: (Where did the product originate?)

Likely Fact Columns (indicated by the right green box):

These columns appear to contain numerical values that could represent measurements or metrics that you would want to analyze.

ARRIVAL IN TONNES: (How much arrived?)
MINIMUM PRICE: (What was the lowest price?)
MAXIMUM PRICE: (What was the highest price?)
MODAL PRICE: (What was the most frequent price?)

In summary, based on this snapshot:

The columns in the left green box are strong candidates for dimension columns. They provide the context for understanding the "when," "where," "what," and "who" related to the pricing information.
The columns in the right green box are strong candidates for fact columns (measures). They contain the numerical data about the arrival quantity and prices, which you would likely want to aggregate and analyze.
Step 2: Designing the Dimensional and Fact tables

To create the dimensional columns, we have to group the dimensional columns which got 1 to one relationship.

Within the descriptive columns, if there is a many-to-one relationship within the columns just separate it. Here all five columns have a many-to-one relationship with each other so we are separating into five columns, so we need to create five five-dimensional tables.

Add sarragate keys to the dimensionally separated columns and create the dimensional tables.

To design the fact table .

1. We need to identify the measurable columns

2. We need to map those measurable columns to the segrate columns we created as the part of the dimensional tables.

What we have done so far is a logical data model

We separated the logical the dimensional, and fact table. Now we are going to create the tables for that in the gold layer because we are going to create this especially for reporting .

We can create the table using the create table syntax in PySpark. ( use catalog then creating schema and then creating tables )

Now we have created all the dimensional and fact tables as we desired.

If we want to reuse the queries in the sequential environment, go to the SQL sections and go to queries, copy and paste, and save the queries.

What is Dimension Table (or) How to Identify Dimension Table?
If you discuss about any one or two behaviours of dimension table that will be fine
- Dimension tables stores Business Description column values like markets , products
- Relationship between each column inside one dimension table is always 1 - 1 . If any columns got 1 - Many relationship then need to be moved to new dimension table
- Dimension table columns data don't change frequently and slowly change over the time
- Include Surrogate Keys on Dimension tables to keep history of changes in source
What is Fact Table (or) How to Identify Fact Table?
If you discuss about any one or two behaviours of fact table that will be fine
- Fact tables stores Business Measure column values like product-prices , product-quantities
- Fact tables includes all Surrogate Keys from dimension table for joining and getting additional dimension values associated to the fact tables
- Fact tables measures column data changed on regular interval and we need to load fact tables also regularly
- Multiple fact tables required if measures captured are associated to different set of dimension tables
What is Logical and Physical Data Modelling?
Logical Data Model : Created based on business requirement without considering physical implementation
Physical Data Model : Created top of logical model to create it physically, normally in databases
What is Snowflake Schema?
The relationship between each column inside a dimension table is always 1 - 1 . If any columns got a 1 - Many relationship but still associated only to the same dimension table and not directly linked to fact table then we create child dimension table to store 1-M columns and link to the main dimension table using surrogate key. This way of designing Star Schema is called Snowflake Schema

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Keerthana Blogs

Search This Blog

session 9 - Dimentional Data modelling

What is Dimension Table (or) How to Identify Dimension Table?

What is Fact Table (or) How to Identify Fact Table?

What is Logical and Physical Data Modelling?

What is Snowflake Schema?

Comments

Post a Comment

Popular posts from this blog

session 19 Git Repository

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

Transformation - section 6 - data flow