Skip to main content

Posts

2.a flattening the JSON File.

Think of JSON processing as this journey: Raw JSON → Python objects → Flatten → Clean → DataFrame → Spark → Production pipeline Goal In Local Python In PySpark (Databricks) In Cloud SQL (Snowflake/BigQuery) Read a file json.load(f) spark.read.json() COPY INTO / Storage Integration Go inside an object data["key"]["subkey"] df.select("key.subkey") SELECT column:key.subkey Turn a list into rows for item in my_list: explode(col("my_list")) LATERAL FLATTEN() / UNNEST() Phase 1 — JSON Fundamentals 1. JSON Data Types You need to immediately recognize how JSON maps to Python. JSON Python Example Object dict {"name":"John"} Array list [1,2,3] String str "London" Number int/float 100 Boolean bool true Null None null Example: { "employee":{ "id":100, "name":"John" }, "skills":[ "Python", "Spark" ] } Python sees this as: { "employee...

2. Things data engineer should know

  data enginner should know how to write and read python code to extract the data from the internal and external apps . 1. Need understand what is API call means . if we search in the front end (GUI) the results will be fetched back  database through API call . a. We need to know to write the python code to fetch the details through API calls . its possible by using Request library in python. 2. Need to understand the use case of the company. example : telecom company (news channel). the company like news channels , weather report like shown in the diagram . it can be acheive through  by exposing certain API by the company X to teh other company . another company gets the details by sending the API calls . The structure of an API call is generally the same across industries. In telecom, APIs are commonly used to manage subscribers, retrieve usage, send SMS, provision services, or check network status. General Structure of an API Call HTTP Method + URL + Headers + Paramete...

1.creating Venv

  Data Engineer Virtual Environment (venv) + Cloud Workflow Notes 1. What is a Virtual Environment (venv)? A virtual environment is an isolated Python environment for a project. It allows each project to have its own Python libraries and versions. Example: Project A | └── venv ├── pandas 2.0 └── pyspark Project B | └── venv ├── pandas 2.2 └── tensorflow Without venv, package versions can conflict. 2. Why Data Engineers need venv Data engineers use many Python libraries: pandas numpy pyspark requests cloud SDKs Example: Automation script: Python script | | v Azure Storage / GCP Storage / APIs The venv keeps the required libraries separate. 3. venv in Cloud Environment Cloud does NOT replace venv. Cloud = infrastructure venv = Python dependency isolation Example: Cloud VM | ├── Python | ├── Project A | └── venv | └── Project B └── venv 4. Team Environment: Do we share the same venv? NO. Each developer has their own venv. Example: Develop...

Hands on project for ADF

Azure Pipeline Creation and Configuration Steps To log in to the Azure portal, a Microsoft account is required. After creating the account, sign in to the Azure portal and proceed with the following steps to build the data pipeline. 1. Create Resource Group and Storage Account Create a Resource Group in Azure. Under the resource group, create the required resources for the pipeline. Azure Data Lake Storage (ADLS Gen2) Create a Storage Account . Enable Hierarchical Namespace to convert it into Data Lake Storage Gen2 . Inside the storage account: Create a container (e.g., blog-container ). Organize data using folders/subfolders (can be created dynamically or manually using directory structure). Storage Structure Example Create a storage account named sdmm : sdmm gold processing sales silver sd mm bronze sd mm 2. Azure Data Factory (ADF) Setup Create an Azure Data Factory instance. Go to Managed Identities and enabl...