Skip to main content

2.a flattening the JSON File.


Think of JSON processing as this journey:

Raw JSON → Python objects → Flatten → Clean → DataFrame → Spark → Production pipeline

GoalIn Local PythonIn PySpark (Databricks)In Cloud SQL (Snowflake/BigQuery)
Read a filejson.load(f)spark.read.json()COPY INTO / Storage Integration
Go inside an objectdata["key"]["subkey"]df.select("key.subkey")SELECT column:key.subkey
Turn a list into rowsfor item in my_list:explode(col("my_list"))LATERAL FLATTEN() / UNNEST()

Phase 1 — JSON Fundamentals

1. JSON Data Types

You need to immediately recognize how JSON maps to Python.

JSONPythonExample
Objectdict{"name":"John"}
Arraylist[1,2,3]
Stringstr"London"
Numberint/float100
Booleanbooltrue
NullNonenull

Example:

{
 "employee":{
    "id":100,
    "name":"John"
 },
 "skills":[
    "Python",
    "Spark"
 ]
}

Python sees this as:

{
 "employee":{
    "id":100,
    "name":"John"
 },

 "skills":[
    "Python",
    "Spark"
 ]
}

Your brain should immediately think:

  • {} → dictionary

  • [] → list


Phase 2 — Python JSON Library

The first tool every Data Engineer uses.

import json

json.load()

Read JSON file.

Example:

employee.json

{
"id":1,
"name":"John"
}

Python:

with open("employee.json") as f:

    data=json.load(f)


print(data)

Output:

{
'id':1,
'name':'John'
}

Use when:

file → python dictionary

json.loads()

String → dictionary

Example:

json_string='{"name":"John"}'


data=json.loads(json_string)


print(data["name"])

Output:

John

Use:

JSON text → python object

json.dump()

Dictionary → JSON file

employee={
"name":"John"
}


with open("output.json","w") as f:

    json.dump(employee,f)

Creates:

{
"name":"John"
}

json.dumps()

Dictionary → JSON string

json.dumps(employee)

Output:

'{"name":"John"}'

Interview:

load vs loads

Answer:

load  = file
loads = string

Phase 3 — Python Dictionary Mastery

Because JSON objects become dictionaries.

Example:

employee={
"name":"John",
"salary":5000
}

Access:

employee["name"]

Output:

John

Problem:

employee["age"]

Error:

KeyError

Better:

employee.get("age")

Output:

None

Important functions:

keys()

employee.keys()

Output:

name salary

values()

employee.values()

items()

Used in loops.

for k,v in employee.items():

    print(k,v)

Output:

name John
salary 5000

Phase 4 — Lists (JSON Arrays)

Example:

skills=[
"Python",
"Spark",
"SQL"
]

Loop:

for skill in skills:

    print(skill)

Output:

Python
Spark
SQL

Important:

append()

Add one item

skills.append("Azure")

extend()

Add multiple:

skills.extend(
["AWS","Snowflake"]
)

enumerate()

Very common interview topic.

for index,item in enumerate(skills):

    print(index,item)

Output:

0 Python
1 Spark
2 SQL

Phase 5 — Access Nested JSON

Real JSON:

{
"id":1,

"address":{
 "city":"London",
 "country":"UK"
}

}

Python:

employee["address"]["city"]

Output:

London

Phase 6 — Loop Nested JSON

Real example:

orders={

"orders":[

{
"id":1,
"price":100
},

{
"id":2,
"price":200
}

]

}

Loop:

for order in orders["orders"]:

    print(order["price"])

Output:

100
200

Phase 7 — isinstance() (Critical)

This is the heart of dynamic flattening.

Example:

value={
"name":"John"
}


isinstance(value,dict)

Output:

True

Why?

Because flattening logic is:

If dictionary:
    go deeper

If list:
    expand

Otherwise:
    save value

Phase 8 — Recursive Flattening

Example:

Input:

{
"user":{
"name":"John",
"city":"London"
}
}

Output:

{
"user_name":"John",
"user_city":"London"
}

Code:

def flatten(data,parent=""):

    result={}


    for key,value in data.items():

        new_key = (
        parent+"_"+key
        if parent
        else key
        )


        if isinstance(value,dict):

            result.update(
            flatten(value,new_key)
            )


        else:

            result[new_key]=value


    return result

Usage:

flatten(data)

Output:

user_name John
user_city London

Phase 9 — Handling Arrays

Example:

{
"name":"John",

"skills":[
"Python",
"Spark"
]

}

Need output:

nameskill
JohnPython
JohnSpark

This requires:

  • loop

  • enumerate

  • explode concept


Phase 10 — Pandas JSON Processing

Most common:

import pandas as pd

Example:

data={

"employee":{
"id":1,
"name":"John"

}

}

Flatten:

df=pd.json_normalize(data)

Output:

employee.idemployee.name
1John

Arrays:

pd.json_normalize(
data,
record_path="orders"
)

Phase 11 — Missing Data Handling

Bad:

employee["salary"]

Good:

employee.get("salary")

Pandas:

df.fillna(0)

Phase 12 — Exception Handling

Production code must not crash.

Example:

try:

    data=json.load(file)


except Exception as e:

    print(e)

Phase 13 — Large JSON Files

Problem:

10GB JSON:

json.load()

loads everything into memory.

Better:

  • streaming

  • chunks

  • Spark


Phase 14 — PySpark JSON (Databricks)

Senior Data Engineer level.

Read:

df=spark.read.json(
"path/file.json"
)

Nested column:

df.select(
"employee.name"
)

Flatten struct:

df.select(
"employee.*"
)

Arrays:

Example:

skills=[
Python,
Spark
]

Explode:

from pyspark.sql.functions import explode


df.withColumn(
"skill",
explode("skills")
)

Result:

Python
Spark

Phase 15 — Spark Functions to Master

For Azure Databricks interviews:

Must know:

spark.read.json()

from_json()

to_json()

explode()

explode_outer()

posexplode()

select()

select("struct.*")

withColumn()

col()

getField()

schema_of_json()


3 Small Additions to Make it Perfect

To truly secure a Senior Data Engineer title, add these edge cases to your checklist:

1. PySpark Explicit Schemas (Crucial for Production)

In Phase 14, you mentioned spark.read.json(). While this works, letting Spark infer the schema on a 10GB file forces it to scan the dataset twice, which kills performance.

  • Interview Tip: Always mention defining a StructType schema explicitly before reading, or using from_json(col, schema).

2. JSON Lines (ndjson) vs. Standard JSON

Standard JSON arrays require loading the entire file into memory as one object. Production pipelines usually use JSON Lines (.jsonl or newline-delimited JSON), where every single line is a valid independent JSON object.

  • Interview Tip: Spark processes JSON Lines inherently in parallel, whereas a massive nested standard JSON array is notoriously difficult to split across nodes.

3. Fleshing out the Cloud SQL Phase

Your introduction matrix mentions Snowflake (LATERAL FLATTEN) and BigQuery (UNNEST), but the phases don't dive into them.

  • Interview Tip: Be prepared to write a quick query showing how to parse a variant/JSON column using the colon syntax (column:nested_field) and flattening an array via a SQL join.

Final Skill Checklist for Senior JSON Processing

You should be able to do:

✅ Read JSON files
✅ Convert JSON ↔ Python objects
✅ Navigate nested dictionaries
✅ Handle arrays
✅ Flatten dynamic JSON
✅ Write recursive flatten functions
✅ Use pandas.json_normalize()
✅ Handle missing fields
✅ Process large JSON files
✅ Flatten JSON in PySpark
✅ Handle schema evolution

The Senior-Level Execution Code snippets

1. PySpark Explicit Schema Enforcements (Instead of InferSchema)

When an interviewer asks, "How do you optimize reading a massive JSON file in PySpark?", do not just say "use a schema." Show them how:

Python
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType

# 1. Define schema explicitly to avoid a double-pass file scan
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("skills", ArrayType(StringType()), True)
])

# 2. Apply it directly during the read phase
df = spark.read.schema(schema).json("hdfs:///data/large_employees.json")

2. JSON Lines (.jsonl) Production Standard

If they ask, "What is the difference between parsing a standard JSON array vs. Newline-Delimited JSON (NDJSON) in production?", highlight that standard JSON forces a single executor to parse the file, while JSON Lines allows distributed, parallel processing.

Note: When using spark.read.json(), PySpark assumes JSON Lines format by default (one JSON object per line). If you are reading a single, massive multi-line JSON array file, you must explicitly pass the multiLine option:

df = spark.read.option("multiLine", True).json("file.json")

3. The Cloud SQL Snippet (Snowflake & BigQuery)

If they pivot to the data warehouse phase and ask how to flatten semi-structured data directly in SQL, be ready to sketch these architectures:

Snowflake (LATERAL FLATTEN)

Assuming you have a table named employee_stage with a VARIANT column named json_data:

SQL
SELECT 
    json_data:id::INT as employee_id,
    json_data:name::STRING as employee_name,
    f.value::STRING as skill
FROM employee_stage,
LATERAL FLATTEN(input => json_data:skills) f;

BigQuery (UNNEST)

Assuming you have a table named employee_stage with a JSON column named json_data:

SQL
SELECT 
    LAX_INT64(json_data.id) as employee_id,
    LAX_STRING(json_data.name) as employee_name,
    skill
FROM employee_stage,
UNNEST(JSON_EXTRACT_STRING_ARRAY(json_data, '$.skills')) as skill;

🚀 Final Interview Cheat Sheet: The 3-Second Deflection Rules

When hit with an open-ended JSON question, default to these architectural architectural decisions instantly:

  • Memory constraint issue? Drop json.load() $\rightarrow$ Switch to ijson (streaming) or PySpark.

  • Performance bottleneck? Drop inferSchema $\rightarrow$ Provide explicit StructType.

  • Explode vs. Explode_outer? Use explode() if you want to drop rows with empty/null arrays. Use explode_outer() if you must retain the parent row even if its array payload is completely empty.




Comments

Popular posts from this blog

Entity Relationship (ER) Diagram Model with DBMS Example

Reference :   Entity Relationship (ER) Diagram Model with DBMS Example What is ER Diagram? ER Diagram  stands for Entity Relationship Diagram, also known as ERD is a diagram that displays the relationship of entity sets stored in a database. In other words, ER diagrams help to explain the logical structure of databases. ER diagrams are created based on three basic concepts: entities, attributes and relationships. ER Diagrams contain different symbols that use rectangles to represent entities, ovals to define attributes and diamond shapes to represent relationships. At first look, an ER diagram looks very similar to the flowchart. However, ER Diagram includes many specialized symbols, and its meanings make this model unique. The purpose of ER Diagram is to represent the entity framework infrastructure. Entity Relationship Diagram Example Table of Content: What is ER Diagram? What is ER Model? History of ER models Why use ER Diagrams? Facts about ER Diagram Model ER Diagram...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...

Session 7 data flow part 2

  Data Flow Name : df_transform_hospital_admissions Pipeline Steps : Source (HospitalAdmissionSource) : Pulls data from ds_raw_hospital_admission . SelectReqdFields : Renames or selects specific fields: country , indicator , etc. LookupCountry : Performs a lookup using CountrySource (likely from ds_country_lookup ) to enrich the data. SelectReqdFields2 : Refines the result further with a new set of selected or renamed fields. Split into Weekly and Daily : A Conditional Split divides the data into two branches: Weekly (9 columns total) Daily (filtering on indicator column, likely conditional logic) Right Panel : Shows general properties. Name: df_transform_hospital_admissions . Description: Empty. Bottom Panel (Data preview) : Currently loading: “Fetching data…”. Status: Data flow debug is enabled (green). Operation counts like INSERT , UPDATE , DELETE , etc., are N/A , meaning this is likely a preview r...