2.a flattening the JSON File.

Think of JSON processing as this journey:

Raw JSON → Python objects → Flatten → Clean → DataFrame → Spark → Production pipeline

Goal	In Local Python	In PySpark (Databricks)	In Cloud SQL (Snowflake/BigQuery)
Read a file	json.load(f)	spark.read.json()	`COPY INTO` / Storage Integration
Go inside an object	data["key"]["subkey"]	df.select("key.subkey")	SELECT column:key.subkey
Turn a list into rows	for item in my_list:	explode(col("my_list"))	`LATERAL FLATTEN()` / `UNNEST()`

Phase 1 — JSON Fundamentals

1. JSON Data Types

You need to immediately recognize how JSON maps to Python.

JSON	Python	Example
Object	dict	`{"name":"John"}`
Array	list	`[1,2,3]`
String	str	`"London"`
Number	int/float	`100`
Boolean	bool	`true`
Null	None	`null`

Example:

{
 "employee":{
    "id":100,
    "name":"John"
 },
 "skills":[
    "Python",
    "Spark"
 ]
}

Python sees this as:

{
 "employee":{
    "id":100,
    "name":"John"
 },

 "skills":[
    "Python",
    "Spark"
 ]
}

Your brain should immediately think:

{} → dictionary
[] → list

Phase 2 — Python JSON Library

The first tool every Data Engineer uses.

import json

json.load()

Read JSON file.

Example:

employee.json

{
"id":1,
"name":"John"
}

Python:

with open("employee.json") as f:

    data=json.load(f)


print(data)

Output:

{
'id':1,
'name':'John'
}

Use when:

file → python dictionary

json.loads()

String → dictionary

Example:

json_string='{"name":"John"}'


data=json.loads(json_string)


print(data["name"])

Output:

John

Use:

JSON text → python object

json.dump()

Dictionary → JSON file

employee={
"name":"John"
}


with open("output.json","w") as f:

    json.dump(employee,f)

Creates:

{
"name":"John"
}

json.dumps()

Dictionary → JSON string

json.dumps(employee)

Output:

'{"name":"John"}'

Interview:

load vs loads

Answer:

load  = file
loads = string

Phase 3 — Python Dictionary Mastery

Because JSON objects become dictionaries.

Example:

employee={
"name":"John",
"salary":5000
}

Access:

employee["name"]

Output:

John

Problem:

employee["age"]

Error:

KeyError

Better:

employee.get("age")

Output:

None

Important functions:

keys()

employee.keys()

Output:

name salary

values()

employee.values()

items()

Used in loops.

for k,v in employee.items():

    print(k,v)

Output:

name John
salary 5000

Phase 4 — Lists (JSON Arrays)

Example:

skills=[
"Python",
"Spark",
"SQL"
]

Loop:

for skill in skills:

    print(skill)

Output:

Python
Spark
SQL

Important:

append()

Add one item

skills.append("Azure")

extend()

Add multiple:

skills.extend(
["AWS","Snowflake"]
)

enumerate()

Very common interview topic.

for index,item in enumerate(skills):

    print(index,item)

Output:

0 Python
1 Spark
2 SQL

Phase 5 — Access Nested JSON

Real JSON:

{
"id":1,

"address":{
 "city":"London",
 "country":"UK"
}

}

Python:

employee["address"]["city"]

Output:

London

Phase 6 — Loop Nested JSON

Real example:

orders={

"orders":[

{
"id":1,
"price":100
},

{
"id":2,
"price":200
}

]

}

Loop:

for order in orders["orders"]:

    print(order["price"])

Output:

100
200

Phase 7 — isinstance() (Critical)

This is the heart of dynamic flattening.

Example:

value={
"name":"John"
}


isinstance(value,dict)

Output:

True

Why?

Because flattening logic is:

If dictionary:
    go deeper

If list:
    expand

Otherwise:
    save value

Phase 8 — Recursive Flattening

Example:

Input:

{
"user":{
"name":"John",
"city":"London"
}
}

Output:

{
"user_name":"John",
"user_city":"London"
}

Code:

def flatten(data,parent=""):

    result={}


    for key,value in data.items():

        new_key = (
        parent+"_"+key
        if parent
        else key
        )


        if isinstance(value,dict):

            result.update(
            flatten(value,new_key)
            )


        else:

            result[new_key]=value


    return result

Usage:

flatten(data)

Output:

user_name John
user_city London

Phase 9 — Handling Arrays

Example:

{
"name":"John",

"skills":[
"Python",
"Spark"
]

}

Need output:

name	skill
John	Python
John	Spark

This requires:

loop
enumerate
explode concept

Phase 10 — Pandas JSON Processing

Most common:

import pandas as pd

Example:

data={

"employee":{
"id":1,
"name":"John"

}

}

Flatten:

df=pd.json_normalize(data)

Output:

employee.id	employee.name
1	John

Arrays:

pd.json_normalize(
data,
record_path="orders"
)

Phase 11 — Missing Data Handling

Bad:

employee["salary"]

Good:

employee.get("salary")

Pandas:

df.fillna(0)

Phase 12 — Exception Handling

Production code must not crash.

Example:

try:

    data=json.load(file)


except Exception as e:

    print(e)

Phase 13 — Large JSON Files

Problem:

10GB JSON:

json.load()

loads everything into memory.

Better:

streaming
chunks
Spark

Phase 14 — PySpark JSON (Databricks)

Senior Data Engineer level.

Read:

df=spark.read.json(
"path/file.json"
)

Nested column:

df.select(
"employee.name"
)

Flatten struct:

df.select(
"employee.*"
)

Arrays:

Example:

skills=[
Python,
Spark
]

Explode:

from pyspark.sql.functions import explode


df.withColumn(
"skill",
explode("skills")
)

Result:

Python
Spark

Phase 15 — Spark Functions to Master

For Azure Databricks interviews:

Must know:

spark.read.json()

from_json()

to_json()

explode()

explode_outer()

posexplode()

select()

select("struct.*")

withColumn()

col()

getField()

schema_of_json()


3 Small Additions to Make it Perfect
To truly secure a Senior Data Engineer title, add these edge cases to your checklist:
1. PySpark Explicit Schemas (Crucial for Production)
In Phase 14, you mentioned spark.read.json(). While this works, letting Spark infer the schema on a 10GB file forces it to scan the dataset twice, which kills performance.
Interview Tip: Always mention defining a StructType schema explicitly before reading, or using from_json(col, schema).
2. JSON Lines (ndjson) vs. Standard JSON
Standard JSON arrays require loading the entire file into memory as one object. Production pipelines usually use JSON Lines (.jsonl or newline-delimited JSON), where every single line is a valid independent JSON object.
Interview Tip: Spark processes JSON Lines inherently in parallel, whereas a massive nested standard JSON array is notoriously difficult to split across nodes.
3. Fleshing out the Cloud SQL Phase
Your introduction matrix mentions Snowflake (LATERAL FLATTEN) and BigQuery (UNNEST), but the phases don't dive into them.
Interview Tip: Be prepared to write a quick query showing how to parse a variant/JSON column using the colon syntax (column:nested_field) and flattening an array via a SQL join.

Final Skill Checklist for Senior JSON Processing

You should be able to do:

✅ Read JSON files
✅ Convert JSON ↔ Python objects
✅ Navigate nested dictionaries
✅ Handle arrays
✅ Flatten dynamic JSON
✅ Write recursive flatten functions
✅ Use pandas.json_normalize()
✅ Handle missing fields
✅ Process large JSON files
✅ Flatten JSON in PySpark
✅ Handle schema evolution

The Senior-Level Execution Code snippets

1. PySpark Explicit Schema Enforcements (Instead of InferSchema)

When an interviewer asks, "How do you optimize reading a massive JSON file in PySpark?", do not just say "use a schema." Show them how:

Python
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType

# 1. Define schema explicitly to avoid a double-pass file scan
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("skills", ArrayType(StringType()), True)
])

# 2. Apply it directly during the read phase
df = spark.read.schema(schema).json("hdfs:///data/large_employees.json")

2. JSON Lines (.jsonl) Production Standard

If they ask, "What is the difference between parsing a standard JSON array vs. Newline-Delimited JSON (NDJSON) in production?", highlight that standard JSON forces a single executor to parse the file, while JSON Lines allows distributed, parallel processing.

Note: When using spark.read.json(), PySpark assumes JSON Lines format by default (one JSON object per line). If you are reading a single, massive multi-line JSON array file, you must explicitly pass the multiLine option:
df = spark.read.option("multiLine", True).json("file.json")

3. The Cloud SQL Snippet (Snowflake & BigQuery)

If they pivot to the data warehouse phase and ask how to flatten semi-structured data directly in SQL, be ready to sketch these architectures:

Snowflake (`LATERAL FLATTEN`)

Assuming you have a table named employee_stage with a VARIANT column named json_data:

SQL
SELECT 
    json_data:id::INT as employee_id,
    json_data:name::STRING as employee_name,
    f.value::STRING as skill
FROM employee_stage,
LATERAL FLATTEN(input => json_data:skills) f;

BigQuery (`UNNEST`)

Assuming you have a table named employee_stage with a JSON column named json_data:

SQL
SELECT 
    LAX_INT64(json_data.id) as employee_id,
    LAX_STRING(json_data.name) as employee_name,
    skill
FROM employee_stage,
UNNEST(JSON_EXTRACT_STRING_ARRAY(json_data, '$.skills')) as skill;

🚀 Final Interview Cheat Sheet: The 3-Second Deflection Rules

When hit with an open-ended JSON question, default to these architectural architectural decisions instantly:

Memory constraint issue? Drop json.load() $\rightarrow$ Switch to ijson (streaming) or PySpark.
Performance bottleneck? Drop inferSchema $\rightarrow$ Provide explicit StructType.
Explode vs. Explode_outer? Use explode() if you want to drop rows with empty/null arrays. Use explode_outer() if you must retain the parent row even if its array payload is completely empty.

Keerthana Blogs

2.a flattening the JSON File.

Phase 1 — JSON Fundamentals

1. JSON Data Types

Phase 2 — Python JSON Library

json.load()

json.loads()

json.dump()

json.dumps()

Phase 3 — Python Dictionary Mastery

keys()

values()

items()

Phase 4 — Lists (JSON Arrays)

append()

extend()

enumerate()

Phase 5 — Access Nested JSON

Phase 6 — Loop Nested JSON

Phase 7 — isinstance() (Critical)

Phase 8 — Recursive Flattening

Phase 9 — Handling Arrays

Phase 10 — Pandas JSON Processing

Phase 11 — Missing Data Handling

Phase 12 — Exception Handling

Phase 13 — Large JSON Files

Phase 14 — PySpark JSON (Databricks)

Phase 15 — Spark Functions to Master

3 Small Additions to Make it Perfect

1. PySpark Explicit Schemas (Crucial for Production)

2. JSON Lines (ndjson) vs. Standard JSON

3. Fleshing out the Cloud SQL Phase

Final Skill Checklist for Senior JSON Processing

The Senior-Level Execution Code snippets

1. PySpark Explicit Schema Enforcements (Instead of InferSchema)

2. JSON Lines (.jsonl) Production Standard

3. The Cloud SQL Snippet (Snowflake & BigQuery)

Snowflake (LATERAL FLATTEN)

BigQuery (UNNEST)

🚀 Final Interview Cheat Sheet: The 3-Second Deflection Rules

Comments

Post a Comment

Popular posts from this blog

Entity Relationship (ER) Diagram Model with DBMS Example

Transformation - section 6 - data flow

Session 7 data flow part 2

Snowflake (`LATERAL FLATTEN`)

BigQuery (`UNNEST`)