Think of JSON processing as this journey:
Raw JSON → Python objects → Flatten → Clean → DataFrame → Spark → Production pipeline
| Goal | In Local Python | In PySpark (Databricks) | In Cloud SQL (Snowflake/BigQuery) |
| Read a file | json.load(f) | spark.read.json() | COPY INTO / Storage Integration |
| Go inside an object | data["key"]["subkey"] | df.select("key.subkey") | SELECT column:key.subkey |
| Turn a list into rows | for item in my_list: | explode(col("my_list")) | LATERAL FLATTEN() / UNNEST() |
Phase 1 — JSON Fundamentals
1. JSON Data Types
You need to immediately recognize how JSON maps to Python.
| JSON | Python | Example |
|---|---|---|
| Object | dict | {"name":"John"} |
| Array | list | [1,2,3] |
| String | str | "London" |
| Number | int/float | 100 |
| Boolean | bool | true |
| Null | None | null |
Example:
{
"employee":{
"id":100,
"name":"John"
},
"skills":[
"Python",
"Spark"
]
}
Python sees this as:
{
"employee":{
"id":100,
"name":"John"
},
"skills":[
"Python",
"Spark"
]
}
Your brain should immediately think:
{}→ dictionary[]→ list
Phase 2 — Python JSON Library
The first tool every Data Engineer uses.
import json
json.load()
Read JSON file.
Example:
employee.json
{
"id":1,
"name":"John"
}
Python:
with open("employee.json") as f:
data=json.load(f)
print(data)
Output:
{
'id':1,
'name':'John'
}
Use when:
file → python dictionary
json.loads()
String → dictionary
Example:
json_string='{"name":"John"}'
data=json.loads(json_string)
print(data["name"])
Output:
John
Use:
JSON text → python object
json.dump()
Dictionary → JSON file
employee={
"name":"John"
}
with open("output.json","w") as f:
json.dump(employee,f)
Creates:
{
"name":"John"
}
json.dumps()
Dictionary → JSON string
json.dumps(employee)
Output:
'{"name":"John"}'
Interview:
load vs loads
Answer:
load = file
loads = string
Phase 3 — Python Dictionary Mastery
Because JSON objects become dictionaries.
Example:
employee={
"name":"John",
"salary":5000
}
Access:
employee["name"]
Output:
John
Problem:
employee["age"]
Error:
KeyError
Better:
employee.get("age")
Output:
None
Important functions:
keys()
employee.keys()
Output:
name salary
values()
employee.values()
items()
Used in loops.
for k,v in employee.items():
print(k,v)
Output:
name John
salary 5000
Phase 4 — Lists (JSON Arrays)
Example:
skills=[
"Python",
"Spark",
"SQL"
]
Loop:
for skill in skills:
print(skill)
Output:
Python
Spark
SQL
Important:
append()
Add one item
skills.append("Azure")
extend()
Add multiple:
skills.extend(
["AWS","Snowflake"]
)
enumerate()
Very common interview topic.
for index,item in enumerate(skills):
print(index,item)
Output:
0 Python
1 Spark
2 SQL
Phase 5 — Access Nested JSON
Real JSON:
{
"id":1,
"address":{
"city":"London",
"country":"UK"
}
}
Python:
employee["address"]["city"]
Output:
London
Phase 6 — Loop Nested JSON
Real example:
orders={
"orders":[
{
"id":1,
"price":100
},
{
"id":2,
"price":200
}
]
}
Loop:
for order in orders["orders"]:
print(order["price"])
Output:
100
200
Phase 7 — isinstance() (Critical)
This is the heart of dynamic flattening.
Example:
value={
"name":"John"
}
isinstance(value,dict)
Output:
True
Why?
Because flattening logic is:
If dictionary:
go deeper
If list:
expand
Otherwise:
save value
Phase 8 — Recursive Flattening
Example:
Input:
{
"user":{
"name":"John",
"city":"London"
}
}
Output:
{
"user_name":"John",
"user_city":"London"
}
Code:
def flatten(data,parent=""):
result={}
for key,value in data.items():
new_key = (
parent+"_"+key
if parent
else key
)
if isinstance(value,dict):
result.update(
flatten(value,new_key)
)
else:
result[new_key]=value
return result
Usage:
flatten(data)
Output:
user_name John
user_city London
Phase 9 — Handling Arrays
Example:
{
"name":"John",
"skills":[
"Python",
"Spark"
]
}
Need output:
| name | skill |
|---|---|
| John | Python |
| John | Spark |
This requires:
loop
enumerate
explode concept
Phase 10 — Pandas JSON Processing
Most common:
import pandas as pd
Example:
data={
"employee":{
"id":1,
"name":"John"
}
}
Flatten:
df=pd.json_normalize(data)
Output:
| employee.id | employee.name |
|---|---|
| 1 | John |
Arrays:
pd.json_normalize(
data,
record_path="orders"
)
Phase 11 — Missing Data Handling
Bad:
employee["salary"]
Good:
employee.get("salary")
Pandas:
df.fillna(0)
Phase 12 — Exception Handling
Production code must not crash.
Example:
try:
data=json.load(file)
except Exception as e:
print(e)
Phase 13 — Large JSON Files
Problem:
10GB JSON:
json.load()
loads everything into memory.
Better:
streaming
chunks
Spark
Phase 14 — PySpark JSON (Databricks)
Senior Data Engineer level.
Read:
df=spark.read.json(
"path/file.json"
)
Nested column:
df.select(
"employee.name"
)
Flatten struct:
df.select(
"employee.*"
)
Arrays:
Example:
skills=[
Python,
Spark
]
Explode:
from pyspark.sql.functions import explode
df.withColumn(
"skill",
explode("skills")
)
Result:
Python
Spark
Phase 15 — Spark Functions to Master
For Azure Databricks interviews:
Must know:
spark.read.json()
from_json()
to_json()
explode()
explode_outer()
posexplode()
select()
select("struct.*")
withColumn()
col()
getField()
schema_of_json()
3 Small Additions to Make it Perfect
To truly secure a Senior Data Engineer title, add these edge cases to your checklist:
1. PySpark Explicit Schemas (Crucial for Production)
In Phase 14, you mentioned spark.read.json(). While this works, letting Spark infer the schema on a 10GB file forces it to scan the dataset twice, which kills performance.
Interview Tip: Always mention defining a StructType schema explicitly before reading, or using from_json(col, schema).
2. JSON Lines (ndjson) vs. Standard JSON
Standard JSON arrays require loading the entire file into memory as one object. Production pipelines usually use JSON Lines (.jsonl or newline-delimited JSON), where every single line is a valid independent JSON object.
Interview Tip: Spark processes JSON Lines inherently in parallel, whereas a massive nested standard JSON array is notoriously difficult to split across nodes.
3. Fleshing out the Cloud SQL Phase
Your introduction matrix mentions Snowflake (LATERAL FLATTEN) and BigQuery (UNNEST), but the phases don't dive into them.
Interview Tip: Be prepared to write a quick query showing how to parse a variant/JSON column using the colon syntax (column:nested_field) and flattening an array via a SQL join.
Final Skill Checklist for Senior JSON Processing
You should be able to do:
✅ Read JSON files
✅ Convert JSON ↔ Python objects
✅ Navigate nested dictionaries
✅ Handle arrays
✅ Flatten dynamic JSON
✅ Write recursive flatten functions
✅ Use pandas.json_normalize()
✅ Handle missing fields
✅ Process large JSON files
✅ Flatten JSON in PySpark
✅ Handle schema evolution
The Senior-Level Execution Code snippets
1. PySpark Explicit Schema Enforcements (Instead of InferSchema)
When an interviewer asks, "How do you optimize reading a massive JSON file in PySpark?", do not just say "use a schema." Show them how:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType
# 1. Define schema explicitly to avoid a double-pass file scan
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("skills", ArrayType(StringType()), True)
])
# 2. Apply it directly during the read phase
df = spark.read.schema(schema).json("hdfs:///data/large_employees.json")
2. JSON Lines (.jsonl) Production Standard
If they ask, "What is the difference between parsing a standard JSON array vs. Newline-Delimited JSON (NDJSON) in production?", highlight that standard JSON forces a single executor to parse the file, while JSON Lines allows distributed, parallel processing.
Note: When using
spark.read.json(), PySpark assumes JSON Lines format by default (one JSON object per line). If you are reading a single, massive multi-line JSON array file, you must explicitly pass the multiLine option:
df = spark.read.option("multiLine", True).json("file.json")
3. The Cloud SQL Snippet (Snowflake & BigQuery)
If they pivot to the data warehouse phase and ask how to flatten semi-structured data directly in SQL, be ready to sketch these architectures:
Snowflake (LATERAL FLATTEN)
Assuming you have a table named employee_stage with a VARIANT column named json_data:
SELECT
json_data:id::INT as employee_id,
json_data:name::STRING as employee_name,
f.value::STRING as skill
FROM employee_stage,
LATERAL FLATTEN(input => json_data:skills) f;
BigQuery (UNNEST)
Assuming you have a table named employee_stage with a JSON column named json_data:
SELECT
LAX_INT64(json_data.id) as employee_id,
LAX_STRING(json_data.name) as employee_name,
skill
FROM employee_stage,
UNNEST(JSON_EXTRACT_STRING_ARRAY(json_data, '$.skills')) as skill;
🚀 Final Interview Cheat Sheet: The 3-Second Deflection Rules
When hit with an open-ended JSON question, default to these architectural architectural decisions instantly:
Memory constraint issue? Drop
json.load()$\rightarrow$ Switch toijson(streaming) or PySpark.Performance bottleneck? Drop
inferSchema$\rightarrow$ Provide explicitStructType.Explode vs. Explode_outer? Use
explode()if you want to drop rows with empty/null arrays. Useexplode_outer()if you must retain the parent row even if its array payload is completely empty.
Comments
Post a Comment