Data Engineer Virtual Environment (venv) + Cloud Workflow Notes
1. What is a Virtual Environment (venv)?
A virtual environment is an isolated Python environment for a project.
It allows each project to have its own Python libraries and versions.
Example:
Project A
|
└── venv
├── pandas 2.0
└── pyspark
Project B
|
└── venv
├── pandas 2.2
└── tensorflow
Without venv, package versions can conflict.
2. Why Data Engineers need venv
Data engineers use many Python libraries:
pandas
numpy
pyspark
requests
cloud SDKs
Example:
Automation script:
Python script
|
|
v
Azure Storage / GCP Storage / APIs
The venv keeps the required libraries separate.
3. venv in Cloud Environment
Cloud does NOT replace venv.
Cloud = infrastructure
venv = Python dependency isolation
Example:
Cloud VM
|
├── Python
|
├── Project A
| └── venv
|
└── Project B
└── venv
4. Team Environment: Do we share the same venv?
NO.
Each developer has their own venv.
Example:
Developer A laptop
|
└── venv
Developer B laptop
|
└── venv
The team shares:
Git Repository
├── Python code
├── requirements.txt
└── configuration
The team does NOT share:
❌ venv folder
5. requirements.txt
This file stores package dependencies.
Example:
pandas==2.2.0
numpy==1.26.0
pyspark==3.5.0
requests==2.32.0
Team member creates the same environment:
pip install -r requirements.txt
6. Docker vs venv
venv
Used mainly for:
local development
testing Python scripts
Example:
Laptop
venv
|
Python libraries
|
Code
Docker
Used for:
production deployment
cloud environments
Example:
Docker Container
Python
|
Libraries
|
Application
Usually:
Developer:
venv
Production:
Docker
7. Azure Data Engineer Example
Stack:
Azure Data Factory
Azure Databricks
Power BI
Flow:
Developer Laptop
venv
|
Python automation scripts
|
v
Azure DevOps Git
|
v
ADF / Databricks / Power BI
venv in Azure Data Engineering
Used for:
✅ Python automation scripts
✅ Azure SDK scripts
✅ Data validation scripts
✅ Testing ETL logic
Not used for:
❌ ADF itself
❌ Power BI
❌ Databricks cluster runtime
8. GCP Data Engineer Example
Stack:
Cloud Composer
Dataflow
Dataproc
BigQuery
Flow:
Developer
venv
Python code
Git
CI/CD
Cloud Services
venv used for:
✅ Google Cloud SDK scripts
✅ Apache Beam testing
✅ BigQuery automation
✅ Local development
Not used for:
❌ BigQuery engine
❌ Dataproc cluster Python environment
9. Creating a venv (Windows)
Create:
python -m venv firstenv
Meaning:
python
|
-m = run module
|
venv = Python virtual environment module
|
firstenv = environment name
10. Activate venv (Windows CMD)
Correct:
firstenv\Scripts\activate
You should see:
(firstenv) C:\Users\keert>
Deactivate:
deactivate
11. Activate venv (Linux/Mac)
source firstenv/bin/activate
12. Important venv Commands
Create environment
python -m venv venv
Activate
Windows:
venv\Scripts\activate
Linux/Mac:
source venv/bin/activate
Exit
deactivate
Check Python version
python --version
Check which Python is running
Windows:
where python
Should show:
venv\Scripts\python.exe
13. pip Commands
Install package
pip install pandas
Install project dependencies
pip install -r requirements.txt
Meaning:
-r = read requirements file
List libraries
pip list
Show exact versions
pip freeze
Example:
pandas==2.2.0
numpy==1.26.0
Save environment
pip freeze > requirements.txt
Team shares this file.
Remove package
pip uninstall pandas
Upgrade package
pip install --upgrade pandas
or:
pip install -U pandas
Package details
pip show pandas
14. Meaning of Common Flags
Flags start with -.
Python
-m
Run module:
python -m venv venv
pip
-r
Read requirements file:
pip install -r requirements.txt
-U
Upgrade:
pip install -U pandas
--help
Help:
pip --help
15. Other Package Managers
Python packages
pip
Example:
pip install pandas
Windows software
Chocolatey:
choco install git
Winget:
winget install python
Mac software
Homebrew:
brew install python
brew install git
16. Data Engineer Daily Workflow
Typical:
git clone project
cd project
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python script.py
pip freeze > requirements.txt
deactivate
Most Important Commands to Memorize
python -m venv venv
venv\Scripts\activate
deactivate
where python
pip install package
pip list
pip freeze
pip freeze > requirements.txt
pip install -r requirements.txt
pip uninstall package
Final Mental Model
venv
=
Developer's Python workspace
requirements.txt
=
Team dependency sharing
Docker
=
Production environment
Git
=
Code sharing
Azure DevOps/GCP CI-CD
=
Automation
ADF / Composer
=
Orchestration
Databricks / Dataflow
=
Processing
Power BI / BigQuery
=
Analytics layer
This is the workflow you should remember as a Data Engineer.
step 1. Make sure the python is installed in system.
step 2 . make sure the python is updated to python version 3
or give command -m pip install --upgrade pip(pip is the python package manager aND CHOCO is the windows package manager manager and brew is for MAC )
step 3 : create the venv using command
python -m venv firstenv
python → runs Python
-m → tells Python "run a module"
venv → the module
firstenv → name of your environment
step 4 : Go inside Venv.
go inside the path and give the command
firstenv\Scripts\activate
and its shows like
(firstenv) C:\Users\keert>pip install pandas ( install the libraies which we need to develops)
step 5 : To know what packages or libraies installed give the command
(firstenv) C:\Users\keert> pip freeze
(firstenv) C:\Users\keert> pip list
step 6 : In gendral the teams give the list of requiremnts in git for examples in requirements.txt
for that we need to use pip install -r requirements.txt
step 7 : to come out of the environment give command
(firstenv) C:\Users\keert>deactivate.
to inside again give same C:\Users\keert>firstenv\scripts\activate
do the work we want.
The Cloud vs. venv Distinction: This is a massive stumbling block for beginners. Recognizing that Cloud = Infrastructure and venv = Python Isolation shows you understand that running an EC2, VM, or Cloud Function doesn't magically solve dependency hell.
Databricks/Dataproc Runtimes: You correctly noted that
venvis not used for managed spark cluster runtimes. Databricks handles its own cluster-level libraries (or uses init scripts/%pip), so keeping your local script automation isolated from cluster runtimes is exactly how it works in production.The Git Guardrail: Omitting the
❌ venv folderfrom Git and only sharingrequirements.txtis standard industry best practice. (Pro-tip: Always addvenv/or.venv/to your.gitignorefile immediately).The "where python" / "which python" Check: Checking your environment context via
where python(Windows) orwhich python(Mac/Linux) is the ultimate debugging step when a package isn't loading properly.
🔍 Micro-Tips to Level Up Your Notes
Your commands and logic are flawless, but as you work in enterprise environments, keep these tiny nuances in mind:
Deterministic Builds (
pip-compile): Whilepip freeze > requirements.txtis perfect, it dumps everything, including sub-dependencies. As you advance, you might encounter tools likepip-toolsorPoetrywhich separate your top-level dependencies (e.g., justpandas) from their underlying sub-dependencies to keep things cleaner.Mac/Linux Equivalent for
where: You notedwhere pythonfor Windows. For Mac/Linux, the exact equivalent command to memorize is:Bashwhich pythonDocker + venv (The Hybrid Approach): While your distinction that Developer = venv and Production = Docker is great for a mental model, in advanced production environments, engineers actually use both together. Inside a
Dockerfile, it is common practice to create avenvto isolate the app from the base OS Python layer.
Comments
Post a Comment