creating Venv

Data Engineer Virtual Environment (venv) + Cloud Workflow Notes

1. What is a Virtual Environment (venv)?

A virtual environment is an isolated Python environment for a project.

It allows each project to have its own Python libraries and versions.

Example:

Project A
|
└── venv
    ├── pandas 2.0
    └── pyspark


Project B
|
└── venv
    ├── pandas 2.2
    └── tensorflow

Without venv, package versions can conflict.

2. Why Data Engineers need venv

Data engineers use many Python libraries:

pandas
numpy
pyspark
requests
cloud SDKs

Example:

Automation script:

Python script
      |
      |
      v
Azure Storage / GCP Storage / APIs

The venv keeps the required libraries separate.

3. venv in Cloud Environment

Cloud does NOT replace venv.

Cloud = infrastructure

venv = Python dependency isolation

Example:

Cloud VM
|
├── Python
|
├── Project A
|     └── venv
|
└── Project B
      └── venv

4. Team Environment: Do we share the same venv?

NO.

Each developer has their own venv.

Example:

Developer A laptop
|
└── venv


Developer B laptop
|
└── venv

The team shares:

Git Repository

├── Python code
├── requirements.txt
└── configuration

The team does NOT share:

❌ venv folder

5. requirements.txt

This file stores package dependencies.

Example:

pandas==2.2.0
numpy==1.26.0
pyspark==3.5.0
requests==2.32.0

Team member creates the same environment:

pip install -r requirements.txt

6. Docker vs venv

venv

Used mainly for:

local development
testing Python scripts

Example:

Laptop

venv
 |
Python libraries
 |
Code

Docker

Used for:

production deployment
cloud environments

Example:

Docker Container

Python
 |
Libraries
 |
Application

Usually:

Developer:
venv

Production:
Docker

7. Azure Data Engineer Example

Stack:

Azure Data Factory
Azure Databricks
Power BI

Flow:

Developer Laptop

venv
 |
Python automation scripts

        |
        v

Azure DevOps Git

        |
        v

ADF / Databricks / Power BI

venv in Azure Data Engineering

Used for:

✅ Python automation scripts
✅ Azure SDK scripts
✅ Data validation scripts
✅ Testing ETL logic

Not used for:

❌ ADF itself
❌ Power BI
❌ Databricks cluster runtime

8. GCP Data Engineer Example

Stack:

Cloud Composer
Dataflow
Dataproc
BigQuery

Flow:

Developer

venv

Python code

Git

CI/CD

Cloud Services

venv used for:

✅ Google Cloud SDK scripts
✅ Apache Beam testing
✅ BigQuery automation
✅ Local development

Not used for:

❌ BigQuery engine
❌ Dataproc cluster Python environment

9. Creating a venv (Windows)

Create:

python -m venv firstenv

Meaning:

python
 |
 -m = run module
 |
 venv = Python virtual environment module
 |
 firstenv = environment name

10. Activate venv (Windows CMD)

Correct:

firstenv\Scripts\activate

You should see:

(firstenv) C:\Users\keert>

Deactivate:

deactivate

11. Activate venv (Linux/Mac)

source firstenv/bin/activate

12. Important venv Commands

Create environment

python -m venv venv

Activate

Windows:

venv\Scripts\activate

Linux/Mac:

source venv/bin/activate

Exit

deactivate

Check Python version

python --version

Check which Python is running

Windows:

where python

Should show:

venv\Scripts\python.exe

13. pip Commands

Install package

pip install pandas

Install project dependencies

pip install -r requirements.txt

Meaning:

-r = read requirements file

List libraries

pip list

Show exact versions

pip freeze

Example:

pandas==2.2.0
numpy==1.26.0

Save environment

pip freeze > requirements.txt

Team shares this file.

Remove package

pip uninstall pandas

Upgrade package

pip install --upgrade pandas

or:

pip install -U pandas

Package details

pip show pandas

14. Meaning of Common Flags

Flags start with -.

Python

`-m`

Run module:

python -m venv venv

pip

`-r`

Read requirements file:

pip install -r requirements.txt

`-U`

Upgrade:

pip install -U pandas

`--help`

Help:

pip --help

15. Other Package Managers

Python packages

pip

Example:

pip install pandas

Windows software

Chocolatey:

choco install git

Winget:

winget install python

Mac software

Homebrew:

brew install python
brew install git

16. Data Engineer Daily Workflow

Typical:

git clone project

cd project

python -m venv venv

venv\Scripts\activate

pip install -r requirements.txt

python script.py

pip freeze > requirements.txt

deactivate

Most Important Commands to Memorize

python -m venv venv

venv\Scripts\activate

deactivate

where python

pip install package

pip list

pip freeze

pip freeze > requirements.txt

pip install -r requirements.txt

pip uninstall package

Final Mental Model

venv
 =
Developer's Python workspace


requirements.txt
 =
Team dependency sharing


Docker
 =
Production environment


Git
 =
Code sharing


Azure DevOps/GCP CI-CD
 =
Automation


ADF / Composer
 =
Orchestration


Databricks / Dataflow
 =
Processing


Power BI / BigQuery
 =
Analytics layer

This is the workflow you should remember as a Data Engineer.

step 1. Make sure the python is installed in system.

step 2 . make sure the python is updated to python version 3

or give command -m pip install --upgrade pip(pip is the python package manager aND CHOCO is the windows package manager manager and brew is for MAC )

step 3 : create the venv using command

python -m venv firstenv

python → runs Python

-m → tells Python "run a module"

venv → the module

firstenv → name of your environment

step 4 : Go inside Venv.

go inside the path and give the command

firstenv\Scripts\activate

and its shows like

(firstenv) C:\Users\keert>pip install pandas ( install the libraies which we need to develops)

step 5 : To know what packages or libraies installed give the command

(firstenv) C:\Users\keert> pip freeze

(firstenv) C:\Users\keert> pip list

step 6 : In gendral the teams give the list of requiremnts in git for examples in requirements.txt

for that we need to use pip install -r requirements.txt

step 7 : to come out of the environment give command

(firstenv) C:\Users\keert>deactivate.

to inside again give same C:\Users\keert>firstenv\scripts\activate

do the work we want.

The Cloud vs. venv Distinction: This is a massive stumbling block for beginners. Recognizing that Cloud = Infrastructure and venv = Python Isolation shows you understand that running an EC2, VM, or Cloud Function doesn't magically solve dependency hell.
Databricks/Dataproc Runtimes: You correctly noted that venv is not used for managed spark cluster runtimes. Databricks handles its own cluster-level libraries (or uses init scripts/%pip), so keeping your local script automation isolated from cluster runtimes is exactly how it works in production.
The Git Guardrail: Omitting the ❌ venv folder from Git and only sharing requirements.txt is standard industry best practice. (Pro-tip: Always add venv/ or .venv/ to your .gitignore file immediately).
The "where python" / "which python" Check: Checking your environment context via where python (Windows) or which python (Mac/Linux) is the ultimate debugging step when a package isn't loading properly.

🔍 Micro-Tips to Level Up Your Notes

Your commands and logic are flawless, but as you work in enterprise environments, keep these tiny nuances in mind:

Deterministic Builds (pip-compile): While pip freeze > requirements.txt is perfect, it dumps everything, including sub-dependencies. As you advance, you might encounter tools like pip-tools or Poetry which separate your top-level dependencies (e.g., just pandas) from their underlying sub-dependencies to keep things cleaner.
Mac/Linux Equivalent for where: You noted where python for Windows. For Mac/Linux, the exact equivalent command to memorize is:
Bash
which python
Docker + venv (The Hybrid Approach): While your distinction that Developer = venv and Production = Docker is great for a mental model, in advanced production environments, engineers actually use both together. Inside a Dockerfile, it is common practice to create a venv to isolate the app from the base OS Python layer.

Entity Relationship (ER) Diagram Model with DBMS Example

Reference : Entity Relationship (ER) Diagram Model with DBMS Example What is ER Diagram? ER Diagram stands for Entity Relationship Diagram, also known as ERD is a diagram that displays the relationship of entity sets stored in a database. In other words, ER diagrams help to explain the logical structure of databases. ER diagrams are created based on three basic concepts: entities, attributes and relationships. ER Diagrams contain different symbols that use rectangles to represent entities, ovals to define attributes and diamond shapes to represent relationships. At first look, an ER diagram looks very similar to the flowchart. However, ER Diagram includes many specialized symbols, and its meanings make this model unique. The purpose of ER Diagram is to represent the entity framework infrastructure. Entity Relationship Diagram Example Table of Content: What is ER Diagram? What is ER Model? History of ER models Why use ER Diagrams? Facts about ER Diagram Model ER Diagram...

Keerthana Blogs

creating Venv

Data Engineer Virtual Environment (venv) + Cloud Workflow Notes

1. What is a Virtual Environment (venv)?

2. Why Data Engineers need venv

3. venv in Cloud Environment

4. Team Environment: Do we share the same venv?

5. requirements.txt

6. Docker vs venv

venv

Docker

7. Azure Data Engineer Example

venv in Azure Data Engineering

8. GCP Data Engineer Example

9. Creating a venv (Windows)

10. Activate venv (Windows CMD)

11. Activate venv (Linux/Mac)

12. Important venv Commands

Create environment

Activate

Exit

Check Python version

Check which Python is running

13. pip Commands

Install package

Install project dependencies

List libraries

Show exact versions

Save environment

Remove package

Upgrade package

Package details

14. Meaning of Common Flags

Python

-m

pip

-r

-U

--help

15. Other Package Managers

Python packages

Windows software

Mac software

16. Data Engineer Daily Workflow

Most Important Commands to Memorize

Final Mental Model

🔍 Micro-Tips to Level Up Your Notes

Comments

Post a Comment

Popular posts from this blog

Entity Relationship (ER) Diagram Model with DBMS Example

Transformation - section 6 - data flow

Session 7 data flow part 2

`-m`

`-r`

`-U`

`--help`