Skip to main content

creating Venv

 

Data Engineer Virtual Environment (venv) + Cloud Workflow Notes

1. What is a Virtual Environment (venv)?

A virtual environment is an isolated Python environment for a project.

It allows each project to have its own Python libraries and versions.

Example:

Project A
|
└── venv
    ├── pandas 2.0
    └── pyspark


Project B
|
└── venv
    ├── pandas 2.2
    └── tensorflow

Without venv, package versions can conflict.


2. Why Data Engineers need venv

Data engineers use many Python libraries:

  • pandas

  • numpy

  • pyspark

  • requests

  • cloud SDKs

Example:

Automation script:

Python script
      |
      |
      v
Azure Storage / GCP Storage / APIs

The venv keeps the required libraries separate.


3. venv in Cloud Environment

Cloud does NOT replace venv.

Cloud = infrastructure

venv = Python dependency isolation

Example:

Cloud VM
|
├── Python
|
├── Project A
|     └── venv
|
└── Project B
      └── venv

4. Team Environment: Do we share the same venv?

NO.

Each developer has their own venv.

Example:

Developer A laptop
|
└── venv


Developer B laptop
|
└── venv

The team shares:

Git Repository

├── Python code
├── requirements.txt
└── configuration

The team does NOT share:

❌ venv folder

5. requirements.txt

This file stores package dependencies.

Example:

pandas==2.2.0
numpy==1.26.0
pyspark==3.5.0
requests==2.32.0

Team member creates the same environment:

pip install -r requirements.txt

6. Docker vs venv

venv

Used mainly for:

  • local development

  • testing Python scripts

Example:

Laptop

venv
 |
Python libraries
 |
Code

Docker

Used for:

  • production deployment

  • cloud environments

Example:

Docker Container

Python
 |
Libraries
 |
Application

Usually:

Developer:
venv

Production:
Docker

7. Azure Data Engineer Example

Stack:

  • Azure Data Factory

  • Azure Databricks

  • Power BI

Flow:

Developer Laptop

venv
 |
Python automation scripts

        |
        v

Azure DevOps Git

        |
        v

ADF / Databricks / Power BI

venv in Azure Data Engineering

Used for:

✅ Python automation scripts
✅ Azure SDK scripts
✅ Data validation scripts
✅ Testing ETL logic

Not used for:

❌ ADF itself
❌ Power BI
❌ Databricks cluster runtime


8. GCP Data Engineer Example

Stack:

  • Cloud Composer

  • Dataflow

  • Dataproc

  • BigQuery

Flow:

Developer

venv

Python code

Git

CI/CD

Cloud Services

venv used for:

✅ Google Cloud SDK scripts
✅ Apache Beam testing
✅ BigQuery automation
✅ Local development

Not used for:

❌ BigQuery engine
❌ Dataproc cluster Python environment


9. Creating a venv (Windows)

Create:

python -m venv firstenv

Meaning:

python
 |
 -m = run module
 |
 venv = Python virtual environment module
 |
 firstenv = environment name

10. Activate venv (Windows CMD)

Correct:

firstenv\Scripts\activate

You should see:

(firstenv) C:\Users\keert>

Deactivate:

deactivate

11. Activate venv (Linux/Mac)

source firstenv/bin/activate

12. Important venv Commands

Create environment

python -m venv venv

Activate

Windows:

venv\Scripts\activate

Linux/Mac:

source venv/bin/activate

Exit

deactivate

Check Python version

python --version

Check which Python is running

Windows:

where python

Should show:

venv\Scripts\python.exe

13. pip Commands

Install package

pip install pandas

Install project dependencies

pip install -r requirements.txt

Meaning:

-r = read requirements file

List libraries

pip list

Show exact versions

pip freeze

Example:

pandas==2.2.0
numpy==1.26.0

Save environment

pip freeze > requirements.txt

Team shares this file.


Remove package

pip uninstall pandas

Upgrade package

pip install --upgrade pandas

or:

pip install -U pandas

Package details

pip show pandas

14. Meaning of Common Flags

Flags start with -.

Python

-m

Run module:

python -m venv venv

pip

-r

Read requirements file:

pip install -r requirements.txt

-U

Upgrade:

pip install -U pandas

--help

Help:

pip --help

15. Other Package Managers

Python packages

pip

Example:

pip install pandas

Windows software

Chocolatey:

choco install git

Winget:

winget install python

Mac software

Homebrew:

brew install python
brew install git

16. Data Engineer Daily Workflow

Typical:

git clone project

cd project

python -m venv venv

venv\Scripts\activate

pip install -r requirements.txt

python script.py

pip freeze > requirements.txt

deactivate

Most Important Commands to Memorize

python -m venv venv

venv\Scripts\activate

deactivate

where python

pip install package

pip list

pip freeze

pip freeze > requirements.txt

pip install -r requirements.txt

pip uninstall package

Final Mental Model

venv
 =
Developer's Python workspace


requirements.txt
 =
Team dependency sharing


Docker
 =
Production environment


Git
 =
Code sharing


Azure DevOps/GCP CI-CD
 =
Automation


ADF / Composer
 =
Orchestration


Databricks / Dataflow
 =
Processing


Power BI / BigQuery
 =
Analytics layer

This is the workflow you should remember as a Data Engineer.


step 1. Make sure the python is installed in system.



step 2 . make sure the python is updated to python version 3

or give command -m pip install --upgrade pip(pip is the python package manager aND CHOCO is the windows package manager manager and brew is for MAC )


step 3 : create the venv using command

python -m venv firstenv

python → runs Python

-m → tells Python "run a module"

venv → the module

firstenv → name of your environment





step 4 : Go inside Venv.

go inside the path and give the command 

firstenv\Scripts\activate



and its shows like 

(firstenv) C:\Users\keert>pip install pandas ( install the libraies which we need to develops)



step 5 : To know what packages or libraies installed give the command


(firstenv) C:\Users\keert> pip freeze

(firstenv) C:\Users\keert> pip list 



step 6 : In gendral the teams give the list of requiremnts in git for examples in requirements.txt 


for that we need to use pip install -r requirements.txt


step 7 : to come out of the environment give command 



(firstenv) C:\Users\keert>deactivate.


to inside again give same C:\Users\keert>firstenv\scripts\activate


do the work we want.


  • The Cloud vs. venv Distinction: This is a massive stumbling block for beginners. Recognizing that Cloud = Infrastructure and venv = Python Isolation shows you understand that running an EC2, VM, or Cloud Function doesn't magically solve dependency hell.

  • Databricks/Dataproc Runtimes: You correctly noted that venv is not used for managed spark cluster runtimes. Databricks handles its own cluster-level libraries (or uses init scripts/%pip), so keeping your local script automation isolated from cluster runtimes is exactly how it works in production.

  • The Git Guardrail: Omitting the ❌ venv folder from Git and only sharing requirements.txt is standard industry best practice. (Pro-tip: Always add venv/ or .venv/ to your .gitignore file immediately).

  • The "where python" / "which python" Check: Checking your environment context via where python (Windows) or which python (Mac/Linux) is the ultimate debugging step when a package isn't loading properly.

🔍 Micro-Tips to Level Up Your Notes

Your commands and logic are flawless, but as you work in enterprise environments, keep these tiny nuances in mind:

  1. Deterministic Builds (pip-compile): While pip freeze > requirements.txt is perfect, it dumps everything, including sub-dependencies. As you advance, you might encounter tools like pip-tools or Poetry which separate your top-level dependencies (e.g., just pandas) from their underlying sub-dependencies to keep things cleaner.

  2. Mac/Linux Equivalent for where: You noted where python for Windows. For Mac/Linux, the exact equivalent command to memorize is:

    Bash
    which python
    
  3. Docker + venv (The Hybrid Approach): While your distinction that Developer = venv and Production = Docker is great for a mental model, in advanced production environments, engineers actually use both together. Inside a Dockerfile, it is common practice to create a venv to isolate the app from the base OS Python layer.

Comments

Popular posts from this blog

Entity Relationship (ER) Diagram Model with DBMS Example

Reference :   Entity Relationship (ER) Diagram Model with DBMS Example What is ER Diagram? ER Diagram  stands for Entity Relationship Diagram, also known as ERD is a diagram that displays the relationship of entity sets stored in a database. In other words, ER diagrams help to explain the logical structure of databases. ER diagrams are created based on three basic concepts: entities, attributes and relationships. ER Diagrams contain different symbols that use rectangles to represent entities, ovals to define attributes and diamond shapes to represent relationships. At first look, an ER diagram looks very similar to the flowchart. However, ER Diagram includes many specialized symbols, and its meanings make this model unique. The purpose of ER Diagram is to represent the entity framework infrastructure. Entity Relationship Diagram Example Table of Content: What is ER Diagram? What is ER Model? History of ER models Why use ER Diagrams? Facts about ER Diagram Model ER Diagram...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...

Session 7 data flow part 2

  Data Flow Name : df_transform_hospital_admissions Pipeline Steps : Source (HospitalAdmissionSource) : Pulls data from ds_raw_hospital_admission . SelectReqdFields : Renames or selects specific fields: country , indicator , etc. LookupCountry : Performs a lookup using CountrySource (likely from ds_country_lookup ) to enrich the data. SelectReqdFields2 : Refines the result further with a new set of selected or renamed fields. Split into Weekly and Daily : A Conditional Split divides the data into two branches: Weekly (9 columns total) Daily (filtering on indicator column, likely conditional logic) Right Panel : Shows general properties. Name: df_transform_hospital_admissions . Description: Empty. Bottom Panel (Data preview) : Currently loading: “Fetching data…”. Status: Data flow debug is enabled (green). Operation counts like INSERT , UPDATE , DELETE , etc., are N/A , meaning this is likely a preview r...