Skip to main content

Posts

creating Venv

  Data Engineer Virtual Environment (venv) + Cloud Workflow Notes 1. What is a Virtual Environment (venv)? A virtual environment is an isolated Python environment for a project. It allows each project to have its own Python libraries and versions. Example: Project A | └── venv ├── pandas 2.0 └── pyspark Project B | └── venv ├── pandas 2.2 └── tensorflow Without venv, package versions can conflict. 2. Why Data Engineers need venv Data engineers use many Python libraries: pandas numpy pyspark requests cloud SDKs Example: Automation script: Python script | | v Azure Storage / GCP Storage / APIs The venv keeps the required libraries separate. 3. venv in Cloud Environment Cloud does NOT replace venv. Cloud = infrastructure venv = Python dependency isolation Example: Cloud VM | ├── Python | ├── Project A | └── venv | └── Project B └── venv 4. Team Environment: Do we share the same venv? NO. Each developer has their own venv. Example: Develop...

Hands on project for ADF

Azure Pipeline Creation and Configuration Steps To log in to the Azure portal, a Microsoft account is required. After creating the account, sign in to the Azure portal and proceed with the following steps to build the data pipeline. 1. Create Resource Group and Storage Account Create a Resource Group in Azure. Under the resource group, create the required resources for the pipeline. Azure Data Lake Storage (ADLS Gen2) Create a Storage Account . Enable Hierarchical Namespace to convert it into Data Lake Storage Gen2 . Inside the storage account: Create a container (e.g., blog-container ). Organize data using folders/subfolders (can be created dynamically or manually using directory structure). Storage Structure Example Create a storage account named sdmm : sdmm gold processing sales silver sd mm bronze sd mm 2. Azure Data Factory (ADF) Setup Create an Azure Data Factory instance. Go to Managed Identities and enabl...