Skip to main content

Composer - GCP




the DAG files are stored within a "dags" folder, which resides inside of a Cloud Storage bucket, that is created when a Cloud Composer environment is created.




inside the DAG file there will be the default py file for orchestration.

you can create the number of DAG files and upload into this folder to create the workflow

if you click on the files, it will lead to airflow monitoring under graph you can see DAG and operator (task) , and under code ,you can see the code for orchestration 



if the DAG is yellow in color that means it's not working properly then we have to debug if its in green color its works properly .


from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'your_name',
    'start_date': datetime(2023, 10, 26),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'dagrun_timeout': timedelta(hours=2),
}

with DAG(
    dag_id='my_data_pipeline',  # DAG Name
    schedule_interval='@daily',  # Schedule interval
    default_args=default_args,
    catchup=False, The catchup parameter determines whether Airflow should "catch up" on past scheduled runs that were missed.
) as dag:
    task1 = BashOperator(
        task_id='extract_data',  # Task ID
        bash_command='echo "Extracting data..."',
    )

    task2 = BashOperator(
        task_id='transform_data',  # Task ID
        bash_command='echo "Transforming data..."',
    )

    task1 >> task2  # Define task dependencies, >> operator defines the order in which tasks should be executed

Key Points:

  • DAG parameters control the overall behavior of the workflow.
  • Operator parameters define the individual steps within the workflow.
  • task_id values must be unique within a DAG.
  • The dag_id must be unique across all dags inside of the airflow environment.
  • The schedule_interval is what tells airflow when to run the dag.
  • The dagrun_timeout is a safety net.

I hope this helps clarify these important Airflow concepts.

Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Entity Relationship (ER) Diagram Model with DBMS Example

Reference :   Entity Relationship (ER) Diagram Model with DBMS Example What is ER Diagram? ER Diagram  stands for Entity Relationship Diagram, also known as ERD is a diagram that displays the relationship of entity sets stored in a database. In other words, ER diagrams help to explain the logical structure of databases. ER diagrams are created based on three basic concepts: entities, attributes and relationships. ER Diagrams contain different symbols that use rectangles to represent entities, ovals to define attributes and diamond shapes to represent relationships. At first look, an ER diagram looks very similar to the flowchart. However, ER Diagram includes many specialized symbols, and its meanings make this model unique. The purpose of ER Diagram is to represent the entity framework infrastructure. Entity Relationship Diagram Example Table of Content: What is ER Diagram? What is ER Model? History of ER models Why use ER Diagrams? Facts about ER Diagram Model ER Diagram...