Skip to main content

GCP - Dataproc -- gcs buckets -- pyspark jobs

Dataproc --> Hadoop managed services available in GCP 

In order to create a cluster go to search data proc inside Google Cloud console search data proc and then enable its related API 






then navigate into the cluster and start creating a new cluster. inorder to create the clusters follow give these commands in the cloud shell 



Before creating the cluster, we need permission to do so. To do that, go to IAM, search for service accounts, select the particular service account, and grant access to create the access.
 in general, unlike AWS (user level) in GCP we create service account according to the entire project.




step1:gcloud compute networks create dataproc-network2 --subnet-mode=auto

inorder to create the cluster network

step 2: gcloud compute firewall-rules create allow-internal --allow all --source-ranges 10.128.0.0/9 --network dataproc-network2

creating network and firewall rules to create communication between the notes to master --> external

step 3: gcloud compute firewall-rules create allow-external2 --allow tcp:22,tcp:3389,icmp --network dataproc-network2

creating network and firewall rules to create communication between the notes to master - internal

step 4:gcloud dataproc clusters create test-cluster2 --region us-central1 --zone us-central1-a --master-machine-type n2-standard-2 --master-boot-disk-size 50 --num-workers 2 --worker-machine-type n2-standard-2 --worker-boot-disk-size 50 --network dataproc-network2 --enable-component-gateway --image-version 2.2.40-debian12 --project dataengineering-jan2025

creating how many nodes and size of the master , region and all


created the cluster 



So in earlier sesission... after creating cluster with mater and worker nodes....to interact with the machine, we went into ssh mode. and entered commands like hdfs dfs ...hive etc when we want to interact with dataproc cluster...

So now with the same dataproc cluster we want to submit the spark jobs - what we can do is, there are some commands to submit jobs....before that we need to place our spark job and any relavant files like  reading local data files...so that place is called gcs buckets in google cloud, in aws we call as s3 buckets, in azure we call as blocks...it is a cloud storage where we place the code we are going to use to run it in cluster.

What are GCS Buckets?

  • Containers for Data:
    • GCS buckets are essentially containers that hold your data objects (files). You can store any type of data in them, from text files and images to large datasets and backups.
  • Global Namespace:
    • Bucket names are globally unique across all of Google Cloud Storage. This means that once a bucket name is taken, no one else can use it.
  • Object Storage:
    • GCS is an object storage service, meaning data is stored as objects within buckets. This differs from file systems that use hierarchical directories.
inoder to create the bucket incase go to cloud storage and create bucket and view what is inside the bucket .below buckets are created when the clusters are created .


creating the bucket for test and do spark jobs and named it as spark_jobs1 and uploaded the already created py file .



uploading the py files and their related files into gcs buckets to test the spark jobs.
In py files, edit the location of the files since the files have been moved into the GCS buckets.





command to run the pyspark the job - gcloud dataproc jobs submit pyspark --cluster name --region /bucket location gs://bujet name/python file name.

eg : gcloud dataproc jobs submit pyspark --cluster=cluster-1bd6 --region=us-central1 "gs://spark_jobs1/Spark Practice.py"



and we can see the results of the jobs and dag under the particular cluster under jobs 





  1. first we wrote spark code- which contains transformations and actions.
  2. we uploaded the spark code into gcs bucket
  3. then submitted the spark job using command - gcloud spark submit...gs://buketname/sparkcode.py
  4. written the output to bucket ...we can also write the output to bigquery aswell..very simple as shown below..


5. if we also want tp wrte the data into something like a database, then we will use format as jdbc..as shown below:..in google search - pyspark write dataframe to postgresql, with username and pwd etc

JDBC (Java Database Connectivity)

JDBC is an API (Application Programming Interface) in Java that allows applications to connect to and interact with databases...







refer to Rama Blogs for reminding 








Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...