Dataproc --> Hadoop managed services available in GCP
In order to create a cluster go to search data proc inside Google Cloud console search data proc and then enable its related API
then navigate into the cluster and start creating a new cluster. inorder to create the clusters follow give these commands in the cloud shell
Before creating the cluster, we need permission to do so. To do that, go to IAM, search for service accounts, select the particular service account, and grant access to create the access.
step1:gcloud compute networks create dataproc-network2 --subnet-mode=auto
inorder to create the cluster network
step 2: gcloud compute firewall-rules create allow-internal --allow all --source-ranges 10.128.0.0/9 --network dataproc-network2
creating network and firewall rules to create communication between the notes to master --> external
step 3: gcloud compute firewall-rules create allow-external2 --allow tcp:22,tcp:3389,icmp --network dataproc-network2
creating network and firewall rules to create communication between the notes to master - internal
step 4:gcloud dataproc clusters create test-cluster2 --region us-central1 --zone us-central1-a --master-machine-type n2-standard-2 --master-boot-disk-size 50 --num-workers 2 --worker-machine-type n2-standard-2 --worker-boot-disk-size 50 --network dataproc-network2 --enable-component-gateway --image-version 2.2.40-debian12 --project dataengineering-jan2025
creating how many nodes and size of the master , region and all
So in earlier sesission... after creating cluster with mater and worker nodes....to interact with the machine, we went into ssh mode. and entered commands like hdfs dfs ...hive etc when we want to interact with dataproc cluster...
So now with the same dataproc cluster we want to submit the spark jobs - what we can do is, there are some commands to submit jobs....before that we need to place our spark job and any relavant files like reading local data files...so that place is called gcs buckets in google cloud, in aws we call as s3 buckets, in azure we call as blocks...it is a cloud storage where we place the code we are going to use to run it in cluster.
What are GCS Buckets?
- Containers for Data:
- GCS buckets are essentially containers that hold your data objects (files). You can store any type of data in them, from text files and images to large datasets and backups.
- Global Namespace:
- Bucket names are globally unique across all of Google Cloud Storage. This means that once a bucket name is taken, no one else can use it.
- Object Storage:
- GCS is an object storage service, meaning data is stored as objects within buckets. This differs from file systems that use hierarchical directories.
uploading the py files and their related files into gcs buckets to test the spark jobs.
- first we wrote spark code- which contains transformations and actions.
- we uploaded the spark code into gcs bucket
- then submitted the spark job using command - gcloud spark submit...gs://buketname/sparkcode.py
- written the output to bucket ...we can also write the output to bigquery aswell..very simple as shown below..
5. if we also want tp wrte the data into something like a database, then we will use format as jdbc..as shown below:..in google search - pyspark write dataframe to postgresql, with username and pwd etc
JDBC (Java Database Connectivity)
JDBC is an API (Application Programming Interface) in Java that allows applications to connect to and interact with databases...
refer to Rama Blogs for reminding
Comments
Post a Comment