To check the data was inserted .
We know we have included what the next incremental load would be, now we are going to refactor the code (which means update the existing code ).
Then delete the unwanted print statement , dbutils.help cell , remove the unwanted dbutils widgets by just adding dbutils.widgets.remove() or remove the parameter related to that parameter
Since this sql cell needs to be run only once we have to move this cell to the SQL editor not on the regular basis
while giving the task name as unique , Type we have multiople options like python , SQL , another job,
We have to set the cluster into job cluster because we have created the cluster which stays alive for
inorder to create the table to maintain the details for orestration . go to workspace , default , and then go
How did you manage to connect to different Source Systems from Databricks?
Depending on the Type of source systems you mentioned in your CV we need to provide the answers and have given some of the source systems used in our course
Web Services (or) Websites
We can connect using Certificates created by Web Services and use them in Databricks notebook to connect
We can use username/password to connect to the Web Services
More secured Web services use OAUTH to connect externally
Please explore any of the authentication method applies to your project and ready to explain it properly ( I have not included in the course as you will get confused between the code used for security setting and ingestion process , now you properly know how ingestion works so please explore how to connect and its one time setup)
How type of Ingestion load performed in the project?
Incremental Ingestion(Delta Load)
When we can use some column values (latest_updated_datatime, maximum(primary-key-value), Datetime values in the files) to identify incremental data from the source then mention Incremental Data Load notebooks are developed. This is the recommended way of developing Ingestion pipelines
For some migration projects (or) there is no data attribute to identify incremental source data then we end up developing Full Load Ingestion notebooks and in this case we end up ingesting all source data on every ingestion
How did you performed Incremental load?
Explain in detail depending on the data attribute (latest_updated_datatime, maximum(primary-key-value), Datetime values in the files) from the source used to identify incremental data . In the course we used processed source file dates to identify new source files from the source system
How to migrate (or) load data into multiple environments(dev/test)?
Using Notebook Parameters is the answer for any question related to developing or using the code that works in multiple environments
How did you automate the run of Ingestion notebook and How frequently the notebook runs?
Use Databricks workflows to run the notebooks and also schedule it to run regularly
Scheduling frequency is depends upon how frequently source data changes(if it changes at the end of day , schedule to run once in a day , if source data changes every 30 minutes we need to schedule to run every 30 minutes)
What type of cluster used to run Databricks workflows (or) difference between All Purpose Cluster and Job Cluster?
Please find below the difference and use cases for both of these clusters and always we use Job Clusters in scheduled jobs in data bricks workflows.
All Purpose Cluster
Running Interactive workloads , Ad-Hoc Tasks - Mainly Used for Development Activities
Data Engineers during development , Data Analysts & Data scientists for Ad-Hoc Tasks
Manually started/stopped, can be shared with other users , Not cost efficient as manual start/stop process , Sometimes run longer time
Jobs Clusters
Running Automated Jobs , scheduled tasks , scheduled jobs and Batch processing Jobs
Automatically created , Started and terminated ,More cost-efficient as automated start/stop
Comments
Post a Comment