Skip to main content

factors to considers for choosing the storage account in cloud

 Type of data that is to be stored 

Structured 

semi-structured 

unstructured

The factors depend on operational needs

1. How often is the data accessible?

2. How quickly do we need to serve?

3. Need to run simple Queries?

4. Need to run complex queries?

5. accessed from Multiple regions?


1. Azure Storage Account: Think of an Azure Storage Account as the top-level resource that groups together various Azure Storage services. When you create a "General-purpose v2" storage account, it can contain:

  • Azure Blob Storage: For unstructured data (files, images, videos, backups).

  • Azure File Storage: For managed file shares (SMB/NFS).

  • Azure Queue Storage: For messaging (queues).

  • Azure Table Storage: For NoSQL key-value data.

  • Azure Disks: For VM disks (though often managed separately as "Managed Disks").

2. Azure Blob Storage:

  • This service is for storing unstructured data (Binary Large Objects or "blobs").

  • It's like a vast digital bucket where you put files. These files are accessed via HTTP/HTTPS endpoints.

  • It has different "types of blobs" (Block, Page, Append) and "access tiers" (Hot, Cool, Archive) as we discussed, but none of these are "queues."

3. Azure Queue Storage (The Actual Queue Service):

  • Purpose: Azure Queue Storage is a service specifically designed for storing large numbers of messages. Its primary use case is to decouple components of an application, enabling asynchronous communication.

  • How it works:

    • One part of an application (the "producer") places messages into a queue.

    • Another part of the application (the "consumer" or "worker") retrieves and processes these messages at its own pace.

    • This makes applications more scalable, resilient, and responsive.

  • Key Characteristics:

    • Messages: Each message can be up to 64 KB in size.

    • Retention: Messages can remain in the queue for up to 7 days by default (and can be set to never expire for newer API versions).

    • Access: Messages are typically accessed on a First-In, First-Out (FIFO) basis, though due to its distributed nature, strict order isn't guaranteed without specific design patterns.

    • Visibility Timeout: When a message is retrieved, it becomes "invisible" to other consumers for a configurable period (visibility timeout), preventing multiple workers from processing the same message simultaneously. If processing fails, the message reappears after the timeout.

    • Scalability: Can store millions of messages, up to the total capacity limit of the storage account.

Azure Blog  Gen2 storage 
 Azure Delta Lake Gen2 refers to the powerful synergy between two key Azure components for modern data analytics:
  1. Azure Data Lake Storage Gen2 (ADLS Gen2): This is Microsoft's enterprise-grade, highly scalable, and cost-effective data lake solution built on top of Azure Blob Storage. Its key differentiator is the Hierarchical Namespace (HNS), which allows it to organize data into folders and subfolders, much like a traditional file system. This HNS is critical for the performance and compatibility with big data analytics engines like Apache Spark and Hadoop, which expect a file system-like interaction.

  2. Delta Lake: This is an open-source storage layer that brings ACID transactions (Atomicity, Consistency, Isolation, Durability) to data lakes. It was originally developed by Databricks and operates on top of existing cloud object storage (like ADLS Gen2). Delta Lake fundamentally transforms a traditional data lake into a "Lakehouse" architecture, combining the flexibility and low cost of a data lake with the reliability and performance typically found in data warehouses.

What is Azure Delta Lake Gen2?

When we talk about "Azure Delta Lake Gen2," we're referring to implementing Delta Lake as the storage format and transaction layer directly on files stored in Azure Data Lake Storage Gen2.

Essentially:

  • ADLS Gen2 provides the foundational, massively scalable, and cost-effective storage for your raw data files (typically Parquet, ORC, JSON, CSV).

  • Delta Lake sits on top of these files, adding a transactional layer (via a transaction log) and metadata management that provides features traditionally associated with databases.


Characteristics of Delta Lake :

1.ACID Transactions
2. Schema Enforcement and Evolution
3.Time Travel (Data Versioning)
4.Unified Batch and Streaming:
5.Performance Optimizations
6.Full Compatibility with Apache Spark
7.Cost-Effective Storage


Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...