session 11- Spark Structured streaming - Real Time Data processing.

101. Streaming source data Generation.

Normally streaming process involves streaming the data to Kafka, Kenesis(AWS) and PUb SUb (GCP )

The data is continuously flowing into the bronze layer This kind of data is called as unbounded data.

As far as now, we have only used the reading data from pyspark using Spark. Read the file format (file name ), and we will write using Spark. write to write the data into the destination.

So to read the streaming data, we have to use Spark. readstream and spark.writestream

If we haven't specified any format default format will be the delta format in Azure. For normal write, we will use option (mode, 'override') In Spark streaming, we have to use outputmode ("append")

the spark write stream output will be

Python
sinkStreamJSONCheckpointPath = 'abfss://bronze@adlsudadatalakedev.dfs.core.windows.net/daily-pricing-streaming-data/json/checkpoint/'

streamProcessingQuery = (sourceStreamJSONFileDF
                       .writeStream
                       .outputMode("append")
                       .format("json")
                       .queryName("stream-processing")
                       .trigger(processingTime = "5 Minutes")
                       .option("checkpointLocation", sinkStreamJSONCheckpointPath)
                       .start(sinkStreamJSONFilePath)
                       )

The key additions compared to the previous snippet are:

.queryName("stream-processing"):
- This line assigns a logical name to your streaming query: "stream-processing".
- Benefits of setting a query name:
  - Monitoring and Management: It makes it easier to identify and monitor your specific streaming query in Spark's UI (e.g., the Spark History Server or the Spark UI's Streaming tab).
  - Debugging: When troubleshooting issues, the query name helps in distinguishing logs and metrics related to this particular stream.
  - Programmatic Access: You can potentially use the query name to programmatically interact with or manage the running stream (though this is less common in basic write scenarios).
.trigger(processingTime = "5 Minutes"):
- This line configures the trigger interval for your streaming query.
- processingTime = "5 Minutes" specifies that Spark will attempt to process new data and update the sink every 5 minutes.
- Trigger Modes: Spark Structured Streaming supports different trigger modes:
  - processingTime: As used here, it triggers processing at fixed time intervals.
  - once: The query will process all available data in the source and then terminate. This is useful for one-time batch-like processing on streaming data.
  - continuous (experimental): For low-latency processing with certain sources and sinks.
  - availableNow (as of Spark 3.4): Triggers processing as soon as new data is available.
- Impact: Setting a trigger interval controls the latency of your streaming pipeline. A shorter interval leads to lower latency but potentially higher resource consumption. A longer interval reduces resource usage but increases latency. Choosing the appropriate trigger interval depends on your specific application requirements.

The rest of the code remains the same and performs the following actions:

Reads a streaming DataFrame (sourceStreamJSONFileDF).
Writes the output in "append" mode (only new data is written).
Formats the output as "json" files.
Uses the specified sinkStreamJSONCheckpointPath for maintaining the query's state for fault tolerance.
Starts the streaming query and writes the output to sinkStreamJSONFilePath (assumed to be defined elsewhere).

In summary, this updated code snippet enhances the previous one by:

Providing a logical name to the streaming query for better monitoring and management.
Defining a specific processing interval of 5 minutes, controlling how frequently new data is processed and written to the sink.

This configuration is more explicit about how the streaming pipeline will operate in terms of processing frequency and provides better metadata for monitoring.

he output streamProcessingQuery.Id shows the unique identifier of the Spark Structured Streaming query that you just started. This confirms that the .start() action was successful in launching the streaming process. The query is now running in the background, continuously processing data according to the configured trigger and writing the results to the specified sink.

Module Summary

Spark Structured Streaming Functionalities:

Okay, let's delve into the summary points you've listed regarding Spark Structured Streaming functionalities.

Module Summary: Spark Structured Streaming Functionalities

Spark Structured Streaming provides a powerful and scalable framework for processing real-time and near real-time data streams. It extends the familiar DataFrame and Dataset APIs of Spark SQL to handle continuous data. Here's a breakdown of its core functionalities:

1. Unified Batch and Streaming API:

Core Concept: Structured Streaming treats a continuous stream of data as an unbounded table. Computations on this "table" are performed incrementally and continuously as new data arrives.
DataFrame/Dataset Abstraction: You use the same high-level DataFrame and Dataset APIs for both batch and streaming data processing. This unification simplifies development and allows for code reuse.
SQL Integration: You can apply SQL queries directly to streaming DataFrames and Views.

2. Data Sources (Readers):

Variety of Sources: Structured Streaming supports reading data from various streaming sources, including:
- File-based: Reading new files as they appear in a directory (e.g., text, CSV, JSON, Parquet, ORC).
- Network Sockets: Reading data from network connections.
- Message Queues: Integrating with systems like Apache Kafka, Azure Event Hubs, AWS Kinesis.
- Custom Sources: Allows developers to create their own data sources.
Configuration Options: Each source provides specific configuration options to control aspects like:
- Path patterns for file-based sources.
- Kafka broker lists, topics, and consumer groups.
- Event Hubs connection strings and consumer groups.
- Rate limiting for testing and controlled ingestion.
- Starting offsets or positions in the stream.

3. Data Transformations:

Rich Set of Operations: You can apply a wide range of transformations to streaming DataFrames, similar to batch processing:
- Basic Transformations: select, where (filter), groupBy, orderBy, withColumn, drop, union.
- Windowing Operations: For time-based aggregations (e.g., counting events in the last 5 minutes). Supports various window types (tumbling, sliding, session).
- Watermarking: For handling late-arriving data in windowed aggregations and managing state efficiently.
- Joins: Joining streaming DataFrames with static DataFrames or other streaming DataFrames (with specific constraints).
- User-Defined Functions (UDFs): Allows applying custom logic.

4. Data Sinks (Writers):

Diverse Output Destinations: Structured Streaming supports writing processed data to various sinks:
- File-based: Writing to directories in formats like text, CSV, JSON, Parquet, ORC.
- Foreach Sink: Allows writing data using custom logic (e.g., to external databases or APIs).
- Message Queues: Writing to systems like Apache Kafka, Azure Event Hubs.
- Console Sink: For debugging and development (prints output to the console).
- Memory Sink: For testing and in-memory processing.
- Data Lakes/Warehouses: Integrating with systems like Delta Lake, Apache Hive, cloud-based data warehouses.
Output Modes: Control how data is written to the sink for each trigger:
- append: Only new rows added to the result table since the last trigger are written. (Most common for simple transformations).
- complete: The entire updated result table is written to the sink on every trigger. (Useful for aggregations with watermarking).
- update: Only the rows in the result table that were updated since the last trigger are written. (Requires aggregations or joins with watermarking).

5. Triggering and Latency:

Control over Processing Intervals: You can configure how frequently Spark checks for new data and processes it using triggers:
- Fixed Interval (processingTime): Triggers processing at specified time intervals (e.g., every 5 seconds).
- Once: Processes all available data and then stops.
- Continuous (Experimental): Aims for the lowest possible latency.
- Available Now (as of Spark 3.4): Triggers processing as soon as new data is available.
Trade-off between Latency and Throughput: Shorter trigger intervals result in lower latency but potentially higher resource consumption. Longer intervals improve throughput but increase latency.

6. Fault Tolerance and State Management:

Checkpointing: A crucial mechanism for ensuring fault tolerance. Spark saves the state of the streaming query (e.g., processed offsets, intermediate aggregations) to a reliable storage system (like HDFS or cloud object storage). If a failure occurs, the query can restart from the last successful checkpoint without losing data.
Stateful Operations: For operations like windowing and aggregations, Spark manages state. Checkpointing ensures that this state is preserved across failures.

Spark Structured Streaming Use Case Scenarios

Spark Structured Streaming is well-suited for a wide range of real-time and near real-time data processing applications:

Real-time Analytics:
- Clickstream Analysis: Tracking user activity on websites or applications in real-time to understand behavior and trends.
- IoT Data Processing: Ingesting and analyzing sensor data from connected devices for monitoring, alerting, and predictive maintenance.
- Financial Trading Systems: Analyzing market data in real-time for algorithmic trading and risk management.
- Network Monitoring: Analyzing network traffic for security threats and performance issues.
ETL and Data Pipelines:
- Real-time Data Ingestion and Transformation: Continuously ingesting raw data, performing transformations, and loading it into data lakes or data warehouses.
- Stream Enrichment: Joining real-time streams with static datasets or other streams to add contextual information.
Real-time Monitoring and Alerting:
- System Monitoring: Tracking system metrics and generating alerts based on predefined thresholds.
- Fraud Detection: Analyzing transaction streams in real-time to identify potentially fraudulent activities.
- Anomaly Detection: Identifying unusual patterns in data streams.
Real-time Dashboards and Visualization:
- Powering live dashboards with continuously updating data.
Complex Event Processing (CEP):
- Detecting patterns of events occurring in a specific sequence or within a certain time window.

Spark Structured Streaming Reader / Writer Configurations

As mentioned earlier, both the data source (reader) and the data sink (writer) offer various configuration options. Here are some common examples:

Reader Configurations (Example: Kafka):

Python
spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "your_kafka_brokers") \
    .option("subscribe", "your_topic") \
    .option("startingOffsets", "latest") \
    .load()

kafka.bootstrap.servers: List of Kafka broker addresses.
subscribe: Topic(s) to subscribe to.
startingOffsets: Where to start reading data ("earliest", "latest", or a JSON string specifying offsets for each partition).
kafka.security.protocol, kafka.sasl.mechanism, kafka.sasl.jaas.config: For secure Kafka connections.
maxOffsetsPerTrigger: Limits the number of messages processed in each micro-batch.

Reader Configurations (Example: File-based):

Python
spark.readStream \
    .format("json") \
    .option("path", "/path/to/your/streaming/data") \
    .option("recursiveFileLookup", "true") \
    .option("fileNameOnly", "false") \
    .load()

path: The directory to monitor for new files.
recursiveFileLookup: Whether to look for files recursively in subdirectories.
fileNameOnly: Whether to consider only the filename when detecting new files.
schema: To explicitly define the schema of the files.

Writer Configurations (Example: Kafka):

Python
query = streamingDF.writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "your_kafka_brokers") \
    .option("topic", "your_output_topic") \
    .option("checkpointLocation", "/path/to/checkpoint") \
    .start()

`kafka.bootstrap.servers`: List of Kafka broker addresses.
`topic`: The Kafka topic to write to.
Options for serialization (value.serializer, key.serializer).

Writer Configurations (Example: File-based):

Python
query = streamingDF.writeStream \
    .format("parquet") \
    .option("path", "/path/to/your/output/directory") \
    .option("checkpointLocation", "/path/to/checkpoint") \
    .outputMode("append") \
    .partitionBy("date") \
    .start()

path: The output directory.
partitionBy: Columns to partition the output data by.
Options related to the file format (e.g., compression).

Spark Structured Streaming CHECKPOINT options Purpose and Usability

Purpose of Checkpointing:

Fault Tolerance: Enables the streaming query to recover from failures (e.g., node crashes, application restarts) and continue processing from where it left off without losing data or state.
State Persistence: For stateful operations (like windowing, aggregations, and joins), checkpointing periodically saves the current state of the computation. This ensures that the state is not lost in case of failures and can be restored upon recovery.
Exactly-Once Processing (with some sources and sinks): When combined with idempotent sinks, checkpointing helps achieve exactly-once processing semantics (ensuring that each event is processed exactly once).

Usability of Checkpointing:

Configuration: You specify the checkpoint location using the .option("checkpointLocation", "/path/to/checkpoint") on the DataStreamWriter. The path should be a reliable, fault-tolerant storage system accessible by the Spark cluster (e.g., HDFS, S3, Azure Blob Storage).
Content of Checkpoints: The checkpoint directory typically contains metadata about the progress of the query, processed offsets of the input data, and the state of any stateful operations.
Importance: Setting a checkpointLocation is crucial for any production-level Spark Structured Streaming application that requires fault tolerance or uses stateful operations. Without it, the query will likely fail irrecoverably upon any issue, and you risk data loss or inconsistencies.
Management:
- Permissions: Ensure the Spark application has the necessary read and write permissions to the checkpoint location.
- Storage Considerations: Checkpoint directories can grow over time, especially for long-running stateful queries. Monitor the size and consider strategies for managing or archiving old checkpoints if necessary (though Spark manages this to some extent).
- Consistency: Avoid sharing the same checkpoint location between different streaming queries, as this can lead to unexpected behavior and corruption.

In conclusion, Spark Structured Streaming provides a robust and flexible framework for building real-time data processing pipelines. Understanding its core functionalities, use case scenarios, configuration options for readers and writers, and the critical role of checkpointing is essential for developing and deploying reliable streaming applications.

Keerthana Blogs

Search This Blog