Pyspark

pyspark

link the above link for a sample code for Pyspark

Transformations create new datasets, actions return values, or write data.

Transformations:

Definition:
- Transformations are operations that create a new RDD or data frame from an existing one.
- They are "lazy," meaning they don't execute immediately. Instead, Spark builds a lineage graph (DAG - Directed Acyclic Graph) of transformations.

Examples:

map(): Applies a function to each element.
filter(): Select elements based on a condition.
flatMap(): Similar to a map, but flattens the results.
groupByKey(): Groups elements by key.
reduceByKey(): Reduces elements by key.
join(): Joins two datasets.
distinct(): Removes duplicate elements.
union(): Combines two datasets.
coalesce(): Reduces the number of partitions.
repartition(): Increases or decreases the number of partitions.

Actions:

Definition:

Actions trigger the execution of the lineage graph, which returns a result to the driver program or writes data to external storage.

collect(): Returns all elements as an array to the driver.

count(): Returns the number of elements.

first(): Returns the first element.

take(n): Returns the first 'n' elements.

reduce(): Reduces elements using a function.

foreach(): Applies a function to each element.

saveAsTextFile(): Writes elements to a text file.

show(): Displays the contents of a data frame.

Repartitioning Mechanisms

Spark provides two primary methods for repartitioning:

repartition():
- This method is used to either increase or decrease the number of partitions.
- It performs a full shuffle of the data, meaning that data is redistributed across all partitions. This can be an expensive operation, especially for large datasets, as it involves significant network communication.
- repartition() aims to distribute data evenly across the new partitions.
- It is typically used when you need to redistribute data to achieve a more balanced workload or when you are increasing the number of partitions.
coalesce():
- This method is primarily used to decrease the number of partitions.
- It attempts to minimize data shuffling by combining existing partitions.
- It is generally more efficient than repartition() when reducing the number of partitions.
- However, coalesce() may result in unevenly sized partitions.
- It is best used when you are filtering your data, and have reduced the size of the data set.

Optimation techniques :

Apache Spark, cache() and persist() are fundamental optimization techniques used to store intermediate results of computations, thereby avoiding redundant recalculations. While they are closely related, there's a key distinction:

cache(): its has limited storage capacity if it overloaded keep on running but in persist after it got overloaded its stored in disk
cache() is essentially a shorthand for persist(StorageLevel.MEMORY_ONLY) (for RDDs) or persist(StorageLevel.MEMORY_AND_DISK) (for Dataframes and Datasets).
- It stores the data in memory, if possible. If there's not enough memory, some partitions might be recomputed.
- It's a convenient way to quickly cache data when you're primarily concerned with in-memory storage.
persist(): once its persist we have to unpersist it .
- persist() provides more fine-grained control over storage. You can specify different storage levels, such as:
  - MEMORY_ONLY: Store data in memory only.
  - MEMORY_AND_DISK: Store data in memory, and spill to disk if memory is insufficient.
  - DISK_ONLY: Store data only on disk.
  - And variations of these, including serialization and replication.
- This allows you to choose the storage strategy that best suits your needs, balancing performance and fault tolerance.

In essence:

cache() is a simple way to use the default storage level.
persist() offers flexibility to choose from various storage levels.

Broadcast Join :

A broadcast join in Apache Spark is an optimization technique designed to significantly speed up join operations when one of the DataFrames involved is relatively small

The Solution: Broadcast Joins

A broadcast join avoids this shuffling by copying the smaller DataFrame to every executor node in the Spark cluster.
This means each executor has a complete copy of the small DataFrame in its memory.
Then, when the join is performed, each executor can perform the join locally, without needing to shuffle the larger data frame.
Improved Performance: By eliminating shuffling, broadcast joins can dramatically reduce the execution time of join operations.
Reduced Network Traffic: Network traffic is minimized, as the larger DataFrame does not need to be shuffled.

Limitations

Memory Constraints: The smaller DataFrame must fit in the memory of each executor node. If it's too large, you'll encounter out-of-memory errors.
Suitability: Broadcast joins are not suitable for joining two large DataFrames. In those cases, shuffle joins are necessary.

Keerthana Blogs

Search This Blog

Pyspark

Labels

Comments

Post a Comment

Popular posts from this blog

session 19 Git Repository

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

Transformation - section 6 - data flow