Spark Narrow Transformation and Wide Transformation ?
Spark, transformations can be categorized as narrow or wide transformations based on the number of partitions and dependencies. Understanding these distinctions is crucial for optimizing the performance of your Spark jobs
Narrow Transformation:
Narrow transformations are the ones where each input partition contributes to at most one output partition. These transformations are performed independently on each partition, and they do not require shuffling or redistributing data across partitions.
# Narrow transformation example: map
rdd = sc.parallelize([1, 2, 3, 4, 5], 2) # 2 partitions
def square(x):
return x * x
result_rdd = rdd.map(square)
print(result_rdd.collect())
In this example, the map
operation applies the square
function to each element in the RDD independently within its partition. There's no need to shuffle or exchange data between partitions.
Wide Transformation:
Wide transformations are the ones where each input partition can contribute to multiple output partitions. These transformations require shuffling and redistribution of data across partitions.
# Wide transformation example: groupByKey
pair_rdd = sc.parallelize([(1, ‘a’), (2, ‘b’), (1, ‘c’), (2, ‘d’)], 2) # 2 partitions
result_rdd = pair_rdd.groupByKey()
print(result_rdd.collect())