Spark PartitionBy(“column”)?
Partitioning in Apache Spark is a crucial optimization technique to enhance query performance by organizing data in a way that aligns with the execution model of Spark. The partitionBy
operation is typically used during the process of writing data to storage systems like Apache Parquet or Apache Avro. It allows you to specify how the data should be partitioned, which can significantly impact query performance.
The partitionBy
operation is often used in conjunction with the write
operation on a DataFrame in Spark. Here's a simple example using PySpark:
# Create a DataFrame
# Input DataFrame
data = [
(“John”, “Doe”, “Engineering”, 30),
(“Jane”, “Doe”, “Sales”, 25),
(“Bob”, “Smith”, “Engineering”, 40),
(“Alice”, “Jones”, “HR”, 35),
(“Charlie”, “Brown”, “Sales”, 28)
]
columns = [“First_Name”, “Last_Name”, “Department”, “Age”]
input_df = spark.createDataFrame(data=data,schema=columns)
input_df.display()
input_df.write.format(“delta”).partitionBy(“Department”).save(“/path/to/output”)
Now, let’s say we want to partition this data by the “Department” column and write it to a Parquet file:
Output Parquet Table:
/path/to/output
| — Department=Engineering
| | — <parquet files for employees in the Engineering department>
| — Department=Sales
| | — <parquet files for employees in the Sales department>
| — Department=HR
| | — <parquet files for employees in the HR department>