How to read `.xlsx` file in PySpark ?

2 min readDec 17, 2024

To read an .xlsx file in PySpark, you can use libraries like pyspark-excel or the openpyxl library in combination with PySpark's DataFrame API. Below is a working example of how to read .xlsx files using PySpark.

Install Required Dependencies

You need to install the following libraries to handle Excel files:

openpyxl: Reads .xlsx files.
Spark’s com.crealytics:spark-excel package to use Spark DataFrame API.

Install Libraries:

PyPI installation for local testing:

pip install openpyxl

Add Spark package: Use --packages to add the required JAR when launching PySpark:

pyspark --packages com.crealytics:spark-excel_2.12:3.2.1

Working PySpark Code to Read Excel File

Sample Code:

from pyspark.sql import SparkSession


# Initialize Spark session with Excel support
spark = SparkSession.builder \
    .appName("ReadExcelInPySpark") \
    .config("spark.jars.packages", "com.crealytics:spark-excel_2.12:3.2.1") \
    .getOrCreate()
# Define the file path and options
file_path = "/path/to/your/file.xlsx"
excel_options = {
    "header": "true",        # Use the first row as column headers
    "inferSchema": "true",   # Infer data types
    "dataAddress": "'Sheet1'!A1",  # Optional: Specify the sheet name and range
}
# Read the Excel file into a DataFrame
df = spark.read.format("com.crealytics.spark.excel") \
    .options(**excel_options) \
    .load(file_path)
# Show the data
df.show()

Code Breakdown

Excel File Path:

Replace "/path/to/your/file.xlsx" with the actual path to your .xlsx file.

Excel Options:

header: If the file contains a header row, set this to true.
inferSchema: Automatically infers column types based on data.
dataAddress: Specifies the sheet and range in the format 'SheetName'!Range. For example, 'Sheet1'!A1:C100.

Write Excel Data to Other Formats

After reading the Excel data, you can write it to formats like Parquet, Delta, or CSV for downstream processing.

Write to CSV Example:

output_path = "/path/to/output/csv"
df.write.format("csv").option("header", "true").save(output_path)

Write to Delta Example:

output_delta_path = "/path/to/output/delta"
df.write.format("delta").save(output_delta_path)

Additional Notes

Ensure the Excel file is accessible in the Spark cluster (e.g., local file system, S3, HDFS).
The library com.crealytics:spark-excel works with .xlsx files but may need additional drivers for older formats like .xls.
If you don’t need Spark for large-scale processing, libraries like pandas can also read .xlsx files.

How to read `.xlsx` file in PySpark ?

Install Required Dependencies

Install Libraries:

Working PySpark Code to Read Excel File

Sample Code:

Code Breakdown

Write Excel Data to Other Formats

Write to CSV Example:

Write to Delta Example:

Additional Notes

Written by Pinjari Akbar

No responses yet

How to read .xlsx file in PySpark ?

Install Required Dependencies

Install Libraries:

Working PySpark Code to Read Excel File

Sample Code:

Code Breakdown

Write Excel Data to Other Formats

Write to CSV Example:

Write to Delta Example:

Additional Notes

Written by Pinjari Akbar

No responses yet

How to read `.xlsx` file in PySpark ?