How to read .xlsx file in PySpark ?

Pinjari Akbar
2 min readDec 17, 2024
How to read an .xlsx file in PySpark

To read an .xlsx file in PySpark, you can use libraries like pyspark-excel or the openpyxl library in combination with PySpark's DataFrame API. Below is a working example of how to read .xlsx files using PySpark.

Install Required Dependencies

You need to install the following libraries to handle Excel files:

  • openpyxl: Reads .xlsx files.
  • Spark’s com.crealytics:spark-excel package to use Spark DataFrame API.

Install Libraries:

  1. PyPI installation for local testing:
pip install openpyxl
  1. Add Spark package: Use --packages to add the required JAR when launching PySpark:
pyspark --packages com.crealytics:spark-excel_2.12:3.2.1

Working PySpark Code to Read Excel File

Sample Code:

from pyspark.sql import SparkSession


# Initialize Spark session with Excel support
spark = SparkSession.builder \
.appName("ReadExcelInPySpark") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:3.2.1") \
.getOrCreate()
# Define the file path and options
file_path = "/path/to/your/file.xlsx"
excel_options = {
"header": "true", # Use the first row as column headers
"inferSchema": "true", # Infer data types
"dataAddress": "'Sheet1'!A1", # Optional: Specify the sheet name and range
}
# Read the Excel file into a DataFrame
df = spark.read.format("com.crealytics.spark.excel") \
.options(**excel_options) \
.load(file_path)
# Show the data
df.show()

Code Breakdown

  1. Excel File Path:
  • Replace "/path/to/your/file.xlsx" with the actual path to your .xlsx file.

Excel Options:

  • header: If the file contains a header row, set this to true.
  • inferSchema: Automatically infers column types based on data.
  • dataAddress: Specifies the sheet and range in the format 'SheetName'!Range. For example, 'Sheet1'!A1:C100.

Write Excel Data to Other Formats

After reading the Excel data, you can write it to formats like Parquet, Delta, or CSV for downstream processing.

Write to CSV Example:

output_path = "/path/to/output/csv"
df.write.format("csv").option("header", "true").save(output_path)

Write to Delta Example:

output_delta_path = "/path/to/output/delta"
df.write.format("delta").save(output_delta_path)

Additional Notes

  • Ensure the Excel file is accessible in the Spark cluster (e.g., local file system, S3, HDFS).
  • The library com.crealytics:spark-excel works with .xlsx files but may need additional drivers for older formats like .xls.
  • If you don’t need Spark for large-scale processing, libraries like pandas can also read .xlsx files.

--

--

No responses yet