How to read .xlsx
file in PySpark ?
2 min readDec 17, 2024
.xlsx
file in PySparkTo read an .xlsx
file in PySpark, you can use libraries like pyspark-excel
or the openpyxl
library in combination with PySpark's DataFrame API. Below is a working example of how to read .xlsx
files using PySpark.
Install Required Dependencies
You need to install the following libraries to handle Excel files:
openpyxl
: Reads.xlsx
files.- Spark’s
com.crealytics:spark-excel
package to use Spark DataFrame API.
Install Libraries:
- PyPI installation for local testing:
pip install openpyxl
- Add Spark package: Use
--packages
to add the required JAR when launching PySpark:
pyspark --packages com.crealytics:spark-excel_2.12:3.2.1
Working PySpark Code to Read Excel File
Sample Code:
from pyspark.sql import SparkSession
# Initialize Spark session with Excel support
spark = SparkSession.builder \
.appName("ReadExcelInPySpark") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:3.2.1") \
.getOrCreate()
# Define the file path and options
file_path = "/path/to/your/file.xlsx"
excel_options = {
"header": "true", # Use the first row as column headers
"inferSchema": "true", # Infer data types
"dataAddress": "'Sheet1'!A1", # Optional: Specify the sheet name and range
}
# Read the Excel file into a DataFrame
df = spark.read.format("com.crealytics.spark.excel") \
.options(**excel_options) \
.load(file_path)
# Show the data
df.show()
Code Breakdown
- Excel File Path:
- Replace
"/path/to/your/file.xlsx"
with the actual path to your.xlsx
file.
Excel Options:
header
: If the file contains a header row, set this totrue
.inferSchema
: Automatically infers column types based on data.dataAddress
: Specifies the sheet and range in the format'SheetName'!Range
. For example,'Sheet1'!A1:C100
.
Write Excel Data to Other Formats
After reading the Excel data, you can write it to formats like Parquet, Delta, or CSV for downstream processing.
Write to CSV Example:
output_path = "/path/to/output/csv"
df.write.format("csv").option("header", "true").save(output_path)
Write to Delta Example:
output_delta_path = "/path/to/output/delta"
df.write.format("delta").save(output_delta_path)
Additional Notes
- Ensure the Excel file is accessible in the Spark cluster (e.g., local file system, S3, HDFS).
- The library
com.crealytics:spark-excel
works with.xlsx
files but may need additional drivers for older formats like.xls
. - If you don’t need Spark for large-scale processing, libraries like
pandas
can also read.xlsx
files.