Pyspark StructType and StructField ?

Pinjari Akbar
2 min readNov 11, 2023
Pyspark StructType and StructField

let’s In PySpark, StructType and StructField are classes used for defining the schema of a DataFrame. They allow you to specify the structure of the data, including the names and data types of each column. Here's a brief explanation of each.

  1. StructType:
  • StructType is a class that represents a schema for a DataFrame.
  • It is essentially a collection of StructField objects.
  • It is used to define the structure of the DataFrame, specifying the names and data types of each column.
  • Think of it as a way to define the overall structure (columns) of your DataFrame.

2. StructField:

  • StructField is a class that represents a field (or column) in a DataFrame.
  • It is a part of the StructType schema.
  • Each StructField object defines the name, data type, and whether the column can contain null values.
  • It allows you to provide metadata about each column in your DataFrame.

Here’s a simple example of how you might use StructType and StructField to define a schema:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define the schema for the DataFrame
schema = StructType([ \
StructField("ProductID", IntegerType(), True),
StructField("ProductName", StringType(), True),
StructField("Category", StringType(), True),
StructField("Price", IntegerType(), True)
])

# Sample data
data = [(1, "Laptop", "Electronics", 1200),
(2, "Smartphone", "Electronics", 800),
(3, "Desk Chair", "Furniture", 150),
(4, "Coffee Maker", "Appliances", 50)]

# Create a DataFrame
product_df = spark.createDataFrame(data=data, schema=schema)

# Show the DataFrame
product_df.display()

We import the necessary modules from PySpark.
We create a Spark session using SparkSession.builder.appName(“example”).getOrCreate().
We define the schema for the DataFrame using StructType and StructField.
We create a list of tuples representing the sample data.
We use createDataFrame to create a DataFrame from the sample data and schema.
We display the DataFrame using show().

--

--

Responses (2)