how to fill null values and drop null values in pyspark, sql and scala
1 min readApr 13, 2024
PySpark, SQL, and Scala, handling null values typically involves two main operations: filling null values with specified values and dropping rows or columns containing null values. Here’s how you can perform these operations in each of the mentioned environments:
Null Fill:
PySpark:
from pyspark.sql.functions import when
# Fill null values in a specific column with a specified value
df_filled = df.withColumn("column_name", when(df["column_name"].isNull(), "default_value").otherwise(df["column_name"]))
SQL:
-- Fill null values in a specific column with a specified value
SELECT COALESCE(column_name, 'default_value') AS column_name_filled
FROM table_name;
sacla:
// Fill null values in a specific column with a specified value
val df_filled = df.na.fill("default_value", Seq("column_name"))
Null Drop:
PySpark:
# Drop rows containing any null values
df_dropna = df.dropna()
# Drop rows containing null values in a specific column
df_dropna_specific = df.dropna(subset=["column_name"])
SQL:
-- Drop rows containing any null values
SELECT * FROM table_name WHERE column_name IS NOT NULL;
-- Drop rows containing null values in a specific column
SELECT * FROM table_name WHERE column_name IS NOT NULL;
Scala:
// Drop rows containing any null values
val df_dropna = df.na.drop()
// Drop rows containing null values in a specific column
val df_dropna_specific = df.na.drop(Seq("column_name"))