Working with Different Types of Data in spark
presented basic DataFrame concepts and abstractions. This chapter covers building expressions, which are the bread and butter of Spark’s structured operations. We also review working with a variety of different kinds of data, including the following:
- Booleans
- Numbers
- Strings
- Dates and timestamps
- Handling null
- Complex types
- User-defined functions
Where to Look for APIs
Before we begin, it’s worth explaining where you as a user should look for transformations. Spark is a growing project, and any book (including this one) is a snapshot in time. One of our priorities in this book is to teach where, as of this writing, you should look to find functions to transform your data. Following are the key places to look: DataFrame (Dataset) Methods This is actually a bit of a trick because a DataFrame is just a Dataset of Row types, so you’ll actually end up looking at the Dataset methods, Dataset submodules like DataFrameStatFunctions and DataFrameNaFunctions have more methods that solve specific sets of problems. DataFrameStatFunctions, for example, holds a variety of statistically related functions, whereas DataFrameNaFunctions refers to functions that are relevant when working with null data.
Column Methods
These were introduced for the most part in Chapter 5. They hold a variety of general column related methods like alias or contains. You can find the API Reference for Column methods here. from pyspark.sql.functions import * contains a variety of functions for a range of different data types. Often, you’ll see the entire package imported because they are used so frequently. You can find SQL and DataFrame functions here. Now this may feel a bit overwhelming but have no fear, the majority of these functions are ones that you will find in SQL and analytics systems. All of these tools exist to achieve one purpose, to transform rows of data in one format or structure to another. This might create more rows or reduce the number of rows available. To begin, let’s read in the DataFrame that we’ll be using for this analysis:
#python
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("/data/retail-data/by-day/2010-12-01.csv")
// in Scala
val df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/data/retail-data/by-day/2010-12-01.csv")
df.printSchema()
#Here’s the result of the schema and a small sample of the data:
root
|-- InvoiceNo: string (nullable = true)
|-- StockCode: string (nullable = true)
|-- Description: string (nullable = true)
|-- Quantity: integer (nullable = true)
|-- InvoiceDate: timestamp (nullable = true)
|-- UnitPrice: double (nullable = true)
|-- CustomerID: double (nullable = true)
|-- Country: string (nullable = true)
- — — — — -+ — — — — -+ — — — — — — — — — — + — — — — + — — — — — — — — — -+ — — … |InvoiceNo|StockCode| Description|Quantity| InvoiceDate|Unit… + — — — — -+ — — — — -+ — — — — — — — — — — + — — — — + — — — — — — — — — -+ — — … | 536365| 85123A|WHITE HANGING HEA…| 6|2010–12–01 08:26:00| … ……….| 536365| 71053| WHITE METAL LANTERN| 6|2010–12–01 08:26:00| … ………. | 536367| 21755|LOVE BUILDING BLO…| 3|2010–12–01 08:34:00| … …………..| 536367| 21777|RECIPE BOX WITH M…| 4|2010–12–01 08:34:00| … + — — — — -+ — — — — -+ — — — — — — — — — — + — — — — + — — — — — — — — — -+ — — …
Converting to Spark Types
One thing you’ll see us do throughout this chapter is convert native types to Spark types. We do this by using the first function that we introduce here, the lit function. This function converts a type in another language to its correspnding Spark representation. Here’s how we can convert a couple of different kinds of Scala and Python values to their respective Spark types:
// scala
import org.apache.spark.sql.{SparkSession, functions}
// Create a Spark session
val spark = SparkSession.builder.appName("example").getOrCreate()
// Sample data
val data = Seq(("Alice", 25), ("Bob", 30), ("Charlie", 35))
// Define schema
val columns = Seq("Name", "Age")
val df = spark.createDataFrame(data).toDF(columns: _*)
// Add literal values using lit function
val resultDF = df.select(functions.lit(5).alias("LiteralInt"), functions.lit("five").alias("LiteralString"), functions.lit(5.0).alias("LiteralDouble"))
// Show the result DataFrame
resultDF.show()
# in Python
from pyspark.sql.functions import lit
df.select(lit(5), lit("five"), lit(5.0))
here’s no equivalent function necessary in SQL, so we can use the values directly:
-- in SQL
SELECT 5, "five", 5.0