How to read Azure sql to databricks in pyspark ?

Pinjari Akbar

3 min readJun 15, 2024

How to read Azure sql to databricks in pyspark

basically read data from an SQL database into Databricks using PySpark, you can follow these steps:

Set Up a Spark Session in Databricks: First, make sure you have a Spark session available. Databricks notebooks automatically provide a Spark session, typically named spark.
Read Data Using JDBC: Use the spark.read method with the JDBC format to read data from your SQL database. You'll need the JDBC URL, table name, and credentials.

Here’s a detailed step-by-step guide with code snippets:

Step 1: Set Up the Spark Session

If you’re in a Databricks notebook, the Spark session is usually available as spark:

# Spark session is typically available as `spark` in Databricks notebooks
spark

Step 2: Configure the JDBC Connection

Set up the JDBC connection properties, including the URL, table name, and credentials.

# Database configurations
jdbc_hostname = "your_sql_server_hostname"
jdbc_port = 1433  # Default port for SQL Server
database_name = "your_database_name"
table_name = "your_table_name"
username = "your_username"
password = "your_password"

# JDBC URL
jdbc_url = f"jdbc:sqlserver://{jdbc_hostname}:{jdbc_port};database={database_name}"

# Connection properties
connection_properties = {
    "user": username,
    "password": password,
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"  # Use appropriate driver for your database
}

Step 3: Read Data from the SQL Database

Use the spark.read method to read the data into a DataFrame.

# Read data from SQL database into DataFrame
df = spark.read.jdbc(url=jdbc_url, table=table_name, properties=connection_properties)

# Show the schema of the DataFrame
df.printSchema()

# Display the first few rows of the DataFrame
df.display()

Step 4: Process the Data (Optional)

You can now perform various operations on the DataFrame, such as filtering, aggregating, or joining with other DataFrames.

# Example: Select specific columns
selected_df = df.select("column1", "column2")

# Example: Filter rows
filtered_df = df.filter(df["column1"] > 100)

# Show the resulting DataFrame
filtered_df.display()

Example Code

Here is the complete example in a Databricks notebook:

# Database configurations
jdbc_hostname = "your_sql_server_hostname"
jdbc_port = 1433
database_name = "your_database_name"
table_name = "your_table_name"
username = "your_username"
password = "your_password"

# JDBC URL
jdbc_url = f"jdbc:sqlserver://{jdbc_hostname}:{jdbc_port};database={database_name}"

# Connection properties
connection_properties = {
    "user": username,
    "password": password,
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

# Read data from SQL database into DataFrame
df = spark.read.jdbc(url=jdbc_url, table=table_name, properties=connection_properties)

# Show the schema of the DataFrame
df.printSchema()

# Display the first few rows of the DataFrame
df.display()

# Example: Select specific columns
selected_df = df.select("column1", "column2")

# Example: Filter rows
filtered_df = df.filter(df["column1"] > 100)

# Show the resulting DataFrame
filtered_df.display()

Additional Notes:

Driver Configuration: Ensure that the appropriate JDBC driver for your database is available in Databricks. You can upload the driver jar file to Databricks and attach it to your cluster.
Connection Properties: Depending on your database and authentication method, you might need to configure additional properties in the connection_properties dictionary.
Security: Be cautious with handling sensitive information like database credentials. Use secrets management features in Databricks to securely store and access credentials.

By following these steps, you can read data from an SQL database into Databricks using PySpark and perform various data processing tasks. Adjust the configuration and connection properties as needed based on your specific database setup.