Spark is a popular open-source framework for distributed data processing and analytics.
It allows you to perform various tasks such as data transformation, machine learning, streaming, graph analysis, and more on large-scale datasets.
Azure Synapse Analytics is a cloud-based service that integrates data warehousing, big data analytics, and data integration.
It offers a unified platform for ingesting, preparing, managing, and serving data for various purposes.
One of the features of Azure Synapse Spark Analytics is the ability to create and use serverless Spark pools.
A serverless Spark pool is a way of indicating how you want to work with Spark without managing clusters.
You only pay for the Spark resources used during your session and not for the pool itself.
Create a serverless Spark pool
To create a serverless Spark pool, you need to have an Azure Synapse Analytics workspace and an Azure Data Lake Storage Gen2 account.
If you don’t have them, you can follow the instructions here to create them.
Once you have your workspace and storage account ready, you can use Synapse Studio to create a serverless Spark pool.
Synapse Studio is a web-based tool that allows you to manage and interact with your Azure Synapse resources.
To create a serverless Spark pool, follow these steps:
- Open Synapse Studio and sign in with your Azure account.
- On the left-side pane, select Manage > Apache Spark pools.
- Select New.
- For the Apache Spark pool name, enter Spark1.
- For Node size, enter Small.
- For the Number of nodes, set the minimum to 3 and the maximum to 3.
- Select Review + Create> Create.
Your Apache Spark pool will be ready in a few seconds.
Load and analyze data with Spark
Now that you have created a serverless Spark pool, you can use it to load and analyze data with Spark.
In this example, you will use some sample data from the New York City Taxi dataset that contains information about taxi trips in NYC.
To load and analyze data with Spark, follow these steps:
- In Synapse Studio, go to the Develop hub.
- Create a new notebook by selecting the + icon and then Notebook.
- Create a new code cell and paste the following code into that cell:
%%pyspark
df = spark.read.load(‘abfss://users@contosolake.dfs.core.windows.net/NYCTripSmall.parquet’, format=’parquet’)
display(df.limit(10))
This code will read the sample data from your storage account using the abyss URI scheme and load it into a Spark data frame named df.
Then it will display the first 10 rows of the data frame.
- In the notebook, in the Attach to the menu, choose the Spark1 serverless Spark pool that you created earlier.
- Select Run on the cell. Synapse will start a new Spark session to run this cell if needed.
- If you just want to see the schema of the data frame, run a cell with the following code:
%%pyspark
df.printSchema()
This code will print the schema of the data frame, which shows the column names and types.
- To analyze the data using SQL queries, you need to load it into a Spark database named nyctaxi. Add a new code cell to the notebook, and then enter the following code:
%%pyspark
spark.sql(“CREATE DATABASE IF NOT EXISTS nyctaxi”)
df.write.mode(“overwrite”).saveAsTable(“nyctaxi.trip”)
This code will create a database named nyctaxi if it does not exist, and then write the dataframe into a table named trip in that database.
- To query the data from the nyctaxi.trip table, create a new code cell and enter the following code:
%%pyspark
df = spark.sql(“SELECT * FROM nyctaxi.trip”)
display(df)
This code will create a new dataframe from the nyctaxi.trip table and display it.
- To perform some basic analysis of the data, such as finding the average trip distance and fare amount by passenger count, create a new code cell and enter the following code:
%%pyspark
spark.sql(“””
SELECT passenger_count, AVG(trip_distance) AS avg_distance, AVG(fare_amount) AS avg_fare
FROM nyctaxi.trip
GROUP BY passenger_count
ORDER BY passenger_count
“””).show()
This code will run a SQL query on the nyctaxi.trip table and show the results.
Create and run Spark job definitions
A Spark job definition is a way of defining how you want to run a Spark job using a JSON file.
You can specify various properties such as the main file, the arguments, the reference files, the libraries, and the Spark configuration for your job.
You can create and run Spark job definitions in Synapse Studio using different languages such as Python, Scala, or C#.
Also, you can import or export Spark job definitions as JSON files.
In this section, you will create and run a Spark job definition for Python using a sample word count program.
To create and run a Spark job definition for Python, follow these steps:
- In Synapse Studio, go to the Data hub.
- Select Linked > Azure Data Lake Storage Gen2, and upload wordcount.py and shakespeare.txt into your storage account.
- Go to the Develop hub, select the + icon, and select Spark job definition to create a new Spark job definition.
- Select PySpark (Python) from the Language drop-down list in the Apache Spark job definition main window.
- Fill in the information for the Apache Spark job definition as follows:
Property | Description |
Job definition name | Enter a name for your Apache Spark job definition. This name can be updated at any time until it’s published. Sample: wordcount |
Main definition file | The main file used for the job. Select a PY file from your storage. You can select Upload file to upload the file to a storage account. Sample: abfss://…/path/to/wordcount.py |
Command-line arguments | Optional arguments to the job. Sample: abfss://…/path/to/shakespeare.txt abfss://…/path/to/result Note: Two arguments for the sample job definition are separated by a space. |
Reference files | Additional files used for reference in the main definition file. You can select Upload file to upload the file to a storage account. Sample: abfss://…/path/to/shakespeare.txt |
- Select Publish to save your Spark job definition.
- Select Submit to run your Spark job definition as a batch job.
- In the Submit Apache Spark Job Definition dialog box, select Spark1 as the target Apache Spark pool and then select Submit.
- Go to the Monitor hub to check the status of your Spark job.
When your Spark job is completed, go back to the Data hub and check the result file in your storage account.