A glimpse at Spark

Stéphane Boucheron

LPSM Université Paris-Cité

Spark in perspective

From the archive

Spark project: launched in 2010 by M. Zaharia (UC Berkeley) et al.

Spark 1.0.0 released (May 30, 2014):: … This release expands Spark’s standard libraries, introducing a new SQL package (Spark SQL) that lets users integrate SQL queries into existing Spark workflows. MLlib, Spark’s machine learning library, is expanded with sparse vector support and several new algorithms. The GraphX and Streaming libraries also introduce new features and optimizations. Spark’s core engine adds support for secured YARN clusters, a unified tool for submitting Spark applications, and several performance and stability improvements.

Spark 2.0.0 released (July 26, 2016): The major updates are API usability, SQL 2003 support, performance improvements, structured streaming, R UDF support, as well as operational improvements.

From the archive (continued)

Spark 3.0.0 released (June 18, 2020): … This year is Spark’s 10-year anniversary as an open source project. Since its initial release in 2010, Spark has grown to be one of the most active open source projects. Nowadays, Spark is the de facto unified engine for big data processing, data science, machine learning and data analytics workloads.

Spark SQL is the top active component in this release. 46% of the resolved tickets are for Spark SQL. These enhancements benefit all the higher-level libraries, including structured streaming and MLlib, and higher level APIs, including SQL and DataFrames. Various related optimizations are added in this release. In TPC-DS 30TB benchmark, Spark 3.0 is roughly two times faster than Spark 2.4.

Why Spark?

Scalability
Beyond OLTP: OLAP (and BI)
From Data Mining to Big Data
From Datawarehouses to Datalakes

MapReduce

Apache Hadoop

Hive
- Before 2010, de facto big data SQL API
- Helped propel Hadoop to industry

Spark organization

Cluster overview from Spark official documentation

Note

There is one master per cluster.
The cluster manager/master is launched by start-master.sh.
There are as many workers per machine on the cluster.
A worker process is launched by start-worker.sh (standalone mode)
Spark applications (interactive or not) exchange informations using a driver process.
Master is per cluster, and driver is per application.

Sparksession

from pyspark.sql import SparkSession

spark = (
    SparkSession 
        .builder 
        .appName("Presentation") 
        .getOrCreate()
)

Spark SQL and Dataframes

Spark core

Implements the RDD (Resilient Distributed Datasets)

Spark project was launched to implement the RDD concept presented by Zaharia et al at the end of the 2000’

In words, RDDs behave like distributed, fault-tolerant, Python collections (list or dict)

RDDs areo immutable, they can be transformed using map like operations, transformed RDDs can be reduced, and the result can be collected to the driver process

Spark SQL and `HIVE` (Hadoop InteractiVE)

Spark SQL relies on Hive SQL’s conventions and functions

Since release 2.0, Spark offers a native SQL parser that supports ANSI-SQL and HiveQL

Works for analysts, data engineers, data scientists

Spark-SQL is geared towards OLAP not OLTP

Rows

Spark dataframes are RDDs (collections of Rows)

from pyspark.sql import Row

row1 = Row(name="John", age=21)
row2 = Row(name="James", age=32)
row3 = Row(name="Jane", age=18)

row1['name']

rows = [row1, row2, row3]
column_names = ["Name", "Age"]

df = spark.createDataFrame(rows, column_names)

df.show()

+-----+---+
| Name|Age|
+-----+---+
| John| 21|
|James| 32|
| Jane| 18|
+-----+---+

Schema

df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)

From dataframes to `RDDs`

print(df.rdd.toDebugString().decode("utf-8"))

(20) MapPartitionsRDD[66] at javaToPython at NativeMethodAccessorImpl.java:0 []
 |   MapPartitionsRDD[65] at javaToPython at NativeMethodAccessorImpl.java:0 []
 |   SQLExecutionRDD[64] at javaToPython at NativeMethodAccessorImpl.java:0 []
 |   MapPartitionsRDD[63] at javaToPython at NativeMethodAccessorImpl.java:0 []
 |   MapPartitionsRDD[60] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0 []
 |   MapPartitionsRDD[59] at map at SerDeUtil.scala:69 []
 |   MapPartitionsRDD[58] at mapPartitions at SerDeUtil.scala:117 []
 |   PythonRDD[57] at RDD at PythonRDD.scala:53 []
 |   ParallelCollectionRDD[56] at readRDDFromFile at PythonRDD.scala:289 []

df.rdd.getNumPartitions()

Spark dataframe API

The Spark dataframe API offers a developper-friendly API for implementing

Relational algebra \(\sigma, \pi, \bowtie, \cup, \cap, \setminus\)
Partitionning GROUP BY
Aggregation and Window functions

Compare the Spark Dataframe API with:

dplyr, dtplyr, dbplyr in R Tidyverse

Pandas

Pandas on Spark

Chaining and/or piping enable modular query construction

Basic Single Tables Operations (methods/verbs)

Operation	Description
`select`	Chooses columns from the table \(\pi\)
`selectExpr`	Chooses columns and expressions from table \(\pi\)
`where`	Filters rows based on a boolean rule \(\sigma\)
`limit`	Limits the number of rows `LIMIT ...`
`orderBy`	Sorts the DataFrame based on one or more columns `ORDER BY ...`
`alias`	Changes the name of a column `AS ...`
`cast`	Changes the type of a column
`withColumn`	Adds a new column

Toy example

column_names = ["name", "age", "gender"]
rows = [
        ["John", 21, "male"],
        ["Jane", 25, "female"]
    ]
df = spark.createDataFrame(rows, column_names)

df.show()

+----+---+------+
|name|age|gender|
+----+---+------+
|John| 21|  male|
|Jane| 25|female|
+----+---+------+

Querying SQL style

## Create a temporary view from the DataFrame
df.createOrReplaceTempView("new_view")

## Define the query
query = """
  SELECT name, age 
  FROM new_view 
  WHERE gender='male'
"""

men_df = spark.sql(query)
men_df.show()

+----+---+
|name|age|
+----+---+
|John| 21|
+----+---+

Select

The argument of select() is *cols where cols can be built from column names (strings), column expressions like df.age + 10, lists

df.select(df.name.alias("nom"), df.age+10 ).show()

+----+----------+
| nom|(age + 10)|
+----+----------+
|John|        31|
|Jane|        35|
+----+----------+

df.select([c for c in df.columns if "a" in c]).show()

+----+---+
|name|age|
+----+---+
|John| 21|
|Jane| 25|
+----+---+

Adding new columns

## In a SQL query:
query = "SELECT *, 12*age AS age_months FROM table"

## Using Spark SQL API:
df.withColumn("age_months", df.age * 12).show()

## Or
df.select("*", 
          (df.age * 12).alias("age_months")
  ).show()

+----+---+------+----------+
|name|age|gender|age_months|
+----+---+------+----------+
|John| 21|  male|       252|
|Jane| 25|female|       300|
+----+---+------+----------+

+----+---+------+----------+
|name|age|gender|age_months|
+----+---+------+----------+
|John| 21|  male|       252|
|Jane| 25|female|       300|
+----+---+------+----------+

Basic operations

The full list of operations that can be applied to a DataFrame can be found in the [DataFrame doc]
The list of operations on columns can be found in the [Column docs]

Spark APIs for `R`

`sparkR` and `sparklyr`

sparkR is the official Spark API for R users

sparklyr (released 2016) is the de facto Spark API for tidyverse