Technologies Big Data

Master I MIDS & Informatique

A. Gheerbrandt, S. Gaïffas, V. Ravelomanana, S. Boucheron

Université Paris Cité

2024-02-19

Spark tips

Tip 1. Use DataFrames instead of RDDs

Instead of using the RDD API

rdd = sc.textFile("/path/to/file.txt")

Use the DataFrame API

df = spark.read.textFile("/path/to/file.txt")

The DataFrame API uses the * Catalyst* optimizer to improve the execution plan of your Spark Job
The low-level RDD API does not
Most of the recent Spark advances are towards an improvement of the SQL

Tip 2. Avoid using regular expressions

Java Regex is great to parse data in an expected structure
But, unfortunately, it is generally a slow process when processing millions of rows
Increasing a little bit the parsing of rows increases a lot the entire job
If possible, avoid using Regex’s and try to load your data in a more structured format

Tip 3. Joins: largest dataset on the left

When joining two datasets where one is smaller than the other, you must put the largest on the left

joinedDF = largeDF.join(smallDF, on="id")

The data specified on the left is static on the executors while the data on the right is transfered between the executors
Something like

joinedDF = smallDF.join(largeDF, on="id")

can be much longer or even fail if largeDF is large

Tip 4. Joins: use broadcast joining

Often, we need to join a huge dataframe with a small one
Use broadcast joins for joining small datasets to larger ones

from pyspark.sql.functions import broadcast

joinedDF = largeDF.join(broadcast(smallDF), on="id")

Usually leads to much faster joins since is allows to avoid shuffles

Tip 5. Use caching when repeating queries

If you are constantly using the same DataFrame on multiple queries, you can use caching or persistence:

df = (
  spark
    .read
    .textFile("/path/to/file.txt")
    .cache()
)

But avoid overusing this. Depending on caching strategy (in-memory then swap to disk), cache can end up being slower than reading
Storage space used for caching means less space for processing
Caching can cost more than reading the DataFrame (e.g. only few columns are useful, predictate pushdown)

Tip 6. COMPUTE STATISTICS of tables

Before querying a table, it can be helpful to compute the statistics of those tables so that Catalyst can find a better plan to process it:

query = "ANALYZE TABLE db.table COMPUTE STATISTICS"
spark.sql(query)

However, Spark does not always get everything it needs just from the above broad COMPUTE STATISTICS call

Tip 6. COMPUTE STATISTICS of tables

Also helps to check specific columns so that Catalyst can better check those columns
It’s recommended to COMPUTE STATISTICS for any columns involved in filtering and joining :

query = "ANALYZE TABLE db.table COMPUTE STATISTICS"
            " FOR COLUMNS joinColumn, filterColumn"

spark.sql(query)

Tip 7. Shuffles: know your data

Shuffle is the transportation of data between workers across a Spark cluster’s network
It’s central for operations where a reorganization of data is required, referred to as wide dependencies (wide vs narrow dependencies)
This kind of operation usually is the bottleneck of your Spark application
To use Spark well, you need to know what you shuffle, and for this it’s essential that you know your data

Tip 8. Shuffles: beware of skews

Skew is an imbalance in the distribution of your data
If you fail to account for how your data is distributed, you may find that Spark naively places an overwhelming majority of rows on one executor, and a fraction on all the rest
This is skew, and it will kill your application
Whether by causing out of memory errors, network timeouts, or exponentially long running processes that will never terminate

Tip 9. Partitions: change the default

It’s absolutely essential to model the number of partitions around the kind of stuff you’re solving
The default value for spark.sql.shuffle.partitions is 200. It controls the number of partitions used by shuffles (= number of partitions in the resulting DataFrame of RDD).
Number of shuffle partitions does not change with different data size. For small data, 200 is overkill, for large data, it does not effectively use the all resources.
Rule of thumb: set this configuration to the number of cores you have available across all your executors

(
  spark
    .conf
    .set("spark.sql.shuffle.partitions", 42)
)

Tip 10. Partitions: well-distributed columns

A powerful way to control Spark shuffles is to partition your data intelligently
Partitioning on the right column (or set of columns) helps to balance the amount of data mapped across the cluster network in order to perform actions
Partitioning on a unique ID is generally a good strategy, but don’t partition on sparsely filled columns (with many NAs) or columns that over-represent particular values

Tip 11. Joins again: highly flammable

Joins are shuffle offenders. Dangers of SQL joining are amplified by the scale enabled by Spark
Even joining medium sized data can cause an explosion if there are repeated join values on both sides of your join
Million rows datasets with “pseudo unique” keys can explode into a billions rows join!
Join columns with null values usually means massive skew and an explosive join
A solution is to pre-fill empty cells to arbitrary balanced values (e.g. uniform random values) before running a join

Tip 12. Is your data real yet?

Don’t forget that operations in Spark are divided between transformations and actions. Transformations are lazy operations allowing Spark to optimize your query
Transformations set up a DataFrame for changes (adding a column, joining it to another, etc.) but will not execute these until an action is performed.
This can result in surprising results: imagine that you create an id column using monotonically_increasing_id, and then join on that column. If you do not place an action in between, your values have not been materialized. The result will be non-deterministic!

Tip 13. Checkpointing is your friend

Checkpointing means saving data to disk and reloading it back in, which is redundant anywhere else besides Spark.
It triggers an action on any waiting transformations, and truncates the Spark query plan for the checkpointed data.
This action shows up in your Spark UI, indicating where you are in your job.
It can help to conserve resources, since it can release memory that would otherwise be cached for downstream access.
Checkpointed data is also a valuable source for data-debugging.

Tip 14. Check your runtime with monitoring

Spark UI is your friend, and so are other monitoring tools that let you know how your run is going in real-time.
The Spark UI contains information on the job level, the stage level, and the executor level. You can see if the volume of data going to each partition or each executor makes sense, if some part of your job is taking too much time.
Such a monitoring tool allowing to view your total memory and CPU usage across executors is essential for resource planning and “autopsies” on failed jobs.

Tip 15. CSV reading is brittle

Naively reading CSVs in Spark can result in silent escape-character errors

df = spark.read.csv("quote-happy.csv")

Your DataFrame seems happy: no runtime exceptions, and you can execute operations on the DataFrame
But after careful debugging, you realize that at some point in the data, everything has shifted over one or several columns!
To be safe, you can include escape and quote options in your reads. Even better: use Parquet instead of CSV files!

Tip 16. Parquet is your friend

Read/Write operations are order of magnitude more efficient with Parquet than with uncompressed CSV files
Parquet is “columnar”: reads only the columns required for a sql query and skip over those that are not requested.
And also predicate pushdown operations on filtering operations: run queries only on relevant subsets of the values.
Switching from CSV to Parquet is the first thing you can do to improve performance.
If you are generating Parquet files from another format (using PyArrow, Pandas, etc.) be conscious that creating a single parquet file gives up a major benefit of the format: you need to partition it!

Tip 17. Problems with UDFs

UDF = User Defined Function = something very convenient

>>> from pyspark.sql import functions as F, types as T

>>> data = [{'a': 1, 'b': 0}, {'a': 10, 'b': 3}]
>>> df = spark.createDataFrame(data)

>>> def calculate_a_b_ratio(a, b):
>>>     if b > 0:
>>>         return a / b
>>>     return 0.

>>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.FloatType())

>>> df = df.withColumn('a_b_ratio_float', udf_ratio_calculation('a', 'b'))
>>> df.show()
+---+---+---------------+
|  a|  b|a_b_ratio_float|
+---+---+---------------+
|  1|  0|            0.0|
| 10|  3|      3.3333333|
+---+---+---------------+

Tip 17. Problems with UDFs

UDF are Excruciatingly slow with pyspark and spark won’t complain if the return type is incorrect and just return nulls

>>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.DecimalType())
>>> df = df.withColumn('a_b_ratio_dec', udf_ratio_calculation('a', 'b'))
>>> df.show()
+---+---+---------------+-------------+
|  a|  b|a_b_ratio_float|a_b_ratio_dec|
+---+---+---------------+-------------+
|  1|  0|            0.0|         null|
| 10|  3|      3.3333333|         null|
+---+---+---------------+-------------+

>>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.BooleanType())
>>> df = df.withColumn('a_b_ratio_bool', udf_ratio_calculation('a', 'b'))
>>> df.show()
+---+---+---------------+-------------+--------------+
|  a|  b|a_b_ratio_float|a_b_ratio_dec|a_b_ratio_bool|
+---+---+---------------+-------------+--------------+
|  1|  0|            0.0|         null|          null|
| 10|  3|      3.3333333|         null|          null|
+---+---+---------------+-------------+--------------+

Tip 18. Use all of the resources

Spark driver memory and executor memory are set by default to 1 Go.
It is in general very useful to take a look at the many configuration parameters and their defaults:

https://spark.apache.org/docs/latest/configuration.html

Many things there that can influence your spark application
When running locally, adjust spark.driver.memory to something that’s reasonable for your system, e.g. "8g"
When running on a cluster, you might also want to tweak the spark.executor.memory (though it depends on your cluster and its configuration).

Tip 18. Use all of the resources

from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf()
conf.set('spark.executor.memory', '16g')
conf.set('spark.driver.memory', '8g')

spark_session = SparkSession.builder \
        .config(conf=conf) \
        .appName('Name') \
        .getOrCreate()

Interpret error messages

Error messages don’t mean what they say
Takes quite a while to understand that Spark complains about one thing, when the problem is somewhere else
"Connection reset by peer" often means that you have skews and one particular worker has run out of memory
"java.net.SocketTimeoutException: Write timed out" can mean that the number of partitions too high, so that the filesystem is too slow to handle the number of simultaneous writes attempted by Spark

Interpret error messages

"Total size of serialized results[...] is bigger than spark.driver.maxResultSize" can mean that the number of partitions is too high and results can’t fit onto a particular worker
"Column a is not a member of table b": you have a sql join error. Try your job locally on a small sample to avoid reverse engineering of such errors
Sometimes you get a true "out of memory" error. You can increase the size of individual workers, but before you do that, ask yourself, is the data well distributed ?

Interpret error messages

"ClassNotFoundException": usually when you are trying to connect your application to an external a database. Here is an example

Interpret error messages

Means that Spark cannot find the necessary jar driver to connect to the database
Need to provide the correct jars to your application using the spark configuration or as a command line argument

from pyspark import SparkConf
from pyspark.sql import SparkSession

jars = "/full/path/to/postgres.jar,/full/path/to/other/jar"
conf = SparkConf()
conf.set("spark.jars", jars)

spark = (
  SparkSession
    .builder
    .config(conf=conf)
    .appName('test')
    .getOrCreate()
)

spark-submit --jars /full/path/to/postgres.jar,/full/path/to/other/jar ...

Interpret error messages

All the jars must be accessible to all nodes and not local to the driver.
This error might also mean a Spark version mismatch between the cluster components
Make sure there is no space between the commas in the list of jars.

Interpret error messages

Trying to connect to a database: "java.sql.SQLException: No suitable driver"

Interpret error messages

Error happens while trying to save to a database: "java.lang.NullPointerException"

Interpret error messages

This errors usually mean that we forgot to set the driver, "org.postgresql.Driver" for Postgres:

df = spark.read.format('jdbc').options(
    url= 'db_url',
    driver='org.postgresql.Driver',  # <-- here
    dbtable='table_name',
    user='user',
    password='password'
).load()

and also make sure that the drivers’ jars are set.

Interpret error messages

Horrible error : 'NoneType' object has no attribute '_jvm'

…mainly comes from two mistakes

Interpret error messages

You are using pyspark functions without having an active spark session

from pyspark.sql import SparkSession, functions as fn

class A(object):
    def __init__(self):
        self.calculations = fn.col('a') / fn.col('b')
...
# Instantiating A without an active spark session 
# will give you this error
a = A()

Interpret error messages

You are using pyspark functions within a UDF:

# Create a dataframe
data = [{'a': 1, 'b': 0}, {'a': 10, 'b': 3}]
df = spark.createDataFrame(data)

# Define a simple function that returns a / b
def calculate_a_b_max(a, b):
    return F.max([a, b])

# and a udf for this function - notice the return datatype
udf_max_calculation = F.udf(calculate_a_b_ratio, T.FloatType())

df = df.withColumn('a_b_max', udf_max_calculation('a', 'b'))

df.show()

We CANNOT use pyspark functions inside a udf: a UDF operates on a row per row basis while pyspark functions on a column basis.

Technologies Big Data

Spark tips

Tip 1. Use DataFrames instead of RDDs

Tip 2. Avoid using regular expressions

Tip 3. Joins: largest dataset on the left

Tip 4. Joins: use broadcast joining

Tip 5. Use caching when repeating queries

Tip 6. COMPUTE STATISTICS of tables

Tip 6. COMPUTE STATISTICS of tables

Tip 7. Shuffles: know your data

Tip 8. Shuffles: beware of skews

Tip 9. Partitions: change the default

Tip 10. Partitions: well-distributed columns

Tip 11. Joins again: highly flammable

Tip 12. Is your data real yet?

Tip 13. Checkpointing is your friend

Tip 14. Check your runtime with monitoring

Tip 15. CSV reading is brittle

Tip 16. Parquet is your friend

Tip 17. Problems with UDFs

Tip 17. Problems with UDFs

Tip 18. Use all of the resources

Tip 18. Use all of the resources

Interpret error messages

Interpret error messages

Interpret error messages

Interpret error messages

Interpret error messages

Interpret error messages

Interpret error messages

Interpret error messages

Interpret error messages

Interpret error messages

Interpret error messages

Interpret error messages

Thank you !