Master I MIDS & Informatique
Université Paris Cité
2024-02-19
RDD
APIDataFrame
APIThe DataFrame API uses the * Catalyst
* optimizer to improve the execution plan of your Spark Job
The low-level RDD
API does not
Most of the recent Spark advances are towards an improvement of the SQL
Java Regex
is great to parse data in an expected structure
But, unfortunately, it is generally a slow process when processing millions of rows
Increasing a little bit the parsing of rows increases a lot the entire job
If possible, avoid using Regex’s and try to load your data in a more structured format
The data specified on the left is static on the executors while the data on the right is transfered between the executors
Something like
can be much longer or even fail if largeDF
is large
Often, we need to join a huge dataframe with a small one
Use broadcast joins for joining small datasets to larger ones
But avoid overusing this. Depending on caching strategy (in-memory then swap to disk), cache can end up being slower than reading
Storage space used for caching means less space for processing
Caching can cost more than reading the DataFrame (e.g. only few columns are useful, predictate pushdown)
COMPUTE STATISTICS
callAlso helps to check specific columns so that Catalyst can better check those columns
It’s recommended to COMPUTE STATISTICS for any columns involved in filtering and joining :
Shuffle is the transportation of data between workers across a Spark cluster’s network
It’s central for operations where a reorganization of data is required, referred to as wide dependencies (wide vs narrow dependencies)
This kind of operation usually is the bottleneck of your Spark application
To use Spark well, you need to know what you shuffle, and for this it’s essential that you know your data
Skew is an imbalance in the distribution of your data
If you fail to account for how your data is distributed, you may find that Spark naively places an overwhelming majority of rows on one executor, and a fraction on all the rest
This is skew, and it will kill your application
Whether by causing out of memory errors, network timeouts, or exponentially long running processes that will never terminate
It’s absolutely essential to model the number of partitions around the kind of stuff you’re solving
The default value for spark.sql.shuffle.partitions
is 200. It controls the number of partitions used by shuffles (= number of partitions in the resulting DataFrame of RDD).
Number of shuffle partitions does not change with different data size. For small data, 200 is overkill, for large data, it does not effectively use the all resources.
Rule of thumb: set this configuration to the number of cores you have available across all your executors
A powerful way to control Spark shuffles is to partition your data intelligently
Partitioning on the right column (or set of columns) helps to balance the amount of data mapped across the cluster network in order to perform actions
Partitioning on a unique ID is generally a good strategy, but don’t partition on sparsely filled columns (with many NAs) or columns that over-represent particular values
Joins are shuffle offenders. Dangers of SQL joining are amplified by the scale enabled by Spark
Even joining medium sized data can cause an explosion if there are repeated join values on both sides of your join
Million rows datasets with “pseudo unique” keys can explode into a billions rows join!
Join columns with null values usually means massive skew and an explosive join
A solution is to pre-fill empty cells to arbitrary balanced values (e.g. uniform random values) before running a join
Don’t forget that operations in Spark are divided between transformations and actions. Transformations are lazy operations allowing Spark to optimize your query
Transformations set up a DataFrame for changes (adding a column, joining it to another, etc.) but will not execute these until an action is performed.
This can result in surprising results: imagine that you create an id column using monotonically_increasing_id
, and then join on that column. If you do not place an action in between, your values have not been materialized. The result will be non-deterministic!
Checkpointing means saving data to disk and reloading it back in, which is redundant anywhere else besides Spark.
It triggers an action on any waiting transformations, and truncates the Spark query plan for the checkpointed data.
This action shows up in your Spark UI, indicating where you are in your job.
It can help to conserve resources, since it can release memory that would otherwise be cached for downstream access.
Checkpointed data is also a valuable source for data-debugging.
Spark UI is your friend, and so are other monitoring tools that let you know how your run is going in real-time.
The Spark UI contains information on the job level, the stage level, and the executor level. You can see if the volume of data going to each partition or each executor makes sense, if some part of your job is taking too much time.
Such a monitoring tool allowing to view your total memory and CPU usage across executors is essential for resource planning and “autopsies” on failed jobs.
Your DataFrame seems happy: no runtime exceptions, and you can execute operations on the DataFrame
But after careful debugging, you realize that at some point in the data, everything has shifted over one or several columns!
To be safe, you can include escape
and quote
options in your reads. Even better: use Parquet
instead of CSV files!
Read/Write operations are order of magnitude more efficient with Parquet
than with uncompressed CSV files
Parquet is “columnar”: reads only the columns required for a sql query and skip over those that are not requested.
And also predicate pushdown operations on filtering operations: run queries only on relevant subsets of the values.
Switching from CSV to Parquet is the first thing you can do to improve performance.
If you are generating Parquet files from another format (using PyArrow
, Pandas
, etc.) be conscious that creating a single parquet file gives up a major benefit of the format: you need to partition it!
UDF = User Defined Function = something very convenient
>>> from pyspark.sql import functions as F, types as T
>>> data = [{'a': 1, 'b': 0}, {'a': 10, 'b': 3}]
>>> df = spark.createDataFrame(data)
>>> def calculate_a_b_ratio(a, b):
>>> if b > 0:
>>> return a / b
>>> return 0.
>>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.FloatType())
>>> df = df.withColumn('a_b_ratio_float', udf_ratio_calculation('a', 'b'))
>>> df.show()
+---+---+---------------+
| a| b|a_b_ratio_float|
+---+---+---------------+
| 1| 0| 0.0|
| 10| 3| 3.3333333|
+---+---+---------------+
UDF are Excruciatingly slow with pyspark
and spark won’t complain if the return type is incorrect and just return nulls
>>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.DecimalType())
>>> df = df.withColumn('a_b_ratio_dec', udf_ratio_calculation('a', 'b'))
>>> df.show()
+---+---+---------------+-------------+
| a| b|a_b_ratio_float|a_b_ratio_dec|
+---+---+---------------+-------------+
| 1| 0| 0.0| null|
| 10| 3| 3.3333333| null|
+---+---+---------------+-------------+
>>> udf_ratio_calculation = F.udf(calculate_a_b_ratio, T.BooleanType())
>>> df = df.withColumn('a_b_ratio_bool', udf_ratio_calculation('a', 'b'))
>>> df.show()
+---+---+---------------+-------------+--------------+
| a| b|a_b_ratio_float|a_b_ratio_dec|a_b_ratio_bool|
+---+---+---------------+-------------+--------------+
| 1| 0| 0.0| null| null|
| 10| 3| 3.3333333| null| null|
+---+---+---------------+-------------+--------------+
Spark driver memory and executor memory are set by default to 1 Go.
It is in general very useful to take a look at the many configuration parameters and their defaults:
https://spark.apache.org/docs/latest/configuration.html
Many things there that can influence your spark application
When running locally, adjust spark.driver.memory
to something that’s reasonable for your system, e.g. "8g"
When running on a cluster, you might also want to tweak the spark.executor.memory
(though it depends on your cluster and its configuration).
Error messages don’t mean what they say
Takes quite a while to understand that Spark complains about one thing, when the problem is somewhere else
"Connection reset by peer"
often means that you have skews and one particular worker has run out of memory
"java.net.SocketTimeoutException: Write timed out"
can mean that the number of partitions too high, so that the filesystem is too slow to handle the number of simultaneous writes attempted by Spark
"Total size of serialized results[...] is bigger than spark.driver.maxResultSize"
can mean that the number of partitions is too high and results can’t fit onto a particular worker
"Column a is not a member of table b"
: you have a sql join error. Try your job locally on a small sample to avoid reverse engineering of such errors
Sometimes you get a true "out of memory"
error. You can increase the size of individual workers, but before you do that, ask yourself, is the data well distributed ?
"ClassNotFoundException"
: usually when you are trying to connect your application to an external a database. Here is an exampleMeans that Spark cannot find the necessary jar driver to connect to the database
Need to provide the correct jars to your application using the spark configuration or as a command line argument
from pyspark import SparkConf
from pyspark.sql import SparkSession
jars = "/full/path/to/postgres.jar,/full/path/to/other/jar"
conf = SparkConf()
conf.set("spark.jars", jars)
spark = (
SparkSession
.builder
.config(conf=conf)
.appName('test')
.getOrCreate()
)
or
All the jars must be accessible to all nodes and not local to the driver.
This error might also mean a Spark version mismatch between the cluster components
Make sure there is no space between the commas in the list of jars.
Trying to connect to a database: "java.sql.SQLException: No suitable driver"
Error happens while trying to save to a database: "java.lang.NullPointerException"
This errors usually mean that we forgot to set the driver, "org.postgresql.Driver"
for Postgres
:
df = spark.read.format('jdbc').options(
url= 'db_url',
driver='org.postgresql.Driver', # <-- here
dbtable='table_name',
user='user',
password='password'
).load()
and also make sure that the drivers’ jars are set.
Horrible error : 'NoneType' object has no attribute '_jvm'
…mainly comes from two mistakes
# Create a dataframe
data = [{'a': 1, 'b': 0}, {'a': 10, 'b': 3}]
df = spark.createDataFrame(data)
# Define a simple function that returns a / b
def calculate_a_b_max(a, b):
return F.max([a, b])
# and a udf for this function - notice the return datatype
udf_max_calculation = F.udf(calculate_a_b_ratio, T.FloatType())
df = df.withColumn('a_b_max', udf_max_calculation('a', 'b'))
df.show()
We CANNOT use pyspark functions inside a udf: a UDF operates on a row per row basis while pyspark functions on a column basis.