Master I MIDS & Informatique
Université Paris Cité
2024-02-19
What happens when we do a reduceByKey
on a RDD?
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> rdd.reduceByKey(lambda a, b: a + b).collect()
Let’s have a look at the Spark UI available at :
http://localhost:4040/jobs/]
Spark has to move data from one node to another to be “grouped” with its “key”
Doing this is called shuffling. It is never called directly, it happens behind the curtains for some other functions as reduceByKey
above.
This might be very expensive because of latency
reduceByKey
results in one key-value pair per key
This single key-value pair cannot span across multiple workers.
>>> from collections import namedtuple
>>> columns = ["client_id", "destination", "price"]
>>> CFFPurchase = namedtuple("CFFPurchase", columns)
>>> CFFPurchase(100, "Geneva", 22.25)
CFFPurchase(client_id=100, destination='Geneva', price=22.25)
Goal: calculate how many trips and how much money was spent by each client
>>> purchases = [
CFFPurchase(100, "Geneva", 22.25),
CFFPurchase(100, "Zurich", 42.10),
CFFPurchase(100, "Fribourg", 12.40),
CFFPurchase(101, "St.Gallen", 8.20),
CFFPurchase(101, "Lucerne", 31.60),
CFFPurchase(100, "Basel", 16.20)
]
>>> purchases = sc.parallelize(purchases)
>>> purchases.collect()
[CFFPurchase(client_id=100, destination='Geneva', price=22.25),
CFFPurchase(client_id=100, destination='Zurich', price=42.1),
CFFPurchase(client_id=100, destination='Fribourg', price=12.4),
CFFPurchase(client_id=101, destination='St.Gallen', price=8.2),
CFFPurchase(client_id=101, destination='Lucerne', price=31.6),
CFFPurchase(client_id=100, destination='Basel', price=16.2)]
>>> purchases_per_client = (purchases
# Pair RDD
.map(lambda p: (p.client_id, p.price))
# RDD[p.customerId, List[p.price]]
.groupByKey()
.map(lambda p: (p[0], (len(p[1]), sum(p[1]))))
.collect()
)
>>> purchases_per_client
[(100, (4, 92.95)), (101, (2, 39.8))]
How would this looks on a cluster? (imagine that the dataset has millions of purchases)
This shuffling is very expensive because of latency
Can we do a better job?
Perhaps we can reduce before we shuffle in order to greatly reduce the amount of data sent over network. We use reduceByKey
.
>>> purchases_per_client = (purchases
.map(lambda p: (p.client_id, (1, p.price)))
.reduceByKey(lambda v1, v2: (v1[0] + v2[0], v1[1] + v2[1]))
.collect()
)
This looks like on the cluster:
groupByKey
(left) VS reduceByKey
(right) :
reduceByKey
we shuffle considerably less amount of dataBenefits of this approach:
groupByKey
requires collecting all key-value pairs with the same key on the same machine while reduceByKey
reduces locally before shuffling.How does Spark know which key to put on which machine?
The data within an RDD is split into several partitions. Some properties of partitions:
Two kinds of partitioning are available in Spark:
Hash partitioning
Range partitioning
Customizing a partitioning is only possible on a PairRDD
and DataFrame
, namely something with keys.
Given a Pair RDD that should be grouped, groupByKey
first computes per tuple (k,v)
its partition p
:
Then, all tuples in the same partition p
are sent to the machine hosting p
.
Intuition: hash partitioning attempts to spread data evenly across partitions based on the key.
The other kind of partitioning is range partitioning
int
, String
, etc.Using a range partitioner, keys are partitioned according to 2 things:
(key, value) pairs with keys in the same range end up in the same partition.
Consider a Pair RDD with keys: [8, 96, 240, 400, 401, 800]
, and a desired number of partitions of 4.
With hash partitioning
leads to - Partition 0: [8, 96, 240, 400, 800]
- Partition 1: [401]
- Partition 2: []
- Partition 3: []
This results in a very unbalanced distribution which hurts performance, since the data is spread mostly on 1 node, so not very parallel.
In this case, using range partitioning can improve the distribution significantly.
[1-200], [201-400], [401-600], [601-800]
Based on this the keys are distributed as follows:
[8, 96]
[240, 400]
[401]
[800]
This is much more balanced.
How do we set a partitioning for our data?
On a Pair RDD: call partitionBy
, providing an explicit Partitioner
(scala
only, use a partitioning function in pyspark
)
On a DataFrame: Call repartition
for hash partitioning and repartitionByRange
for range partitioning
Using transformations that return a RDD or a DataFrame with specific partitioners.
RDD
using partitionBy
Using RangePartitioner
with pyspark
requires
Specifying the desired number of partitions
Providing a DataFrame with orderable keys
pairs = purchases.map(lambda p: (p.client_id, p.price))
pairs = spark.createDataFrame(pairs, ["id", "price"])
pairs.repartitionByRange(3, "price").persist()
pairs.show()
Important
The result of partitionBy
, repartition
, repartitionByRange
should be persisted.
Otherwise partitioning is repeatedly applied (with shuffling!) each time the partitioned data is used.
Pair RDDs that are result of a transformation on a partitioned Pair RDD use typically the same hash partitioner
Some operations on RDDs automatically result in an RDD with a known partitioner - when it makes sense.
Examples
When using sortByKey
, a RangePartitioner
is used.
With groupByKey
, a default hash partitioner is used.
cogroup
, groupWith
, join
, leftOuterJoin
, rightOuterJoin
[group,reduce,fold,combine]ByKey
, partitionBy
, sort
mapValues
, flatMapValues
, filter
(if parent has a partitioner)
All other operations will produce a result without partitioner!
Consider the map
transformation. Given that we have a hash-partitioned Pair RDD, why loosing the partitioner in the returned RDD?
Because its possible for map
or flatMap
to change the key:
If the map
transformation preserved the previous partitioner, it would no longer makes sense: the keys are all same after this map
Hence, use mapValues
, since it enables to do map
transformations without changing the keys, thereby preserving the partitioner.
Why would we want to repartition the data?
Because it can bring substantial performance gains, especially before shuffles.
We saw that using reduceByKey
instead of groupByKey
localizes data better due to different partitioning strategies and thus reduces latency to deliver performance gains.
By manually repartitioning the data for the same example as before, we can improve the performance even further.
By using range partitioners we can optimize the use of reduceByKey
in that example so that it does not involve any shuffling over the network at all!
Compared to what we did previously, we use sortByKey
to produce a range partitioner for the RDD that we immediately persist.
>>> pairs = purchases.map(lambda p: (p.client_id, (1, p.price)))
>>> pairs = pairs.sortByKey().persist()
>>> pairs.reduceByKey(
lambda v1, v2: (v1[0] + v2[0], v1[1] + v2[1])
).collect()
This typically leads to much faster computations in this case (for large RDD, not the small toy example from before).
When joining two DataDrames, where one is small enough to fit in memory, it is broadcasted over all the workers where the large DataFrame resides (and a hash join is performed). This has two phases:
There is therefore no shuffling involved and this can be much faster than a regular join.
The default threshold for broadcasting is
meaning 10MB
Can be changed using
"spark.broadcast.compress"
can be used to configure whether to compress the data before sending it (True
by default).
It uses the compression specified in "spark.io.compression.codec config"
and the default is "lz4"
. We can use other compression codecs but what the hell.
More important: even though a DataFrame is small, sometimes Spark can’t estimate the size of it. We can enforce using a broadcast hint:
If a broadcast hint is specified, the side with the hint will be broadcasted irrespective of autoBroadcastJoinThreshold
.
If both sides have broadcast hints, the side with a smallest estimated size will be broadcasted.
If there is no hint and the estimated size of DataFrame < autoBroadcastJoinThreshold
, that table is usually broadcasted
Spark has a BitTorrent-like implementation to perform broadcast. Allows to avoid the driver being the bottleneck when sending data to multiple executors.
Usually, a broadcast join performs faster than other join algorithms when the broadcast side is small enough.
However, broadcasting tables is network-intensive and can cause out of memory
errors or even perform worse than other joins if the broadcasted table is too large.
Broadcast join is not supported for a full outer join. For right outer join, only left side table can be broadcasted and for other left joins only the right table can be broadcasted.
Spark can use mainly two strategies for joining:
Sort merge join is the default join strategy, since it is very scalable and performs better than other joins most of the times.
Shuffle hash join is used as the join strategy when:
spark.sql.join.preferSortMergeJoin
is set to False
Shuffle hash join has 2 phases:
Thus, shuffle hash join breaks apart the big join of two tables into localized smaller chunks.
By default, sort merge join is preferred over shuffle hash join. ShuffledHashJoin
is still useful when:
This explains why this shuffle is used for broadcast joins.
Sort merge join is Spark’s default join strategy if:
It is very scalable and is an inheritance of Hadoop and map-reduce programs. What makes it scalable is that it can spill data to the disk and doesn’t require the entire data to fit inside the memory.
It has three phases:
There are other join types, such as BroadcastNestedLoopJoin
in weird situations where no joining keys are specified and either there is a broadcast hint or the size of a table is < autoBroadcastJoinThreshold
.
In words: don’t use these, if you see these in an execution plan or in the Spark UI, it usually means that something has been done poorly.
Sort merge join is the default join and performs well in most of the scenarios.
In some cases, if you are confident enough that shuffle hash join is better than sort merge join, you can disable sort merge join for those scenarios.
Tune spark.sql.autoBroadcastJoinThreshold
accordingly if deemed necessary. Try to use broadcast joins wherever possible and filter out the irrelevant rows to the join key before the join to avoid unnecessary data shuffling.
Joins without unique join keys or no join keys can often be very expensive and should be avoided.
Rule of thumb: a shuffle can occur when the resulting data depends on other data (can be the same or another RDD/DataFrame).
We can also figure out if a shuffle has been planned or executed via:
scala
only)toDebugString
on a RDD to see its execution plan:>>> print(pairs.toDebugString().decode("utf-8"))
(3) PythonRDD[157] at RDD at PythonRDD.scala:53 [Memory Serialized 1x Replicated]
| CachedPartitions: 3; MemorySize: 233.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
| MapPartitionsRDD[156] at mapPartitions at PythonRDD.scala:133 [Memory Serialized 1x Replicated]
| ShuffledRDD[155] at partitionBy at <unknown>:0 [Memory Serialized 1x Replicated]
+-(8) PairwiseRDD[154] at sortByKey at <ipython-input-35-112008c310ec>:2 [Memory Serialized 1x Replicated]
| PythonRDD[153] at sortByKey at <ipython-input-35-112008c310ec>:2 [Memory Serialized 1x Replicated]
| ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 [Memory Serialized 1x Replicated]
Operations that might cause a shuffle:
cogroup
, groupWith
, join
, leftOuterJoin
, rightOuterJoin
, groupByKey
, reduceByKey
, combineByKey
, distinct
, intersection
, repartition
, coalesce
When can we avoid shuffles using partitioning ?
reduceByKey
running on a pre-partitioned RDD will cause the values to be computed locally, requiring only the final reduced value to be sent to the driver.join
called on 2 RDDs that are pre-partitioned with the same partitioner and cached on the same machine will cause the join to be computed locally, with no shuffling across the network.We have seen that some transformations are significantly more expensive (latency) than others
This is often explained by wide versus narrow dependencies, which dictate relationships between RDDs in graphs of computation, that has a lot to do with shuffling
Computations on RDDs are represented as a lineage graph: a DAG representing the computations done on the RDD.
This DAG is what Spark analyzes to do optimizations. Thanks to this, it is possible for an operation to step back and figure out how a result is derived from a particular point.
RDDs are made up of 4 parts:
Partitions: atomic pieces of the dataset. One or many per worker
Dependencies: models relationship between this RDD and its partitions with the RDD(s) it was derived from (dependencies maybe modeled per partition)
A function for computing the dataset based on its parent RDDs
Metadata about partitioning scheme and data placement.
RDD dependencies and shuffles
Transformations cause shuffles, and can have 2 kinds of dependencies:
Fast! No shuffle necessary. Optimizations like pipelining possible.
Transformations with narrow dependencies are fast.
---> [child RDD partition 1]
[parent RDD partition] ---> [child RDD partition 2]
---> [child RDD partition 3]
Slow! Shuffle necessary for all or some data over the network.
Transformations with wide dependencies are slow.
Assume that we have a following DAG:
]
What do the dependencies look like? Which is wide and which is narrow?
join
is narrow because groupByKey
already partitions the keys and places them appropriately in B.join
operations can be narrow or wide depending on lineageTransformations with (usually) narrow dependencies:
map
, mapValues
, flatMap
, filter
, mapPartitions
, mapPartitionsWithIndex
Transformations with (usually) wide dependencies (might cause a shuffle):
cogroup
, groupWith
, join
, leftOuterJoin
, rightOuterJoin
, groupByKey
, reduceByKey
, combineByKey
, distinct
, intersection
, repartition
, coalesce
These list are usually correct, but as seen above for join
, a correct answer depends on lineage
How do we find out if an operation is wide or narrow?
Monitor the job with the Spark UI and check if ShuffleRDD are used.
Use the toDebugString
method. It prints the RDD lineage along with other information relevant to scheduling. Indentations separate groups of narrow transformations that may be pipelined together with wide transformations that require shuffles. These groupings are called stages.
>>> print(pairs.toDebugString().decode("utf-8"))
(3) PythonRDD[157] at RDD at PythonRDD.scala:53 [Memory Serialized 1x Replicated]
| CachedPartitions: 3; MemorySize: 233.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
| MapPartitionsRDD[156] at mapPartitions at PythonRDD.scala:133 [Memory Serialized 1x Replicated]
| ShuffledRDD[155] at partitionBy at <unknown>:0 [Memory Serialized 1x Replicated]
+-(8) PairwiseRDD[154] at sortByKey at <ipython-input-35-112008c310ec>:2 [Memory Serialized 1x Replicated]
| PythonRDD[153] at sortByKey at <ipython-input-35-112008c310ec>:2 [Memory Serialized 1x Replicated]
| ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 [Memory Serialized 1x Replicated]
scala
and java
APIs): there is a dependencies
method on RDDs. It returns a sequence of Dependency
objects, which are the dependencies used by Spark’s scheduler to know how this RDD depends on other (or itself) RDDs.The types of dependency objects that this method may return include:
OneToOneDependency
, PruneDependency
, RangeDependency
ShuffleDependency
Example in scala
:
Also, toDebugString
is more precise with the scala
API:
val pairs = wordsRdd.map(c=>(c,1))
.groupByKey
.toDebugString
// pairs: String =
// (8) ShuffledRDD[219] at groupByKey at <console>:38 []
// +-(8) MapPartitionsRDD[218] at map at <console>:37 []
// | ParallelCollectionRDD[217] at parallelize at <console>:36 []
We can see immediately that a ShuffledRDD
is used
Lineages are the key to fault tolerance in Spark
Ideas from functional programming enable fault tolerance in Spark:
map
, flatMap
, filter
to do functional transformations on this immutable dataThis is all done in Spark RDDs: a by product of these ideas is fault tolerance:
Aren’t you amazed by this ?!?
If a partition fails:
Spark recomputes it to get back on track:
spark.sql
What is the Catalyst optimizer ?
An optimizer that automatically finds out the most efficient plan to execute data operations specified in the user’s program.
It translates transformations used to build the dataset to an optimized physical plan of execution, which is a DAG of low-level operations on RDDs.
A precious tool for spark.sql
in terms of performance. It understands the structure of the data used and of the operations made on it, so the optimizer can make some decisions helping to reduce execution time.
Let’s first define some terminology used in the optimizer
Logical plan: series of algebraic or language constructs, for example: SELECT, GROUP BY, UNION, etc. Usually represented as a DAG where nodes are the constructs.
Physical plan: similar to the logical plan, also represented by a DAG but concerning low-level operations (operations on RDDs).
Unoptimized/optimized plans: a plan becomes optimized when the optimizer passed on it and made some optimizations, such as merging filter()
methods, replacing some instructions by faster another ones, etc.
Catalyst helps to move from an unoptimized logical query plan to an optimized physical plan
Try to optimize logical plan through predefined rule-based optimizations. Some optimizations are:
startsWith
or contains
;Create the optimized logical plan.
Construct multiple physical plans from the optimized logical plan. These are also optimized, some examples are: merging different filter()
, sending predicate/projection pushdown to the data source to eliminate data directly from the source.
Determine which physical plan has the lowest cost of execution and choses it as the physical plan used for the computation.
Generate bytecode for the best physical plan thanks to a scala
feature called quasiquotes
.
Once a physical plan is defined, it’s executed and retrieved data is put to the output DataFrame.
Let’s understand how Catalyst optimizer works for a given query
By performing these transformations, Catalyst improves the execution times of relational queries and mitigates the importance of semantics
Catalyst makes use of some powerful functional programming features from Scala
to allow developers to concisely specify complex relational optimizations.
Catalyst helps but only when it can: explicit schemas, precise function calls, clever order of operations can only help Catalyst.