name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } /* custom.css */ .plot-callout { width: 300px; bottom: 5%; right: 5%; position: absolute; padding: 0px; z-index: 100; } .plot-callout img { width: 100%; border: 1px solid #23373B; } </style>
--- class: middle, left, inverse # Technologies Big Data : Apache and RDD ### 2024-02-13 #### [Master I MIDS Master I Informatique]() #### [Technologies Big Data](http://stephane-v-boucheron.fr/courses/grosses-data/) #### [Amélie Gheerbrandt, Stéphane Gaïffas, Vlady Ravelomanana, Stéphane Boucheron](http://stephane-v-boucheron.fr) --- template: inter-slide ### Introduction --- ### Principles `Spark` computing framework deals with many complex issues: fault tolerance, slow machines, big datasets, etc. *"Here's an operation, run it on all the data."* - I do not care where it runs - Feel free to run it twice on different nodes -- Jobs are divided in tasks, that are executed by the workers - How do we deal with *failure*? Launch *another* task! - How do we deal with *stragglers*? Launch *another task*! <br> .fr[... and kill the original task] ??? n Apache Spark, "jobs" and "tasks" are fundamental concepts related to the execution of distributed computations: ### Job: A job in Spark represents a complete computation triggered by an *action* in the application code. When you invoke an action (such as `collect()`, `saveAsTextFile()`, etc.) on a Spark RDD, DataFrame, or Dataset, it triggers the execution of one or more jobs. Each job consists of one or more *stages*, where each stage represents a set of *tasks* that can be executed in parallel. Jobs in Spark are created by *transformations* that have no dependency on each other, meaning each stage can execute independently. ### Task: A task is the smallest unit of work in Spark and represents the execution of a computation on a single *partition* of data. Tasks are created for each partition of the RDD, DataFrame, or Dataset involved in the computation. Spark's execution engine assigns tasks to individual executor nodes in the cluster for parallel execution. Tasks are executed within the context of a specific *stage*, and each task typically operates on a subset of the data distributed across the cluster. The number of tasks within a stage depends on the number of partitions of the input data and the degree of parallelism configured for the Spark application. In summary, a "job" represents the entire computation triggered by an action, composed of one or more stages, each of which is divided into smaller units of work called "tasks." Tasks operate on individual partitions of the data in parallel to achieve efficient and scalable distributed computation in Spark. --- ### API An *API* allows a user to interact with the software `Spark` is implemented in [Scala](https://www.scala-lang.org), runs on the *JVM* (Java Virtual Machine) *Multiple* Application Programming Interfaces (APIs): - `Scala` (JVM) - `Java` (JVM) - `Python` - `R` *This course uses the `Python` API*. Easier to learn than `Scala` and `Java` - About the `R` APIs: See [Mastering Spark in R](https://therinspark.com) ??? API: Application Programming Interface See [https://en.wikipedia.org/wiki/API](https://en.wikipedia.org/wiki/API) for more on this acronym In `Python` language, look at `interface` and corresponding chapter *Interfaces, Protocols and ABCs* in [Fluent Python](https://www.fluentpython.com) For `R` there are in fact two APIs, or two packages that offer a `Spark` API - [`sparklyr`](https://spark.rstudio.com) - [`SparkR`](https://spark.apache.org/docs/latest/sparkr.html) See [Mastering `Spark` with `R` by Javier Luraschi, Kevin Kuo, Edgar Ruiz](https://therinspark.com/index.html) --- ### Architecture When you interact with `Spark` through its API, you send instructions to the *Driver* - The *Driver* is the **central coordinator** - It communicates with distributed workers called *executors* - Creates a *logical directed acyclic graph* (DAG) of operations - *Merges operations* that can be merged - *Splits* the operations in *tasks* (smallest unit of work in Spark) - *Schedules* the tasks and send them to the *executors* - *Tracks* data and tasks #### Example - Example of DAG: `map(f) - map(g) - filter(h) - reduce(l)` - `map(f o g)` --- ## SparkSession and SparkContext --- `SparkContext` and `SparkSession` serve different purposes SparkContext was the main entry point for Spark applications in first versions of Apache Spark. SparkContext represented the connection to a Spark cluster, allowing the application to interact with the cluster manager. SparkContext was responsible for coordinating and managing the execution of *jobs* and *tasks*. SparkContext provided APIs for creating `RDDs` (Resilient Distributed Datasets), which were the primary abstraction in Spark for representing distributed data. --- --- ### SparkContext object Your `python` session interacts with the **driver** through a `SparkContext` object - In the `Spark` interactive shell <br> An object of class `SparkContext` is automatically created in the session and named `sc` - In a `jupyter notebook` <br> Create a `SparkContext` object using: ```python >>> from pyspark import SparkConf, SparkContext >>> conf = SparkConf().setAppName(appName).setMaster(master) >>> sc = SparkContext(conf=conf) ``` --- ### SparkSession In Spark 2.0 and later versions, `SparkContext` is still available but is not the primary entry point. Instead, SparkSession is preferred. `SparkSession` was introduced in Spark 2.0 as a higher-level abstraction that encapsulates SparkContext, SQLContext, and HiveContext. `SparkSession` provides a unified entry point for Spark functionality, integrating Structured APIs: - SQL, - DataFrame, - Dataset and the traditional RDD-based APIs. `SparkSession` is designed to make it easier to work with structured data (like data stored in tables or files with a schema) using Spark's DataFrame and Dataset APIs. It also provides built-in support for reading data from various sources (like Parquet, JSON, JDBC, etc.) into DataFrames and writing DataFrames back to different formats. Additionally, SparkSession simplifies the configuration of Spark properties and provides a Spark SQL CLI and a Spark Shell with SQL and DataFrame support. It's important to note that SparkSession internally creates and manages a SparkContext, so when you create a SparkSession, you don't need to create a SparkContext separately. In summary, while SparkContext is lower-level and primarily focused on managing the execution of Spark jobs and interacting with the cluster, SparkSession provides a higher-level, more user-friendly interface for working with structured data and integrates various Spark functionalities, including SQL, DataFrame, and Dataset APIs. --- ### RDDs and running model Spark programs are written in terms of operations on **RDDs** - *RDD* = **Resilient Distributed Dataset** <br> - An **immutable distributed collection** of objects spread across the cluster disks or memory - RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes - Parallel *transformations* and *actions* can be applied to RDDs - RDDs are automatically rebuilt on machine failure ??? --- ### Creating a RDD From an iterable object `iterator` (e.g. a `Python` `list`, etc.): ```python lines = sc.parallelize(iterator) ``` From a text file: ```python lines = sc.textFile("/path/to/file.txt") ``` where `lines` is the resulting RDD, and `sc` the spark context **Remarks** - `parallelize` not really used in practice - In real life: **load data from external storage** - External storage is often **HDFS** (Hadoop Distributed File System) - Can read most formats (`json`, `csv`, `xml`, `parquet`, `orc`, etc.) ??? For iterators look again at [Fluent Python](https://www.fluentpython.com), chapter 17 *Iterators, Generators, and Classic Coroutines* --- ### Operations on RDD **Two families of operations** can be performed on RDDs - *Transformations* <br> Operations on RDDs which return a new RDD <br> *Lazy evaluation* - *Actions* <br> Operations on RDDs that return some other data type <br> **Triggers computations** What is *lazy evaluation* ? -- When a transformation is called on a RDD: - The operation is *not immediately performed* - Spark internally *records that this operation has been requested* - Computations are triggered only *if an action requires the result of this transformation* at some point --- template: inter-slide ## Transformations --- ### Transformations The most important transformation is `map` .pure-table.pure-table-striped[ | transformation | description | | :-------------: |:-----------------------------------------------| | `map(f)` | apply a function `f` to each element of the RDD | ] Here is an example: ```python >>> rdd = sc.parallelize([2, 3, 4]) >>> rdd.map(lambda x: list(range(1, x))).collect() [[1], [1, 2], [1, 2, 3]] ``` - We need to call `collect` (an *action*) otherwise *nothing happens* - Once again, transformation `map` is lazy-evaluated
- In `Python`, *three options for passing functions* into `Spark` - for short functions: `lambda` expressions (anonymous functions) - top-level functions - *locally/user defined functions* with `def` --- ### Transformations About passing functions to `map`: - Involves *serialization* with `pickle` - `Spark` sends the *entire pickled function* to worker nodes
Warning. If the function is an *object method*: - The *whole object is pickled* since the method contains references to the object (`self`) and references to attributes of the object - The whole object can be *large* - The whole object *may not be serializable with `pickle`* .footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)] ??? > *serialization* > Converting an object from its in-memory structure to a binary or text-oriented format for storage or transmission, > in a way that allows the future reconstruction of a clone of the object on the same system or on a different one. > The `pickle` module supports serialization of arbitrary `Python` objects to a binary format. .fr[from Fluent Python by Ramalho] --- ### Transformations Then we have `flatMap` .pure-table.pure-table-striped[ | transformation | description | | :------------: | :------------------------------- | | `flatMap(f)` | apply `f` to each element of the RDD, then flattens the results | ] Example ```python >>> rdd = sc.parallelize([2, 3, 4, 5]) >>> rdd.flatMap(lambda x: range(1, x)).collect() [1, 1, 2, 1, 2, 3, 1, 2, 3, 4] ``` --- ### Transformations `filter` allows to filter an RDD .pure-table.pure-table-striped[ | transformation | description | | :-------------: | :------------------------------ | | `filter(f)` | Return an RDD consisting of only elements that pass the condition `f` passed to `filter()` | ] Example ```python >>> rdd = sc.parallelize(range(10)) >>> rdd.filter(lambda x: x % 2 == 0).collect() [0, 2, 4, 6, 8] ``` --- ### Transformations About `distinct` and `sample` .pure-table.pure-table-striped[ | transformation | description | | :-------------: | :------------------------------- | | `distinct()` | Removes duplicates | | `sample(withReplacement, fraction, [seed])` | Sample an RDD, with or without replacement | ] Example ```python >>> rdd = sc.parallelize([1, 1, 4, 2, 1, 3, 3]) >>> rdd.distinct().collect() [1, 2, 3, 4] ``` --- ### Transformations We have also pseudo-set-theoretical operations .pure-table.pure-table-striped[ | transformation | description | |: -------------: |: -------------------------------| | `union(otherRdd)` | Returns union with `otherRdd` | | `instersection(otherRdd)` | Returns intersection with `otherRdd` | | `subtract(otherRdd)` | Return each value in `self` that is not contained in `otherRdd`. | ] - If if there are duplicates in the input RDD, the result of `union()` *will* contain duplicates (fixed with `distinct()`) - `intersection()` removes all duplicates (including duplicates from a single RDD) - Performance of `intersection()` is much worse than `union()` since it requires a *shuffle* to identify common elements - `subtract` also requires a shuffle --- ### Transformations We have also pseudo-set-theoretical operations .pure-table.pure-table-striped[ | transformation | description | |: -------------: |: -------------------------------| | `union(otherRdd)` | Returns union with `otherRdd` | | `instersection(otherRdd)` | Returns intersection with `otherRdd` | | `subtract(otherRdd)` | Return each value in `self` that is not contained in `otherRdd`. | ] Example with `union` and `distinct` ```python >>> rdd1 = sc.parallelize(range(5)) >>> rdd2 = sc.parallelize(range(3, 9)) >>> rdd3 = rdd1.union(rdd2) >>> rdd3.collect() [0, 1, 2, 3, 4, 3, 4, 5, 6, 7, 8] ``` ```python >>> rdd3.distinct().collect() [0, 1, 2, 3, 4, 5, 6, 7, 8] ``` --- ### About shuffles - Certain operations trigger a *shuffle* - It is `Spark`’s mechanism for *re-distributing data* so that it’s grouped differently across partitions - It involves *copying data across executors and machines*, making the shuffle a complex and costly operation - We will discuss shuffles in detail later in the course ## Performance Impact - A shuffle involves disk I/O, data serialization and network I/O. - To organize data for the shuffle, `Spark` generates sets of *tasks* - *map tasks* to organize the data - and a set of *reduce tasks* to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations. --- ### Transformations Another "pseudo set" operation .pure-table.pure-table-striped[ | transformation | description | |: -------------: |: -------------------------------| | `cartesian(otherRdd)` | Return the Cartesian product of this RDD and another one | ] Example ```python >>> rdd1 = sc.parallelize([1, 2]) >>> rdd2 = sc.parallelize(["a", "b"]) >>> rdd1.cartesian(rdd2).collect() [(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')] ``` - `cartesian()` is **very expensive** for large RDDs .footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)] --- template: inter-slide ## Actions --- ### Actions `collect` brings the `RDD` back to the driver .pure-table.pure-table-striped[ | transformation | description | |: -------------: |: -------------------------------| | `collect()` | Return all elements from the RDD | ] Example ```python >>> rdd = sc.parallelize([1, 2, 3, 3]) >>> rdd.collect() [1, 2, 3, 3] ``` #### Remarks -
Be sure that the *retrieved data fits in the driver memory* ! - Useful when developping and working on small data for testing -
We'll use it a lot here, but *we don't use it in real-world problems* --- ### Actions It's important to count ! .pure-table.pure-table-striped[ | transformation | description | |: -------------: |: -------------------------------| | `count()` | Return the number of elements in the RDD | | `countByValue()` | Return the count of each unique value in the RDD as a dictionary of `{value: count}` pairs. | ] Example ```python >>> rdd = sc.parallelize([1, 3, 1, 2, 2, 2]) >>> rdd.count() 6 ``` ```python >>> rdd.countByValue() defaultdict(int, {1: 2, 3: 1, 2: 3}) ``` ??? In SQL, you would first perform a `group by`, then a `count(*)` aggregation --- ### Actions How to get some values in an RDD ? .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `take(n)` | Return `n` elements from the RDD (deterministic)| | `top(n)` | Return first `n` elements from the RDD (decending order)| | `takeOrdered(num, key=None)` | Get the N elements from a RDD ordered in ascending order or as specified by the optional key function.| ] **Remarks** - `take(n)` returns n elements from the RDD and attempts to **minimize the number of partitions it accesses** -
the result may be a *biased* collection - `collect` and `take` may return the elements in an order you don't expect --- ### Actions How to get some values in an RDD ? .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `take(n)` | Return `n` elements from the RDD (deterministic)| | `top(n)` | Return first `n` elements from the RDD (decending order)| | `takeOrdered(num, key=None)` | Get the $N $elements from a RDD ordered in ascending order or as specified by the optional key function.| ] Example ```python >>> rdd = sc.parallelize([(3, 'a'), (1, 'b'), (2, 'd')]) >>> rdd.takeOrdered(2) [(1, 'b'), (2, 'd')] ``` ```python >>> rdd.takeOrdered(2, key=lambda x: x[1]) [(3, 'a'), (1, 'b')] ``` ??? deterministic but arbitrary (may depend on implementation) --- ### Actions The `reduce` action .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `reduce(f)` | Reduces the elements of this RDD using the specified commutative and associative binary operator `f`. | | `fold(zeroValue, op)` | Same as `reduce()` but with the provided zero value. | ] - `op(x, y)` is allowed to modify x and return it as its result value to avoid object allocation; however, it should not modify y. - `reduce` applies some operation to pairs of elements until there is just one left. Throws an exception for empty collections. - `fold` has initial zero-value: defined for empty collections. --- ### Actions The `reduce` action .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `reduce(f)` | Reduces the elements of this RDD using the specified commutative and associative binary operator `f`. | | `fold(zeroValue, op)` | Same as `reduce()` but with the provided zero value. | ] Example ```python >>> rdd = sc.parallelize([1, 2, 3]) >>> rdd.reduce(lambda a, b: a + b) 6 ``` ```python >>> rdd.fold(0, lambda a, b: a + b) 6 ``` --- ### Actions The `reduce` action .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `reduce(f)` | Reduces the elements of this RDD using the specified commutative and associative binary operator `f`. | | `fold(zeroValue, op)` | Same as `reduce()` but with the provided zero value. | ] **Warning with `fold`.** Solutions can depend on the number of partitions ```python >>> rdd = sc.parallelize([1, 2, 4], 2) # RDD with 2 partitions >>> rdd.fold(2.5, lambda a, b: a + b) 14.5 ``` - RDD has 2 partition: say [1, 2] and [4] - Sum in the partitions: 2.5 + (1 + 2) = 5.5 and 2.5 + (4) = 6.5 - Sum over partitions: 2.5 + (5.5 + 6.5) = 14.5 --- ### Actions The `reduce` action .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `reduce(f)` | Reduces the elements of this RDD using the specified commutative and associative binary operator `f`. | | `fold(zeroValue, op)` | Same as `reduce()` but with the provided zero value. | ] **Warning with `fold`.** Solutions can depend on the number of partitions ```python >>> rdd = sc.parallelize([1, 2, 3], 5) # RDD with 5 partitions >>> rdd.fold(2, lambda a, b: a + b) ??? ``` .footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)] --- ### Actions The `reduce` action .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `reduce(f)` | Reduces the elements of this RDD using the specified commutative and associative binary operator `f`. | | `fold(zeroValue, op)` | Same as `reduce()` but with the provided zero value. | ] **Warning with `fold`.** Solutions can depend on the number of partitions ```python >>> rdd = sc.parallelize([1, 2, 3], 5) # RDD with 5 partitions >>> rdd.fold(2, lambda a, b: a + b) 18 ``` - Yes, even if there is less partitions than elements ! - 18 = 2 * 5 + (1+2+3) + 2 ??? Find a proper showcase for `fold()` --- ### Actions The `aggregate` action .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `aggregate(zero, seqOp, combOp)` | Similar to `reduce()` but used to return a different type. | ] Aggregates the elements of each partition, and then the results for all the partitions, given aggregation functions and zero value. - `seqOp(acc, val)`: function to combine the elements of a partition from the RDD (`val`) with an accumulator (`acc`). It can return a different result type than the type of this `RDD` - `combOp`: function that merges the accumulators of two partitions - Once again, in both functions, the first argument can be modified while the second cannot --- ### Actions The `aggregate` action .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `aggregate(zero, seqOp, combOp)` | Similar to `reduce()` but used to return a different type. | ] Example ```python >>> seqOp = lambda x, y: (x[0] + y, x[1] + 1) >>> combOp = lambda x, y: (x[0] + y[0], x[1] + y[1]) >>> sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp) (10, 4) ``` ```python >>> ( sc.parallelize([]) .aggregate((0, 0), seqOp, combOp) ) (0, 0) ``` .footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)] ??? The result is partition-dependent --- ### Actions The `foreach` action .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `foreach(f)` | Apply a function `f` to each element of a RDD | ] - Performs an action on all of the elements in the RDD without returning any result to the driver. - Example : insert records into a database with `f` The `foreach()` action lets us perform computations on each element in the RDD without bringing it back locally ??? In which way do `foreach` and `map` differ? --- class: center, middle, inverse ### Persistence --- ### Lazy evaluation and persistence - Spark RDDs are **lazily evaluated** - Each time an action is called on a RDD, this RDD and all its dependencies are *recomputed* - If you plan to reuse a RDD multiple times, you should use *persistence* **Remarks** - Lazy evaluation helps `spark` to **reduce the number of passes** over the data it has to make by grouping operations together - No substantial benefit to writing a single complex map instead of chaining together many simple operations - Users are free to organize their program into **smaller**, more **manageable operations** ??? Distinguish persistence from caching --- ### Persistence How to use persistence ? .pure-table.pure-table-striped[ | method | description | |: ---------------------------:|: --------------------------------------------| | `cache()` | Persist the RDD in memory | | `persist(storageLevel)` | Persist the RDD according to `storageLevel` | ] - These methods must be called *before* the action, and do not trigger the computation Usage of `storageLevel` ```python pyspark.StorageLevel( useDisk, useMemory, useOffHeap, deserialized, replication=1 ) ``` ??? - What does persistence in memory mean? - Make `storageLevel` explicit - Any difference between `cache()` and `persist()` with `useMemory`? - Why do we call persistence caching? --- name: option-for-persistence ### Persistence Options for persistence .pure-table.pure-table-striped[ | argument | description | |: -------------: |: -------------------------------| | `useDisk` | Allow caching to use disk if `True` | | `useMemory` | Allow caching to use memory if `True` | | `useOffHeap` | Store data outside of JVM heap if `True`. Useful if using some in-memory storage system (such a `Tachyon`) | | `deserialized` | Cache data without serialization if `True` | | `replication` | Number of replications of the cached data | ] --- template: option-for-persistence `replication` - If you cache data that is quite slow to be recomputed, you can use replications. If a machine fails, data will not have to be recomputed. ??? `Tachyon` : --- template: option-for-persistence `deserialized` - Serialization is conversion of the data to a binary format - To the best of our knowledge, `PySpark` only support serialized caching (using `pickle`) ??? --- template: option-for-persistence `useOffHeap` - Data cached in the JVM heap by default - Very interesting alternative in-memory solutions such as `tachyon` - Don't forget that `spark` is `scala` running on the JVM --- ### Back to options for persistence ```python StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication) ``` You can use these constants: ```python DISK_ONLY = StorageLevel(True, False, False, False, 1) DISK_ONLY_2 = StorageLevel(True, False, False, False, 2) MEMORY_AND_DISK = StorageLevel(True, True, False, True, 1) MEMORY_AND_DISK_2 = StorageLevel(True, True, False, True, 2) MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1) MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2) MEMORY_ONLY = StorageLevel(False, True, False, True, 1) MEMORY_ONLY_2 = StorageLevel(False, True, False, True, 2) MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1) MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2) OFF_HEAP = StorageLevel(False, False, True, False, 1) ``` and simply call for instance ```python rdd.persist(MEMORY_AND_DISK) ``` --- ### Persistence What if you attempt to *cache too much data to fit in memory ?* Spark will automatically evict old partitions using a *Least Recently Used* (LRU) cache policy: - For the *memory-only* storage levels, it will recompute these partitions the next time they are accessed - For the *memory-and-disk* ones, it will write them out to disk Use `unpersist()` to RDDs to **manually remove them** from the cache --- ### Reminder: about passing functions (
) **Warning** - When passing functions, you can *inadvertently serialize the object containing the function*. If you pass a function that: - is the member of an object - contains references to fields in an object then `Spark` sends the *entire object to worker nodes*, which can be **much larger** than the bit of information you need - This can cause your *program to fail*, if your class contains objects that **Python can't pickle** --- ### About passing functions Passing a function with field references (don’t do this !
) ```python class SearchFunctions(object): def __init__(self, query): self.query = query def isMatch(self, s): return self.query in s def getMatchesFunctionReference(self, rdd): # Problem: references all of "self" in "self.isMatch" return rdd.filter(self.isMatch) def getMatchesMemberReference(self, rdd): # Problem: references all of "self" in "self.query" return rdd.filter(lambda x: self.query in x) ``` Instead, **just extract the fields you need** from your object into a local variable and pass that in --- ### About passing functions `Python` function passing without field references ```python class WordFunctions(object): ... def getMatchesNoReference(self, rdd): # Safe: extract only the field we need into a local variable query = self.query return rdd.filter(lambda x: query in x) ``` -- Much better to do this instead --- template: inter-slide ## Pair RDD: key-value pairs --- ### Pair RDD: key-value pairs It's roughly an RDD where each element is a tuple with two elements: a key and a value - For numerous tasks, such as aggregations tasks, storing information as `(key, value)` pairs into RDD is very convenient - Such RDDs are called `PairRDD` - Pair RDDs expose *new operations* such as **grouping together** data with the same key, and **grouping together two different RDDs** ### Creating a pair RDD Calling `map` with a function returning a `tuple` with two elements ```python >>> rdd = sc.parallelize([[1, "a", 7], [2, "b", 13], [2, "c", 17]]) >>> rdd = rdd.map(lambda x: (x[0], x[1:])) >>> rdd.collect() [(1, ['a', 7]), (2, ['b', 13]), (2, ['c', 17])] ``` --- ###
Warning All elements of a `PairRDD` must be tuples with two elements (the key and the value) ```python >>> rdd = sc.parallelize([[1, "a", 7], [2, "b", 13], [2, "c", 17]]) >>> rdd.keys().collect() [1, 2, 2] >>> rdd.values().collect() ['a', 'b', 'c'] ``` -- For things to work as expected you **must** do ```python >>> rdd = sc.parallelize([[1, "a", 7], [2, "b", 13], [2, "c", 17]])\ .map(lambda x: (x[0], x[1:])) >>> rdd.keys().collect() [1, 2, 2] >>> rdd.values().collect() [['a', 7], ['b', 13], ['c', 17]] ``` --- name: transformations-for-a-single-PairRDD ### Transformations for a single `PairRDD` .pure-table.pure-table-striped.f6[ | transformation | description | |: -------------: |: -------------------------------| | `keys()` | Return an RDD containing the keys | | `values()` | Return an RDD containing the values | | `sortByKey()` | Return an RDD sorted by the key | | `mapValues(f)` | Apply a function `f` to each value of a pair RDD without changing the key | | `flatMapValues(f)` | Pass each value in the key-value pair RDD through a flatMap function `f` without changing the keys | ] --- template: transformations-for-a-single-PairRDD Example with `mapValues` ```python >>> rdd = sc.parallelize([("a", "x y z"), ("b", "p r")]) >>> rdd.mapValues(lambda v: v.split(' ')).collect() [('a', ['x', 'y', 'z']), ('b', ['p', 'r'])] ``` --- template: transformations-for-a-single-PairRDD Example with `flatMapValues` ```python >>> texts = sc.parallelize([("a", "x y z"), ("b", "p r")]) >>> tokenize = lambda x: x.split(" ") >>> texts.flatMapValues(tokenize).collect() [('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')] ``` --- name: transformations-for-a-single-PairRDD-keyed ### Transformations for a single `PairRDD` (keyed) .pure-table.pure-table-striped.f6[ | transformation | description | |: -------------: |: -------------------------------| | `groupByKey()` | Group values with the same key | | `reduceByKey(f)`| Merge the values for each key using an associative reduce function `f`. | | `foldByKey(f)` | Merge the values for each key using an associative reduce function `f`. | | `combineByKey(createCombiner, mergeValue, mergeCombiners, [partitioner])` | Generic function to combine the elements for each key using a custom set of aggregation functions. | ] --- template: transformations-for-a-single-PairRDD-keyed Example with `groupByKey` ```python >>> rdd = sc.parallelize([ ("a", 1), ("b", 1), ("a", 1), ("b", 3), ("c", 42) ]) >>> rdd.groupByKey().mapValues(list).collect() [('c', [42]), ('b', [1, 3]), ('a', [1, 1])] ``` --- .center[<img src="figs/group_by.png">] --- template: transformations-for-a-single-PairRDD-keyed Example with `reduceByKey` ```python >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) >>> rdd.reduceByKey(lambda a, b: a + b).collect() [('a', 2), ('b', 1)] ``` - The reducing occurs first **locally** (within partitions) - Then, a shuffle is performed with the local results to reduce globally --- .center[<img src="figs/reduce_by.png">] --- template: transformations-for-a-single-PairRDD-keyed `combineByKey` Transforms an `RDD[(K, V)]` into another RDD of type `RDD[(K, C)]` for a "combined" type `C` that can be different from `V` The user must define - `createCombiner` : which turns a `V` into a `C` - `mergeValue` : to merge a `V` into a `C` - `mergeCombiners` : to combine two `C`’s into a single one --- template: transformations-for-a-single-PairRDD-keyed In this example - `createCombiner` : converts the value to `str` - `mergeValue` : concatenates two `str` - `mergeCombiners` : concatenates two `str` ```python >>> rdd = sc.parallelize([('a', 1), ('b', 2), ('a', 13)]) >>> def add(a, b): return a + str(b) >>> rdd.combineByKey(str, add, add).collect() [('a', '113'), ('b', '2')] ``` --- ### Transformations for two `PairRDD` .pure-table.pure-table-striped[ | transformation | description | |: -------------: |: -------------------------------| | `subtractByKey(other)` | Remove elements with a key present in the `other` RDD. | | `join(other)` | Inner join with `other` RDD. | | `rightOuterJoin(other)` | Right join with `other` RDD. | | `leftOuterJoin(other)` | Left join with `other` RDD. | ] - Right join: the key must be present in the first RDD - Left join: the key must be present in the `other` RDD .center[<img width='600px' src="figs/join-types.png">] --- ### Transformations for two `PairRDD` - Join operations are mainly used through the high-level API: `DataFrame` objects and the `spark.sql` API - We will use them a lot with the high-level API (`DataFrame` from `spark.sql`) .footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)] --- ### Actions for a single `PairRDD` .pure-table.pure-table-striped[ | action | description | |: -------------: |: -------------------------------| | `countByKey()` | Count the number of elements for each key. | | `lookup(key)` | Return all the values associated with the provided `key`. | | `collectAsMap()` | Return the key-value pairs in this RDD to the master as a Python dictionary. | ] .footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)] --- ### Data partitionning - Some operations on `PairRDD`s, such as `join`, require to scan the data **more than once** - Partitionning the RDDs **in advance** can reduce network communications - When a key-oriented dataset is reused several times, partitionning can improve performance - In `Spark`: you can *choose which keys will appear on the same node*, but no explicit control of which worker node each key goes to. --- ### Data partitionning In practice, you can specify the number of partitions with ```python rdd.partitionBy(100) ``` You can also use a custom partition function `hash` such that `hash(key)` returns a hash ```python import urlparse >>> def hash_domain(url): # Returns a hash associated to the domain of a website return hash(urlparse.urlparse(url).netloc) rdd.partitionBy(20, hash_domain) # Create 20 partitions ``` To have finer control on partitionning, you must use the Scala API. --- class: center, middle, inverse ### Thank you !