Technologies Big Data

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}

/* custom.css */
.plot-callout {
  width: 300px;
  bottom: 5%;
  right: 5%;
  position: absolute;
  padding: 0px;
  z-index: 100;
}
.plot-callout img {
  width: 100%;
  border: 1px solid #23373B;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./figs/UniversiteParisCite_logo_horizontal_couleur_RVB.jpeg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

# Technologies Big Data : Apache and RDD

### 2024-02-13

#### [Master I MIDS Master I Informatique]()

#### [Technologies Big Data](http://stephane-v-boucheron.fr/courses/grosses-data/)

#### [Amélie Gheerbrandt, Stéphane Gaïffas, Vlady Ravelomanana, Stéphane Boucheron](http://stephane-v-boucheron.fr)

---
template: inter-slide

### Introduction

---

### Principles

`Spark` computing framework deals with many complex issues: fault tolerance, slow machines, big datasets, etc.

*"Here's an operation, run it on all the data."*

- I do not care where it runs
- Feel free to run it twice on different nodes

Jobs are divided in tasks, that are executed by the workers

- How do we deal with *failure*? Launch *another* task!  
- How do we deal with *stragglers*? Launch *another task*! <br> .fr[... and kill the original task]

???

n Apache Spark, "jobs" and "tasks" are fundamental concepts related to the execution of distributed computations:

### Job:

A job in Spark represents a complete computation triggered by an *action* in the application code.

When you invoke an action (such as `collect()`, `saveAsTextFile()`, etc.) on a Spark RDD,
DataFrame, or Dataset, it triggers the execution of one or more jobs.

Each job consists of one or more *stages*, where each stage represents a set of *tasks*
that can be executed in parallel.

Jobs in Spark are created by *transformations* that have no dependency on each other,
meaning each stage can execute independently.

### Task:

A task is the smallest unit of work in Spark and represents 
the execution of a computation on a single *partition* of data.

Tasks are created for each partition of the RDD, DataFrame, or Dataset involved in the computation.

Spark's execution engine assigns tasks to individual executor nodes in the cluster for parallel execution.

Tasks are executed within the context of a specific *stage*, 
and each task typically operates on a subset of the data distributed across the cluster.

The number of tasks within a stage depends on the number of partitions of the input data and the degree of parallelism configured for the Spark application.

In summary, a "job" represents the entire computation triggered by an action, 
composed of one or more stages, 
each of which is divided into smaller units of work called "tasks."

Tasks operate on individual partitions of the data in parallel to achieve efficient and scalable distributed computation in Spark.

---

### API

An *API* allows a user to interact with the software

`Spark` is implemented in [Scala](https://www.scala-lang.org), runs on the *JVM* (Java Virtual Machine)

*Multiple* Application Programming Interfaces (APIs):

- `Scala` (JVM)
- `Java` (JVM)
- `Python`
- `R`

*This course uses the `Python` API*. Easier to learn than `Scala` and `Java`

- About the `R` APIs:  See [Mastering Spark in R](https://therinspark.com)

???

API: Application Programming Interface

See [https://en.wikipedia.org/wiki/API](https://en.wikipedia.org/wiki/API) for more on this acronym

In `Python` language, look at `interface`  and corresponding chapter *Interfaces, Protocols and ABCs* in [Fluent Python](https://www.fluentpython.com)

For `R` there are in fact two APIs, or two packages that offer a `Spark` API

- [`sparklyr`](https://spark.rstudio.com)
- [`SparkR`](https://spark.apache.org/docs/latest/sparkr.html)

See [Mastering `Spark` with `R` by Javier Luraschi, Kevin Kuo, Edgar Ruiz](https://therinspark.com/index.html)

---

### Architecture

When you interact with `Spark` through its API, you send instructions to the *Driver*

- The *Driver* is the **central coordinator**
- It communicates with distributed workers called *executors*
- Creates a *logical directed acyclic graph* (DAG) of operations
- *Merges operations* that can be merged
- *Splits* the operations in *tasks* (smallest unit of work in Spark)
- *Schedules* the tasks and send them to the *executors*
- *Tracks* data and tasks

#### Example

- Example of DAG: `map(f) - map(g) - filter(h) - reduce(l)`
- `map(f o g)`

---

## SparkSession and SparkContext

---

`SparkContext` and `SparkSession` serve different purposes

SparkContext was the main entry point for Spark applications in first versions of Apache Spark.

SparkContext represented the connection to a Spark cluster, allowing the application to interact with the cluster manager.

SparkContext was responsible for coordinating and managing the execution of *jobs* and *tasks*.

SparkContext provided APIs for creating `RDDs` (Resilient Distributed Datasets), which were the primary abstraction in Spark for representing distributed data.

---

### SparkContext object

Your `python` session interacts with the **driver** through a `SparkContext` object

- In the `Spark` interactive shell <br> An object of class `SparkContext` is automatically created in the session and named `sc`

- In a `jupyter notebook` <br> Create a `SparkContext` object using:

```python
>>> from pyspark import SparkConf, SparkContext

>>> conf = SparkConf().setAppName(appName).setMaster(master)
>>> sc = SparkContext(conf=conf)
```

---

### SparkSession

In Spark 2.0 and later versions, `SparkContext` is still available 
but is not the primary entry point.

Instead, SparkSession is preferred.

`SparkSession` was introduced in Spark 2.0 as a higher-level abstraction that encapsulates SparkContext, SQLContext, and HiveContext.

`SparkSession` provides a unified entry point for Spark functionality, integrating 
Structured APIs:

- SQL, 
- DataFrame, 
- Dataset  
 
and the traditional RDD-based APIs.

`SparkSession` is designed to make it easier to work with structured data (like data stored in tables or files with a schema) using Spark's DataFrame and Dataset APIs.

It also provides built-in support for reading data from various sources (like Parquet, JSON, JDBC, etc.) into DataFrames and writing DataFrames back to different formats.
Additionally, SparkSession simplifies the configuration of Spark properties and provides a Spark SQL CLI and a Spark Shell with SQL and DataFrame support.
It's important to note that SparkSession internally creates and manages a SparkContext, so when you create a SparkSession, you don't need to create a SparkContext separately.
In summary, while SparkContext is lower-level and primarily focused on managing the execution of Spark jobs and interacting with the cluster, SparkSession provides a higher-level, more user-friendly interface for working with structured data and integrates various Spark functionalities, including SQL, DataFrame, and Dataset APIs.

---

### RDDs and running model

Spark programs are written in terms of operations on **RDDs**

- *RDD* = **Resilient Distributed Dataset** <br>

-  An **immutable distributed collection** of objects spread across the cluster disks or memory

- RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes

- Parallel *transformations* and *actions* can be applied to RDDs

- RDDs are automatically rebuilt on machine failure

???

---

### Creating a RDD

From an iterable object `iterator` (e.g. a `Python` `list`, etc.):

```python
lines = sc.parallelize(iterator)
```

From a text file:

```python
lines = sc.textFile("/path/to/file.txt")
```

where `lines` is the resulting RDD, and `sc` the spark context

**Remarks**

- `parallelize` not really used in practice
- In real life: **load data from external storage**
- External storage is often **HDFS** (Hadoop Distributed File System)
- Can read most formats (`json`, `csv`, `xml`, `parquet`, `orc`, etc.)

???

For iterators look again at [Fluent Python](https://www.fluentpython.com), chapter 17 *Iterators, Generators, and Classic Coroutines*

---

### Operations on RDD

**Two families of operations** can be performed on RDDs

- *Transformations* <br> Operations on RDDs which return a new RDD <br> *Lazy evaluation*

- *Actions* <br> Operations on RDDs that return some other data type <br> **Triggers computations**

What is *lazy evaluation* ?

When a transformation is called on a RDD:
- The operation is *not immediately performed*
- Spark internally *records that this operation has been requested*
- Computations are triggered only *if an action requires the result of this transformation* at some point

---
template: inter-slide

## Transformations

---

### Transformations

The most important transformation is `map`

.pure-table.pure-table-striped[
| transformation | description                                     |
| :-------------: |:-----------------------------------------------|
| `map(f)`       | apply a function `f` to each element of the RDD |
]

Here is an example:

```python
>>> rdd = sc.parallelize([2, 3, 4])
>>> rdd.map(lambda x: list(range(1, x))).collect()
[[1], [1, 2], [1, 2, 3]]
```

- We need to call `collect` (an *action*) otherwise *nothing happens*
- Once again, transformation `map` is lazy-evaluated  <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256 32c14.2 0 27.3 7.5 34.5 19.8l216 368c7.3 12.4 7.3 27.7 .2 40.1S486.3 480 472 480H40c-14.3 0-27.6-7.7-34.7-20.1s-7-27.8 .2-40.1l216-368C228.7 39.5 241.8 32 256 32zm0 128c-13.3 0-24 10.7-24 24V296c0 13.3 10.7 24 24 24s24-10.7 24-24V184c0-13.3-10.7-24-24-24zm32 224a32 32 0 1 0 -64 0 32 32 0 1 0 64 0z"/></svg>
- In `Python`, *three options for passing functions* into `Spark`
  - for short functions: `lambda` expressions (anonymous functions)
  - top-level functions 
  - *locally/user defined functions* with `def`

---

### Transformations

About passing functions to `map`:

- Involves *serialization* with `pickle`
- `Spark` sends the *entire pickled function* to worker nodes

- The *whole object is pickled* since the method contains references to the object (`self`) and references to attributes of the object
- The whole object can be *large* 
- The whole object *may not be serializable with `pickle`*

.footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)]

???

> *serialization*

> Converting an object from its in-memory structure to a binary or text-oriented format for storage or transmission, 
> in a way that allows the future reconstruction of a clone of the object on the same system or on a different one. 
> The `pickle` module supports serialization of arbitrary `Python` objects to a binary format.

---

### Transformations

Then we have `flatMap`

.pure-table.pure-table-striped[

| transformation | description                      |
| :------------: | :------------------------------- |
| `flatMap(f)`   | apply `f` to each element of the RDD, then flattens the results |
]

Example

```python
>>> rdd = sc.parallelize([2, 3, 4, 5])
>>> rdd.flatMap(lambda x: range(1, x)).collect()
[1, 1, 2, 1, 2, 3, 1, 2, 3, 4]
```

---

### Transformations

`filter` allows to filter an RDD

Example

```python
>>> rdd = sc.parallelize(range(10))
>>> rdd.filter(lambda x: x % 2 == 0).collect()
[0, 2, 4, 6, 8]
```

---

### Transformations

About `distinct` and `sample`

Example

```python
>>> rdd = sc.parallelize([1, 1, 4, 2, 1, 3, 3])
>>> rdd.distinct().collect()
[1, 2, 3, 4]
```

---

### Transformations

We have also pseudo-set-theoretical operations

.pure-table.pure-table-striped[
| transformation | description                      |
|: -------------: |: -------------------------------|
| `union(otherRdd)`  | Returns union with `otherRdd` |
| `instersection(otherRdd)`  | Returns intersection with `otherRdd` |
| `subtract(otherRdd)`  | Return each value in `self` that is not contained in `otherRdd`. |
]

- If if there are duplicates in the input RDD, the result of `union()` *will* contain duplicates (fixed with `distinct()`)
- `intersection()` removes all duplicates (including duplicates from a single RDD)
- Performance of `intersection()` is much worse than `union()` since it requires a *shuffle* to identify common elements
- `subtract` also requires a shuffle

---

### Transformations

We have also pseudo-set-theoretical operations

Example with `union` and `distinct`

```python
>>> rdd1 = sc.parallelize(range(5))
>>> rdd2 = sc.parallelize(range(3, 9))
>>> rdd3 = rdd1.union(rdd2)
>>> rdd3.collect()
[0, 1, 2, 3, 4, 3, 4, 5, 6, 7, 8]
```

```python
>>> rdd3.distinct().collect()
[0, 1, 2, 3, 4, 5, 6, 7, 8]
```

---

### About shuffles

- Certain operations trigger a *shuffle*
- It is `Spark`’s mechanism for *re-distributing data* so that it’s grouped differently across partitions
- It involves *copying data across executors and machines*, making the shuffle a complex and costly operation
- We will discuss shuffles in detail later in the course

## Performance Impact

- A shuffle involves disk I/O, data serialization and network I/O. 
- To organize data for the shuffle, `Spark` generates sets of *tasks*
  - *map tasks* to organize the data
  - and a set of *reduce tasks* to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.

---

### Transformations

Another "pseudo set" operation

.pure-table.pure-table-striped[
| transformation | description                      |
|: -------------: |: -------------------------------|
| `cartesian(otherRdd)`  | Return the Cartesian product of this RDD and another one |
]

Example

```python
>>> rdd1 = sc.parallelize([1, 2])
>>> rdd2 = sc.parallelize(["a", "b"])
>>> rdd1.cartesian(rdd2).collect()
[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')]
```

- `cartesian()` is **very expensive** for large RDDs

.footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)]

---
template: inter-slide

## Actions

---

### Actions

`collect` brings the `RDD` back to the driver

.pure-table.pure-table-striped[

| transformation | description                      |
|: -------------: |: -------------------------------|
| `collect()`    | Return all elements from the RDD |
]

Example

```python
>>> rdd = sc.parallelize([1, 2, 3, 3])
>>> rdd.collect()
[1, 2, 3, 3]
```

#### Remarks

- <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256 32c14.2 0 27.3 7.5 34.5 19.8l216 368c7.3 12.4 7.3 27.7 .2 40.1S486.3 480 472 480H40c-14.3 0-27.6-7.7-34.7-20.1s-7-27.8 .2-40.1l216-368C228.7 39.5 241.8 32 256 32zm0 128c-13.3 0-24 10.7-24 24V296c0 13.3 10.7 24 24 24s24-10.7 24-24V184c0-13.3-10.7-24-24-24zm32 224a32 32 0 1 0 -64 0 32 32 0 1 0 64 0z"/></svg> Be sure that the *retrieved data fits in the driver memory* !
- Useful when developping and working on small data for testing
- <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 128l-177.6 0c1 5.2 1.6 10.5 1.6 16l0 16 32 0 144 0c8.8 0 16-7.2 16-16s-7.2-16-16-16zM224 144c0-17.7-14.3-32-32-32c0 0 0 0 0 0l-24 0c-66.3 0-120 53.7-120 120l0 48c0 52.5 33.7 97.1 80.7 113.4c-.5-3.1-.7-6.2-.7-9.4c0-20 9.2-37.9 23.6-49.7c-4.9-9-7.6-19.4-7.6-30.3c0-15.1 5.3-29 14-40c-8.8-11-14-24.9-14-40l0-40c0-13.3 10.7-24 24-24s24 10.7 24 24l0 40c0 8.8 7.2 16 16 16s16-7.2 16-16l0-40 0-40zM192 64s0 0 0 0c18 0 34.6 6 48 16l208 0c35.3 0 64 28.7 64 64s-28.7 64-64 64l-82 0c1.3 5.1 2 10.5 2 16c0 25.3-14.7 47.2-36 57.6c2.6 7 4 14.5 4 22.4c0 20-9.2 37.9-23.6 49.7c4.9 9 7.6 19.4 7.6 30.3c0 35.3-28.7 64-64 64l-64 0-24 0C75.2 448 0 372.8 0 280l0-48C0 139.2 75.2 64 168 64l24 0zm64 336c8.8 0 16-7.2 16-16s-7.2-16-16-16l-48 0-16 0c-8.8 0-16 7.2-16 16s7.2 16 16 16l64 0zm16-176c0 5.5-.7 10.9-2 16l2 0 32 0c8.8 0 16-7.2 16-16s-7.2-16-16-16l-32 0 0 16zm-24 64l-40 0c-8.8 0-16 7.2-16 16s7.2 16 16 16l48 0 16 0c8.8 0 16-7.2 16-16s-7.2-16-16-16l-24 0z"/></svg> We'll use it a lot here, but *we don't use it in real-world problems*

---

### Actions

It's important to count !

.pure-table.pure-table-striped[
| transformation | description                      |
|: -------------: |: -------------------------------|
| `count()`      | Return the number of elements in the RDD |
| `countByValue()` | Return the count of each unique value in the RDD as a dictionary of `{value: count}` pairs. |
]

Example

```python
>>> rdd = sc.parallelize([1, 3, 1, 2, 2, 2])
>>> rdd.count()
6
```

```python
>>> rdd.countByValue()
defaultdict(int, {1: 2, 3: 1, 2: 3})
```

???

In SQL, you would first perform a `group by`, then a `count(*)` aggregation

---

### Actions

How to get some values in an RDD ?

.pure-table.pure-table-striped[
| action         | description                      |
|: -------------: |: -------------------------------|
| `take(n)`      | Return `n` elements from the RDD (deterministic)|
| `top(n)`       | Return first `n` elements from the RDD (decending order)|
| `takeOrdered(num, key=None)`    | Get the N elements from a RDD ordered in ascending order or as specified by the optional key function.|

]

**Remarks**
- `take(n)` returns n elements from the RDD and attempts to **minimize the number of partitions it accesses**
- <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256 32c14.2 0 27.3 7.5 34.5 19.8l216 368c7.3 12.4 7.3 27.7 .2 40.1S486.3 480 472 480H40c-14.3 0-27.6-7.7-34.7-20.1s-7-27.8 .2-40.1l216-368C228.7 39.5 241.8 32 256 32zm0 128c-13.3 0-24 10.7-24 24V296c0 13.3 10.7 24 24 24s24-10.7 24-24V184c0-13.3-10.7-24-24-24zm32 224a32 32 0 1 0 -64 0 32 32 0 1 0 64 0z"/></svg> the result  may be a *biased* collection
- `collect` and `take` may return the elements in an order you  don't expect

---

### Actions

How to get some values in an RDD ?

.pure-table.pure-table-striped[

| action         | description                      |
|: -------------: |: -------------------------------|
| `take(n)`      | Return `n` elements from the RDD (deterministic)|
| `top(n)`       | Return first `n` elements from the RDD (decending order)|
| `takeOrdered(num, key=None)`  | Get the $N $elements from a RDD ordered in ascending order or as specified by the optional key function.|

]

Example

```python
>>> rdd = sc.parallelize([(3, 'a'), (1, 'b'), (2, 'd')])
>>> rdd.takeOrdered(2)
[(1, 'b'), (2, 'd')]
```

```python
>>> rdd.takeOrdered(2, key=lambda x: x[1])
[(3, 'a'), (1, 'b')]
```

???

deterministic but arbitrary (may depend on implementation)

---

### Actions

The `reduce` action

.pure-table.pure-table-striped[
| action         | description                      |
|: -------------: |: -------------------------------|
| `reduce(f)`    | Reduces the elements of this RDD using the specified commutative and associative binary operator `f`. |
| `fold(zeroValue, op)`    | Same as `reduce()` but with the provided zero value. |
]

- `op(x, y)` is allowed to modify x and return it as its result value to avoid object allocation; however, it should not modify y.
- `reduce` applies some operation to pairs of elements until there is just one left. Throws an exception for empty collections.
- `fold` has initial zero-value: defined for empty collections.

---

### Actions

The `reduce` action

Example

```python
>>> rdd = sc.parallelize([1, 2, 3])
>>> rdd.reduce(lambda a, b: a + b)
6
```

```python
>>> rdd.fold(0, lambda a, b: a + b)
6
```

---

### Actions

The `reduce` action

**Warning with `fold`.** Solutions can depend on the number of partitions

```python
>>> rdd = sc.parallelize([1, 2, 4], 2) # RDD with 2 partitions
>>> rdd.fold(2.5, lambda a, b: a + b)
14.5
```

- RDD has 2 partition: say [1, 2] and [4] 
- Sum in the partitions: 2.5 + (1 + 2) = 5.5  and  2.5 + (4) = 6.5
- Sum over partitions: 2.5 + (5.5 + 6.5) = 14.5

---

### Actions

The `reduce` action

**Warning with `fold`.** Solutions can depend on the number of partitions

```python
>>> rdd = sc.parallelize([1, 2, 3], 5) # RDD with 5 partitions
>>> rdd.fold(2, lambda a, b: a + b)
???
```

.footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)]

---

### Actions

The `reduce` action

**Warning with `fold`.** Solutions can depend on the number of partitions

```python
>>> rdd = sc.parallelize([1, 2, 3], 5) # RDD with 5 partitions
>>> rdd.fold(2, lambda a, b: a + b)
18
```

- Yes, even if there is less partitions than elements !
- 18 = 2 * 5 + (1+2+3) + 2

???

Find a proper showcase for `fold()`

---

### Actions

The `aggregate` action

.pure-table.pure-table-striped[
| action         | description                      |
|: -------------: |: -------------------------------|
| `aggregate(zero, seqOp, combOp)` | Similar to `reduce()` but used to return a different type. |
]

Aggregates the elements of each partition, and then the results for all the partitions, given aggregation functions and zero value.

- `seqOp(acc, val)`: function to combine the elements of a partition from the RDD (`val`) with an accumulator (`acc`). 
It can return a different result type than the type of this `RDD`
- `combOp`: function that merges the accumulators of two partitions
- Once again, in both functions, the first argument can be modified while the second cannot

---

### Actions

The `aggregate` action

Example

```python
>>> seqOp = lambda x, y: (x[0] + y, x[1] + 1)
>>> combOp = lambda x, y: (x[0] + y[0], x[1] + y[1])
>>> sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp)
(10, 4)
```

```python
>>> ( 
      sc.parallelize([])
        .aggregate((0, 0), seqOp, combOp)
)
(0, 0)
```

.footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)]

???

The result is partition-dependent

---

### Actions

The `foreach` action

.pure-table.pure-table-striped[
| action         | description                      |
|: -------------: |: -------------------------------|
| `foreach(f)` | Apply a function `f` to each element of a RDD |
]

- Performs an action on all of the elements in the RDD without returning any result to the driver.

- Example : insert records into a database with `f`

The `foreach()` action lets us perform computations on each element in the RDD without bringing it back locally

???

In which way do `foreach` and `map` differ?

---
class: center, middle, inverse

### Persistence

---

### Lazy evaluation and persistence

- Spark RDDs are **lazily evaluated**

- Each time an action is called on a RDD, this RDD and all its dependencies are *recomputed*

- If you plan to reuse a RDD multiple times, you should use *persistence*

**Remarks**
- Lazy evaluation helps `spark` to **reduce the number of passes** over the data it has to make by grouping operations together
- No substantial benefit to writing a single complex map instead of chaining together many simple operations
- Users are free to organize their program into **smaller**, more **manageable operations**

???

Distinguish persistence from caching

---

### Persistence

How to use persistence ?

.pure-table.pure-table-striped[

| method                       | description                                  |
|: ---------------------------:|: --------------------------------------------|
| `cache()`                    | Persist the RDD in memory                    |
| `persist(storageLevel)`      | Persist the RDD according to `storageLevel`  |

]

- These methods must be called *before* the action, and do not trigger the computation

Usage of `storageLevel`

```python
pyspark.StorageLevel(
  useDisk, useMemory, useOffHeap, deserialized, replication=1
)
```

???

- What does persistence in memory mean?
- Make `storageLevel` explicit
- Any difference between `cache()` and `persist()` with `useMemory`?
- Why do we call persistence caching?

---
name: option-for-persistence

### Persistence

Options for persistence

.pure-table.pure-table-striped[

| argument        | description                      |
|: -------------: |: -------------------------------|
| `useDisk`      | Allow caching to use disk if `True`  |
| `useMemory`    | Allow caching to use memory if `True`  |
| `useOffHeap`   | Store data outside of JVM heap if `True`. Useful if using some in-memory storage system (such a `Tachyon`) |
| `deserialized` | Cache data without serialization if `True` |
| `replication`  | Number of replications of the cached data  |

]

---
template: option-for-persistence

`replication`
- If you cache data that is quite slow to be recomputed, you can use replications. If a machine fails, data will not have to be recomputed.

???

`Tachyon` :

---
template: option-for-persistence

`deserialized`

- Serialization is conversion of the data to a binary format
- To the best of our knowledge, `PySpark` only support serialized caching (using `pickle`)

???

---
template: option-for-persistence

`useOffHeap`

- Data cached in the JVM heap by default
- Very interesting alternative in-memory solutions such as `tachyon`
- Don't forget that `spark` is `scala` running on the JVM

---

### Back to options for persistence

```python
StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication)
```

You can use these constants:
```python
DISK_ONLY = StorageLevel(True, False, False, False, 1)
DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
MEMORY_AND_DISK = StorageLevel(True, True, False, True, 1)
MEMORY_AND_DISK_2 = StorageLevel(True, True, False, True, 2)
MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2)
MEMORY_ONLY = StorageLevel(False, True, False, True, 1)
MEMORY_ONLY_2 = StorageLevel(False, True, False, True, 2)
MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2)
OFF_HEAP = StorageLevel(False, False, True, False, 1)
```
and simply call
for instance

```python
rdd.persist(MEMORY_AND_DISK)
```

---

### Persistence

What if you attempt to *cache too much data to fit in memory ?*

Spark will automatically evict old partitions using a *Least Recently Used* (LRU) cache policy:

- For the *memory-only* storage levels, it will recompute these partitions the next time they are accessed

- For the *memory-and-disk* ones, it will write them out to disk

Use `unpersist()` to RDDs to **manually remove them** from the cache

---

### Reminder: about passing functions (<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M441 7l32 32 32 32c9.4 9.4 9.4 24.6 0 33.9s-24.6 9.4-33.9 0l-15-15L417.9 128l55 55c9.4 9.4 9.4 24.6 0 33.9s-24.6 9.4-33.9 0l-72-72L295 73c-9.4-9.4-9.4-24.6 0-33.9s24.6-9.4 33.9 0l55 55L422.1 56 407 41c-9.4-9.4-9.4-24.6 0-33.9s24.6-9.4 33.9 0zM210.3 155.7l61.1-61.1c.3 .3 .6 .7 1 1l16 16 56 56 56 56 16 16c.3 .3 .6 .6 1 1l-191 191c-10.5 10.5-24.7 16.4-39.6 16.4H97.9L41 505c-9.4 9.4-24.6 9.4-33.9 0s-9.4-24.6 0-33.9l57-57V325.3c0-14.9 5.9-29.1 16.4-39.6l43.3-43.3 57 57c6.2 6.2 16.4 6.2 22.6 0s6.2-16.4 0-22.6l-57-57 41.4-41.4 57 57c6.2 6.2 16.4 6.2 22.6 0s6.2-16.4 0-22.6l-57-57z"/></svg>)

**Warning**

- When passing functions, you can *inadvertently serialize the object containing the function*.

If you pass a function that:
-  is the member of an object
- contains references to fields in an object

then `Spark` sends the *entire object to worker nodes*, which can be **much larger** than the bit of information you need

- This can cause your *program to fail*, if your class contains objects that **Python can't pickle**

---

### About passing functions

Passing a function with field references (don’t do this !  <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M413.5 237.5c-28.2 4.8-58.2-3.6-80-25.4l-38.1-38.1C280.4 159 272 138.8 272 117.6V105.5L192.3 62c-5.3-2.9-8.6-8.6-8.3-14.7s3.9-11.5 9.5-14l47.2-21C259.1 4.2 279 0 299.2 0h18.1c36.7 0 72 14 98.7 39.1l44.6 42c24.2 22.8 33.2 55.7 26.6 86L503 183l8-8c9.4-9.4 24.6-9.4 33.9 0l24 24c9.4 9.4 9.4 24.6 0 33.9l-88 88c-9.4 9.4-24.6 9.4-33.9 0l-24-24c-9.4-9.4-9.4-24.6 0-33.9l8-8-17.5-17.5zM27.4 377.1L260.9 182.6c3.5 4.9 7.5 9.6 11.8 14l38.1 38.1c6 6 12.4 11.2 19.2 15.7L134.9 484.6c-14.5 17.4-36 27.4-58.6 27.4C34.1 512 0 477.8 0 435.7c0-22.6 10.1-44.1 27.4-58.6z"/></svg> <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M368 128c0 44.4-25.4 83.5-64 106.4V256c0 17.7-14.3 32-32 32H176c-17.7 0-32-14.3-32-32V234.4c-38.6-23-64-62.1-64-106.4C80 57.3 144.5 0 224 0s144 57.3 144 128zM168 176a32 32 0 1 0 0-64 32 32 0 1 0 0 64zm144-32a32 32 0 1 0 -64 0 32 32 0 1 0 64 0zM3.4 273.7c7.9-15.8 27.1-22.2 42.9-14.3L224 348.2l177.7-88.8c15.8-7.9 35-1.5 42.9 14.3s1.5 35-14.3 42.9L295.6 384l134.8 67.4c15.8 7.9 22.2 27.1 14.3 42.9s-27.1 22.2-42.9 14.3L224 419.8 46.3 508.6c-15.8 7.9-35 1.5-42.9-14.3s-1.5-35 14.3-42.9L152.4 384 17.7 316.6C1.9 308.7-4.5 289.5 3.4 273.7z"/></svg>)

```python
class SearchFunctions(object):
  
  def __init__(self, query):
      self.query = query

def isMatch(self, s):
      return self.query in s

def getMatchesFunctionReference(self, rdd):
      # Problem: references all of "self" in "self.isMatch"
      return rdd.filter(self.isMatch)

def getMatchesMemberReference(self, rdd):
      # Problem: references all of "self" in "self.query"
      return rdd.filter(lambda x: self.query in x)
```

Instead, **just extract the fields you need** from your object into a local variable and pass that in

---

### About passing functions

`Python` function passing without field references

```python
class WordFunctions(object):
  ...

def getMatchesNoReference(self, rdd):
  # Safe: extract only the field we need into a local variable
  query = self.query
  return rdd.filter(lambda x: query in x)
```

Much better to do this instead

---
template: inter-slide

## Pair RDD: key-value pairs

---

### Pair RDD: key-value pairs

It's roughly an RDD where each element is a tuple with two elements: a key and a value

- For numerous tasks, such as aggregations tasks, storing information as `(key, value)` pairs into RDD is very convenient
- Such RDDs are called `PairRDD`
- Pair RDDs expose *new operations* such as **grouping together** data with the same key, and **grouping together two different RDDs**

### Creating a pair RDD

Calling `map` with a function returning a `tuple` with two elements
```python
>>> rdd = sc.parallelize([[1, "a", 7], [2, "b", 13], [2, "c", 17]])
>>> rdd = rdd.map(lambda x: (x[0], x[1:]))
>>> rdd.collect()
[(1, ['a', 7]), (2, ['b', 13]), (2, ['c', 17])]
```

---
### <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256 32c14.2 0 27.3 7.5 34.5 19.8l216 368c7.3 12.4 7.3 27.7 .2 40.1S486.3 480 472 480H40c-14.3 0-27.6-7.7-34.7-20.1s-7-27.8 .2-40.1l216-368C228.7 39.5 241.8 32 256 32zm0 128c-13.3 0-24 10.7-24 24V296c0 13.3 10.7 24 24 24s24-10.7 24-24V184c0-13.3-10.7-24-24-24zm32 224a32 32 0 1 0 -64 0 32 32 0 1 0 64 0z"/></svg> Warning

All elements of a `PairRDD` must be tuples with two elements (the key and the value)
```python
>>> rdd = sc.parallelize([[1, "a", 7], [2, "b", 13], [2, "c", 17]])
>>> rdd.keys().collect()
[1, 2, 2]
>>> rdd.values().collect()
['a', 'b', 'c']
```

For things to work as expected you **must** do
```python
>>> rdd = sc.parallelize([[1, "a", 7], [2, "b", 13], [2, "c", 17]])\
      .map(lambda x: (x[0], x[1:]))
>>> rdd.keys().collect()
[1, 2, 2]
>>> rdd.values().collect()
[['a', 7], ['b', 13], ['c', 17]]
```

---
name: transformations-for-a-single-PairRDD

### Transformations for a single `PairRDD`

.pure-table.pure-table-striped.f6[

| transformation | description                      |
|: -------------: |: -------------------------------|
| `keys()`       | Return an RDD containing the keys |
| `values()`     | Return an RDD containing the values |
| `sortByKey()`  | Return an RDD sorted by the key |
| `mapValues(f)`  | Apply a function `f` to each value of a pair RDD without changing the key |
| `flatMapValues(f)` | Pass each value in the key-value pair RDD through a flatMap function `f` without changing the keys |
]

---
template: transformations-for-a-single-PairRDD

Example with `mapValues`

```python
>>> rdd = sc.parallelize([("a", "x y z"), ("b", "p r")])
>>> rdd.mapValues(lambda v: v.split(' ')).collect()
[('a', ['x', 'y', 'z']), ('b', ['p', 'r'])]
```

---
template: transformations-for-a-single-PairRDD

Example with `flatMapValues`

```python
>>> texts = sc.parallelize([("a", "x y z"), ("b", "p r")])
>>> tokenize = lambda x: x.split(" ")
>>> texts.flatMapValues(tokenize).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]
```

---
name: transformations-for-a-single-PairRDD-keyed

### Transformations for a single `PairRDD` (keyed)

.pure-table.pure-table-striped.f6[

| transformation | description                      |
|: -------------: |: -------------------------------|
| `groupByKey()`  | Group values with the same key  |
| `reduceByKey(f)`| Merge the values for each key using an associative reduce function `f`. |
| `foldByKey(f)`  | Merge the values for each key using an associative reduce function `f`. |
| `combineByKey(createCombiner, mergeValue, mergeCombiners, [partitioner])` | Generic function to combine the elements for each key using a custom set of aggregation functions. |

]

---
template: transformations-for-a-single-PairRDD-keyed

Example with `groupByKey`

```python
>>> rdd = sc.parallelize([
        ("a", 1), ("b", 1), ("a", 1), 
        ("b", 3), ("c", 42)
        ])
>>> rdd.groupByKey().mapValues(list).collect()
[('c', [42]), ('b', [1, 3]), ('a', [1, 1])]
```

---

---
template: transformations-for-a-single-PairRDD-keyed

Example with `reduceByKey`

```python
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> rdd.reduceByKey(lambda a, b: a + b).collect()
[('a', 2), ('b', 1)]
```

- The reducing occurs first **locally** (within partitions)
- Then, a shuffle is performed with the local results to reduce globally

---

---
template: transformations-for-a-single-PairRDD-keyed

`combineByKey` Transforms an `RDD[(K, V)]` into another RDD of type `RDD[(K, C)]` for a "combined" type `C` that can be different from `V`

The user must define
- `createCombiner` : which turns a `V` into a `C`
- `mergeValue` : to merge a `V` into a `C`
- `mergeCombiners` : to combine two `C`’s into a single one

---
template: transformations-for-a-single-PairRDD-keyed

In this example

- `createCombiner` : converts the value to `str`
- `mergeValue` : concatenates two `str`
- `mergeCombiners` : concatenates two `str`

```python
>>> rdd = sc.parallelize([('a', 1), ('b', 2), ('a', 13)])
>>> def add(a, b):
        return a + str(b)
>>> rdd.combineByKey(str, add, add).collect()
[('a', '113'), ('b', '2')]
```

---

### Transformations for two `PairRDD`

.pure-table.pure-table-striped[
| transformation | description                      |
|: -------------: |: -------------------------------|
| `subtractByKey(other)` | Remove elements with a key present in the `other` RDD. |
| `join(other)` | Inner join with `other` RDD. |
| `rightOuterJoin(other)` | Right join with `other` RDD. |
| `leftOuterJoin(other)` | Left join with `other` RDD. |
]

- Right join: the key must be present in the first RDD
- Left join: the key must be present in the `other` RDD

---

### Transformations for two `PairRDD`

- Join operations are mainly used through the high-level API: `DataFrame` objects and the `spark.sql` API

- We will use them a lot with the high-level API (`DataFrame` from `spark.sql`)

.footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)]

---

### Actions for a single  `PairRDD`

.pure-table.pure-table-striped[
| action         | description                      |
|: -------------: |: -------------------------------|
| `countByKey()` | Count the number of elements for each key. |
| `lookup(key)`  | Return all the values associated with the provided `key`. |
| `collectAsMap()` | Return the key-value pairs in this RDD to the master as a Python dictionary. |
]

.footnote[[[Let's go to notebook05_sparkrdd.ipynb]](http://localhost:8888/notebooks/notebooks/notebook05_sparkrdd.ipynb)]

---

### Data partitionning

- Some operations on `PairRDD`s, such as `join`, require to scan the data **more than once**

- Partitionning the RDDs **in advance** can reduce network communications

- When a key-oriented dataset is reused several times, partitionning can improve  performance

- In `Spark`: you can *choose which keys will appear on the same node*, but no explicit control of which worker node each key goes to.

---

### Data partitionning

In practice, you can specify the number of partitions with

```python
rdd.partitionBy(100)
```

You can also use a custom partition function `hash` such that `hash(key)` returns a hash

```python
import urlparse

>>> def hash_domain(url):
        # Returns a hash associated to the domain of a website
        return hash(urlparse.urlparse(url).netloc)

rdd.partitionBy(20, hash_domain) # Create 20 partitions
```

To have finer control on partitionning, you must use the Scala API.

---

### Thank you !