Technologies Big Data

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}

/* custom.css */
.plot-callout {
  width: 300px;
  bottom: 5%;
  right: 5%;
  position: absolute;
  padding: 0px;
  z-index: 100;
}
.plot-callout img {
  width: 100%;
  border: 1px solid #23373B;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./figs/UniversiteParisCite_logo_horizontal_couleur_RVB.jpeg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

# Technologies Big Data : Introduction

### 2024-01-23

#### [Master I MIDS Master I Informatique]()

#### [Technologies Big Data](http://stephane-v-boucheron.fr/courses/isidata/)

#### [S. Gaïffas, A. Gheerbrandt, S. Has, S. Boucheron, V. Ravelomanana](http://stephane-v-boucheron.fr)

---
exclude: true

# Big Data Technologies

## Master Mathematics and Informatics

.center[
    <img src="figs/lpsm.pnog" style="height: 160px;" />
    <img src="" style="width: 30px;" />
    <img src="figs/paris-diderot.png" style="height: 90px;" />
    <img src="" style="width: 30px;" />
    <img src="figs/uparis.png" style="height: 120px;" />
]

---

---
template: inter-slide

## Course logistics

---
exclude: true

### Who are we ?

.fl.w-50.pa2[

- Stéphane Boucheron
- LPSM
- Statistics <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M64 64c0-17.7-14.3-32-32-32S0 46.3 0 64V400c0 44.2 35.8 80 80 80H480c17.7 0 32-14.3 32-32s-14.3-32-32-32H80c-8.8 0-16-7.2-16-16V64zm96 288H448c17.7 0 32-14.3 32-32V251.8c0-7.6-2.7-15-7.7-20.8l-65.8-76.8c-12.1-14.2-33.7-15-46.9-1.8l-21 21c-10 10-26.4 9.2-35.4-1.6l-39.2-47c-12.6-15.1-35.7-15.4-48.7-.6L135.9 215c-5.1 5.8-7.9 13.3-7.9 21.1v84c0 17.7 14.3 32 32 32z"/></svg>
- [https://stephane-v-boucheron.fr](https://stephane-v-boucheron.fr)

]

.fl.w-50.pa2[

- Amélie Gheerbrant
- IRIF
- Data Science, Databases <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 80v48c0 44.2-100.3 80-224 80S0 172.2 0 128V80C0 35.8 100.3 0 224 0S448 35.8 448 80zM393.2 214.7c20.8-7.4 39.9-16.9 54.8-28.6V288c0 44.2-100.3 80-224 80S0 332.2 0 288V186.1c14.9 11.8 34 21.2 54.8 28.6C99.7 230.7 159.5 240 224 240s124.3-9.3 169.2-25.3zM0 346.1c14.9 11.8 34 21.2 54.8 28.6C99.7 390.7 159.5 400 224 400s124.3-9.3 169.2-25.3c20.8-7.4 39.9-16.9 54.8-28.6V432c0 44.2-100.3 80-224 80S0 476.2 0 432V346.1z"/></svg>
- [https://www.irif.fr/~amelie/](https://www.irif.fr/~amelie/)

]

---

### Who are we ?

.fl.w-50.pa2[

]

.fl.w-50.pa2[

- Vlady Ravelomanana
- IRIF
- Data Science, Graph, Algorithms <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M448 80v48c0 44.2-100.3 80-224 80S0 172.2 0 128V80C0 35.8 100.3 0 224 0S448 35.8 448 80zM393.2 214.7c20.8-7.4 39.9-16.9 54.8-28.6V288c0 44.2-100.3 80-224 80S0 332.2 0 288V186.1c14.9 11.8 34 21.2 54.8 28.6C99.7 230.7 159.5 240 224 240s124.3-9.3 169.2-25.3zM0 346.1c14.9 11.8 34 21.2 54.8 28.6C99.7 390.7 159.5 400 224 400s124.3-9.3 169.2-25.3c20.8-7.4 39.9-16.9 54.8-28.6V432c0 44.2-100.3 80-224 80S0 476.2 0 432V346.1z"/></svg>
- [https://www.irif.fr/~vlad/](https://www.irif.fr/~vlad/)

]

---

---
### Course logistics

- 24 hours = 2 hours `$\times$` .stress[12 weeks] : classes + hands-on

- [Agenda](https://edt.math.univ-paris-diderot.fr/#/parcours/mathinfo/m1)

#### About the hands-on

- Hands-on and homeworks using .stress[`Jupyter` notebooks/Quarto notebooks]

- Using a `Docker` image  built for the course

- <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M64 96c0-35.3 28.7-64 64-64H512c35.3 0 64 28.7 64 64V352H512V96H128V352H64V96zM0 403.2C0 392.6 8.6 384 19.2 384H620.8c10.6 0 19.2 8.6 19.2 19.2c0 42.4-34.4 76.8-76.8 76.8H76.8C34.4 480 0 445.6 0 403.2zM281 209l-31 31 31 31c9.4 9.4 9.4 24.6 0 33.9s-24.6 9.4-33.9 0l-48-48c-9.4-9.4-9.4-24.6 0-33.9l48-48c9.4-9.4 24.6-9.4 33.9 0s9.4 24.6 0 33.9zM393 175l48 48c9.4 9.4 9.4 24.6 0 33.9l-48 48c-9.4 9.4-24.6 9.4-33.9 0s-9.4-24.6 0-33.9l31-31-31-31c-9.4-9.4-9.4-24.6 0-33.9s24.6-9.4 33.9 0z"/></svg> Hands-on must be carried out using your .stress[own laptop]. Bring it at **all the courses**

---
exclude: true

### Course logistics

- The .stress[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M352 256c0 22.2-1.2 43.6-3.3 64H163.3c-2.2-20.4-3.3-41.8-3.3-64s1.2-43.6 3.3-64H348.7c2.2 20.4 3.3 41.8 3.3 64zm28.8-64H503.9c5.3 20.5 8.1 41.9 8.1 64s-2.8 43.5-8.1 64H380.8c2.1-20.6 3.2-42 3.2-64s-1.1-43.4-3.2-64zm112.6-32H376.7c-10-63.9-29.8-117.4-55.3-151.6c78.3 20.7 142 77.5 171.9 151.6zm-149.1 0H167.7c6.1-36.4 15.5-68.6 27-94.7c10.5-23.6 22.2-40.7 33.5-51.5C239.4 3.2 248.7 0 256 0s16.6 3.2 27.8 13.8c11.3 10.8 23 27.9 33.5 51.5c11.6 26 20.9 58.2 27 94.7zm-209 0H18.6C48.6 85.9 112.2 29.1 190.6 8.4C165.1 42.6 145.3 96.1 135.3 160zM8.1 192H131.2c-2.1 20.6-3.2 42-3.2 64s1.1 43.4 3.2 64H8.1C2.8 299.5 0 278.1 0 256s2.8-43.5 8.1-64zM194.7 446.6c-11.6-26-20.9-58.2-27-94.6H344.3c-6.1 36.4-15.5 68.6-27 94.6c-10.5 23.6-22.2 40.7-33.5 51.5C272.6 508.8 263.3 512 256 512s-16.6-3.2-27.8-13.8c-11.3-10.8-23-27.9-33.5-51.5zM135.3 352c10 63.9 29.8 117.4 55.3 151.6C112.2 482.9 48.6 426.1 18.6 352H135.3zm358.1 0c-30 74.1-93.6 130.9-171.9 151.6c25.5-34.2 45.2-87.7 55.3-151.6H493.4z"/></svg> webpage] of the course is:

.center[[https://stephane-v-boucheron.fr/courses/grosses-data/](https://stephane-v-boucheron.fr/courses/grosses-data/)]

- .stress[<svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M0 48C0 21.5 21.5 0 48 0l0 48V441.4l130.1-92.9c8.3-6 19.6-6 27.9 0L336 441.4V48H48V0H336c26.5 0 48 21.5 48 48V488c0 9-5 17.2-13 21.3s-17.6 3.4-24.9-1.8L192 397.5 37.9 507.5c-7.3 5.2-16.9 5.9-24.9 1.8S0 497 0 488V48z"/></svg> Bookmark it] !

- Follow .stress[carefully] the steps described in the `tools` page:

.center[[https://stephanegaiffas.github.io/big_data_course/tools](https://stephanegaiffas.github.io/big_data_course/tools)]

- <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M349.9 236.3h-66.1v-59.4h66.1v59.4zm0-204.3h-66.1v60.7h66.1V32zm78.2 144.8H362v59.4h66.1v-59.4zm-156.3-72.1h-66.1v60.1h66.1v-60.1zm78.1 0h-66.1v60.1h66.1v-60.1zm276.8 100c-14.4-9.7-47.6-13.2-73.1-8.4-3.3-24-16.7-44.9-41.1-63.7l-14-9.3-9.3 14c-18.4 27.8-23.4 73.6-3.7 103.8-8.7 4.7-25.8 11.1-48.4 10.7H2.4c-8.7 50.8 5.8 116.8 44 162.1 37.1 43.9 92.7 66.2 165.4 66.2 157.4 0 273.9-72.5 328.4-204.2 21.4.4 67.6.1 91.3-45.2 1.5-2.5 6.6-13.2 8.5-17.1l-13.3-8.9zm-511.1-27.9h-66v59.4h66.1v-59.4zm78.1 0h-66.1v59.4h66.1v-59.4zm78.1 0h-66.1v59.4h66.1v-59.4zm-78.1-72.1h-66.1v60.1h66.1v-60.1z"/></svg> Who knows about `docker` ?

---

### Course evaluation

- .stress[Evaluation] using **homeworks** and a **final project**

- Find a .stress[friend] : all work done by **pairs of students**

- **All your work** goes in your private repository and nowhere else: .stress[no emails] !

- All your homework will be using .stress[`jupyter` notebooks] or .stress[`quarto`] files

---
exclude: true
template: inter-slide

## <svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M349.9 236.3h-66.1v-59.4h66.1v59.4zm0-204.3h-66.1v60.7h66.1V32zm78.2 144.8H362v59.4h66.1v-59.4zm-156.3-72.1h-66.1v60.1h66.1v-60.1zm78.1 0h-66.1v60.1h66.1v-60.1zm276.8 100c-14.4-9.7-47.6-13.2-73.1-8.4-3.3-24-16.7-44.9-41.1-63.7l-14-9.3-9.3 14c-18.4 27.8-23.4 73.6-3.7 103.8-8.7 4.7-25.8 11.1-48.4 10.7H2.4c-8.7 50.8 5.8 116.8 44 162.1 37.1 43.9 92.7 66.2 165.4 66.2 157.4 0 273.9-72.5 328.4-204.2 21.4.4 67.6.1 91.3-45.2 1.5-2.5 6.6-13.2 8.5-17.1l-13.3-8.9zm-511.1-27.9h-66v59.4h66.1v-59.4zm78.1 0h-66.1v59.4h66.1v-59.4zm78.1 0h-66.1v59.4h66.1v-59.4zm-78.1-72.1h-66.1v60.1h66.1v-60.1z"/></svg> `Docker`

---
exclude: true

### Why [`docker`](https://www.docker.com) ? What is it ?

- Don't mess with your `python` env. and configuration files
- Everything in embedded in a .stress[container] (better than a Virtual Machine)
- A .stress[container] is an **instance** of an .stress[image]
- Same image = same environment for everybody 
- Same image = no {version, dependencies, install} problems
- It is an .stress[industrial standard] used everywhere now!

.pull-left[
<img src="figs/containers.png" style="width: 70%;" />
]
.pull-right[
<img src="figs/python_environment.png" style="width: 75%;" />
]

---
exclude: true

### `docker`

- Have a look at

.center[[https://s-v-b.github.io/big_data_course/tools](https://s-v-b.github.io/big_data_course/tools)]

- have a look at the `Dockerfile` to explain a little bit how the image is built

- perform a quick demo on how to use the `docker` image

<br>

#### And that's it for the logistics !

---

## Big data

---

### Big data

- .stress[Moore's Law]: *computing power* **doubled** every two years between 1975 and 2012

- Nowadays, **less** than two years and a half

- .stress[Rapid growth of datasets]: **internet activity**, social networks, genomics, physics, censor networks, IOT, ...

- .stress[Data size trends]: **doubles every year** according to [IDC executive summary](https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm)

- .stress[Data deluge]: Today, data is growing faster than computing power

### Question

- How do we **catch up** to **process the data deluge** and to **learn from it** ?

---

### Order of magnitudes

#### bit

A *bit* is a value of either a 1 or 0 (on or off)

#### byte (B)

A *byte* is made of 8 bits

- 1 character, e.g., "a", is one byte

#### Kilobyte (KB)

A kilobyte is `$1024 =2^{10}$` bytes

- **2** or **3** paragraphs of ASCII text

---

### Some more comparisons

#### Megabyte (MB)

A megabyte is `$1 048 576=2^{20}$` B or `$1 024$` KB

- **873** pages of plain text
- **4** books (200 pages or 240 000 characters)

#### Gigabyte (GB)

A gigabyte is `$1 073 741 824=2^{30}$` B, `$1 024$` MB or `$1 048 576$` KB

- **894 784** pages of plain text (1 200 characters)
- **4 473** books (200 pages or 240 000 characters)
- **640** web pages (with 1.6 MB average file size)
- **341** digital pictures (with 3 MB average file size)
- **256** MP3 audio files (with 4 MB average file size)
- **1,5** 650 MB CD

---

### Even more

#### Terabyte (TB)

A terabyte is `$1 099 511 627 776=2^{40}$` B, **1 024** GB  or **1 048 576** MB.

- **916 259 689** pages of plain text (1 200 characters)
- **4 581 298** books (200 pages or 240 000 characters)
- **655 360** web pages (with 1.6 MB average file size)
- **349 525** digital pictures (with 3 MB average file size)
- **262 144** MP3 audio files (with 4 MB average file size)
- **1 613** 650 MB CD's
- **233** 4.38 GB DVDs
- **40** 25 GB Blu-ray discs

---

### The deluge

#### Petabyte (PB)

A petabyte is **1 024** TB, **1 048 576** GB or **1 073 741 824** MB

`$$1125899906842624 = 2^{50} \quad\text{Bytes}$$`

- **938 249 922 368** pages of plain text (1 200 characters)
- **4 691 249 611** books (200 pages or 240 000 characters)
- **671 088 640** web pages (with 1.6 MB average file size)
- **357 913 941** digital pictures (with 3 MB average file size)
- **268 435 456** MP3 audio files (with 4 MB average file size)
- **1 651 910** 650 MB CD's
- **239 400** 4.38 GB DVDs
- **41 943** 25 GB Blu-ray discs

#### Exabyte, etc.

- 1 EB = 1 exabyte = 1 024 PB
- 1 ZB = 1 zettabyte = 1 024 EB

---

### Some figures

You have every .stress[single second] `$\mbox{}^1$` :

- At least **8,000 tweets** sent

- **900+ photos** posted on **Instagram**

- **Thousands of Skype calls** made

- Over **70,000 Google searches** performed

- Around **80,000 YouTube videos** viewed

- Over **2 million emails** sent

---

### Some figures

There are `$\mbox{}^1$` :

- .stress[5 billion web pages] as of mid-2019 (indexed web)

and we expected$^2$ :

- .stress[4.8 ZB] of annual IP traffic in 2022

Note that

- **1** ZB `$\approx$` **36 000** years of HD video
- Netflix's **entire catalog** is `$\approx$` **3.5 years** of HD video

.footnote[
[1] [https://www.worldwidewebsize.com](https://www.worldwidewebsize.com) <br>
[2] Cisco's Visual Networking Index
]

---

### Some figures

More figures :

- **facebook** daily logs: **60TB**

- **1000 genomes** project: **200TB**

- Google web index: **10+ PB**

- Cost of **1TB** of storage: **~$35**

- Time to read **1TB** from disk: **3 hours** if **100MB/s**

---

### Latency numbers

.f6[.pure-table.pure-table-striped[
| Memory type                        | Latency(ns)      | Latency(us) | (ms)   |                             |
| :--------------------------------- | ---------------: | ----------: | -----: | :-------------------------- |
| L1 cache reference                 |           0.5 ns |             |        |                             |  
| L2 cache reference                 |           7   ns |             |        | 14x L1 cache                |
| Main memory reference              |         100   ns |             |        | 20x L2, 200x L1             |
| Compress 1K bytes with Zippy/Snappy|       3,000   ns |        3 us |        |                             |
| Send 1K bytes over 1 Gbps network  |      10,000   ns |       10 us |        |                             |
| Read 4K randomly from SSD*         |     150,000   ns |      150 us |        | ~1GB/sec SSD                |
| Read 1 MB sequentially from memory |     250,000   ns |      250 us |        |                             |
| Round trip within same datacenter  |     500,000   ns |      500 us |        |                             |
| Read 1 MB sequentially from SSD*   |   1,000,000   ns |    1,000 us |   1 ms | ~1GB/sec SSD, 4X memory     |
| Disk seek                          |  10,000,000   ns |   10,000 us |  10 ms | 20x datacenter roundtrip    |
| Read 1 MB sequentially from disk   |  20,000,000   ns |   20,000 us |  20 ms | 80x memory, 20x SSD         |
| Send packet US -> Europe -> US     | 150,000,000   ns |  150,000 us | 150 ms | 600x memory                 |
]]

---
exclude: true

```
traceroute to mathscinet.ams.org (104.238.176.204), 64 hops max
  1   192.168.10.1  3,149ms  1,532ms  1,216ms 
  2   192.168.0.254  1,623ms  1,397ms  1,309ms 
  3   78.196.1.254  2,571ms  2,120ms  2,371ms 
  4   78.255.140.126  2,813ms  2,621ms  2,200ms 
  5   78.254.243.86  2,626ms  2,528ms  2,517ms 
  6   78.254.253.42  2,517ms  4,129ms  2,671ms 
  7   78.254.242.54  2,535ms  2,258ms  2,350ms 
  8   *  *  * 
  9   195.66.224.191  12,231ms  11,718ms  12,486ms 
 10   *  *  * 
 11   63.218.14.58  26,213ms  19,264ms  18,949ms 
 12   63.218.231.106  29,135ms  22,078ms  17,954ms
```
---

### Latency numbers

- Reading 1MB from **disk** = **100 x** reading 1MB from **memory**

- Sending packet from **US to Europe to US** = **1 000 000 x** main memory reference

#### General tendency

True in general, not always:

- memory operations : .stress[fastest]

- disk operations : .stress[slow]

- network operations : .stress[slowest]

---

### Latency numbers

.small[[https://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html](https://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html)]

---

### Humanized latency numbers

Lets multiply all these durations by a billion

.f6[.pure-table.pure-table-striped[
| Memory type                        | Latency      | Human duration                                        |
| :--------------------------------- | -----------: | ----------------------------------------------------: |
| L1 cache reference                 | 0.5 s        | One heart beat (0.5 s)                                |
| L2 cache reference                 | 7 s          | Long yawn                                             | 
| Main memory reference              | 100 s        | Brushing your teeth                                   |
| Send 2K bytes over 1 Gbps network  | 5.5 hr       | From lunch to end of work day                         | 
| SSD random read                    | 1.7 days     | A normal weekend                                      | 
| Read 1 MB sequentially from memory | 2.9 days     | A long weekend                                        |
| Round trip within same datacenter  | 5.8 days     | A medium vacation                                     | 
| Read 1 MB sequentially from SSD    | 11.6 days    | Waiting for almost 2 weeks for a delivery             |
| Disk seek                          | 16.5 weeks   | A semester in university                              |
| Read 1 MB sequentially from disk   | 7.8 months   | Almost producing a new human being                    | 
| Send packet US -> Europe -> US     | 4.8 years    | Average time it takes to complete a bachelor's degree | 
]]

---
template: inter-slide

## Challenges

---

### Challenges with big datasets

- Large data .stress[don't fit] on a **single** hard-drive

- **One** large (and expensive) machine .stress[can't process or store] **all** the data

- For **computations** how do we .stress[stream data] from the **disk to the different 
layers of memory** ?

- **Concurrent accesses** to the data: disks .stress[cannot] be **read in parallel**

---

### Solutions

- Combine .stress[several machines] containing **hard drives** and **processors** on a **network**

- Using .stress[commodity hardware]: cheap, common architecture i.e. **processor** + **RAM** + **disk**

- .stress[Scalability] = **more machines** on the network

- .stress[Partition] the data across the machines

---

### Challenges

Dealing with distributed computations adds **software complexity**

- .stress[Scheduling]: How to **split the work across machines**? Must exploit and optimize data locality since moving data is very expensive

- .stress[Reliability]: How to **deal with failure**? Commodity (cheap) hardware fails more often. @Google [1%, 5%] HD failure/year and 0.2% [DIMM](https://en.wikipedia.org/wiki/DIMM) failure/year

- .stress[Uneven performance] of the machines: some nodes are slower than others

???

.fl.w-50.pa2[

Problems sketched in

.
![](./figs/next-gen-databases.png)

]

.fl.w-50.pa2[

]

---

### Solutions

- .stress[Schedule], **manage** and **coordinate** threads and resources using appropriate software

- .stress[Locks] to **limit** access to resources

- .stress[Replicate] data for **faster reading** and **reliability**

---

### Is it HPC ?

- **High Performance Computing** (HPC)

- **Parallel computing**

#### Comments

- For HPC, *scaling up* means using a .stress[bigger machine]

- Huge performance increase for **medium** scale problems

- .stress[Very expensive], specialized machines, lots of processors and memory

#### Answer is no !

???

> Google committed to a number of key tenants when designing its data center architecture. Most
significantly—and at the time, uniquely—Google committed to massively parallelizing and distributing
processing across very large numbers of commodity servers. Google also adopted a “Jedis build their own
lightsabers” attitude: very little third party— and virtually no commercial—software would be found in the
Google architecture. “Build” was considered better than “buy” at Google.

---

### The Big Data universe

Many technologies combining .stress[software] and .stress[cloud computing]

---

### The Big Data universe

Often used with/for with .stress[Machine Learning] (or AI)

---

### Tools

- Softwares such as .stress[`HadoopMR`] (Hadoop Map Reduce) and more recently  .stress[`Spark`] and .stress[`Dask`]  cope with these challenges
- They are .stress[distributed computational engines]: softwares that ease the development of distributed algorithms

They run on .stress[clusters] (several machine on a network), managed by a .stress[resource manager] such as :
- **`Yarn` :** 
[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
- **`Mesos` :** [http://mesos.apache.org](http://mesos.apache.org)
- **`Kubernetes :` **[https://kubernetes.io](https://kubernetes.io/)

A resource manager ensures that the tasks running on the cluster do not try to use the same resources all at once

???

---

## `Apache Spark`

---

### Apache `Spark`

The course will focus mainly on .stress[`Spark`] for big data processing

[https://spark.apache.org](https://spark.apache.org)
]

- `Spark` is an .stress[industrial standard] <br>
(cf [https://spark.apache.org/powered-by.html](https://spark.apache.org/powered-by.html))
- One of the most used .stress[big data processing framework]
- .stress[Open source]

The predecessor of `Spark` is [`Hadoop`](ttps://hadoop.apache.org)

???

See Chapter 2 in [Next Generation Dabases](https://link.springer.com/book/10.1007/978-1-4842-1329-2)

[Guy Harrison](https://www.guyharrison.net)

---

### [`Hadoop`](ttps://hadoop.apache.org)

- `Hadoop` has a simple API and good fault tolerance (tolerance to nodes failing midway through a processing job)

- The cost is lots of .stress[data shuffling] across the network

- With intermediate computations .stress[written to disk] **over the network** which we know is .stress[very time expensive]

It is made of three components:

- .stress[`HDFS`] (Highly Distributed File System) inspired from `GoogleFileSystem`, see 
.small[[https://ai.google/research/pubs/pub51](https://ai.google/research/pubs/pub51)]

- .stress[`YARN`] (Yet Another Ressource Negociator)

- .stress[`MapReduce`] inspired from Google <br> .small[[https://research.google.com/archive/mapreduce.html](https://research.google.com/archive/mapreduce.html)]

???

> The Hadoop 1.0 architecture is powerful and easy to understand, but it is limited to MapReduce
workloads and it provides limited flexibility with regard to scheduling and resource allocation.

> In the Hadoop 2.0 architecture, YARN (Yet Another Resource Negotiator or, recursively, YARN Application Resource
Negotiator) improves scalability and flexibility by splitting the roles of the Task Tracker into two processes.

> A *Resource Manager* controls access to the clusters resources (memory, CPU, etc.) while the *Application
Manager* (one per job) controls task execution.

---

### MapReduce's wordcount example

---

### `Spark`

Advantages of `Spark` over `HadoopMR` ?

- .stress[In-memory storage]: use **RAM** for fast iterative computations
- .stress[Lower overhead] for starting jobs
- .stress[Simple and expressive] with `Scala`, `Python`, `R`, `Java` APIs
- .stress[Higher level libraries] with `SparkSQL`, `SparkStreaming`, etc.

Disadvantages of `Spark` over `HadoopMR` ?
 
- `Spark` requires servers with **more CPU** and **more memory**
- But still much cheaper than HPC

`Spark` is .stress[much faster] than `Hadoop`

- `Hadoop` uses **disk** and **network** 
- `Spark` tries to use **memory** as much as possible for operations while minimizing network use

---

### `Spark` and `Hadoop` comparison

<br>

.pure-table.pure-table-striped[
|                          | HadoopMR     | Spark                           |
|:-------------------------|:--------------|:------------------------------ |
| Storage                  | Disk         | in-memory or disk               |
| Operations               | Map, reduce  | Map, reduce, join, sample, ...  |
| Execution model          | Batch        | Batch, interactive, streaming   |
| Programming environments | Java         | Scala, Java, Python, R          |

]

---

### `Spark` and `Hadoop` comparison

For **logistic regression** training (a simple **classification** algorithm which requires **several passes** on a dataset)

.center[
  <img src="figs/spark-dev3.png" width=50%/>
]
<br>
.center[
  <img src="figs/logistic-regression.png" width=30%/>
]

---

### The `Spark` stack

---

### The `Spark` stack

???

---

### `Spark` can run "everywhere"

???

- [https://mesos.apache.org](https://mesos.apache.org): Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos
kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for
resource management and scheduling across entire datacenter and cloud environments.

- [https://kubernetes.io](https://kubernetes.io) Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.

---
template: inter-slide

## Agenda, tools and references

---

### Very tentative agenda for the course

**Weeks 1, 2 and 3** <br> 
The .stress[`Python` data-science stack] for **medium-scale** problems

**Weeks 4 and 5** <br>
Introduction to .stress[`spark`] and its .stress[low-level API]

**Weeks 6, 7 and 8** <br>
`Spark`'s high level API: .stress[`spark.sql`]. Data from different formats and sources

**Week 9** <br>
Run a job on a cluster with .stress[`spark-submit`], monitoring, mistakes and debugging

**Weeks 10, 11, 12** <br>
Introduction to .stress[`spark-streaming`] and a glimpse on other big data technologies (Dask)

---

### Main tools for the course (tentative...)

#### Infrastructure

#### Python stack

.center[
<img src="figs/python.png" width=20%/>
<img src="" width=5%/>
<img src="figs/numpy.jpg" width=18%/>
<img src="" width=5%/>
<img src="figs/pandas.png" width=28%/>
<img src="" width=5%/>
<img src="figs/jupyter_logo.png" width=7%/>
]

#### Data Visualization

.center[
<img src="figs/matplotlib.png" width=20%/>
<img src="" width=5%/>
<img src="figs/seaborn.png" width=20%/>
<img src="" width=5%/>
<img src="figs/bokeh.png" width=20%/>
<img src="" width=5%/>
<img src="figs/plotly-logo.png" width=20%/>
]

---

### Main tools for the course (tentative...)

#### Big data processing

.center[
<img src="figs/spark.png" width=20%/>
<img src="" width=10%/>
<img src="figs/pyspark.jpg" width=20%/>
<img src="" width=10%/>
<img src="figs/dask.png" width=10%/>
]

#### Data storage / formats / querying

.center[
<img src="figs/sql.jpg" width=20%/>
<img src="" width=5%/>
<img src="figs/orc.png" width=20%/>
<img src="" width=5%/>
<img src="figs/parquet.png" width=30%/>

<img src="figs/json.png" width=20%/>
<img src="" width=15%/>
<img src="figs/hdfs.png" width=25%/>
]

---

### Learning resources

- .stress[Spark Documentation Website]  <br>
.small[[http://spark.apache.org/docs/latest/](http://spark.apache.org/docs/latest/)]

- .stress[API docs] <br>
.small[[http://spark.apache.org/docs/latest/api/scala/index.html](http://spark.apache.org/docs/latest/api/scala/index.html)] <br>
.small[[http://spark.apache.org/docs/latest/api/python/](http://spark.apache.org/docs/latest/api/python/)]

- .stress[`Databricks` learning notebooks] <br>
.small[[https://databricks.com/resources](https://databricks.com/resources)]

- .stress[StackOverflow] <br>
.small[[https://stackoverflow.com/tags/apache-spark](https://stackoverflow.com/tags/apache-spark)]  <br>
.small[[https://stackoverflow.com/tags/pyspark](https://stackoverflow.com/tags/pyspark)]

- .stress[More advanced] <br>
.small[[http://books.japila.pl/apache-spark-internals/](http://books.japila.pl/apache-spark-internals/)]

- .stress[Misc.] <br>
.small[[Next Generation Databases: NoSQLand Big Data by Guy Harrison](https://link.springer.com/book/10.1007/978-1-4842-1329-2)]<br>
.small[[Data Pipelines Pocket Reference by J. Densmore](https://www.oreilly.com/library/view/data-pipelines-pocket/9781492087823/)]

---

### Learning Resources

.pull-left-80[
- .stress[Book]: **"Spark The Definitive Guide"** 
  .small[[http://shop.oreilly.com/product/0636920034957.do](http://shop.oreilly.com/product/0636920034957.do)] <br>
  .tiny[[https://github.com/databricks/Spark-The-Definitive-Guide](https://github.com/databricks/Spark-The-Definitive-Guide)]
]

And the **most important thing is:**

.pull-left[
.stress[.large[Practice!]]
]
.pull-right[
  <img src="figs/wtf.jpg" style="height: 200px;" />
]

---

template:inter-slide

# Data centers

---

### Data centers

Wonder what a .stress[datacenter looks like] ?

- Have a look at [http://www.google.com/about/datacenters](http://www.google.com/about/datacenters)

---

### Data centers

Wonder what a .stress[datacenter looks like] ?

---

### Data centers

Wonder what a .stress[datacenter looks like] ?

<br>

.center[
<iframe width="672" height="378" src="https://www.youtube.com/embed/avP5d16wEp0" 
        frameborder="0" allowfullscreen>
</iframe>
]

---

# Thank you !