name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } /* custom.css */ .plot-callout { width: 300px; bottom: 5%; right: 5%; position: absolute; padding: 0px; z-index: 100; } .plot-callout img { width: 100%; border: 1px solid #23373B; } </style>
--- class: middle, left, inverse # Technologies Big Data : Introduction ### 2023-01-17 #### [Master I MIDS Master I Informatique]() #### [Technologies Big Data](http://stephane-v-boucheron.fr/courses/isidata/) #### [Amélie Gheerbrandt, Stéphane Gaïffas, Stéphane Boucheron](http://stephane-v-boucheron.fr) --- exclude: true class: center, middle # Big Data Technologies ## Master Mathematics and Informatics .medium[Stéphane Gaïffas - Stéphane Boucheron] .center[ <img src="figs/lpsm.pnog" style="height: 160px;" /> <img src="" style="width: 30px;" /> <img src="figs/paris-diderot.png" style="height: 90px;" /> <img src="" style="width: 30px;" /> <img src="figs/uparis.png" style="height: 120px;" /> ] --- layout: true class: top --- template: inter-slide ## Course logistics --- ### Who are we ? .fl.w-50.pa2[ .center[ <img src="figs/stephaneb.jpg" style="height:140px;" /> ] - Stéphane Boucheron - LPSM - Statistics
- [https://stephane-v-boucheron.fr](https://stephane-v-boucheron.fr) ] .fl.w-50.pa2[ .center[ <img src="figs/amelie.jpeg" style="height: 140px;" /> ] - Amélie Gheerbrant - IRIF - Data Science, Databases
- [https://www.irif.fr/~amelie/](https://www.irif.fr/~amelie/) ] --- ### Course logistics - 24 hours = 2 hours `\(\times\)` .stress[12 weeks] : classes + hands-on - Tuesdays, 10:30 - 12:30 #### About the hands-on - Hands-on and homeworks using .stress[`Jupyter` notebooks] - Using a `Docker` image built for the course -
Hands-on must be carried out using your .stress[own laptop]. Bring it at **all the courses** --- ### Course logistics - The .stress[
webpage] of the course is: .center[[https://stephane-v-boucheron.fr/courses/grosses-data/](https://stephane-v-boucheron.fr/courses/grosses-data/)] - .stress[
Bookmark it] ! - Follow .stress[carefully] the steps described in the `tools` page: .center[[https://stephanegaiffas.github.io/big_data_course/tools](https://stephanegaiffas.github.io/big_data_course/tools)] -
Who knows about `docker` ? .center[<img src="figs/docker.png" style="width: 70%;" />] --- ### Course evaluation - .stress[Evaluation] using **homeworks** and a **final project** - Find a .stress[friend] : all work done by **pairs of students** - **All your work** goes in your private repository and nowhere else: .stress[no emails] ! - All your homework will be using .stress[`jupyter` notebooks] or .stress[`quarto`] files --- template: inter-slide ##
`Docker` --- ### Why `docker` ? What is it ? - Don't mess with your `python` env. and configuration files - Everything in embedded in a .stress[container] (better than a VM) - A .stress[container] is an **instance** of an .stress[image] - Same image = same environment for everybody - Same image = no {version, dependencies, install} problems - It is an .stress[industrial standard] used everywhere now! .pull-left[ <img src="figs/containers.png" style="width: 70%;" /> ] .pull-right[ <img src="figs/python_environment.png" style="width: 75%;" /> ] --- ### `docker` - Have a look at .center[[https://stephanegaiffas.github.io/big_data_course/tools](https://stephanegaiffas.github.io/big_data_course/tools)] - have a look at the `Dockerfile` to explain a little bit how the image is built - perform a quick demo on how to use the `docker` image <br> #### And that's it for the logistics ! --- class: center, middle, inverse ## Big data --- ### Big data - .stress[Moore's Law]: computing power **doubled** every two years from 1975 to 2012 - Nowadays, **less** than two years and a half - .stress[Rapid growth of datasets]: **internet activity**, social networks, genomics, physics, censor networks, etc. - .stress[Data size trends]: **doubles every year** according to [IDC executive summary](https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm) - .stress[Now, data grows faster than Moore's law] ### Question - How do we **scale** to **process it** and to **learn from it** ? --- ### Let's recall some units #### bit A bit is a value of either a 1 or 0 (on or off) #### byte (B) A byte is 8 bits - 1 character, e.g., "a", is one byte #### Kilobyte (KB) A kilobyte is **1 024** B - **2** or **3** paragraphs of text --- ### Let's recall some units #### Megabyte (MB) A megabyte is **1 048 576** B or **1 024** KB - **873** pages of plain text - **4** books (200 pages or 240 000 characters) #### Gigabyte (GB) A gigabyte is **1 073 741 824** B, **1 024** MB or **1 048 576** KB - **894 784** pages of plain text (1 200 characters) - **4 473** books (200 pages or 240 000 characters) - **640** web pages (with 1.6 MB average file size) - **341** digital pictures (with 3 MB average file size) - **256** MP3 audio files (with 4 MB average file size) - **1,5** 650 MB CD --- ### Let's recall some units #### Terabyte (TB) A terabyte is **1 099 511 627 776** B, **1 024** GB or **1 048 576** MB. - **916 259 689** pages of plain text (1 200 characters) - **4 581 298** books (200 pages or 240 000 characters) - **655 360** web pages (with 1.6 MB average file size) - **349 525** digital pictures (with 3 MB average file size) - **262 144** MP3 audio files (with 4 MB average file size) - **1 613** 650 MB CD's - **233** 4.38 GB DVDs - **40** 25 GB Blu-ray discs --- ### Let's recall some units #### Petabyte (PB) A petabyte is **1 024** TB, **1 048 576** GB or **1 073 741 824** MB - **938 249 922 368** pages of plain text (1 200 characters) - **4 691 249 611** books (200 pages or 240 000 characters) - **671 088 640** web pages (with 1.6 MB average file size) - **357 913 941** digital pictures (with 3 MB average file size) - **268 435 456** MP3 audio files (with 4 MB average file size) - **1 651 910** 650 MB CD's - **239 400** 4.38 GB DVDs - **41 943** 25 GB Blu-ray discs #### Exabyte, etc. - 1 EB = 1 exabyte = 1 024 PB - 1 ZB = 1 zettabyte = 1 024 EB --- ### Some figures You have every .stress[single second]$^1$: - At least **8,000 tweets** sent - **900+ photos** posted on **Instagram** - **Thousands of Skype calls** made - Over **70,000 Google searches** performed - Around **80,000 YouTube videos** viewed - Over **2 million emails** sent .footnote[$^1$[https://www.internetlivestats.com](https://www.internetlivestats.com)] --- ### Some figures There are$^1$: - .stress[5 billion web pages] as of mid-2019 (indexed web) and we expect$^2$ : - .stress[4.8 ZB] of annual IP traffic in 2022 Note that - **1** ZB `\(\approx\)` **36 000** years of HD video - Netflix's **entire catalog** is `\(\approx\)` **3.5 years** of HD video .footnote[ `\(^1\)`[https://www.worldwidewebsize.com](https://www.worldwidewebsize.com) <br> `\(^2\)`Cisco's Visual Networking Index ] --- ### Some figures More figures : - **facebook** daily logs: **60TB** - **1000 genomes** project: **200TB** - Google web index: **10+ PB** - Cost of **1TB** of storage: **~$35** - Time to read **1TB** from disk: **3 hours** if **100MB/s** ### Let's give some .stress[latencies] now --- ### Latency numbers .f6[.pure-table.pure-table-striped[ | Memory type | Latency(ns) | Latency(us) | (ms) | | | :--------------------------------- | ---------------: | ----------: | -----: | :-------------------------- | | L1 cache reference | 0.5 ns | | | | | L2 cache reference | 7 ns | | | 14x L1 cache | | Main memory reference | 100 ns | | | 20x L2, 200x L1 | | Compress 1K bytes with Zippy | 3,000 ns | 3 us | | | | Send 1K bytes over 1 Gbps network | 10,000 ns | 10 us | | | | Read 4K randomly from SSD* | 150,000 ns | 150 us | | ~1GB/sec SSD | | Read 1 MB sequentially from memory | 250,000 ns | 250 us | | | | Round trip within same datacenter | 500,000 ns | 500 us | | | | Read 1 MB sequentially from SSD* | 1,000,000 ns | 1,000 us | 1 ms | ~1GB/sec SSD, 4X memory | | Disk seek | 10,000,000 ns | 10,000 us | 10 ms | 20x datacenter roundtrip | | Read 1 MB sequentially from disk | 20,000,000 ns | 20,000 us | 20 ms | 80x memory, 20x SSD | | Send packet US -> Europe -> US | 150,000,000 ns | 150,000 us | 150 ms | 600x memory | ]] --- ### Latency numbers - Reading 1MB from **disk** = **100X** reading 1MB from **memory** - Sending packet from **US to Europe to US** = **1 000 000X** main memory reference #### General tendency True in general, not always: - memory operations : .stress[fastest] - disk operations : .stress[slow] - network operations : .stress[slowest] --- ### Latency numbers .small[[https://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html](https://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html)] .center[ <img src="figs/latency_numbers.png" style="width: 100%;" /> ] --- ### Humanized latency numbers Lets multiply all these durations by a billion .f6[.pure-table.pure-table-striped[ | Memory type | Latency | Human duration | | :--------------------------------- | -----------: | ----------------------------------------------------: | | L1 cache reference | 0.5 s | One heart beat (0.5 s) | | L2 cache reference | 7 s | Long yawn | | Main memory reference | 100 s | Brushing your teeth | | Send 2K bytes over 1 Gbps network | 5.5 hr | From lunch to end of work day | | SSD random read | 1.7 days | A normal weekend | | Read 1 MB sequentially from memory | 2.9 days | A long weekend | | Round trip within same datacenter | 5.8 days | A medium vacation | | Read 1 MB sequentially from SSD | 11.6 days | Waiting for almost 2 weeks for a delivery | | Disk seek | 16.5 weeks | A semester in university | | Read 1 MB sequentially from disk | 7.8 months | Almost producing a new human being | | Send packet US -> Europe -> US | 4.8 years | Average time it takes to complete a bachelor's degree | ]] --- template: inter-slide ## Challenges --- ### Challenges with big datasets - Large data .stress[don't fit] on a **single** hard-drive - **One** large machine .stress[can't process or store] **all** the data - For **computations** how do we .stress[stream data] from the **disk to the different layers of memory** ? - **Concurrent accesses** to the data: disks .stress[cannot] be **read in parallel** --- ### Solutions - Combine .stress[several machines] containing **hard drives** and **processors** on a **network** - Using .stress[commodity hardware]: cheap, common architecture i.e. **processor** + **RAM** + **disk** - .stress[Scalability] = **more machines** on the network - .stress[Partition] the data across the machines <!-- .center[ <img src="figs/big-data-tease.jpg" style="width: 35%;" /> ] --> --- ### Challenges Dealing with distributed computations adds **software complexity** - .stress[Scheduling]: How to **split the work across machines**? Must exploit and optimize data locality since moving data is very expensive - .stress[Reliability]: How to **deal with failure**? Commodity (cheap) hardware fails more often. @Google [1%, 5%] HD failure/year and 0.2% [DIMM](https://en.wikipedia.org/wiki/DIMM) failure/year - .stress[Uneven performance] of the machines: some nodes are slower than others --- ### Solutions - .stress[Schedule], **manage** and **coordinate** threads and resources using appropriate software - .stress[Locks] to **limit** access to resources - .stress[Replicate] data for **faster reading** and **reliability** --- ### Is it HPC ? - **High Performance Computing** (HPC) - **Parallel computing** #### Comments - For HPC, scaling-up means using a .stress[bigger machine] - Huge performance increase for **medium** scale problems - .stress[Very expensive], specialized machines, lots of processors and memory #### Answer is no ! --- ### The Big Data universe Many technologies combining .stress[software] and .stress[cloud computing] .center[ <img src="figs/teasing2.jpg" style="width: 100%;" /> ] --- ### The Big Data universe Often used with/for with .stress[Machine Learning] (or AI) .center[ <img src="figs/teasing3.png" style="width: 90%;" /> ] --- ### Tools - Softwares such as .stress[`Spark`] or .stress[`HadoopMR`] (Hadoop Map Reduce) are in charge of these challenges - They are .stress[distributed compute engines]: softwares that ease the development of distributed algorithms They run on .stress[clusters] (several machine on a network), managed by a .stress[resource manager] such as : - **`Yarn` :** [https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) - **`Mesos` :** [http://mesos.apache.org](http://mesos.apache.org) - **`Kubernetes :` **[https://kubernetes.io](https://kubernetes.io/) A resource manager ensures that the tasks running on the cluster do not try to use the same resources all at once --- class: center, middle, inverse ## `Apache Spark` --- ### Apache `Spark` The course will focus mainly on .stress[`Spark`] for big data processing .center[ <img src="figs/spark.png" style="width: 35%;" /> [https://spark.apache.org](https://spark.apache.org) ] - `Spark` is an .stress[industrial standard] <br> (cf [https://spark.apache.org/powered-by.html](https://spark.apache.org/powered-by.html)) - One of the most used .stress[big data processing framework] - .stress[Open source] The predecessor of `Spark` is `Hadoop` --- ### `Hadoop` - `Hadoop` has a simple API and good fault tolerance (tolerance to nodes failing midway through a processing job) - The cost is lots of .stress[data shuffling] across the network - With intermediate computations .stress[written to disk] **over the network** which we know is .stress[very time expensive] It is made of three components: - .stress[`HDFS`] (Highly Distributed File System) inspired from `GoogleFileSystem`, see .small[[https://ai.google/research/pubs/pub51](https://ai.google/research/pubs/pub51)] - .stress[`YARN`] (Yet Another Ressource Negociator) - .stress[`MapReduce`] inspired from Google <br> .small[[https://research.google.com/archive/mapreduce.html](https://research.google.com/archive/mapreduce.html)] --- ### MapReduce's wordcount example .center[<img src="figs/WordCountFlow.JPG" width=95%/>] --- ### `Spark` Advantages of `Spark` over `HadoopMR` ? - .stress[In-memory storage]: use **RAM** for fast iterative computations - .stress[Lower overhead] for starting jobs - .stress[Simple and expressive] with `Scala`, `Python`, `R`, `Java` APIs - .stress[Higher level libraries] with `SparkSQL`, `SparkStreaming`, etc. Disadvantages of `Spark` over `HadoopMR` ? - `Spark` requires servers with **more CPU** and **more memory** - But still much cheaper than HPC `Spark` is .stress[much faster] than `Hadoop` - `Hadoop` uses **disk** and **network** - `Spark` tries to use **memory** as much as possible for operations while minimizing network use --- ### `Spark` and `Hadoop` comparison <br> .pure-table.pure-table-striped[ | | HadoopMR | Spark | | -----------------------: | -----------: | ------------------------------: | | storage | Disk | in-memory or disk | | operations | Map, reduce | Map, reduce, join, sample, among many others | | execution model | Batch | Batch, interactive, streaming | | Programming environments | Java | Scala, Java, Python, R | ] --- ### `Spark` and `Hadoop` comparison For **logistic regression** training (a simple **classification** algorithm which requires **several passes** on a dataset) .center[ <img src="figs/spark-dev3.png" width=50%/> ] <br> .center[ <img src="figs/logistic-regression.png" width=30%/> ] --- ### The `Spark` stack .center[<img src="figs/spark_stack.png" width=85%/>] --- ### The `Spark` stack .center[<img src="figs/spark-env-source.png" width=95%/>] --- ### `Spark` can run "everywhere" .center[<img src="figs/spark-runs-everywhere.png" width=55%/>] --- template: inter-slide ## Agenda, tools and references --- ### Very tentative agenda for the course **Weeks 1, 2 and 3** <br> The .stress[`Python` data-science stack] for **medium-scale** problems **Weeks 4 and 5** <br> Introduction to .stress[`spark`] and its .stress[low-level API] **Weeks 6, 7 and 8** <br> `Spark`'s high level API: .stress[`spark.sql`]. Data from different formats and sources **Week 9** <br> Run a job on a cluster with .stress[`spark-submit`], monitoring, mistakes and debugging **Weeks 10, 11, 12** <br> Introduction to .stress[`spark-streaming`] and a glimpse on other big data technologies (Dask) --- ### Main tools for the course (tentative...) #### Infrastructure .center[ <img src="figs/docker.png" width=25%/> <img src="" width=10%/> ] #### Python stack .center[ <img src="figs/python.png" width=20%/> <img src="" width=5%/> <img src="figs/numpy.jpg" width=18%/> <img src="" width=5%/> <img src="figs/pandas.png" width=28%/> <img src="" width=5%/> <img src="figs/jupyter_logo.png" width=7%/> ] #### Data Visualization .center[ <img src="figs/matplotlib.png" width=20%/> <img src="" width=5%/> <img src="figs/seaborn.png" width=20%/> <img src="" width=5%/> <img src="figs/bokeh.png" width=20%/> <img src="" width=5%/> <img src="figs/plotly-logo.png" width=20%/> ] --- ### Main tools for the course (tentative...) #### Big data processing .center[ <img src="figs/spark.png" width=20%/> <img src="" width=10%/> <img src="figs/pyspark.jpg" width=20%/> <img src="" width=10%/> <img src="figs/dask.png" width=10%/> ] #### Data storage / formats / querying .center[ <img src="figs/sql.jpg" width=20%/> <img src="" width=5%/> <img src="figs/orc.png" width=20%/> <img src="" width=5%/> <img src="figs/parquet.png" width=30%/> <img src="figs/json.png" width=20%/> <img src="" width=15%/> <img src="figs/hdfs.png" width=25%/> ] --- ### Learning resources - .stress[Spark Documentation Website] <br> .small[[http://spark.apache.org/docs/latest/](http://spark.apache.org/docs/latest/)] - .stress[API docs] <br> .small[[http://spark.apache.org/docs/latest/api/scala/index.html](http://spark.apache.org/docs/latest/api/scala/index.html)] <br> .small[[http://spark.apache.org/docs/latest/api/python/](http://spark.apache.org/docs/latest/api/python/)] - .stress[`Databricks` learning notebooks] <br> .small[[https://databricks.com/resources](https://databricks.com/resources)] - .stress[StackOverflow] <br> .small[[https://stackoverflow.com/tags/apache-spark](https://stackoverflow.com/tags/apache-spark)] <br> .small[[https://stackoverflow.com/tags/pyspark](https://stackoverflow.com/tags/pyspark)] - .stress[More advanced] <br> .small[[http://books.japila.pl/apache-spark-internals/](http://books.japila.pl/apache-spark-internals/)] - .stress[Misc.] <br> .small[[Next Generation Databases: NoSQLand Big Data by Guy Harrison](https://link.springer.com/book/10.1007/978-1-4842-1329-2)]<br> .small[[Data Pipelines Pocket Reference by J. Densmore](https://www.oreilly.com/library/view/data-pipelines-pocket/9781492087823/)] --- ### Learning Resources .pull-left-80[ - .stress[Book]: **"Spark The Definitive Guide"** .small[[http://shop.oreilly.com/product/0636920034957.do](http://shop.oreilly.com/product/0636920034957.do)] <br> .tiny[[https://github.com/databricks/Spark-The-Definitive-Guide](https://github.com/databricks/Spark-The-Definitive-Guide)] ] .pull-right-20[ <img src="figs/spark_book.gif" style="height: 160px;" /> ] <img src="" style="height: 200px;" /> And the **most important thing is:** .pull-left[ .stress[.large[Practice!]] ] .pull-right[ <img src="figs/wtf.jpg" style="height: 200px;" /> ] --- template:inter-slide # Data centers --- ### Data centers Wonder what a .stress[datacenter looks like] ? - Have a look at [http://www.google.com/about/datacenters](http://www.google.com/about/datacenters) --- ### Data centers Wonder what a .stress[datacenter looks like] ? .center[<img src="figs/datacenter2.jpg" width=80%/>] --- ### Data centers Wonder what a .stress[datacenter looks like] ? <br> .center[ <iframe width="672" height="378" src="https://www.youtube.com/embed/avP5d16wEp0" frameborder="0" allowfullscreen> </iframe> ] --- class: center, middle, inverse # Thank you !