Technologies Big Data

---
name: layout-general
layout: true
class: left, middle

.remark-slide-number .progress-bar-container {
  position: absolute;
  bottom: 0;
  height: 4px;
  display: block;
  left: 0;
  right: 0;
}

.remark-slide-number .progress-bar {
  height: 100%;
  background-color: red;
}

/* custom.css */
.plot-callout {
  width: 300px;
  bottom: 5%;
  right: 5%;
  position: absolute;
  padding: 0px;
  z-index: 100;
}
.plot-callout img {
  width: 100%;
  border: 1px solid #23373B;
}
</style>

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(./figs/UniversiteParisCite_logo_horizontal_couleur_RVB.jpeg);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'http://master.math.univ-paris-diderot.fr/'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

---

# Technologies Big Data : Python Data Science Stack

### 2024-01-23

#### [Master I MIDS Master I Informatique]()

#### [Technologies Big Data](http://stephane-v-boucheron.fr/courses/isidata/)

#### [Amélie Gheerbrandt, Stéphane Gaïffas, Stéphane Boucheron, Vlady Ravelomanana](http://stephane-v-boucheron.fr)

---

### What is `Python` ?

- born in 1990

- designed by Guido van Rossum (BDFL)

- multi-purpose

- easy to read

- easy to learn

- object-oriented

- strongly and dynamically typed

- cross-platform

---

### Features of `Python`

- High-level data types (`tuples`, `dict`, `list`, `set`, etc.)

- Standard libraries with batteries included

- String services, regular expressions

- Libraries for scientific computing

- Easy and efficient I/O, many file formats

- OS, threading, multiprocessing

- Networking, email, html, webserver, scrapping

- Can be extended with `C/C++` and easily accelerated (`cython`, `numba`, `pypy`)

- Tons of external libraries

---

### Features of `Python`

---

### The [`stackoverflow` 2022 survey](https://survey.stackoverflow.co/2022/)

---

# `Python` popularity growth

---

# `Python` popularity growth

---

# Why `Python` for  data science ?

Besides these features, `Python` has:

- large communities for data science, analytics, etc.

- many and well-established libraries

- lots of examples and documentation

- **huge** demand from the industry

---

# The `Python` Data Science Stack

### Maths / Science

.center[
<img src="figs/numpy.jpg" width=28%/>
<img src="" width=10%/>
<img src="figs/scipy.png" width=28%/>
]

---

# The `Python` Data Science Stack

### Maths / Science

- `numpy` is all about **multi-dimensional arrays** and **matrices**.

- high-level mathematical computation such as **linear algebra** in `numpy.linalg` and **random number generation** in `numpy.random`

- **Fast** but not optimized for multi-threaded architectures

- And not for **distributed** multi-machine settings

---

# The `Python` Data Science Stack

### Maths / Science

- `scipy` extends `numpy` with extra modules

- Mainly optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing

- And very useful sparse matrix formats in `scipy.sparse`

---

# The `Python` Data Science Stack

### Data processing

.center[
<img src="figs/pandas.png" width=40%/>
<img src="" width=5%/>
<img src="figs/dask.png" width=10%/>
<img src="" width=5%/>
<img src="figs/pyspark.jpg" width=20%/>
]

---

# The `Python` Data Science Stack

### Data processing

.center[
<img src="figs/pandas.png" width=40%/>
<img src="" width=5%/>
<img src="" width=10%/>
<img src="" width=5%/>
<img src="" width=20%/>
]

- `pandas` builds upon `numpy` to provide a high-performance, easy-to-use `DataFrame` object, with high-level data processing

- Easy I/O with most data format : `csv`, `json`, `hdf5`, `feather`, `parquet`, etc.

- `SQL` semantics: `groupby`, `agg`, `select`, `where`, etc.

- Some data visualization tools

- Very large **general-purpose library for data processing**, not distributed, **medium scale** data only

---

# The `Python` Data Science Stack

### Data processing

.center[
<img src="" width=40%/>
<img src="" width=5%/>
<img src="figs/dask.png" width=10%/>
<img src="" width=5%/>
<img src="" width=20%/>
]

- `dask` is roughly a **distributed** and **parallel** `pandas`

- Same API has `pandas` !

- Task scheduling, lazy evaluation, distributed dataframes

- Still young and **far behind** `spark`, but can be useful

- Easier than `spark`, full `Python` (no `JVM`)

---

# The `Python` Data Science Stack

### Data processing

.center[
<img src="" width=40%/>
<img src="" width=5%/>
<img src="" width=10%/>
<img src="" width=5%/>
<img src="figs/pyspark.jpg" width=20%/>
]

- `pyspark` is the `python` API to `spark`, a big data processing framework

- We will use it **a lot** in this course

- Native API to `spark` is `scala`: `pyspark` can be **slower** (much slower if you are not careful)

---

# The `Python` Data Science Stack

### Data Visualization

.center[
<img src="figs/matplotlib.png" width=25%/>
<img src="" width=10%/>
<img src="figs/seaborn.png" width=20%/>
<img src="" width=10%/>
<img src="figs/bokeh.png" width=20%/>
]

---

# The `Python` Data Science Stack

### Data Visualization

.center[
<img src="figs/matplotlib.png" width=25%/>
<img src="" width=10%/>
<img src="" width=20%/>
<img src="" width=10%/>
<img src="" width=20%/>
]

- `matplotlib` provides **2D plotting capabilities**

- **Very large** and **highly customizable** library

- The historical one, somewhat **low-level** when plotting things related to data

---

# The `Python` Data Science Stack

### Data Visualization

.center[
<img src="" width=25%/>
<img src="" width=10%/>
<img src="figs/seaborn.png" width=20%/>
<img src="" width=10%/>
<img src="" width=20%/>
]

- A **higher-level** plotting library built on top of `matplotlib`

- To be use **with a `pandas` dataframes** as data source

- Higher-level plotting possibilities

- Usually better-looking plots with good default parameters

---

# The `Python` Data Science Stack

### Data Visualization

.center[
<img src="" width=25%/>
<img src="" width=10%/>
<img src="" width=20%/>
<img src="" width=10%/>
<img src="figs/bokeh.png" width=20%/>
]

- An **interactive visualization library** for web browsers based on `javascript` graphic library [`d3.js`](https://d3js.org)

- With a clean and simple `python` interface, can be used in a `jupyter` notebook

- Interactions enabled by default (zoom, etc.) and fast rendering

- Very good looking plots with good default parameters

[there is also `plotly`...]

---

# The `Python` Data Science Stack

### Interfaces

---

# The `Python` Data Science Stack

### Interfaces

Ways to use all these tools

- Write a script `script.py` and use `python` directly in a CLI : `python script.py`

- Use the `ipython` interactive shell

---

# The `Python` Data Science Stack

### Interfaces

- Use `jupyter`: a web application that allows to create and run documents, called **notebooks** (with `.ipynb` extension)

- Notebooks can contain code, equations, visualizations, text, etc. We will **use these a lot** in the course.

- Each `notebook` as a `kernel` running a `python` thread

- A **problem**: a `ipynb` file is a `json` document. Leads to bad code diff, a problem with `git` versioning

---

# But also...

Many libraries for statistics, machine learning and deep learning

### Statistics

- `statlearn`, `statsmodels`

### Machine learning

- `scikit-learn`, `xgboost`, `lightgbm`

### Deep learning

- `tensorflow`, `pytorch`

### Getting faster

- `numba`, `cython`, `dask`

---

# But also...

- `Python` APIs for most databases and clouds

- Processing and plotting tools for Geospatial data

- Image processing

- Web development, web scrapping

among many many many other things...

---

# Thank you !