name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } /* custom.css */ .plot-callout { width: 300px; bottom: 5%; right: 5%; position: absolute; padding: 0px; z-index: 100; } .plot-callout img { width: 100%; border: 1px solid #23373B; } </style>
--- class: middle, left, inverse # Technologies Big Data : Python Data Science Stack ### 2024-01-23 #### [Master I MIDS Master I Informatique]() #### [Technologies Big Data](http://stephane-v-boucheron.fr/courses/isidata/) #### [Amélie Gheerbrandt, Stéphane Gaïffas, Stéphane Boucheron, Vlady Ravelomanana](http://stephane-v-boucheron.fr) --- ### What is `Python` ? .center[<img src="figs/python.png" width=40%/>] - born in 1990 - designed by Guido van Rossum (BDFL) - multi-purpose - easy to read - easy to learn - object-oriented - strongly and dynamically typed - cross-platform --- ### Features of `Python` - High-level data types (`tuples`, `dict`, `list`, `set`, etc.) - Standard libraries with batteries included - String services, regular expressions - Libraries for scientific computing - Easy and efficient I/O, many file formats - OS, threading, multiprocessing - Networking, email, html, webserver, scrapping - Can be extended with `C/C++` and easily accelerated (`cython`, `numba`, `pypy`) - Tons of external libraries --- ### Features of `Python` .center[ <img src="figs/python_antigravity.png" style="width:55%;" /> ] --- ### The [`stackoverflow` 2022 survey](https://survey.stackoverflow.co/2022/) .center[ <img src="figs/stackoverflow-survey.png" style="width:50%;" /> ] --- # `Python` popularity growth .center[ <img src="figs/python_growth_major_languages.png" style="width:75%;" /> ] --- # `Python` popularity growth .center[ <img src="figs/python_growth_smaller_languages.png" style="width:75%;" /> ] --- # Why `Python` for data science ? Besides these features, `Python` has: - large communities for data science, analytics, etc. - many and well-established libraries - lots of examples and documentation - **huge** demand from the industry --- # The `Python` Data Science Stack ### Maths / Science .center[ <img src="figs/numpy.jpg" width=28%/> <img src="" width=10%/> <img src="figs/scipy.png" width=28%/> ] --- # The `Python` Data Science Stack ### Maths / Science .center[ <img src="figs/numpy.jpg" width=28%/> <img src="" width=10%/> <img src="" width=28%/> ] - `numpy` is all about **multi-dimensional arrays** and **matrices**. - high-level mathematical computation such as **linear algebra** in `numpy.linalg` and **random number generation** in `numpy.random` - **Fast** but not optimized for multi-threaded architectures - And not for **distributed** multi-machine settings --- # The `Python` Data Science Stack ### Maths / Science .center[ <img src="" width=28%/> <img src="" width=10%/> <img src="figs/scipy.png" width=28%/> ] - `scipy` extends `numpy` with extra modules - Mainly optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing - And very useful sparse matrix formats in `scipy.sparse` --- # The `Python` Data Science Stack ### Data processing .center[ <img src="figs/pandas.png" width=40%/> <img src="" width=5%/> <img src="figs/dask.png" width=10%/> <img src="" width=5%/> <img src="figs/pyspark.jpg" width=20%/> ] --- # The `Python` Data Science Stack ### Data processing .center[ <img src="figs/pandas.png" width=40%/> <img src="" width=5%/> <img src="" width=10%/> <img src="" width=5%/> <img src="" width=20%/> ] - `pandas` builds upon `numpy` to provide a high-performance, easy-to-use `DataFrame` object, with high-level data processing - Easy I/O with most data format : `csv`, `json`, `hdf5`, `feather`, `parquet`, etc. - `SQL` semantics: `groupby`, `agg`, `select`, `where`, etc. - Some data visualization tools - Very large **general-purpose library for data processing**, not distributed, **medium scale** data only --- # The `Python` Data Science Stack ### Data processing .center[ <img src="" width=40%/> <img src="" width=5%/> <img src="figs/dask.png" width=10%/> <img src="" width=5%/> <img src="" width=20%/> ] - `dask` is roughly a **distributed** and **parallel** `pandas` - Same API has `pandas` ! - Task scheduling, lazy evaluation, distributed dataframes - Still young and **far behind** `spark`, but can be useful - Easier than `spark`, full `Python` (no `JVM`) --- # The `Python` Data Science Stack ### Data processing .center[ <img src="" width=40%/> <img src="" width=5%/> <img src="" width=10%/> <img src="" width=5%/> <img src="figs/pyspark.jpg" width=20%/> ] - `pyspark` is the `python` API to `spark`, a big data processing framework - We will use it **a lot** in this course - Native API to `spark` is `scala`: `pyspark` can be **slower** (much slower if you are not careful) --- # The `Python` Data Science Stack ### Data Visualization .center[ <img src="figs/matplotlib.png" width=25%/> <img src="" width=10%/> <img src="figs/seaborn.png" width=20%/> <img src="" width=10%/> <img src="figs/bokeh.png" width=20%/> ] --- # The `Python` Data Science Stack ### Data Visualization .center[ <img src="figs/matplotlib.png" width=25%/> <img src="" width=10%/> <img src="" width=20%/> <img src="" width=10%/> <img src="" width=20%/> ] - `matplotlib` provides **2D plotting capabilities** - **Very large** and **highly customizable** library - The historical one, somewhat **low-level** when plotting things related to data --- # The `Python` Data Science Stack ### Data Visualization .center[ <img src="" width=25%/> <img src="" width=10%/> <img src="figs/seaborn.png" width=20%/> <img src="" width=10%/> <img src="" width=20%/> ] - A **higher-level** plotting library built on top of `matplotlib` - To be use **with a `pandas` dataframes** as data source - Higher-level plotting possibilities - Usually better-looking plots with good default parameters --- # The `Python` Data Science Stack ### Data Visualization .center[ <img src="" width=25%/> <img src="" width=10%/> <img src="" width=20%/> <img src="" width=10%/> <img src="figs/bokeh.png" width=20%/> ] - An **interactive visualization library** for web browsers based on `javascript` graphic library [`d3.js`](https://d3js.org) - With a clean and simple `python` interface, can be used in a `jupyter` notebook - Interactions enabled by default (zoom, etc.) and fast rendering - Very good looking plots with good default parameters [there is also `plotly`...] --- # The `Python` Data Science Stack ### Interfaces <img src="figs/python.png" width=35%/> <img src="" width=10%/> <img src="figs/ipython.jpg" width=20%/> <img src="" width=10%/> <img src="figs/jupyter_logo.png" width=12%/> --- # The `Python` Data Science Stack ### Interfaces <img src="figs/python.png" width=35%/> <img src="" width=10%/> <img src="figs/ipython.jpg" width=20%/> <img src="" width=10%/> <img src="" width=12%/> Ways to use all these tools - Write a script `script.py` and use `python` directly in a CLI : `python script.py` - Use the `ipython` interactive shell --- # The `Python` Data Science Stack ### Interfaces <img src="" width=35%/> <img src="" width=10%/> <img src="" width=20%/> <img src="" width=10%/> <img src="figs/jupyter_logo.png" width=12%/> - Use `jupyter`: a web application that allows to create and run documents, called **notebooks** (with `.ipynb` extension) - Notebooks can contain code, equations, visualizations, text, etc. We will **use these a lot** in the course. - Each `notebook` as a `kernel` running a `python` thread - A **problem**: a `ipynb` file is a `json` document. Leads to bad code diff, a problem with `git` versioning --- # But also... Many libraries for statistics, machine learning and deep learning ### Statistics - `statlearn`, `statsmodels` ### Machine learning - `scikit-learn`, `xgboost`, `lightgbm` ### Deep learning - `tensorflow`, `pytorch` ### Getting faster - `numba`, `cython`, `dask` --- # But also... - `Python` APIs for most databases and clouds - Processing and plotting tools for Geospatial data - Image processing - Web development, web scrapping among many many many other things... --- class: center, middle, inverse # Thank you !