name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } /* custom.css */ .plot-callout { width: 300px; bottom: 5%; right: 5%; position: absolute; padding: 0px; z-index: 100; } .plot-callout img { width: 100%; border: 1px solid #23373B; } </style>
--- class: middle, left, inverse # Technologies Big Data : File formats for Big Data ### 2023-03-29 #### [Master I MIDS Master I Informatique]() #### [Technologies Big Data](http://stephane-v-boucheron.fr/courses/isidata/) #### [Amélie Gheerbrant, Stéphane Gaïffas, Stéphane Boucheron](http://stephane-v-boucheron.fr) --- template: inter-slide ## File formats --- ### File formats - You will need to choose the .stress[right format] for your data - The right format typically .stress[depends on the use-case] #### Why different file formats ? - A **huge bottleneck** for big data applications is .stress[time spent to find data] in a particular location and .stress[time spent to write it] back to another location - Even more complicated with **large datasets** with .stress[evolving schemas], or .stress[storage constraints] - Several `Hadoop` file formats **evolved to ease these issues** across a number of use cases --- ### File formats Choosing an appropriate file format can have the following benefits - Faster **read times** - Faster **write times** - **Splittable** files - **Schema evolution** support (schema changes over time) - **Advanced compression** support Some file formats are designed for .stress[general use] Others for more .stress[specific use cases] Some with .stress[specific data characteristics] in mind ??? What is the meaning of *Schema*? May depend on format. --- template: inter-slide ## Main file formats for big data --- ### Main file formats .center[ <img src="figs/parquet.png" style="width: 30%;" /> <img src="" style="width: 5%;" /> <img src="figs/orc.png" style="width: 23%;" /> <img src="" style="width: 5%;" /> <img src="figs/avro.png" style="width: 25%;" /> ] We will talk about the .stress[core concepts] and .stress[use-cases] for the following data formats that are widely used: - `Avro` : [https://avro.apache.org](https://avro.apache.org) - `ORC` : [https://parquet.apache.org](https://parquet.apache.org) - `Parquet` : [https://orc.apache.org](https://orc.apache.org) --- ### About `Avro` .center[<img src="figs/avro.png" style="width: 35%;" />] - `Avro` is a .stress[row-based data format and data serialization system] released by the `Hadoop` working group in 2009 - Data schema is stored as `JSON` in the header. Rest of the data stored in a **binary format** to make it compact and efficient - `Avro` is language-neutral and can be used by many languages (for now `C`, `C++`, `C#`, `Java`, `Python`, `R` and `Ruby`). - One shining point of `Avro`: .stress[robust support for schema evolution], such as missing / added / changed fields ??? Avro is used in streaming applications --- ### About `Avro` - `Avro` provides rich data structures: can create a record that contains an array, an enumerated type and a sub-record Ideal candidate to .stress[store data in a data lake] since: 1. Data is usually **read as a whole** in a data lake for further processing by downstream systems 2. Downstream systems can **retrieve schemas easily from files** (no need to store the schemas separately). 3. Any source .stress[schema change is easily handled] ??? - data lake - data warehouse - databse Spot the differences --- ### About `Avro` .center[<img src="figs/avro-file.png" style="width: 100%;" />] --- ### About `Parquet` .center[<img src="figs/parquet.png" style="width: 40%;" />] - `Parquet` is an open-source file format for `Hadoop` created by `Cloudera` and `Twitter` in 2013 - It stores **nested data structures** in a .stress[flat columnar format]. - Compared to traditional **row-oriented approaches**, `Parquet` is .stress[more efficient in terms of storage and performance] - It is especially good for queries that need .stress[read a small subset of columns] from a data file with many columns : .stress[only the required columns are read] (optimized I/O) ??? - meaning of nested data structure - --- ### Row-wise VS columnar storage format If you have a dataframe like this ``` +----+-------+----------+ | ID | Name | Product | +----+-------+----------+ | 1 | name1 | product1 | | 2 | name2 | product2 | | 3 | name3 | product3 | +----+-------+----------+ ``` In **row-wise** storage format .stress[records are contiguous] in the file: ```python 1 name1 product1 2 name2 product2 3 name3 product3 ``` While in the **columnar storage** format, .stress[columns are stored together]: ```python 1 2 3 name1 name2 name3 product1 product2 product3 ``` --- ### About `Parquet` - This makes **columnar storage** more efficient when .stress[querying a few columns] from the table - No need to read whole records, but only the .stress[required columns] - A unique feature of `Parquet` is that even .stress[nested fields] can be read individually without the need to read all the fields - `Parquet` uses **record shredding** and an **assembly algorithm** to store nested structures in a columnar fashion ??? Examples of nested fields --- ### About `Parquet` .center[<img src="figs/parquet-format.gif" style="width: 80%;" />] --- ### About `Parquet` The main entities in a `Parquet` file are the following: - **Row group**: a horizontal partitioning of the data into rows. A row group consists of a column chunk for each column in the dataset - **Column chunk**: a chunk of the data for a particular column. These column chunks live in a particular row group and are guaranteed to be contiguous in the file - **Page**: column chunks are divided up into pages written back to back. The pages share a common header and readers can skip the page they are not interested in --- ### About `Parquet` .center[<img src="figs/parquet-dive.png" style="width: 80%;" />] --- ### About `Parquet` - The header just contains a magic number "PAR1" (4-byte) that identifies the file as `Parquet` format file The footer contains: - **File metadata**: all the locations of all the column metadata start locations. Readers first read the file metadata to find the column chunks they need. Column chunks are then read sequentially. It also includes the format version, the schema, and any extra key-value pairs. - **length** of file metadata (4-byte) - **magic number** "PAR1" (4-byte) --- ### About `ORC` .center[<img src="figs/orc.png" style="width: 35%;" />] - `ORC` stands for .stress[Optimized Row Columnar] file format. Created by Hortonworks in 2013 in order to speed up `Hive` - `ORC` file format provides a .stress[highly efficient way to store data] - It is a .stress[raw columnar data format] highly optimized for reading, writing, and processing data in `Hive` - It stores data in a .stress[compact way] and enables .stress[skipping quickly irrelevant parts] ??? A few words about `Hive` [Hive official site](https://hive.apache.org) --- ### About `ORC` - `ORC` stores .stress[collections of rows in one file]. Within the collection, row data is stored in a .stress[columnar format] - An `ORC` file contains **groups of row data** called .stress[stripes], along with auxiliary information in a file footer. At the end of the file a postscript holds compression parameters and the size of the compressed footer - The default stripe size is 250 MB. **Large stripe** sizes enable .stress[large, efficient reads from HDFS] - The **file footer** contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum --- ## About `ORC` - **Index data** include min and max values for each column and the row’s positions within each column - **`ORC` indexes** are used only for the selection of stripes and row groups and not for answering queries .center[<img src="figs/orc-file-structure.png" style="width: 50%;" />] --- ### About `ORC` `ORC` file format has many advantages such as: - `Hive` type support including `DateTime`, `decimal`, and the complex types (`struct`, `list`, `map` and `union`) - **Concurrent reads** of the same file - Ability to split files **without scanning for markers** - Estimate an **upper bound on heap memory allocation** based on the information in the file footer. --- template: inter-slide ## Comparison between formats --- ### `Avro` versus `Parquet` - `Avro` is a .stress[row-based] storage format whereas `Parquet` is a .stress[columnar based] storage format - `Parquet` is much better for .stress[analytical querying] i.e. reads and querying are much more efficient than writing. - **Write operations** in `Avro` are better than in `Parquet`. - `Avro` is more mature than `Parquet` for .stress[schema evolution]: `Parquet` supports only **schema append** while `Avro` supports more things, such as **adding or modifying columns** - `Parquet` is ideal for .stress[querying a subset of columns] in a multi-column table. `Avro` is ideal for **operations where all the columns are needed** (such as in a ETL workflow) --- ### `ORC` vs `Parquet` - `Parquet` is more capable of .stress[storing nested data] - `ORC` is more capable of .stress[predicate pushdown] (SQL queries on a data file are better optimized, chunks of data can be **skipped** directly while reading) - `ORC` is more .stress[compression efficient] --- ### In summary... .center[<img src="figs/file-formats.png" style="width: 80%;" />] --- template: inter-slide ## How to choose a file format --- ### R ead / write intensive & query pattern - **Row-based** file formats are overall better for storing write-intensive data because .stress[appending new records is easier] - If only a **small subset of columns** is queried frequently, .stress[columnar formats will be better] since only those needed columns will be accessed and transmitted (whereas row formats need to pull all the columns) --- ### C ompression - Compression is one of the key aspects to consider since .stress[compression helps reduce the resources] required to store and transmit data - .stress[Columnar formats are better than row-based formats in terms of compression] because **storing the same type of values together allows more efficient compression** - In columnar formats, **a different and efficient encoding is utilized for each column** - .stress[`ORC` has the best compression rate of all three], thanks to its **stripes** --- ### S chema Evolution - One challenge in big data is the .stress[frequent change of data schema]: e.g. **adding/dropping columns** and changing columns names - If you know that the **schema of the data will change** several times, .stress[the best choice is `Avro`] - `Avro` data schema is in JSON and `Avro` is able to keep data compact even when many different schemas exist --- ### N ested Columns - If you have a lot of .stress[complex nested columns] in your dataset and often only query a **subset of columns or subcolumns**, .stress[Parquet is the best choice] - Parquet allows to .stress[access and retrieve subcolumns without pulling the rest] of the nested column --- ### F ramework support - You have .stress[consider the framework] you are using when choosing a data format - Data formats **perform differently** depending on where they are used - `ORC` works best with `Hive` (it was made for it). - `Spark` provides great support for processing `Parquet` formats. - `Avro` is often a good choice for `Kafka` But... you can .stress[use an try all formats with any framework] --- template: inter-slide ## Thank you !