Master I MIDS & Informatique
Université Paris Cité
2024-02-19
You will need to choose the right format for your data
The right format typically depends on the use-case
A huge bottleneck for big data applications is time spent to find data in a particular location and time spent to write it back to another location
Even more complicated with large datasets with evolving schemas, or storage constraints
Several Hadoop
file formats evolved to ease these issues across a number of use cases
Choosing an appropriate file format has the following potential benefits
Faster reads or faster writes
Splittable files
Schema evolution support (schema changes over time)
Advanced compression support
Some file formats are designed for general use
Others for more specific use cases
Some with specific data characteristics in mind
]
We shall talk about the core concepts and use-cases for the following popular data formats:
Avro
: https://avro.apache.org
Parquet
: https://orc.apache.org
Avro
Avro
is a row-based data format and data serialization system released by the Hadoop
working group in 2009
Data schema is stored as JSON
in the header. Rest of the data stored in a binary format to make it compact and efficient
Avro
is language-neutral and can be used by many languages (for now C
, C++
, ...
, Python
, and R
)
One shining point of Avro
: robust support for schema evolution
Avro
: rationaleAvro
provides rich data structures: can create a record that contains an array, an enumerated type and a sub-recordIdeal candidate to store data in a data lake since:
Data is usually read as a whole in a data lake for further processing by downstream systems
Downstream systems can retrieve schemas easily from files (no need to store the schemas separately).
Any source schema change is easily handled
Avro
: organizationParquet
Parquet
is an open-source file format for Hadoop
created by Cloudera
and Twitter
in 2013
It stores nested data structures in a flat columnar format.
Compared to traditional row-oriented approaches, Parquet
is more efficient in terms of storage and performance
It is especially good for queries that need read a small subset of columns from a data file with many columns : only the required columns are read (optimized I/O)
If you have a dataframe like this
+----+-------+----------+
| ID | Name | Product |
+----+-------+----------+
| 1 | name1 | product1 |
| 2 | name2 | product2 |
| 3 | name3 | product3 |
+----+-------+----------+
In row-wise storage format records are contiguous in the file:
While in the columnar storage format, columns are stored together:
Parquet
: organizationThis makes columnar storage more efficient when querying a few columns from the table
No need to read whole records, but only the required columns
A unique feature of Parquet
is that even nested fields can be read individually without the need to read all the fields
Parquet
uses record shredding and an assembly algorithm to store nested structures in a columnar fashion
Parquet
: organization (continued)Parquet
: organization (lexikon)The main entities in a Parquet
file are the following:
Parquet
Parquet
: headers and footersParquet
format fileThe footer contains:
File metadata: all the locations of all the column metadata start locations. Readers first read the file metadata to find the column chunks they need. Column chunks are then read sequentially. It also includes the format version, the schema, and any extra key-value pairs.
length of file metadata (4-byte)
magic number “PAR1” (4-byte)
ORC
ORC
stands for Optimized Row Columnar file format. Created by Hortonworks in 2013 in order to speed up Hive
ORC
file format provides a highly efficient way to store data
It is a raw columnar data format highly optimized for reading, writing, and processing data in Hive
It stores data in a compact way and enables skipping quickly irrelevant parts
ORC
: organizationORC
stores collections of rows in one file. Within the collection, row data is stored in a columnar format
An ORC
file contains groups of row data called stripes, along with auxiliary information in a file footer. At the end of the file a postscript holds compression parameters and the size of the compressed footer
The default stripe size is 250 MB. Large stripe sizes enable large, efficient reads from HDFS
The file footer contains a list of stripes in the file, the number of rows per stripe, and each column’s data type. It also contains column-level aggregates count, min, max, and sum
ORC
(onctinued)Index data include min and max values for each column and the row’s positions within each column
ORC
indexes are used only for the selection of stripes and row groups and not for answering queries
ORC
ORC
file format has many advantages such as:
Hive
type support including DateTime
, decimal
, and the complex types (struct
, list
, map
and union
)
Concurrent reads of the same file
Ability to split files without scanning for markers
Estimate an upper bound on heap memory allocation based on the information in the file footer.
Avro
versus Parquet
Avro
is a row-based storage format whereas Parquet
is a columnar based storage format
Parquet
is much better for analytical querying i.e. reads and querying are much more efficient than writing.
Write operations in Avro
are better than in Parquet
.
Avro
is more mature than Parquet
for schema evolution: Parquet
supports only schema append while Avro
supports more things, such as adding or modifying columns
Parquet
is ideal for querying a subset of columns in a multi-column table. Avro
is ideal for operations where all the columns are needed (such as in a ETL workflow)
ORC
vs Parquet
Parquet
is more capable of storing nested data
ORC
is more capable of predicate pushdown (SQL queries on a data file are better optimized, chunks of data can be skipped directly while reading)
ORC
is more compression efficient
Row-based file formats are overall better for storing write-intensive data because appending new records is easier
If only a small subset of columns is queried frequently, columnar formats will be better since only those needed columns will be accessed and transmitted (whereas row formats need to pull all the columns)
Compression is one of the key aspects to consider since compression helps reduce the resources required to store and transmit data
Columnar formats are better than row-based formats in terms of compression because storing the same type of values together allows more efficient compression
In columnar formats, a different and efficient encoding is utilized for each column
ORC
has the best compression rate of all three, thanks to its stripes
One challenge in big data is the frequent change of data schema: e.g. adding/dropping columns and changing columns names
If you know that the schema of the data will change several times, the best choice is Avro
Avro
data schema is in JSON and Avro
is able to keep data compact even when many different schemas exist
If you have a lot of complex nested columns in your dataset and often only query a subset of columns or subcolumns, Parquet is the best choice
Parquet allows to access and retrieve subcolumns without pulling the rest of the nested column
You have consider the framework you are using when choosing a data format
Data formats perform differently depending on where they are used
ORC
works best with Hive
(it was designed for it)
Spark
provides great support for processing Parquet
formats.
Avro
is often a good choice for Kafka
(streaming applications)
But… you can use an try all formats with any framework