name: inter-slide class: left, middle, inverse {{ content }} --- name: layout-general layout: true class: left, middle <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 4px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: red; } /* custom.css */ .plot-callout { width: 300px; bottom: 5%; right: 5%; position: absolute; padding: 0px; z-index: 100; } .plot-callout img { width: 100%; border: 1px solid #23373B; } </style>
--- class: middle, left, inverse # Technologies Big Data : JSON ### 2023-03-22 #### [Master I MIDS Master I Informatique]() #### [Technologies Big Data](http://stephane-v-boucheron.fr/courses/isidata/) #### [Amélie Gheerbrandt, Stéphane Gaïffas, Stéphane Boucheron](http://stephane-v-boucheron.fr) --- template: inter-slide ## JSON data format --- ### What is `JSON` ? - JavaScript Object Notation (JSON) is a .stress[lightweight data-interchange format] based on the syntax of JavaScript objects - It is a text-based, human-readable, language-independent format for **representing structured object data** for **easy transmission** or **saving** - `JSON` objects can also be **stored in files** — typically a text file with a `.json` extension - `JSON` is used for **two way data transmission** between a web-server and a client, but it is also often used as a .stress[semi-structured data format] - Its syntax **closely resembles JavaScript** objects, but `JSON` can be used independently of `JavaScript` --- ### What is `JSON` ? - Most languages have libraries to manipulate `JSON` - We'll use `JSON` data in `python` using the `json` module from the standard library ### Some terminology - `JSON` exists as a **string** or a **sequence** (or series) of **bytes** - .stress[Serialization]: convert an object (e.g. `dict`) to a `JSON` representation. The object is **encoded** for easy transmission - .stress[Deserialization]: the opposite of serialization. Involves **decoding** data in `JSON` format to **native data types** that can be manipulated --- ### Why `JSON` ? - .stress[Much smaller representation than `XML`] (its predecessor) in server-client communication: **faster data transfers** - `JSON` exists as a **"sequence of bytes"**: very useful to transmit (stream) data over a network - `JSON` is .stress[human-friendly] since it is **textual** and simultaneously **machine-friendly** - `JSON` has an .stress[expressive syntax] for representing arrays, objects, numbers and booleans --- template: inter-slide ## Using JSON with Python --- ### Working with built-in datatypes - The `json` module encodes `Python` objects as `JSON` strings implemented by the `json.JSONEncoder` class - and decodes `JSON` strings into `Python` objects using the `json.JSONDecoder` class - The `JSON` encoder .stress[only understands native `Python` data types] (`str`, `int`, `float`, `bool`, `list`, `tuple` and `dict`) The `json` module provides two very handy methods for .stress[serialization] : .pure-table.pure-table-striped[ | Function | Description | | :---------| :----- | | `dumps()` | serializes an object to a `JSON` **formatted string** | | `dump()` | serializes an object to a `JSON` **formatted stream** (which supports writing to a file). | ] --- ### Serializing built-in datatypes `json.dumps` and `json.dump` use the following conversions for built-in datatypes : .pure-table.pure-table-striped[ | Python | `JSON` | | :---------- | -----: | | dict | object | | list, tuple | array | | str | string | | int, float | number | | True | true | | False | false | | None | null | ] --- ### Serialization example .pull-left[ Serialize a `Python` object into a `JSON` formatted string using `json.dumps` ```python >>> import json >>> json.dumps({ "name": "Foo Bar", "age": 78, "friends": ["Jane","John"], "balance": 345.80, "other_names":("Doe","Joe"), "active": True, "spouse": None }, sort_keys=True, indent=4) ``` ] .pull-right-40[ Output ```json { "active": true, "age": 78, "balance": 345.8, "friends": [ "Jane", "John" ], "name": "Foo Bar", "other_names": [ "Doe", "Joe" ], "spouse": null } ``` ] .pull-left[ Pretty printing options - `sort_keys=True`: sort the keys of the JSON object - `indent=4`: indent using 4 spaces ] --- ### Serialization example Similarly, `json.dump()` allows to write the output stream to a file ```python >>> import json >>> with open('user.json','w') as file: json.dump({ "name": "Foo Bar", "age": 78, "friends": ["Jane","John"], "balance": 345.80, "other_names": ("Doe","Joe"), "active": True, "spouse": None }, file, sort_keys=True, indent=4) ``` This writes a `user.json` file to disk with similar content as in the previous example --- ### Deserializing built-in datatypes Similarly to serialization, the `json` module exposes two methods for deserialization: .pure-table.pure-table-striped[ | Function | Description | | :---------| :----- | | `loads()` | deserializes a `JSON` document to a Python object | | `load()` | deserializes a `JSON` formatted stream (which supports reading from a file) to a Python object | ] --- ### Deserializing built-in datatypes The decoder converts `JSON` encoded data into native Python data types as in the table below: .pure-table.pure-table-striped[ | `JSON` | Python | | :--------------------------------- | ---------------: | | object | dict | | array | list | | string | str | | number (int) | int | | number (real) | float | | true | True | | false | False | null | None | ] --- ### Deserialization example Passed a `JSON` string to the `json.loads()` method : ```python >>> import json >>> json.loads('{"active": true, "age": 78, "balance": 345.8, "friends": ["Jane","John"], "name": "Foo Bar", "other_names": ["Doe","Joe"],"spouse":null}') ``` we obtain a dictionary as the output: ```python {'active': True, 'age': 78, 'balance': 345.8, 'friends': ['Jane', 'John'], 'name': 'Foo Bar', 'other_names': ['Doe', 'Joe'], 'spouse': None} ``` --- ### Deserialization example We can also read from the `user.json` that we created before: ```python >>> import json >>> with open('user.json', 'r') as file: user_data = json.load(file) >>> print(user_data) {'active': True, 'age': 78, 'balance': 345.8, 'friends': ['Jane', 'John'], 'name': 'Foo Bar', 'other_names': ['Doe', 'Joe'], 'spouse': None} ``` We obtain the same `dict`. This is .stress[very simple] and actually pretty fast. NB: doing the same in `Java` is more involved and uglier and slower... --- ### Serialize and deserialize custom objects - Using `JSON`, we serialized and deserialized objects containing only .stress[encapsulated built-in types] - We can also work a little bit to serialize .stress[custom objects] - Let's go to [notebook07_json-format.ipynb](../labs/notebook07_json-format.ipynb) --- template: inter-slide ## Using JSON data with Spark --- ### Using `JSON` data with `Spark` Typically achieved using ```python spark.read.json(filename, multiLine=True) ``` - Pretty simple - but usually requires .stress[extra cleaning] or .stress[schema flattening] Everything is explained in the notebook : .center[[notebook07_json-format.ipynb](../labs/notebook07_json-format.ipynb)] --- template: inter-slide ### Thank you !