Master I MIDS & Informatique
Université Paris Cité
2024-02-19
JSON ?JavaScript Object Notation (JSON) is a lightweight data-interchange format based on the syntax of JavaScript objects
It is a text-based, human-readable, language-independent format for representing structured object data for easy transmission or saving
JSON objects can also be stored in files — typically a text file with a .json extension
JSON is used for two-way data transmission between a web-server and a client, but it is also often used as a semi-structured data format
Its syntax closely resembles JavaScript objects, but JSON can be used independently of JavaScript
JSON?Most languages have libraries to manipulate JSON
In we shall use JSON data in python using the json module from the standard library
has several JSON packages to handle JSON. For example jsonlite
JSON objects should be thought of as strings or a sequences (or series) of bytes complying with the JSON syntaxdict) to a JSON representation. The object is encoded for easy storage and/or transmissionJSON format to native data types that can be manipulatedJSON ?.stress[Much smaller representation than XML] (its predecessor) in client-server communication: faster data transfers
JSON exists as a sequence of bytes: very useful to transmit (stream) data over a network
JSON is reader-friendly since it is ultimately text and simultaneously machine-friendly
JSON has an expressive syntax for representing arrays, objects, numbers and booleans/logicals
The json module ()
encodes Python objects as JSON strings using instances of class json.JSONEncoder
decodes JSON strings into Python objects using instances of class json.JSONDecoder
Warning
The JSON encoder only handles native Python data types (str, int, float, bool, list, tuple and dict)
The json module provides two very handy methods for serialization :
| Function | Description |
|---|---|
dumps() |
serializes an object to a JSON formatted string |
dump() |
serializes an object to a JSON formatted stream (which supports writing to a file). |
json.dumps() and json.dump() use the following mapping conventions for built-in datatypes :
| Python | JSON |
|---|---|
| dict | object |
| list, tuple | array |
| str | string |
| int, float | number |
| True | true |
| False | false |
| None | null |
. . .
Warning
list and tuple are mapped to the same json type.
int and float are mapped to the same json type
Serialize a Python object into a JSON formatted string using json.dumps()
str
Remember:
JSON.dumps() converts a Python object into a JSON formatted text.
A Python date object is not serializable.
But it can be converted into serializable types.
'"2024-03-04"'
'[2024, 10, 1]'
'[2024, 3, 4, 0, 0, 0, 0, 64, -1]'
json.dump() allows to write the output stream to a file
Similarly to serialization, the json module exposes two methods for deserialization:
| Function | Description |
|---|---|
loads() |
deserializes a JSON document to a Python object |
load() |
deserializes a JSON formatted stream (which supports reading from a file) to a Python object |
The decoder converts JSON encoded data into native Python data types as in the table below:
JSON |
Python |
|---|---|
| object | dict |
| array | list |
| string | str |
| number (int) | int |
| number (real) | float |
| true | True |
| false | False |
| null | None |
Pass a JSON string to the json.loads() method :
We can also read from the user.json file we created before:
{'active': True,
'age': 78,
'balance': 345.8,
'friends': ['Jane', 'John'],
'name': 'Foo Bar',
'other_names': ['Doe', 'Joe'],
'spouse': None}
We obtain the same dict. This is simple and fast.
Using JSON, we serialized and deserialized objects containing only encapsulated built-in types
We can also work a little bit to serialize custom objects
Let’s go to notebook07_json-format.ipynb
JSON data with SparkTypically achieved using
Pretty simple
but usually requires extra cleaning or schema flattening
(Almost) Everything is explained in the notebook :
JSON reader and writer allows us save and read Spark dataframes with composite types.
A common use of JSON is to collect JSON data from a web server as a file or HTTP request, and convert the JSON data to a Python/R/Spark object.
An object can be defined as an unordered set of name/value pairs. An object in JSON starts with {left brace} and finish or ends with {right brace}. Every name is followed by: (colon) and the name/value pairs are parted by, (comma).
JSON syntax is a subset of the JavaScript object notation syntax.
objectsarraysJSON types:
json versus pickleTwo competing serialization modules?
Pickle is Python boundPickle handles (almost) everything that can be defined in Pythonpickle dumps.json is used by widely different languages and systemsjson is readablejson is less prone to malicious code injectionJSON objects are used extensively to handle spatial or textual data.
JSON objects are used by spatial extensions of Pandas and Spark.
GeoJSON is a format for encoding a variety of geographic data structures. GeoJSON supports the following geometry types: Point, LineString, Polygon, MultiPoint, MultiLineString, and MultiPolygon. Geometric objects with additional properties are Feature objects. Sets of features are contained by FeatureCollection objects.
{'type': 'FeatureCollection',
'features': [{'type': 'Feature',
'properties': {},
'geometry': {'coordinates': [2.381584638521815, 48.82906361931293],
'type': 'Point'}}]}
Natural Language Processing (NLP) handles corpora of texts (called documents), annotates the documents, parses the documents into sentences and tokens, performs syntactic analysis (POS tagging), and eventually enables topic modeling, sentiment analysis, automatic translation, and other machine learning tasks.
Corpus annotation can be performed using spark-nlp a package developped by the John Snow Labs to offer NLP above Spark SQL and Spark MLLib.
Annotation starts by applying a DocumentAssembler() transformation to a corpus. This introduces columns with composite types
>>> result = documentAssembler.transform(data)
>>> result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document").printSchema()
root
|-- document: array (nullable = True)
| |-- element: struct (containsNull = True)
| | |-- annotatorType: string (nullable = True)
| | |-- begin: integer (nullable = False)
| | |-- end: integer (nullable = False)
| | |-- result: string (nullable = True)
| | |-- metadata: map (nullable = True)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = True)
| | |-- embeddings: array (nullable = True)
| | | |-- element: float (containsNull = False)Column document is of type ArrayType(). The basetype of document column is of StructType() (element), the element contains subfields of primitive type, but alo a field of type map (MapType()) and a field of type StructType().
