Master I MIDS & Informatique
Université Paris Cité
2024-02-19
JSON
?JavaScript Object Notation (JSON) is a lightweight data-interchange format based on the syntax of JavaScript objects
It is a text-based, human-readable, language-independent format for representing structured object data for easy transmission or saving
JSON
objects can also be stored in files — typically a text file with a .json
extension
JSON
is used for two-way data transmission between a web-server and a client, but it is also often used as a semi-structured data format
Its syntax closely resembles JavaScript objects, but JSON
can be used independently of JavaScript
JSON
?Most languages have libraries to manipulate JSON
In we shall use JSON
data in python
using the json
module from the standard library
has several JSON
packages to handle JSON
. For example jsonlite
JSON
objects should be thought of as strings or a sequences (or series) of bytes complying with the JSON syntaxdict
) to a JSON
representation. The object is encoded for easy storage and/or transmissionJSON
format to native data types that can be manipulatedJSON
?.stress[Much smaller representation than XML
] (its predecessor) in client-server communication: faster data transfers
JSON
exists as a sequence of bytes: very useful to transmit (stream) data over a network
JSON
is reader-friendly since it is ultimately text and simultaneously machine-friendly
JSON
has an expressive syntax for representing arrays, objects, numbers and booleans/logicals
The json
module ()
encodes Python
objects as JSON
strings using instances of class json.JSONEncoder
decodes JSON
strings into Python
objects using instances of class json.JSONDecoder
Warning
The JSON
encoder only handles native Python
data types (str
, int
, float
, bool
, list
, tuple
and dict
)
The json
module provides two very handy methods for serialization :
Function | Description |
---|---|
dumps() |
serializes an object to a JSON formatted string |
dump() |
serializes an object to a JSON formatted stream (which supports writing to a file). |
json.dumps()
and json.dump()
use the following mapping conventions for built-in datatypes :
Python | JSON |
---|---|
dict | object |
list, tuple | array |
str | string |
int, float | number |
True | true |
False | false |
None | null |
. . .
Warning
list
and tuple
are mapped to the same json
type.
int
and float
are mapped to the same json
type
Serialize a Python
object into a JSON
formatted string using json.dumps()
str
Remember:
JSON.dumps()
converts a Python object into a JSON formatted text.
A Python date
object is not serializable.
But it can be converted into serializable types.
'"2024-03-04"'
'[2024, 10, 1]'
'[2024, 3, 4, 0, 0, 0, 0, 64, -1]'
json.dump()
allows to write the output stream to a file
Similarly to serialization, the json
module exposes two methods for deserialization:
Function | Description |
---|---|
loads() |
deserializes a JSON document to a Python object |
load() |
deserializes a JSON formatted stream (which supports reading from a file) to a Python object |
The decoder converts JSON
encoded data into native Python data types as in the table below:
JSON |
Python |
---|---|
object | dict |
array | list |
string | str |
number (int) | int |
number (real) | float |
true | True |
false | False |
null | None |
Pass a JSON
string to the json.loads()
method :
We can also read from the user.json
file we created before:
{'active': True,
'age': 78,
'balance': 345.8,
'friends': ['Jane', 'John'],
'name': 'Foo Bar',
'other_names': ['Doe', 'Joe'],
'spouse': None}
We obtain the same dict
. This is simple and fast.
Using JSON
, we serialized and deserialized objects containing only encapsulated built-in types
We can also work a little bit to serialize custom objects
Let’s go to notebook07_json-format.ipynb
JSON
data with Spark
Typically achieved using
Pretty simple
but usually requires extra cleaning or schema flattening
(Almost) Everything is explained in the notebook :
JSON
reader and writer allows us save and read Spark dataframes with composite types.
A common use of JSON is to collect JSON data from a web server as a file or HTTP request, and convert the JSON data to a Python/R/Spark object.
An object can be defined as an unordered set of name/value pairs. An object in JSON starts with {left brace} and finish or ends with {right brace}. Every name is followed by: (colon) and the name/value pairs are parted by, (comma).
JSON syntax is a subset of the JavaScript object notation syntax.
objects
arrays
JSON types:
json
versus pickle
Two competing serialization modules?
Pickle
is Python boundPickle
handles (almost) everything that can be defined in Python
pickle
dumps.json
is used by widely different languages and systemsjson
is readablejson
is less prone to malicious code injectionJSON objects are used extensively to handle spatial or textual data.
JSON objects are used by spatial extensions of Pandas and Spark.
GeoJSON is a format for encoding a variety of geographic data structures. GeoJSON supports the following geometry types: Point
, LineString
, Polygon
, MultiPoint
, MultiLineString
, and MultiPolygon
. Geometric objects with additional properties are Feature
objects. Sets of features are contained by FeatureCollection
objects.
{'type': 'FeatureCollection',
'features': [{'type': 'Feature',
'properties': {},
'geometry': {'coordinates': [2.381584638521815, 48.82906361931293],
'type': 'Point'}}]}
Natural Language Processing (NLP) handles corpora of texts (called documents), annotates the documents, parses the documents into sentences and tokens, performs syntactic analysis (POS tagging), and eventually enables topic modeling, sentiment analysis, automatic translation, and other machine learning tasks.
Corpus annotation can be performed using spark-nlp
a package developped by the John Snow Labs to offer NLP above Spark SQL
and Spark MLLib
.
Annotation starts by applying a DocumentAssembler()
transformation to a corpus. This introduces columns with composite types
>>> result = documentAssembler.transform(data)
>>> result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document").printSchema()
root
|-- document: array (nullable = True)
| |-- element: struct (containsNull = True)
| | |-- annotatorType: string (nullable = True)
| | |-- begin: integer (nullable = False)
| | |-- end: integer (nullable = False)
| | |-- result: string (nullable = True)
| | |-- metadata: map (nullable = True)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = True)
| | |-- embeddings: array (nullable = True)
| | | |-- element: float (containsNull = False)
Column document
is of type ArrayType()
. The basetype of document
column is of StructType()
(element
), the element
contains subfields of primitive type, but alo a field of type map
(MapType()
) and a field of type StructType()
.