print("Salut tout le monde!")Salut tout le monde!
January 21, 2024
We introduce here the python language. Only the bare minimum necessary for getting started with the data-science stack (a bunch of libraries for data science). Python is a programming language, as are C++, java, fortran, javascript, etc.
an interpreted (as opposed to compiled) language. Contrary to e.g. C++ or fortran, one does not compile Python code before executing it.
Used as a scripting language, by python python script.py in a terminal
But can be used also interactively: the jupyter notebook, iPython, etc.
A free software released under an open-source license: Python can be used and distributed free of charge, even for building commercial software.
multi-platform: Python is available for all major operating systems, Windows, Linux/Unix, MacOS X, most likely your mobile phone OS, etc.
A very readable language with clear non-verbose syntax
A language for which a large amount of high-quality packages are available for various applications, including web-frameworks and scientific computing
It has been one of the top languages for data science and machine learning for several years, because it is expressive and and easy to deploy
An object-oriented language
See https://www.python.org/about/ for more information about distinguishing features of Python.
Simple answer: don’t use Python 2, use Python 3
Python 2 is mostly deprecated and has not been maintained for years
You’ll end up hanged if you use Python 2
If Python 2 is mandatory at your workplace, find another work
In a jupyter notebook, you have an interactive interpreter.
You type in the code cells, execute commands with Shift + Enter (on MacOS)
We can assign values to variables with =
We don’t declare the type of a variable before assigning its value. In C, conversely, one should write
or even
8004153099680695240677662228684856314409365427758266999205063931175132640587226837141154215226851187899067565063096026317140186260836873939218139105634817684999348008544433671366043519135008200013865245747791955240844192282274023825424476387832943666754140847806277355805648624376507618604963106833797989037967001806494232055319953368448928268857747779203073913941756270620192860844700087001827697624308861431399538404552468712313829522630577767817531374612262253499813723569981496051353450351968993644643291035336065584116155321928452618573467361004489993801594806505273806498684433633838323916674207622468268867047187858269410016150838175127772100983052010703525089
There exists a floating point type that is created when the variable has decimal values
Similarly, boolean types are created from a comparison
Python provides many efficient types of containers, in which collections of objects can be stored.
The main ones are list, tuple, set and dict (but there are many others…)
You can’t change a tuple, we say that it’s immutable
Three ways of doing the same thing
Simpler is better in Python, so usually you want to use Method 2.
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
A list is an ordered collection of objects. These objects may have different types. For example:
Indexing: accessing individual objects contained in the list
Warning. Indexing starts at 0 (as in C), not at 1 (as in Fortran, R, or Matlab) for any iterable object in Python.
Counting from the end with negative indices:
Index must remain in the range of the list
This work with anything iterable whenever it makes sense (list, str, tuple, etc.)
Slicing syntax: colors[start:stop:stride]
NB: All slicing parameters are optional
slice(None, 4, None)
slice(1, 5, None)
slice(None, 13, 3)
Different string syntaxes (simple, double or triple quotes):
" Bonjour,\nJe m'appelle Stephane.\nJe vous souhaite une bonne journée.\nSalut. \n"
['B', 'o', 'n', 'j', 'o', 'u', 'r', ',', '\n', 'J', 'e', ' ', 'm', "'", 'a']
['Bonjour,',
"Je m'appelle Stephane.",
'Je vous souhaite une bonne journée.',
'Salut.']
Chaining method calls is the basic of pipeline building.
(
" ".join(['Il', 'fait', 'super', 'beau', "aujourd'hui"])
.title()
.replace(' ', '')
.replace("'","")
)'IlFaitSuperBeauAujourdHui'
A string is immutable !!
in keywordYou can use the in keyword with any container, whenever it makes sense
['red', 'blue', 3.14, 'black', 'white', ('truc', 3.14, 'truc')]
False
Explain this weird behaviour:
dict under the hood in Python{'emmanuelle': 5752, 'sebastian': 5578}
<class 'dict'>
{'emmanuelle': 5752,
'sebastian': 5578,
'francis': '5919',
7162453: [1, 3, 2],
3.14: 'bidule',
('jaouad', 2): 1234}
tel = {'emmanuelle': 5752, 'sebastian' : 5578, 'jaouad' : 1234}
print(tel.keys())
print(tel.values())
print(tel.items())dict_keys(['emmanuelle', 'sebastian', 'jaouad'])
dict_values([5752, 5578, 1234])
dict_items([('emmanuelle', 5752), ('sebastian', 5578), ('jaouad', 1234)])
You can swap values like this
{'emmanuelle': 5752, 'sebastian': 5578, 'jaouad': 1234}
{'emmanuelle': 5578, 'sebastian': 5752, 'jaouad': 1234}
Get keys of tel sorted by decreasing order
Get keys of tel sorted by increasing values
Obtain a sorted-by-key version of tel
<class 'dict_items'>
False
If you really want an ordered dict OrderDict memorizes order of insertion in it
A set is an unordered container, containing unique elements
You can combine all containers together
Python is name bindingQuestion. What is in ss ?
ss and sss are names for the same object
Comparisons
***********
Unlike C, all comparison operations in Python have the same priority,
which is lower than that of any arithmetic, shifting or bitwise
operation. Also unlike C, expressions like "a < b < c" have the
interpretation that is conventional in mathematics:
comparison ::= or_expr (comp_operator or_expr)*
comp_operator ::= "<" | ">" | "==" | ">=" | "<=" | "!="
| "is" ["not"] | ["not"] "in"
Comparisons yield boolean values: "True" or "False". Custom *rich
comparison methods* may return non-boolean values. In this case Python
will call "bool()" on such value in boolean contexts.
Comparisons can be chained arbitrarily, e.g., "x < y <= z" is
equivalent to "x < y and y <= z", except that "y" is evaluated only
once (but in both cases "z" is not evaluated at all when "x < y" is
found to be false).
Formally, if *a*, *b*, *c*, …, *y*, *z* are expressions and *op1*,
*op2*, …, *opN* are comparison operators, then "a op1 b op2 c ... y
opN z" is equivalent to "a op1 b and b op2 c and ... y opN z", except
that each expression is evaluated at most once.
Note that "a op1 b op2 c" doesn’t imply any kind of comparison between
*a* and *c*, so that, e.g., "x < y > z" is perfectly legal (though
perhaps not pretty).
Value comparisons
=================
The operators "<", ">", "==", ">=", "<=", and "!=" compare the values
of two objects. The objects do not need to have the same type.
Chapter Objects, values and types states that objects have a value (in
addition to type and identity). The value of an object is a rather
abstract notion in Python: For example, there is no canonical access
method for an object’s value. Also, there is no requirement that the
value of an object should be constructed in a particular way, e.g.
comprised of all its data attributes. Comparison operators implement a
particular notion of what the value of an object is. One can think of
them as defining the value of an object indirectly, by means of their
comparison implementation.
Because all types are (direct or indirect) subtypes of "object", they
inherit the default comparison behavior from "object". Types can
customize their comparison behavior by implementing *rich comparison
methods* like "__lt__()", described in Basic customization.
The default behavior for equality comparison ("==" and "!=") is based
on the identity of the objects. Hence, equality comparison of
instances with the same identity results in equality, and equality
comparison of instances with different identities results in
inequality. A motivation for this default behavior is the desire that
all objects should be reflexive (i.e. "x is y" implies "x == y").
A default order comparison ("<", ">", "<=", and ">=") is not provided;
an attempt raises "TypeError". A motivation for this default behavior
is the lack of a similar invariant as for equality.
The behavior of the default equality comparison, that instances with
different identities are always unequal, may be in contrast to what
types will need that have a sensible definition of object value and
value-based equality. Such types will need to customize their
comparison behavior, and in fact, a number of built-in types have done
that.
The following list describes the comparison behavior of the most
important built-in types.
* Numbers of built-in numeric types (Numeric Types — int, float,
complex) and of the standard library types "fractions.Fraction" and
"decimal.Decimal" can be compared within and across their types,
with the restriction that complex numbers do not support order
comparison. Within the limits of the types involved, they compare
mathematically (algorithmically) correct without loss of precision.
The not-a-number values "float('NaN')" and "decimal.Decimal('NaN')"
are special. Any ordered comparison of a number to a not-a-number
value is false. A counter-intuitive implication is that not-a-number
values are not equal to themselves. For example, if "x =
float('NaN')", "3 < x", "x < 3" and "x == x" are all false, while "x
!= x" is true. This behavior is compliant with IEEE 754.
* "None" and "NotImplemented" are singletons. **PEP 8** advises that
comparisons for singletons should always be done with "is" or "is
not", never the equality operators.
* Binary sequences (instances of "bytes" or "bytearray") can be
compared within and across their types. They compare
lexicographically using the numeric values of their elements.
* Strings (instances of "str") compare lexicographically using the
numerical Unicode code points (the result of the built-in function
"ord()") of their characters. [3]
Strings and binary sequences cannot be directly compared.
* Sequences (instances of "tuple", "list", or "range") can be compared
only within each of their types, with the restriction that ranges do
not support order comparison. Equality comparison across these
types results in inequality, and ordering comparison across these
types raises "TypeError".
Sequences compare lexicographically using comparison of
corresponding elements. The built-in containers typically assume
identical objects are equal to themselves. That lets them bypass
equality tests for identical objects to improve performance and to
maintain their internal invariants.
Lexicographical comparison between built-in collections works as
follows:
* For two collections to compare equal, they must be of the same
type, have the same length, and each pair of corresponding
elements must compare equal (for example, "[1,2] == (1,2)" is
false because the type is not the same).
* Collections that support order comparison are ordered the same as
their first unequal elements (for example, "[1,2,x] <= [1,2,y]"
has the same value as "x <= y"). If a corresponding element does
not exist, the shorter collection is ordered first (for example,
"[1,2] < [1,2,3]" is true).
* Mappings (instances of "dict") compare equal if and only if they
have equal "(key, value)" pairs. Equality comparison of the keys and
values enforces reflexivity.
Order comparisons ("<", ">", "<=", and ">=") raise "TypeError".
* Sets (instances of "set" or "frozenset") can be compared within and
across their types.
They define order comparison operators to mean subset and superset
tests. Those relations do not define total orderings (for example,
the two sets "{1,2}" and "{2,3}" are not equal, nor subsets of one
another, nor supersets of one another). Accordingly, sets are not
appropriate arguments for functions which depend on total ordering
(for example, "min()", "max()", and "sorted()" produce undefined
results given a list of sets as inputs).
Comparison of sets enforces reflexivity of its elements.
* Most other built-in types have no comparison methods implemented, so
they inherit the default comparison behavior.
User-defined classes that customize their comparison behavior should
follow some consistency rules, if possible:
* Equality comparison should be reflexive. In other words, identical
objects should compare equal:
"x is y" implies "x == y"
* Comparison should be symmetric. In other words, the following
expressions should have the same result:
"x == y" and "y == x"
"x != y" and "y != x"
"x < y" and "y > x"
"x <= y" and "y >= x"
* Comparison should be transitive. The following (non-exhaustive)
examples illustrate that:
"x > y and y > z" implies "x > z"
"x < y and y <= z" implies "x < z"
* Inverse comparison should result in the boolean negation. In other
words, the following expressions should have the same result:
"x == y" and "not x != y"
"x < y" and "not x >= y" (for total ordering)
"x > y" and "not x <= y" (for total ordering)
The last two expressions apply to totally ordered collections (e.g.
to sequences, but not to sets or mappings). See also the
"total_ordering()" decorator.
* The "hash()" result should be consistent with equality. Objects that
are equal should either have the same hash value, or be marked as
unhashable.
Python does not enforce these consistency rules. In fact, the
not-a-number values are an example for not following these rules.
Membership test operations
==========================
The operators "in" and "not in" test for membership. "x in s"
evaluates to "True" if *x* is a member of *s*, and "False" otherwise.
"x not in s" returns the negation of "x in s". All built-in sequences
and set types support this as well as dictionary, for which "in" tests
whether the dictionary has a given key. For container types such as
list, tuple, set, frozenset, dict, or collections.deque, the
expression "x in y" is equivalent to "any(x is e or x == e for e in
y)".
For the string and bytes types, "x in y" is "True" if and only if *x*
is a substring of *y*. An equivalent test is "y.find(x) != -1".
Empty strings are always considered to be a substring of any other
string, so """ in "abc"" will return "True".
For user-defined classes which define the "__contains__()" method, "x
in y" returns "True" if "y.__contains__(x)" returns a true value, and
"False" otherwise.
For user-defined classes which do not define "__contains__()" but do
define "__iter__()", "x in y" is "True" if some value "z", for which
the expression "x is z or x == z" is true, is produced while iterating
over "y". If an exception is raised during the iteration, it is as if
"in" raised that exception.
Lastly, the old-style iteration protocol is tried: if a class defines
"__getitem__()", "x in y" is "True" if and only if there is a non-
negative integer index *i* such that "x is y[i] or x == y[i]", and no
lower integer index raises the "IndexError" exception. (If any other
exception is raised, it is as if "in" raised that exception).
The operator "not in" is defined to have the inverse truth value of
"in".
Identity comparisons
====================
The operators "is" and "is not" test for an object’s identity: "x is
y" is true if and only if *x* and *y* are the same object. An
Object’s identity is determined using the "id()" function. "x is not
y" yields the inverse truth value. [4]
Related help topics: EXPRESSIONS, BASICMETHODS
When you code
you just - bind the variable name x to a list [1, 2, 3] - give another name y to the same object
Important remarks
A list is mutable
140143828322240 [1, 2, 3]
140143828322240 [43, 2, 3, 3.14]
A str is immutable
In order to “change” an immutable object, Python creates a new one
Once again, a list is mutable
(140143828457216, 140143828457216)
other_list and super_list are the same listid returns the identity of an object. Two objects with the same idendity are the same (not only the same type, but the same instance)([3.14, 'youps', 'tintin'], [3.14, 'youps', 'tintin'])
Only other_list is modified.
But… what if you have a list of list ? (or a mutable object containing mutable objects)
(140142153134336, 140142153387200, 140142153387200, True)
Let’s make a copy of list_list
([[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6], 'super'])
OK, only copy_list is modified, as expected
But now…
([[1, 'oups', 3], [4, 5, 6], 'super'], [[1, 'oups', 3], [4, 5, 6]])
Question. What happened ?!?
list_list object is copiedcopy does a shallow copy, not a deep copydeepcopy.140142153248448 ([1, 2, 3], [4, 5, 6])
[140143828522688, 140143828456448]
140142153248448 ([1, '42', 3], [4, 5, 6])
[140143828522688, 140143828456448]
Namely tests, loops, again booleans, etc.
For example, don’t do this to test if a list is empty
but this
Some poetry
dict is Falsestring is Falselist is Falsetuple is Falseset is False0 is False.0 is FalseTrueEmpty sequences are falsies
Non-empty sequences are truthies
Compute the decimals of Pi using the Wallis formula
\[ \pi = 2 \prod_{i=1}^{100} \frac{4i^2}{4i^2 - 1} \]
for loop with rangerange has the same parameters as with slicing start:end:stride, all parameters being optionalfor i in range(4):
print(i + 1)
print('-')
for i in range(1, 5):
print(i)
print('-')
for i in range(1, 10, 3):
print(i)1
2
3
4
-
1
2
3
4
-
1
4
7
Something for nerds. You can use else in a for loop
You can iterate using for over any container: list, tuple, dict, str, set among others…
# This is stupid
for i in range(len(colors)):
print(colors[i])
# This is better
for color in colors:
print(color)red
blue
black
white
red
blue
black
white
To iterate over several sequences at the same time, use zip
red stephane
blue jaouad
black mokhtar
white yiyang
Bonjour 7
{'francis': 5214, 'stephane': 5123} 2
('truc', 3) 2
Loop over a str
Loop over a dict
dd = {(1, 3): {'hello', 'world'}, 'truc': [1, 2, 3], 5: (1, 4, 2)}
# Default is to loop over keys
for key in dd:
print(key)(1, 3)
truc
5
(1, 3) {'hello', 'world'}
truc [1, 2, 3]
5 (1, 4, 2)
You can construct a list, dict, set and others using the comprehension syntax
list comprehension
['red', 'blue', 'black', 'white']
['stephane', 'jaouad', 'mokhtar', 'yiyang', 'rémi']
# The list of people with favorite color that has no more than 4 characters
[people for color, people in zip(colors, peoples) if len(color) <= 4]['stephane', 'jaouad']
dict comprehension
{'stephane': 'red', 'jaouad': 'blue'}
# Allows to build a dict from two lists (for keys and values)
{key: value for (key, value) in zip(peoples, colors)}{'stephane': 'red', 'jaouad': 'blue', 'mokhtar': 'black', 'yiyang': 'white'}
{'stephane': 'red', 'jaouad': 'blue', 'mokhtar': 'black', 'yiyang': 'white'}
Something very convenient is enumerate
We can use lambda to define anonymous functions, and use them in the map and reduce functions
['__annotations__',
'__builtins__',
'__call__',
'__class__',
'__closure__',
'__code__',
'__defaults__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__get__',
'__getattribute__',
'__globals__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__kwdefaults__',
'__le__',
'__lt__',
'__module__',
'__name__',
'__ne__',
'__new__',
'__qualname__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__']
Intended for short and one-line function.
More complex functions use def (see below)
Print the squares of even numbers between 0 et 15
mapRemark. We will see later why we need to use list above
Now, to get the sum of these squares, we can use sum
We can also use reduce (not a good idea here, but it’s good to know that it exists)
There is also something that can be useful in functool called partial
It allows to simplify functions by freezing some arguments
What is the output of
This does the following
plt.figure(figsize=(6, 6))
plt.plot([sys.getsizeof(list(range(i))) for i in range(10000)], lw=3)
plt.plot([sys.getsizeof(range(i)) for i in range(10000)], lw=3)
plt.xlabel('Number of elements (value of i)', fontsize=14)
plt.ylabel('Size (in bytes)', fontsize=14)
_ = plt.legend(['list(range(i))', 'range(i)'], fontsize=16)
The memory used by range(i) does not scale linearly with i
What is happening ?
range(n) does not allocate a list of n elements !PythonPython standard library behaves like thisWarning. Getting the real memory footprint of a Python object is difficult. Note that sizeof calls the __sizeof__ method of r, which does not give in general the actual memory used by an object. But nevermind here.
The following computation has no memory footprint:
map does not return a list for the same reason
Namely generators defined through comprehensions. Just replace [] by () in the comprehension.
A generator can be iterated on only once
(0, 0)
(0, 1)
(0, 2)
(1, 0)
(1, 1)
(1, 2)
(2, 0)
(2, 1)
(2, 2)
(3, 0)
(3, 1)
(3, 2)
<generator object <genexpr> at 0x7f7561b827a0>
yieldSomething very powerful
But also with a for loop
collections module(This is where the good stuff hides)
texte = """
Bonjour,
Python c'est super.
Python ca a l'air quand même un peu compliqué.
Mais bon, ca a l'air pratique.
Peut-être que je pourrais m'en servir pour faire des trucs super.
"""
texte" \nBonjour,\nPython c'est super.\nPython ca a l'air quand même un peu compliqué.\nMais bon, ca a l'air pratique.\nPeut-être que je pourrais m'en servir pour faire des trucs super.\n"
Bonjour,
Python c'est super.
Python ca a l'air quand même un peu compliqué.
Mais bon, ca a l'air pratique.
Peut-être que je pourrais m'en servir pour faire des trucs super.
# Some basic text preprocessing
new_text = (
texte
.strip()
.replace('\n', ' ')
.replace(',', ' ')
.replace('.', ' ')
.replace("'", ' ')
)
print(new_text)
print('-' * 8)
words = new_text.split()
print(words)Bonjour Python c est super Python ca a l air quand même un peu compliqué Mais bon ca a l air pratique Peut-être que je pourrais m en servir pour faire des trucs super
--------
['Bonjour', 'Python', 'c', 'est', 'super', 'Python', 'ca', 'a', 'l', 'air', 'quand', 'même', 'un', 'peu', 'compliqué', 'Mais', 'bon', 'ca', 'a', 'l', 'air', 'pratique', 'Peut-être', 'que', 'je', 'pourrais', 'm', 'en', 'servir', 'pour', 'faire', 'des', 'trucs', 'super']
Count the number of occurences of all the words in words.
Output must be a dictionary containg word: count
['Bonjour', 'Python', 'c', 'est', 'super', 'Python', 'ca', 'a', 'l', 'air', 'quand', 'même', 'un', 'peu', 'compliqué', 'Mais', 'bon', 'ca', 'a', 'l', 'air', 'pratique', 'Peut-être', 'que', 'je', 'pourrais', 'm', 'en', 'servir', 'pour', 'faire', 'des', 'trucs', 'super']
words_counts = {}
for word in words:
if word in words_counts:
words_counts[word] += 1
else:
words_counts[word] = 1
print(words_counts){'Bonjour': 1, 'Python': 2, 'c': 1, 'est': 1, 'super': 2, 'ca': 2, 'a': 2, 'l': 2, 'air': 2, 'quand': 1, 'même': 1, 'un': 1, 'peu': 1, 'compliqué': 1, 'Mais': 1, 'bon': 1, 'pratique': 1, 'Peut-être': 1, 'que': 1, 'je': 1, 'pourrais': 1, 'm': 1, 'en': 1, 'servir': 1, 'pour': 1, 'faire': 1, 'des': 1, 'trucs': 1}
defaultdictfrom collections import defaultdict
words_counts = defaultdict(int)
for word in words:
words_counts[word] += 1
print(words_counts)defaultdict(<class 'int'>, {'Bonjour': 1, 'Python': 2, 'c': 1, 'est': 1, 'super': 2, 'ca': 2, 'a': 2, 'l': 2, 'air': 2, 'quand': 1, 'même': 1, 'un': 1, 'peu': 1, 'compliqué': 1, 'Mais': 1, 'bon': 1, 'pratique': 1, 'Peut-être': 1, 'que': 1, 'je': 1, 'pourrais': 1, 'm': 1, 'en': 1, 'servir': 1, 'pour': 1, 'faire': 1, 'des': 1, 'trucs': 1})
defaultdict can be extremely usefulint is created (defaults to 0) if key is not founddefaultdictdefaultdictcounter{'Bonjour': 1, 'Python': 2, 'c': 1, 'est': 1, 'super': 2, 'ca': 2, 'a': 2, 'l': 2, 'air': 2, 'quand': 1, 'même': 1, 'un': 1, 'peu': 1, 'compliqué': 1, 'Mais': 1, 'bon': 1, 'pratique': 1, 'Peut-être': 1, 'que': 1, 'je': 1, 'pourrais': 1, 'm': 1, 'en': 1, 'servir': 1, 'pour': 1, 'faire': 1, 'des': 1, 'trucs': 1}
Counter counts the number of occurences of all objects in an iterable
Question. Which one do you prefer ?
Counter one right ?When you need to do something, assume that there is a tool to do it directly
If you can’t find it, ask google or stackoverflow
Otherwise, try to do it as simply as possible
Compute the number of occurences AND the length of each word in words.
Output must be a dictionary containing word: (count, length)
from collections import Counter
{word: (count, len(word)) for word, count in Counter(words).items()}{'Bonjour': (1, 7),
'Python': (2, 6),
'c': (1, 1),
'est': (1, 3),
'super': (2, 5),
'ca': (2, 2),
'a': (2, 1),
'l': (2, 1),
'air': (2, 3),
'quand': (1, 5),
'même': (1, 4),
'un': (1, 2),
'peu': (1, 3),
'compliqué': (1, 9),
'Mais': (1, 4),
'bon': (1, 3),
'pratique': (1, 8),
'Peut-être': (1, 9),
'que': (1, 3),
'je': (1, 2),
'pourrais': (1, 8),
'm': (1, 1),
'en': (1, 2),
'servir': (1, 6),
'pour': (1, 4),
'faire': (1, 5),
'des': (1, 3),
'trucs': (1, 5)}
namedtupleThere is also the namedtuple. It’s a tuple but with named attributes
from collections import namedtuple
Jedi = namedtuple('Jedi', ['firstname', 'lastname', 'age', 'color'])
yoda = Jedi('Minch', 'Yoda', 900, 'green')
yodaJedi(firstname='Minch', lastname='Yoda', age=900, color='green')
Remark. A better alternative since Python 3.7 is dataclasses. We will talk about it later
Next, put a text file miserables.txt in the folder containing this notebook. If it is not there, the next cell downloads it, if is it there, then we do nothing.
import requests
import os
# The path containing your notebook
path_data = './'
# The name of the file
filename = 'miserables.txt'
if os.path.exists(os.path.join(path_data, filename)):
print('The file %s already exists.' % os.path.join(path_data, filename))
else:
url = 'https://stephanegaiffas.github.io/big_data_course/data/miserables.txt'
r = requests.get(url)
with open(os.path.join(path_data, filename), 'wb') as f:
f.write(r.content)
print('Downloaded file %s.' % os.path.join(path_data, filename))Downloaded file ./miserables.txt.
total 3,5M
drwxrwxr-x 4 boucheron boucheron 4,0K janv. 30 13:38 ./
drwxrwxr-x 8 boucheron boucheron 4,0K janv. 30 13:33 ../
-rw-rw-r-- 1 boucheron boucheron 163 janv. 30 13:33 fruits.csv
drwxrwxr-x 2 boucheron boucheron 4,0K janv. 29 21:33 img/
drwxrwxr-x 2 boucheron boucheron 4,0K janv. 29 23:35 .ipynb_checkpoints/
-rw-rw-r-- 1 boucheron boucheron 3,1M janv. 30 13:38 miserables.txt
-rw-rw-r-- 1 boucheron boucheron 44 janv. 30 11:51 miserable_word_counts.pkl
-rw-rw-r-- 1 boucheron boucheron 162K janv. 30 13:38 notebook01_python.ipynb
-rw-rw-r-- 1 boucheron boucheron 70K janv. 29 20:23 notebook01_python.qmd
-rw-rw-r-- 1 boucheron boucheron 61K janv. 30 08:29 notebook02_numpy.ipynb
-rw-rw-r-- 1 boucheron boucheron 28K janv. 29 17:24 notebook02_numpy.qmd
-rw-rw-r-- 1 boucheron boucheron 32K janv. 30 13:32 notebook03_pandas.qmd
-rw-rw-r-- 1 boucheron boucheron 24K janv. 26 23:12 notebook04_pandas.qmd
-rw-rw-r-- 1 boucheron boucheron 7,8K janv. 29 22:13 tips.csv
In jupyter and ipython you can run terminal command lines using !
Let’s count number of lines and number of words with the wc command-line tool (linux or mac only, don’t ask me how on windows)
Count the number of occurences of each word in the text file miserables.txt. We use a open context and the Counter from before.
from collections import Counter
counter = Counter()
with open('miserables.txt', encoding='utf8') as f:
for line_idx, line in enumerate(f):
line = line.strip().replace('\n', ' ')\
.replace(',', ' ')\
.replace('.', ' ')\
.replace('»', ' ')\
.replace('-', ' ')\
.replace('!', ' ')\
.replace('(', ' ')\
.replace(')', ' ')\
.replace('?', ' ').split()
counter.update(line)FileNotFoundError: [Errno 2] No such file or directory: 'miserables.txt'
A context in Python is something that we use with the with keyword.
It allows to deal automatically with the opening and the closing of the file.
Note the for loop:
You loop directly over the lines of the open file from within the open context
pickleYou can save your computation with pickle.
pickle is a way of saving almost anything with Python.You must use function to order and reuse code
Function blocks must be indented as other control-flow blocks.
Functions can optionally return values. By default, functions return None.
The syntax to define a function:
def keyword;return object for optionally returning values.A function that returns several elements returns a tuple
Mandatory parameters (positional arguments)
Optimal parameters
You can do stuff like this, using unpacking * notation
Back to function f you can unpack a tuple as positional arguments
Pythondef f(*args, **kwargs):
print('args=', args)
print('kwargs=', kwargs)
f(1, 2, 'truc', lastname='gaiffas', firstname='stephane')args= (1, 2, 'truc')
kwargs= {'lastname': 'gaiffas', 'firstname': 'stephane'}
* for argument unpacking and ** for keyword argument unpackingargs and kwargs are a convention, not mandatory# How to get fired
def f(*aaa, **bbb):
print('args=', aaa)
print('kwargs=', bbb)
f(1, 2, 'truc', lastname='gaiffas', firstname='stephane') args= (1, 2, 'truc')
kwargs= {'lastname': 'gaiffas', 'firstname': 'stephane'}
Remark. A function is a regular an object… you can add attributes on it !
Python supports object-oriented programming (OOP). The goals of OOP are:
Here is a small example: we create a Student class, which is an object gathering several custom functions (called methods) and variables (called attributes).
class Student(object):
def __init__(self, name, birthyear, major='computer science'):
self.name = name
self.birthyear = birthyear
self.major = major
def __repr__(self):
return "Student(name='{name}', birthyear={birthyear}, major='{major}')"\
.format(name=self.name, birthyear=self.birthyear, major=self.major)
anna = Student('anna', 1987)
annaStudent(name='anna', birthyear=1987, major='computer science')
The __repr__ is what we call a ‘magic method’ in Python, that allows to display an object as a string easily. There is a very large number of such magic methods. There are used to implement interfaces
Add a age method to the Student class that computes the age of the student. - You can (and should) use the datetime module. - Since we only know about the birth year, let’s assume that the day of the birth is January, 1st.
from datetime import datetime
class Student(object):
def __init__(self, name, birthyear, major='computer science'):
self.name = name
self.birthyear = birthyear
self.major = major
def __repr__(self):
return "Student(name='{name}', birthyear={birthyear}, major='{major}')"\
.format(name=self.name, birthyear=self.birthyear, major=self.major)
def age(self):
return datetime.now().year - self.birthyear
anna = Student('anna', 1987)
anna.age()37
We can make methods look like attributes using properties, as shown below
class Student(object):
def __init__(self, name, birthyear, major='computer science'):
self.name = name
self.birthyear = birthyear
self.major = major
def __repr__(self):
return "Student(name='{name}', birthyear={birthyear}, major='{major}')"\
.format(name=self.name, birthyear=self.birthyear, major=self.major)
@property
def age(self):
return datetime.now().year - self.birthyear
anna = Student('anna', 1987)
anna.age37
A MasterStudent is a Student with a new extra mandatory internship attribute
class MasterStudent(Student):
def __init__(self, name, age, internship, major='computer science'):
# Student.__init__(self, name, age, major)
Student.__init__(self, name, age, major)
self.internship = internship
def __repr__(self):
return f"MasterStudent(name='{self.name}', internship={self.internship}, birthyear={self.birthyear}, major={self.major})"
MasterStudent('djalil', 22, 'pwc')MasterStudent(name='djalil', internship=pwc, birthyear=22, major=computer science)
class MasterStudent(Student):
def __init__(self, name, age, internship, major='computer science'):
# Student.__init__(self, name, age, major)
Student.__init__(self, name, age, major)
self.internship = internship
def __repr__(self):
return "MasterStudent(name='{name}', internship='{internship}'" \
", birthyear={birthyear}, major='{major}')"\
.format(name=self.name, internship=self.internship,
birthyear=self.birthyear, major=self.major)
djalil = MasterStudent('djalil', 1996, 'pwc'){'name': 'djalil',
'birthyear': 1996,
'major': 'computer science',
'internship': 'pwc'}
Python are objects and actually dicts under the hood…Since Python 3.7 you can use a dataclass for this
Does a lot of work for you (produces the __repr__ among many other things for you)
from dataclasses import dataclass
from datetime import datetime
@dataclass
class Student(object):
name: str
birthyear: int
major: str = 'computer science'
@property
def age(self):
return datetime.now().year - self.birthyear
anna = Student(name="anna", birthyear=1987)
annaStudent(name='anna', birthyear=1987, major='computer science')
PythonFirst, best way to learn and practice:
Start with the official tutorial https://docs.python.org/fr/3/tutorial/index.html
Look at https://python-3-for-scientists.readthedocs.io/en/latest/index.html
Continue with the documentation at https://docs.python.org/fr/3/index.html and work!
def foo(bar=[]):
bar.append('oops')
return bar
print(foo())
print(foo())
print(foo())
print('-' * 8)
print(foo(['Ah ah']))
print(foo([]))['oops']
['oops', 'oops']
['oops', 'oops', 'oops']
--------
['Ah ah', 'oops']
['oops']
(['oops', 'oops', 'oops'],)
(['oops', 'oops', 'oops', 'oops'],)
the bar argument is initialized to its default (i.e., an empty list) only when foo() is first definedfoo() (with no a bar argument specified) use the same list!One should use instead
def foo(bar=None):
if bar is None:
bar = []
bar.append('oops')
return bar
print(foo())
print(foo())
print(foo())
print(foo(['OK']))['oops']
['oops']
['oops']
['OK', 'oops']
No problem with immutable types
('oops',)
('oops',)
('oops',)
x is not an attribute of b nor cB and CA, which contains a class attribute xClasses and objects contain a hidden dict to store their attributes, and are accessed following a method resolution order (MRO)
(mappingproxy({'__module__': '__main__',
'x': 4,
'__init__': <function __main__.A.__init__(self)>,
'__dict__': <attribute '__dict__' of 'A' objects>,
'__weakref__': <attribute '__weakref__' of 'A' objects>,
'__doc__': None}),
mappingproxy({'__module__': '__main__',
'__init__': <function __main__.B.__init__(self)>,
'__doc__': None}),
mappingproxy({'__module__': '__main__',
'__init__': <function __main__.C.__init__(self)>,
'__doc__': None}))
This can lead to nasty errors when using class attributes: learn more about this
means
which is an assigment: ints must be defined in the local scope, but it is not, while
is not an assignemnt
list while iterating over itodd = lambda x: bool(x % 2)
numbers = list(range(10))
for i in range(len(numbers)):
if odd(numbers[i]):
del numbers[i]IndexError: list index out of range
Typically an example where one should use a list comprehension
Accept to spend time to write clean docstrings (my favourite is the numpydoc style)
def create_student(name, age, address, major='computer science'):
"""Add a student in the database
Parameters
----------
name: `str`
Name of the student
age: `int`
Age of the student
address: `str`
Address of the student
major: `str`, default='computer science'
The major chosen by the student
Returns
-------
output: `Student`
A fresh student
"""
passWhile it’s always better than a hand-made solution