import pandas as pd
= {
fruits "apples": [3, 2, 0, 1],
"oranges": [0, 3, 7, 2]
}
= pd.DataFrame(fruits)
df_fruits df_fruits
apples | oranges | |
---|---|---|
0 | 3 | 0 |
1 | 2 | 3 |
2 | 0 | 7 |
3 | 1 | 2 |
Pandas
January 30, 2024
The Pandas
library (https://Pandas.pydata.org) is one of the most popular tool for pythonistas dealing with data science.
DataFrame
object (a table of data) with a huge set of functionalitiesThrough Pandas
, you get acquainted with your data by analyzing them
A
correlate with column B
?C
look like?you first get acquainted with your data by cleaning and transforming
Pandas allows you to
matplotlib
, seaborn
, plotly
, …)Pandas
is a central component of the python
“stack” for data science
Pandas
is built on top of numpy
DataFrame
is often fed to plotting functions or machine learning algorithms (such as those provided by scikit-learn
)jupyter
, leading to a nice interactive environment for data exploration and modelingThe two primary components of Pandas are Series
and DataFrame
.
A Series
is essentially a column
A DataFrame
is a two-dimensional table made up of a collection of Series
with equal length
According to Wes McKinney (the designer of Pandas) Pandas dataframes were inspired by dataframes
offered by the R
langugage for statistical computing. The latter parallels the tables that are at the heart of Relational Databases Management Systems.
DataFrame
from scratchimport pandas as pd
fruits = {
"apples": [3, 2, 0, 1],
"oranges": [0, 3, 7, 2]
}
df_fruits = pd.DataFrame(fruits)
df_fruits
apples | oranges | |
---|---|---|
0 | 3 | 0 |
1 | 2 | 3 |
2 | 0 | 7 |
3 | 1 | 2 |
Dataframes can be created in many different ways. Here we used a dictionary with values made of sequences of equal length. The dictionary keys serve as column names.
Common patterns of creation use Dataframe()
from Pandas
and call it with a list of dictionaries (each dictionary represents a row of the dataframe under construction) or a dictionary of lists/containers (as above).
The notion of tidy dataframe has been popularized by Hadley Wickham. It embodies a collection of good practices that should be accepted by every data scientist.
A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
Your Pandas dataframes should abide.
tutorials = {
"HS118": [
"Getting Started With JupyterLab",
"Introduction to Python for Scientific Programming",
"Introduction to NumPy",
"Introduction to pandas",
],
"HS120": [
"Increase Citations, Ease Review and Collaboration",
"Time Series Forecasting With scikit-learn",
"Evaluating Your Machine Learning Models",
"Introduction to Audio and Speech Recognition",
],
}
schedule = pd.DataFrame(tutorials, index=["08:30", "10:30", "13:30", "15:30"])
schedule
HS118 | HS120 | |
---|---|---|
08:30 | Getting Started With JupyterLab | Increase Citations, Ease Review and Collaboration |
10:30 | Introduction to Python for Scientific Programming | Time Series Forecasting With scikit-learn |
13:30 | Introduction to NumPy | Evaluating Your Machine Learning Models |
15:30 | Introduction to pandas | Introduction to Audio and Speech Recognition |
Is it tidy? Why?
Pivot to long form using melt
method
What happens if we modify the optional argument ignore_index
?
(
molten_schedule
.reset_index()
.rename(columns={
"index": "time",
"variable": "room",
"value": "title"})
.sort_values(by=["time", "room"])
.reset_index(drop=True)
)
time | room | title | |
---|---|---|---|
0 | 08:30 | HS118 | Getting Started With JupyterLab |
1 | 08:30 | HS120 | Increase Citations, Ease Review and Collaboration |
2 | 10:30 | HS118 | Introduction to Python for Scientific Programming |
3 | 10:30 | HS120 | Time Series Forecasting With scikit-learn |
4 | 13:30 | HS118 | Introduction to NumPy |
5 | 13:30 | HS120 | Evaluating Your Machine Learning Models |
6 | 15:30 | HS118 | Introduction to pandas |
7 | 15:30 | HS120 | Introduction to Audio and Speech Recognition |
At the end of the day, dataframes have to be queried, updated, just as tables in a RDBMS.
Every Pandas dataframe is equipped with an Index.
DataFrame
uses a contiguous indexapples | oranges | |
---|---|---|
Daniel | 3 | 0 |
Sean | 2 | 3 |
Pierce | 0 | 7 |
Roger | 1 | 2 |
Is the index a column? How is it implemented?
.loc
versus .iloc
.loc
locates by name.iloc
locates by numerical index (position)Daniel 0
Sean 3
Pierce 7
Roger 2
Name: oranges, dtype: int64
DataFrame
A DataFrame
has many attributes
A DataFrame
has many methods
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Daniel to Roger
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 apples 4 non-null int64
1 oranges 4 non-null int64
dtypes: int64(2)
memory usage: 268.0+ bytes
What if we don’t know how many apples are in Sean’s basket ?
apples | oranges | |
---|---|---|
Daniel | 3.0 | 0 |
Sean | NaN | 3 |
Pierce | 0.0 | 7 |
Roger | 1.0 | 2 |
apples | oranges | |
---|---|---|
count | 3.000000 | 4.00000 |
mean | 1.333333 | 3.00000 |
std | 1.527525 | 2.94392 |
min | 0.000000 | 0.00000 |
25% | 0.500000 | 1.50000 |
50% | 1.000000 | 2.50000 |
75% | 2.000000 | 4.00000 |
max | 3.000000 | 7.00000 |
Note that count
is 3 for apples now, since we have 1 missing value among the 4
Ooooops, we forgot about the bananas !
And we forgot the dates !
df_fruits['time'] = [
"2020/10/08 12:13", "2020/10/07 11:37",
"2020/10/10 14:07", "2020/10/09 10:51"
]
df_fruits
apples | oranges | bananas | time | |
---|---|---|---|---|
Daniel | 3.0 | 0 | 0 | 2020/10/08 12:13 |
Sean | NaN | 3 | 2 | 2020/10/07 11:37 |
Pierce | 0.0 | 7 | 1 | 2020/10/10 14:07 |
Roger | 1.0 | 2 | 6 | 2020/10/09 10:51 |
It’s not a date but a string (str
) ! So we convert this column to something called datetime
apples | oranges | bananas | time | |
---|---|---|---|---|
Daniel | 3.0 | 0 | 0 | 2020-10-08 12:13:00 |
Sean | NaN | 3 | 2 | 2020-10-07 11:37:00 |
Pierce | 0.0 | 7 | 1 | 2020-10-10 14:07:00 |
Roger | 1.0 | 2 | 6 | 2020-10-09 10:51:00 |
What if we want to keep only the baskets after (including) October, 9th ?
apples | oranges | bananas | time | |
---|---|---|---|---|
Pierce | 0.0 | 7 | 1 | 2020-10-10 14:07:00 |
Roger | 1.0 | 2 | 6 | 2020-10-09 10:51:00 |
Have a look at the documentation of the datetime
package
apples | oranges | bananas | time | |
---|---|---|---|---|
Daniel | 3.0 | 0 | 0 | 2020-10-08 12:13:00 |
Sean | NaN | 3 | 2 | 2020-10-07 11:37:00 |
Pierce | 0.0 | 7 | 1 | 2020-10-10 14:07:00 |
Roger | 1.0 | 2 | 6 | 2020-10-09 10:51:00 |
oranges | bananas | time | |
---|---|---|---|
Daniel | 0 | 0 | 2020-10-08 12:13:00 |
Sean | 3 | 2 | 2020-10-07 11:37:00 |
Pierce | 7 | 1 | 2020-10-10 14:07:00 |
Roger | 2 | 6 | 2020-10-09 10:51:00 |
apples | oranges | bananas | |
---|---|---|---|
Daniel | 3.0 | 0 | 0 |
Sean | NaN | 3 | 2 |
What if we want to write the file ?
apples | oranges | bananas | time | |
---|---|---|---|---|
Daniel | 3.0 | 0 | 0 | 2020-10-08 12:13:00 |
Sean | NaN | 3 | 2 | 2020-10-07 11:37:00 |
Pierce | 0.0 | 7 | 1 | 2020-10-10 14:07:00 |
Roger | 1.0 | 2 | 6 | 2020-10-09 10:51:00 |
-rw-rw-r-- 1 boucheron boucheron 163 janv. 30 13:33 fruits.csv
,apples,oranges,bananas,time
Daniel,3.0,0,0,2020-10-08 12:13:00
Sean,,3,2,2020-10-07 11:37:00
Pierce,0.0,7,1,2020-10-10 14:07:00
Roger,1.0,2,6,2020-10-09 10:51:00
csv
files are not the end of history.
csv
have no built-in compression mechanismThe Apache arrow
project offers a variety of file formats to remedy this set of drawbacks.
The DataFrame
class is equipped with methods for saving to many formats.
['to_clipboard',
'to_csv',
'to_dict',
'to_excel',
'to_feather',
'to_gbq',
'to_hdf',
'to_html',
'to_json',
'to_latex',
'to_markdown',
'to_numpy',
'to_orc',
'to_parquet',
'to_period',
'to_pickle',
'to_records',
'to_sql',
'to_stata',
'to_string',
'to_timestamp',
'to_xarray',
'to_xml']
Let’s read the file tips.csv
. It is a CSV
file (Comma Separated Values).
It contains data about a restaurant: the bill, tip and some informations about the customers.
fruits.csv notebook02_numpy.ipynb notebook04_pandas.qmd
img notebook02_numpy.qmd tips.csv
miserable_word_counts.pkl notebook03_pandas.ipynb
notebook01_python.qmd notebook03_pandas.qmd
import requests
import os
# The path containing your notebook
path_data = './'
# The name of the file
filename = 'tips.csv'
if os.path.exists(os.path.join(path_data, filename)):
print('The file %s already exists.' % os.path.join(path_data, filename))
else:
url = 'http://stephane-v-boucheron.fr/data/tips.csv'
r = requests.get(url)
with open(os.path.join(path_data, filename), 'wb') as f:
f.write(r.content)
print(f'Downloaded file {os.path.join(path_data, filename)}.')
The file ./tips.csv already exists.
-rw-rw-r-- 1 boucheron boucheron 163 janv. 30 13:33 fruits.csv
-rw-rw-r-- 1 boucheron boucheron 7943 janv. 29 22:13 tips.csv
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 |
7 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 |
8 | 15.04 | 1.96 | Male | No | Sun | Dinner | 2 |
9 | 14.78 | 3.23 | Male | No | Sun | Dinner | 2 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null object
3 smoker 244 non-null object
4 day 244 non-null object
5 time 244 non-null object
6 size 244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB
By default, columns that are non-numerical contain strings (str
type)
category
typeAn important type in Pandas
is category
for variables that are non-numerical
Pro tip. It’s always a good idea to tell Pandas
which columns should be imported as categorical
So, let’s read again the file specifying some dtype
s to the read_csv
function
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
dtype: object
Have a look at Pandas book on categorical datas
total_bill | tip | size | |
---|---|---|---|
count | 244.000000 | 244.000000 | 244.000000 |
mean | 19.785943 | 2.998279 | 2.569672 |
std | 8.902412 | 1.383638 | 0.951100 |
min | 3.070000 | 1.000000 | 1.000000 |
25% | 13.347500 | 2.000000 | 2.000000 |
50% | 17.795000 | 2.900000 | 2.000000 |
75% | 24.127500 | 3.562500 | 3.000000 |
max | 50.810000 | 10.000000 | 6.000000 |
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
count | 244.000000 | 244.000000 | 244 | 244 | 244 | 244 | 244.000000 |
unique | NaN | NaN | 2 | 2 | 4 | 2 | NaN |
top | NaN | NaN | Male | No | Sat | Dinner | NaN |
freq | NaN | NaN | 157 | 151 | 87 | 176 | NaN |
mean | 19.785943 | 2.998279 | NaN | NaN | NaN | NaN | 2.569672 |
std | 8.902412 | 1.383638 | NaN | NaN | NaN | NaN | 0.951100 |
min | 3.070000 | 1.000000 | NaN | NaN | NaN | NaN | 1.000000 |
25% | 13.347500 | 2.000000 | NaN | NaN | NaN | NaN | 2.000000 |
50% | 17.795000 | 2.900000 | NaN | NaN | NaN | NaN | 2.000000 |
75% | 24.127500 | 3.562500 | NaN | NaN | NaN | NaN | 3.000000 |
max | 50.810000 | 10.000000 | NaN | NaN | NaN | NaN | 6.000000 |
total_bill | tip | size | |
---|---|---|---|
total_bill | 1.000000 | 0.675734 | 0.598315 |
tip | 0.675734 | 1.000000 | 0.489299 |
size | 0.598315 | 0.489299 | 1.000000 |
matplotlib
and seaborn
Let’s show how we can use matplotlib
and seaborn
to visualize data contained in a Pandas
dataframe
Pandas
Let us read again the tips.csv
file
import pandas as pd
dtypes = {
"sex": "category",
"smoker": "category",
"day": "category",
"time": "category"
}
df = pd.read_csv("tips.csv", dtype=dtypes)
df.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Pandas
: broadcastingLet’s add a column that contains the tip percentage
total_bill | tip | sex | smoker | day | time | size | tip_percentage | |
---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 |
The computation
uses a broadcast rule.
numpy
arrays, Series
or Pandas
dataframes when the computation makes sense in view of their respective shapeThis principle is called broadcast or broadcasting.
The tip
and total_bill
columns have the same shape
, so broadcasting performs pairwise division.
This corresponds to the following “hand-crafted” approach with a for
loop:
But using such a loop is:
NEVER use Python
for-loops for numerical computations !
%%timeit -n 10
for i in range(df.shape[0]):
df.loc[i, "tip_percentage"] = df.loc[i, "tip"] / df.loc[i, "total_bill"]
35.9 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
198 µs ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The for
loop is \(\approx\) 200 times slower ! (even worse on larger data)
DataFrame
When you want to change a value in a DataFrame
, never use
but use
namely, use a single loc
or iloc
statement. The first version might not work: it might modify a copy of the column and not the dataframe itself !
Another example of broadcasting is:
tip_percentage | |
---|---|
0 | 5.944673 |
1 | 16.054159 |
2 | 16.658734 |
3 | 13.978041 |
4 | 14.680765 |
where we multiplied each entry of the tip_percentage
column by 100.
Remark. Note the difference between
which returns a DataFrame
containing only the tip_percentage
column and
which returns a Series
containing the data of the tip_percentage
column
tip_percentage
outliers ?groupby
and aggregate
Many computations can be formulated as a groupby followed by and aggregation.
tip
and tip percentage
each day ?total_bill | tip | sex | smoker | day | time | size | tip_percentage | |
---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 |
/tmp/ipykernel_25457/381386936.py:1: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
TypeError: category dtype does not support aggregation 'mean'
But we don’t care about the size
column here, so we can use insead
/tmp/ipykernel_25457/1814539930.py:1: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
total_bill | tip | tip_percentage | |
---|---|---|---|
day | |||
Fri | 17.151579 | 2.734737 | 0.169913 |
Sat | 20.441379 | 2.993103 | 0.153152 |
Sun | 21.410000 | 3.255132 | 0.166897 |
Thur | 17.682742 | 2.771452 | 0.161276 |
If we want to be more precise, we can groupby
using several columns
(
df[["total_bill", "tip", "tip_percentage", "day", "time"]] # selection
.groupby(["day", "time"]) # partition
.mean() # aggregation
)
/tmp/ipykernel_25457/125369857.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
total_bill | tip | tip_percentage | ||
---|---|---|---|---|
day | time | |||
Fri | Dinner | 19.663333 | 2.940000 | 0.158916 |
Lunch | 12.845714 | 2.382857 | 0.188765 | |
Sat | Dinner | 20.441379 | 2.993103 | 0.153152 |
Lunch | NaN | NaN | NaN | |
Sun | Dinner | 21.410000 | 3.255132 | 0.166897 |
Lunch | NaN | NaN | NaN | |
Thur | Dinner | 18.780000 | 3.000000 | 0.159744 |
Lunch | 17.664754 | 2.767705 | 0.161301 |
Remarks
DataFrame
with a two-level indexing: on the day
and the time
NaN
values for empty groups (e.g. Sat
, Lunch
)Sometimes, it’s more convenient to get the groups as columns instead of a multi-level index.
For this, use reset_index
:
(
df[["total_bill", "tip", "tip_percentage", "day", "time"]] # selection
.groupby(["day", "time"]) # partition
.mean() # aggregation
.reset_index() # ako ungroup
)
/tmp/ipykernel_25457/835267922.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
day | time | total_bill | tip | tip_percentage | |
---|---|---|---|---|---|
0 | Fri | Dinner | 19.663333 | 2.940000 | 0.158916 |
1 | Fri | Lunch | 12.845714 | 2.382857 | 0.188765 |
2 | Sat | Dinner | 20.441379 | 2.993103 | 0.153152 |
3 | Sat | Lunch | NaN | NaN | NaN |
4 | Sun | Dinner | 21.410000 | 3.255132 | 0.166897 |
5 | Sun | Lunch | NaN | NaN | NaN |
6 | Thur | Dinner | 18.780000 | 3.000000 | 0.159744 |
7 | Thur | Lunch | 17.664754 | 2.767705 | 0.161301 |
Computations with Pandas can include many operations that are pipelined until the final computation.
Pipelining many operations is good practice and perfectly normal, but in order to make the code readable you can put it between parenthesis (python
expression) as follows:
(
df[["total_bill", "tip", "tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.mean()
.reset_index()
# and on top of all this we sort the dataframe with respect
# to the tip_percentage
.sort_values("tip_percentage")
)
/tmp/ipykernel_25457/45053252.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
day | time | total_bill | tip | tip_percentage | |
---|---|---|---|---|---|
2 | Sat | Dinner | 20.441379 | 2.993103 | 0.153152 |
0 | Fri | Dinner | 19.663333 | 2.940000 | 0.158916 |
6 | Thur | Dinner | 18.780000 | 3.000000 | 0.159744 |
7 | Thur | Lunch | 17.664754 | 2.767705 | 0.161301 |
4 | Sun | Dinner | 21.410000 | 3.255132 | 0.166897 |
1 | Fri | Lunch | 12.845714 | 2.382857 | 0.188765 |
3 | Sat | Lunch | NaN | NaN | NaN |
5 | Sun | Lunch | NaN | NaN | NaN |
DataFrame
with style
Now, we can answer, with style, to the question: what are the average tip percentages along the week ?
(
df[["tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.mean()
# At the end of the pipeline you can use .style
.style
# Print numerical values as percentages
.format("{:.2%}")
.background_gradient()
)
/tmp/ipykernel_25457/838795167.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
ImportError: Pandas requires version '3.1.2' or newer of 'jinja2' (version '3.0.3' currently installed).
NaN
valuesBut the NaN
values are somewhat annoying. Let’s remove them
(
df[["tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.mean()
# We just add this from the previous pipeline
.dropna()
.style
.format("{:.2%}")
.background_gradient()
)
/tmp/ipykernel_25457/2662169510.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
ImportError: Pandas requires version '3.1.2' or newer of 'jinja2' (version '3.0.3' currently installed).
Now, we clearly see when tip_percentage
is maximal. But what about the standard deviation ?
.mean()
for now, but we can use several aggregating function using .agg()
(
df[["tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.agg(["mean", "std"]) # we feed `agg` with a list of names of callables
.dropna()
.style
.format("{:.2%}")
.background_gradient()
)
/tmp/ipykernel_25457/1856885208.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
ImportError: Pandas requires version '3.1.2' or newer of 'jinja2' (version '3.0.3' currently installed).
And we can use also .describe()
as aggregation function. Moreover we - use the subset
option to specify which column we want to style - we use ("tip_percentage", "count")
to access multi-level index
(
df[["tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.describe() # all-purpose summarising function
)
/tmp/ipykernel_25457/3924876303.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
tip_percentage | |||||||||
---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | ||
day | time | ||||||||
Fri | Dinner | 12.0 | 0.158916 | 0.047024 | 0.103555 | 0.123613 | 0.144742 | 0.179199 | 0.263480 |
Lunch | 7.0 | 0.188765 | 0.045885 | 0.117735 | 0.167289 | 0.187735 | 0.210996 | 0.259314 | |
Sat | Dinner | 87.0 | 0.153152 | 0.051293 | 0.035638 | 0.123863 | 0.151832 | 0.188271 | 0.325733 |
Sun | Dinner | 76.0 | 0.166897 | 0.084739 | 0.059447 | 0.119982 | 0.161103 | 0.187889 | 0.710345 |
Thur | Dinner | 1.0 | 0.159744 | NaN | 0.159744 | 0.159744 | 0.159744 | 0.159744 | 0.159744 |
Lunch | 61.0 | 0.161301 | 0.038972 | 0.072961 | 0.137741 | 0.153846 | 0.193424 | 0.266312 |
(
df[["tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.describe()
.dropna()
.style
.bar(subset=[("tip_percentage", "count")])
.background_gradient(subset=[("tip_percentage", "50%")])
)
/tmp/ipykernel_25457/673231177.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
ImportError: Pandas requires version '3.1.2' or newer of 'jinja2' (version '3.0.3' currently installed).
tip
based on the total_bill
As an example of very simple machine-learning problem, let’s try to understand how we can predict tip
based on total_bill
.
import numpy as np
plt.scatter(df["total_bill"], df["tip"])
plt.xlabel("total_bill", fontsize=12)
plt.ylabel("tip", fontsize=12)
Text(0, 0.5, 'tip')
There’s a rough linear dependence between the two. Let’s try to find it by hand!
Namely, we look for numbers \(b\) and \(w\) such that
\[ \texttt{tip} \approx b + w \times \texttt{total_bill} \]
for all the examples of pairs of \((\texttt{tip}, \texttt{total_bill})\) we observe in the data.
In machine learning, we say that this is a very simple example of a supervised learning problem (here it is a regression problem), where \(\texttt{tip}\) is the label and where \(\texttt{total_bill}\) is the (only) feature, for which we intend to use a linear predictor.
plt.scatter(df["total_bill"], df["tip"])
plt.xlabel("total_bill", fontsize=12)
plt.ylabel("tip", fontsize=12)
slope = 1.0
intercept = 0.0
x = np.linspace(0, 50, 1000)
plt.plot(x, intercept + slope * x, color="red")
This might require
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
%matplotlib widget
%matplotlib inline
x = np.linspace(0, 50, 1000)
@widgets.interact(intercept=(-5, 5, 1.), slope=(0, 1, .05))
def update(intercept=0.0, slope=0.5):
plt.scatter(df["total_bill"], df["tip"])
plt.plot(x, intercept + slope * x, color="red")
plt.xlim((0, 50))
plt.ylim((0, 10))
plt.xlabel("total_bill", fontsize=12)
plt.ylabel("tip", fontsize=12)
This is kind of tedious to do this by hand… it would be nice to come up with an automated way of doing this. Moreover:
total_bill
column to predict the tip
, while we know about many other thingsWe can’t perform computations (products and sums) with columns containing categorical variables. So, we can’t use them like this to predict the tip
. We need to convert them to numbers somehow.
The most classical approach for this is one-hot encoding (or “create dummies” or “binarize”) of the categorical variables, which can be easily achieved with Pandas.get_dummies
Why one-hot ? See wikipedia for a plausible explanation
total_bill | tip | size | tip_percentage | sex#Female | sex#Male | smoker#No | smoker#Yes | day#Fri | day#Sat | day#Sun | day#Thur | time#Dinner | time#Lunch | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | 2 | 0.059447 | True | False | True | False | False | False | True | False | True | False |
1 | 10.34 | 1.66 | 3 | 0.160542 | False | True | True | False | False | False | True | False | True | False |
2 | 21.01 | 3.50 | 3 | 0.166587 | False | True | True | False | False | False | True | False | True | False |
3 | 23.68 | 3.31 | 2 | 0.139780 | False | True | True | False | False | False | True | False | True | False |
4 | 24.59 | 3.61 | 4 | 0.146808 | True | False | True | False | False | False | True | False | True | False |
Only the categorical columns have been one-hot encoded. For instance, the "day"
column is replaced by 4 columns named "day#Thur"
, "day#Fri"
, "day#Sat"
, "day#Sun"
, since "day"
has 4 modalities (see next line).
['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Fri', 'Sat', 'Sun', 'Thur']
Sums over dummies for sex
, smoker
, day
, time
and size
are all equal to one (by constrution of the one-hot encoded vectors).
day_cols = [col for col in df_one_hot.columns if col.startswith("day")]
df_one_hot[day_cols].head()
df_one_hot[day_cols].sum(axis=1)
0 1
1 1
2 1
3 1
4 1
..
239 1
240 1
241 1
242 1
243 1
Length: 244, dtype: int64
The most standard solution is to remove a modality (i.e. remove a one-hot encoding vector). Simply achieved by specifying drop_first=True
in the get_dummies
function.
['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Fri', 'Sat', 'Sun', 'Thur']
total_bill | tip | size | tip_percentage | sex#Male | smoker#Yes | day#Sat | day#Sun | day#Thur | time#Lunch | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | 2 | 0.059447 | False | False | False | True | False | False |
1 | 10.34 | 1.66 | 3 | 0.160542 | True | False | False | True | False | False |
2 | 21.01 | 3.50 | 3 | 0.166587 | True | False | False | True | False | False |
3 | 23.68 | 3.31 | 2 | 0.139780 | True | False | False | True | False | False |
4 | 24.59 | 3.61 | 4 | 0.146808 | False | False | False | True | False | False |
Now, if a categorical feature has \(K\) modalities, we use only \(K-1\) dummies. For instance, there is no more sex#Female
binary column.
Question. So, a linear regression won’t fit a weight for sex#Female
. But, where do the model weights of the dropped binary columns go ?
Answer. They just “go” to the intercept: interpretation of the population bias depends on the “dropped” one-hot encodings.
So, we actually fit
\[ \begin{array}{rl} \texttt{tip} \approx b &+ w_1 \times \texttt{total_bill} + w_2 \times \texttt{size} \\ &+ w_3 \times \texttt{sex#Male} + w_4 \times \texttt{smoker#Yes} \\ &+ w_5 \times \texttt{day#Sat} + w_6 \times \texttt{day#Sun} + w_7 \times \texttt{day#Thur} \\ &+ w_8 \times \texttt{time#Lunch} \end{array} \]