import pandas as pd
fruits = {
"apples": [3, 2, 0, 1],
"oranges": [0, 3, 7, 2]
}
df_fruits = pd.DataFrame(fruits)
df_fruits| apples | oranges | |
|---|---|---|
| 0 | 3 | 0 |
| 1 | 2 | 3 |
| 2 | 0 | 7 |
| 3 | 1 | 2 |
PandasJanuary 30, 2024
The Pandas library (https://Pandas.pydata.org) is one of the most popular tool for pythonistas dealing with data science.
DataFrame object (a table of data) with a huge set of functionalitiesThrough Pandas, you get acquainted with your data by analyzing them
A correlate with column B?C look like?you first get acquainted with your data by cleaning and transforming
Pandas allows you to
matplotlib, seaborn, plotly, …)Pandas is a central component of the python “stack” for data science
Pandas is built on top of numpyDataFrame is often fed to plotting functions or machine learning algorithms (such as those provided by scikit-learn)jupyter, leading to a nice interactive environment for data exploration and modelingThe two primary components of Pandas are Series and DataFrame.
A Series is essentially a column
A DataFrame is a two-dimensional table made up of a collection of Series with equal length
According to Wes McKinney (the designer of Pandas) Pandas dataframes were inspired by dataframes offered by the R langugage for statistical computing. The latter parallels the tables that are at the heart of Relational Databases Management Systems.
DataFrame from scratchimport pandas as pd
fruits = {
"apples": [3, 2, 0, 1],
"oranges": [0, 3, 7, 2]
}
df_fruits = pd.DataFrame(fruits)
df_fruits| apples | oranges | |
|---|---|---|
| 0 | 3 | 0 |
| 1 | 2 | 3 |
| 2 | 0 | 7 |
| 3 | 1 | 2 |
Dataframes can be created in many different ways. Here we used a dictionary with values made of sequences of equal length. The dictionary keys serve as column names.
Common patterns of creation use Dataframe() from Pandas and call it with a list of dictionaries (each dictionary represents a row of the dataframe under construction) or a dictionary of lists/containers (as above).
The notion of tidy dataframe has been popularized by Hadley Wickham. It embodies a collection of good practices that should be accepted by every data scientist.
A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
Your Pandas dataframes should abide.
tutorials = {
"HS118": [
"Getting Started With JupyterLab",
"Introduction to Python for Scientific Programming",
"Introduction to NumPy",
"Introduction to pandas",
],
"HS120": [
"Increase Citations, Ease Review and Collaboration",
"Time Series Forecasting With scikit-learn",
"Evaluating Your Machine Learning Models",
"Introduction to Audio and Speech Recognition",
],
}
schedule = pd.DataFrame(tutorials, index=["08:30", "10:30", "13:30", "15:30"])
schedule| HS118 | HS120 | |
|---|---|---|
| 08:30 | Getting Started With JupyterLab | Increase Citations, Ease Review and Collaboration |
| 10:30 | Introduction to Python for Scientific Programming | Time Series Forecasting With scikit-learn |
| 13:30 | Introduction to NumPy | Evaluating Your Machine Learning Models |
| 15:30 | Introduction to pandas | Introduction to Audio and Speech Recognition |
Is it tidy? Why?
Pivot to long form using melt method
What happens if we modify the optional argument ignore_index ?
(
molten_schedule
.reset_index()
.rename(columns={
"index": "time",
"variable": "room",
"value": "title"})
.sort_values(by=["time", "room"])
.reset_index(drop=True)
)| time | room | title | |
|---|---|---|---|
| 0 | 08:30 | HS118 | Getting Started With JupyterLab |
| 1 | 08:30 | HS120 | Increase Citations, Ease Review and Collaboration |
| 2 | 10:30 | HS118 | Introduction to Python for Scientific Programming |
| 3 | 10:30 | HS120 | Time Series Forecasting With scikit-learn |
| 4 | 13:30 | HS118 | Introduction to NumPy |
| 5 | 13:30 | HS120 | Evaluating Your Machine Learning Models |
| 6 | 15:30 | HS118 | Introduction to pandas |
| 7 | 15:30 | HS120 | Introduction to Audio and Speech Recognition |
At the end of the day, dataframes have to be queried, updated, just as tables in a RDBMS.
Every Pandas dataframe is equipped with an Index.
DataFrame uses a contiguous index| apples | oranges | |
|---|---|---|
| Daniel | 3 | 0 |
| Sean | 2 | 3 |
| Pierce | 0 | 7 |
| Roger | 1 | 2 |
Is the index a column? How is it implemented?
.loc versus .iloc.loc locates by name.iloc locates by numerical index (position)Daniel 0
Sean 3
Pierce 7
Roger 2
Name: oranges, dtype: int64
DataFrameA DataFrame has many attributes
A DataFrame has many methods
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, Daniel to Roger
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 apples 4 non-null int64
1 oranges 4 non-null int64
dtypes: int64(2)
memory usage: 268.0+ bytes
What if we don’t know how many apples are in Sean’s basket ?
| apples | oranges | |
|---|---|---|
| Daniel | 3.0 | 0 |
| Sean | NaN | 3 |
| Pierce | 0.0 | 7 |
| Roger | 1.0 | 2 |
| apples | oranges | |
|---|---|---|
| count | 3.000000 | 4.00000 |
| mean | 1.333333 | 3.00000 |
| std | 1.527525 | 2.94392 |
| min | 0.000000 | 0.00000 |
| 25% | 0.500000 | 1.50000 |
| 50% | 1.000000 | 2.50000 |
| 75% | 2.000000 | 4.00000 |
| max | 3.000000 | 7.00000 |
Note that count is 3 for apples now, since we have 1 missing value among the 4
Ooooops, we forgot about the bananas !
And we forgot the dates !
df_fruits['time'] = [
"2020/10/08 12:13", "2020/10/07 11:37",
"2020/10/10 14:07", "2020/10/09 10:51"
]
df_fruits| apples | oranges | bananas | time | |
|---|---|---|---|---|
| Daniel | 3.0 | 0 | 0 | 2020/10/08 12:13 |
| Sean | NaN | 3 | 2 | 2020/10/07 11:37 |
| Pierce | 0.0 | 7 | 1 | 2020/10/10 14:07 |
| Roger | 1.0 | 2 | 6 | 2020/10/09 10:51 |
It’s not a date but a string (str) ! So we convert this column to something called datetime
| apples | oranges | bananas | time | |
|---|---|---|---|---|
| Daniel | 3.0 | 0 | 0 | 2020-10-08 12:13:00 |
| Sean | NaN | 3 | 2 | 2020-10-07 11:37:00 |
| Pierce | 0.0 | 7 | 1 | 2020-10-10 14:07:00 |
| Roger | 1.0 | 2 | 6 | 2020-10-09 10:51:00 |
What if we want to keep only the baskets after (including) October, 9th ?
| apples | oranges | bananas | time | |
|---|---|---|---|---|
| Pierce | 0.0 | 7 | 1 | 2020-10-10 14:07:00 |
| Roger | 1.0 | 2 | 6 | 2020-10-09 10:51:00 |
Have a look at the documentation of the datetime package
| apples | oranges | bananas | time | |
|---|---|---|---|---|
| Daniel | 3.0 | 0 | 0 | 2020-10-08 12:13:00 |
| Sean | NaN | 3 | 2 | 2020-10-07 11:37:00 |
| Pierce | 0.0 | 7 | 1 | 2020-10-10 14:07:00 |
| Roger | 1.0 | 2 | 6 | 2020-10-09 10:51:00 |
| oranges | bananas | time | |
|---|---|---|---|
| Daniel | 0 | 0 | 2020-10-08 12:13:00 |
| Sean | 3 | 2 | 2020-10-07 11:37:00 |
| Pierce | 7 | 1 | 2020-10-10 14:07:00 |
| Roger | 2 | 6 | 2020-10-09 10:51:00 |
| apples | oranges | bananas | |
|---|---|---|---|
| Daniel | 3.0 | 0 | 0 |
| Sean | NaN | 3 | 2 |
What if we want to write the file ?
| apples | oranges | bananas | time | |
|---|---|---|---|---|
| Daniel | 3.0 | 0 | 0 | 2020-10-08 12:13:00 |
| Sean | NaN | 3 | 2 | 2020-10-07 11:37:00 |
| Pierce | 0.0 | 7 | 1 | 2020-10-10 14:07:00 |
| Roger | 1.0 | 2 | 6 | 2020-10-09 10:51:00 |
-rw-rw-r-- 1 boucheron boucheron 163 janv. 30 13:33 fruits.csv
,apples,oranges,bananas,time
Daniel,3.0,0,0,2020-10-08 12:13:00
Sean,,3,2,2020-10-07 11:37:00
Pierce,0.0,7,1,2020-10-10 14:07:00
Roger,1.0,2,6,2020-10-09 10:51:00
csv files are not the end of history.
csv have no built-in compression mechanismThe Apache arrow project offers a variety of file formats to remedy this set of drawbacks.
The DataFrame class is equipped with methods for saving to many formats.
['to_clipboard',
'to_csv',
'to_dict',
'to_excel',
'to_feather',
'to_gbq',
'to_hdf',
'to_html',
'to_json',
'to_latex',
'to_markdown',
'to_numpy',
'to_orc',
'to_parquet',
'to_period',
'to_pickle',
'to_records',
'to_sql',
'to_stata',
'to_string',
'to_timestamp',
'to_xarray',
'to_xml']
Let’s read the file tips.csv. It is a CSV file (Comma Separated Values).
It contains data about a restaurant: the bill, tip and some informations about the customers.
fruits.csv notebook02_numpy.ipynb notebook04_pandas.qmd
img notebook02_numpy.qmd tips.csv
miserable_word_counts.pkl notebook03_pandas.ipynb
notebook01_python.qmd notebook03_pandas.qmd
import requests
import os
# The path containing your notebook
path_data = './'
# The name of the file
filename = 'tips.csv'
if os.path.exists(os.path.join(path_data, filename)):
print('The file %s already exists.' % os.path.join(path_data, filename))
else:
url = 'http://stephane-v-boucheron.fr/data/tips.csv'
r = requests.get(url)
with open(os.path.join(path_data, filename), 'wb') as f:
f.write(r.content)
print(f'Downloaded file {os.path.join(path_data, filename)}.')The file ./tips.csv already exists.
-rw-rw-r-- 1 boucheron boucheron 163 janv. 30 13:33 fruits.csv
-rw-rw-r-- 1 boucheron boucheron 7943 janv. 29 22:13 tips.csv
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
| 5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
| 6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 |
| 7 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 |
| 8 | 15.04 | 1.96 | Male | No | Sun | Dinner | 2 |
| 9 | 14.78 | 3.23 | Male | No | Sun | Dinner | 2 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 sex 244 non-null object
3 smoker 244 non-null object
4 day 244 non-null object
5 time 244 non-null object
6 size 244 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB
By default, columns that are non-numerical contain strings (str type)
category typeAn important type in Pandas is category for variables that are non-numerical
Pro tip. It’s always a good idea to tell Pandas which columns should be imported as categorical
So, let’s read again the file specifying some dtypes to the read_csv function
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
dtype: object
Have a look at Pandas book on categorical datas
| total_bill | tip | size | |
|---|---|---|---|
| count | 244.000000 | 244.000000 | 244.000000 |
| mean | 19.785943 | 2.998279 | 2.569672 |
| std | 8.902412 | 1.383638 | 0.951100 |
| min | 3.070000 | 1.000000 | 1.000000 |
| 25% | 13.347500 | 2.000000 | 2.000000 |
| 50% | 17.795000 | 2.900000 | 2.000000 |
| 75% | 24.127500 | 3.562500 | 3.000000 |
| max | 50.810000 | 10.000000 | 6.000000 |
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| count | 244.000000 | 244.000000 | 244 | 244 | 244 | 244 | 244.000000 |
| unique | NaN | NaN | 2 | 2 | 4 | 2 | NaN |
| top | NaN | NaN | Male | No | Sat | Dinner | NaN |
| freq | NaN | NaN | 157 | 151 | 87 | 176 | NaN |
| mean | 19.785943 | 2.998279 | NaN | NaN | NaN | NaN | 2.569672 |
| std | 8.902412 | 1.383638 | NaN | NaN | NaN | NaN | 0.951100 |
| min | 3.070000 | 1.000000 | NaN | NaN | NaN | NaN | 1.000000 |
| 25% | 13.347500 | 2.000000 | NaN | NaN | NaN | NaN | 2.000000 |
| 50% | 17.795000 | 2.900000 | NaN | NaN | NaN | NaN | 2.000000 |
| 75% | 24.127500 | 3.562500 | NaN | NaN | NaN | NaN | 3.000000 |
| max | 50.810000 | 10.000000 | NaN | NaN | NaN | NaN | 6.000000 |
| total_bill | tip | size | |
|---|---|---|---|
| total_bill | 1.000000 | 0.675734 | 0.598315 |
| tip | 0.675734 | 1.000000 | 0.489299 |
| size | 0.598315 | 0.489299 | 1.000000 |
matplotlib and seabornLet’s show how we can use matplotlib and seaborn to visualize data contained in a Pandas dataframe
PandasLet us read again the tips.csv file
import pandas as pd
dtypes = {
"sex": "category",
"smoker": "category",
"day": "category",
"time": "category"
}
df = pd.read_csv("tips.csv", dtype=dtypes)
df.head()| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Pandas : broadcastingLet’s add a column that contains the tip percentage
| total_bill | tip | sex | smoker | day | time | size | tip_percentage | |
|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 |
The computation
uses a broadcast rule.
numpy arrays, Series or Pandas dataframes when the computation makes sense in view of their respective shapeThis principle is called broadcast or broadcasting.
The tip and total_billcolumns have the same shape, so broadcasting performs pairwise division.
This corresponds to the following “hand-crafted” approach with a for loop:
But using such a loop is:
NEVER use Python for-loops for numerical computations !
%%timeit -n 10
for i in range(df.shape[0]):
df.loc[i, "tip_percentage"] = df.loc[i, "tip"] / df.loc[i, "total_bill"]35.9 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
198 µs ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The for loop is \(\approx\) 200 times slower ! (even worse on larger data)
DataFrameWhen you want to change a value in a DataFrame, never use
but use
namely, use a single loc or iloc statement. The first version might not work: it might modify a copy of the column and not the dataframe itself !
Another example of broadcasting is:
| tip_percentage | |
|---|---|
| 0 | 5.944673 |
| 1 | 16.054159 |
| 2 | 16.658734 |
| 3 | 13.978041 |
| 4 | 14.680765 |
where we multiplied each entry of the tip_percentage column by 100.
Remark. Note the difference between
which returns a DataFrame containing only the tip_percentage column and
which returns a Series containing the data of the tip_percentage column
tip_percentage outliers ?groupby and aggregateMany computations can be formulated as a groupby followed by and aggregation.
tip and tip percentage each day ?| total_bill | tip | sex | smoker | day | time | size | tip_percentage | |
|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 | 0.059447 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 | 0.160542 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 | 0.166587 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 | 0.139780 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 | 0.146808 |
/tmp/ipykernel_25457/381386936.py:1: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
TypeError: category dtype does not support aggregation 'mean'
But we don’t care about the size column here, so we can use insead
/tmp/ipykernel_25457/1814539930.py:1: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
| total_bill | tip | tip_percentage | |
|---|---|---|---|
| day | |||
| Fri | 17.151579 | 2.734737 | 0.169913 |
| Sat | 20.441379 | 2.993103 | 0.153152 |
| Sun | 21.410000 | 3.255132 | 0.166897 |
| Thur | 17.682742 | 2.771452 | 0.161276 |
If we want to be more precise, we can groupby using several columns
(
df[["total_bill", "tip", "tip_percentage", "day", "time"]] # selection
.groupby(["day", "time"]) # partition
.mean() # aggregation
)/tmp/ipykernel_25457/125369857.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
| total_bill | tip | tip_percentage | ||
|---|---|---|---|---|
| day | time | |||
| Fri | Dinner | 19.663333 | 2.940000 | 0.158916 |
| Lunch | 12.845714 | 2.382857 | 0.188765 | |
| Sat | Dinner | 20.441379 | 2.993103 | 0.153152 |
| Lunch | NaN | NaN | NaN | |
| Sun | Dinner | 21.410000 | 3.255132 | 0.166897 |
| Lunch | NaN | NaN | NaN | |
| Thur | Dinner | 18.780000 | 3.000000 | 0.159744 |
| Lunch | 17.664754 | 2.767705 | 0.161301 |
Remarks
DataFrame with a two-level indexing: on the day and the timeNaN values for empty groups (e.g. Sat, Lunch)Sometimes, it’s more convenient to get the groups as columns instead of a multi-level index.
For this, use reset_index:
(
df[["total_bill", "tip", "tip_percentage", "day", "time"]] # selection
.groupby(["day", "time"]) # partition
.mean() # aggregation
.reset_index() # ako ungroup
)/tmp/ipykernel_25457/835267922.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
| day | time | total_bill | tip | tip_percentage | |
|---|---|---|---|---|---|
| 0 | Fri | Dinner | 19.663333 | 2.940000 | 0.158916 |
| 1 | Fri | Lunch | 12.845714 | 2.382857 | 0.188765 |
| 2 | Sat | Dinner | 20.441379 | 2.993103 | 0.153152 |
| 3 | Sat | Lunch | NaN | NaN | NaN |
| 4 | Sun | Dinner | 21.410000 | 3.255132 | 0.166897 |
| 5 | Sun | Lunch | NaN | NaN | NaN |
| 6 | Thur | Dinner | 18.780000 | 3.000000 | 0.159744 |
| 7 | Thur | Lunch | 17.664754 | 2.767705 | 0.161301 |
Computations with Pandas can include many operations that are pipelined until the final computation.
Pipelining many operations is good practice and perfectly normal, but in order to make the code readable you can put it between parenthesis (python expression) as follows:
(
df[["total_bill", "tip", "tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.mean()
.reset_index()
# and on top of all this we sort the dataframe with respect
# to the tip_percentage
.sort_values("tip_percentage")
)/tmp/ipykernel_25457/45053252.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
| day | time | total_bill | tip | tip_percentage | |
|---|---|---|---|---|---|
| 2 | Sat | Dinner | 20.441379 | 2.993103 | 0.153152 |
| 0 | Fri | Dinner | 19.663333 | 2.940000 | 0.158916 |
| 6 | Thur | Dinner | 18.780000 | 3.000000 | 0.159744 |
| 7 | Thur | Lunch | 17.664754 | 2.767705 | 0.161301 |
| 4 | Sun | Dinner | 21.410000 | 3.255132 | 0.166897 |
| 1 | Fri | Lunch | 12.845714 | 2.382857 | 0.188765 |
| 3 | Sat | Lunch | NaN | NaN | NaN |
| 5 | Sun | Lunch | NaN | NaN | NaN |
DataFrame with styleNow, we can answer, with style, to the question: what are the average tip percentages along the week ?
(
df[["tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.mean()
# At the end of the pipeline you can use .style
.style
# Print numerical values as percentages
.format("{:.2%}")
.background_gradient()
)/tmp/ipykernel_25457/838795167.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
ImportError: Pandas requires version '3.1.2' or newer of 'jinja2' (version '3.0.3' currently installed).
NaN valuesBut the NaN values are somewhat annoying. Let’s remove them
(
df[["tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.mean()
# We just add this from the previous pipeline
.dropna()
.style
.format("{:.2%}")
.background_gradient()
)/tmp/ipykernel_25457/2662169510.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
ImportError: Pandas requires version '3.1.2' or newer of 'jinja2' (version '3.0.3' currently installed).
Now, we clearly see when tip_percentage is maximal. But what about the standard deviation ?
.mean() for now, but we can use several aggregating function using .agg()(
df[["tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.agg(["mean", "std"]) # we feed `agg` with a list of names of callables
.dropna()
.style
.format("{:.2%}")
.background_gradient()
)/tmp/ipykernel_25457/1856885208.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
ImportError: Pandas requires version '3.1.2' or newer of 'jinja2' (version '3.0.3' currently installed).
And we can use also .describe() as aggregation function. Moreover we - use the subset option to specify which column we want to style - we use ("tip_percentage", "count") to access multi-level index
(
df[["tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.describe() # all-purpose summarising function
)/tmp/ipykernel_25457/3924876303.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
| tip_percentage | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | ||
| day | time | ||||||||
| Fri | Dinner | 12.0 | 0.158916 | 0.047024 | 0.103555 | 0.123613 | 0.144742 | 0.179199 | 0.263480 |
| Lunch | 7.0 | 0.188765 | 0.045885 | 0.117735 | 0.167289 | 0.187735 | 0.210996 | 0.259314 | |
| Sat | Dinner | 87.0 | 0.153152 | 0.051293 | 0.035638 | 0.123863 | 0.151832 | 0.188271 | 0.325733 |
| Sun | Dinner | 76.0 | 0.166897 | 0.084739 | 0.059447 | 0.119982 | 0.161103 | 0.187889 | 0.710345 |
| Thur | Dinner | 1.0 | 0.159744 | NaN | 0.159744 | 0.159744 | 0.159744 | 0.159744 | 0.159744 |
| Lunch | 61.0 | 0.161301 | 0.038972 | 0.072961 | 0.137741 | 0.153846 | 0.193424 | 0.266312 | |
(
df[["tip_percentage", "day", "time"]]
.groupby(["day", "time"])
.describe()
.dropna()
.style
.bar(subset=[("tip_percentage", "count")])
.background_gradient(subset=[("tip_percentage", "50%")])
)/tmp/ipykernel_25457/673231177.py:3: FutureWarning:
The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
ImportError: Pandas requires version '3.1.2' or newer of 'jinja2' (version '3.0.3' currently installed).
tip based on the total_billAs an example of very simple machine-learning problem, let’s try to understand how we can predict tip based on total_bill.
import numpy as np
plt.scatter(df["total_bill"], df["tip"])
plt.xlabel("total_bill", fontsize=12)
plt.ylabel("tip", fontsize=12)Text(0, 0.5, 'tip')

There’s a rough linear dependence between the two. Let’s try to find it by hand!
Namely, we look for numbers \(b\) and \(w\) such that
\[ \texttt{tip} \approx b + w \times \texttt{total_bill} \]
for all the examples of pairs of \((\texttt{tip}, \texttt{total_bill})\) we observe in the data.
In machine learning, we say that this is a very simple example of a supervised learning problem (here it is a regression problem), where \(\texttt{tip}\) is the label and where \(\texttt{total_bill}\) is the (only) feature, for which we intend to use a linear predictor.
plt.scatter(df["total_bill"], df["tip"])
plt.xlabel("total_bill", fontsize=12)
plt.ylabel("tip", fontsize=12)
slope = 1.0
intercept = 0.0
x = np.linspace(0, 50, 1000)
plt.plot(x, intercept + slope * x, color="red")
This might require
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
%matplotlib widget
%matplotlib inline
x = np.linspace(0, 50, 1000)
@widgets.interact(intercept=(-5, 5, 1.), slope=(0, 1, .05))
def update(intercept=0.0, slope=0.5):
plt.scatter(df["total_bill"], df["tip"])
plt.plot(x, intercept + slope * x, color="red")
plt.xlim((0, 50))
plt.ylim((0, 10))
plt.xlabel("total_bill", fontsize=12)
plt.ylabel("tip", fontsize=12)This is kind of tedious to do this by hand… it would be nice to come up with an automated way of doing this. Moreover:
total_bill column to predict the tip, while we know about many other thingsWe can’t perform computations (products and sums) with columns containing categorical variables. So, we can’t use them like this to predict the tip. We need to convert them to numbers somehow.
The most classical approach for this is one-hot encoding (or “create dummies” or “binarize”) of the categorical variables, which can be easily achieved with Pandas.get_dummies
Why one-hot ? See wikipedia for a plausible explanation
| total_bill | tip | size | tip_percentage | sex#Female | sex#Male | smoker#No | smoker#Yes | day#Fri | day#Sat | day#Sun | day#Thur | time#Dinner | time#Lunch | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | 2 | 0.059447 | True | False | True | False | False | False | True | False | True | False |
| 1 | 10.34 | 1.66 | 3 | 0.160542 | False | True | True | False | False | False | True | False | True | False |
| 2 | 21.01 | 3.50 | 3 | 0.166587 | False | True | True | False | False | False | True | False | True | False |
| 3 | 23.68 | 3.31 | 2 | 0.139780 | False | True | True | False | False | False | True | False | True | False |
| 4 | 24.59 | 3.61 | 4 | 0.146808 | True | False | True | False | False | False | True | False | True | False |
Only the categorical columns have been one-hot encoded. For instance, the "day" column is replaced by 4 columns named "day#Thur", "day#Fri", "day#Sat", "day#Sun", since "day" has 4 modalities (see next line).
['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Fri', 'Sat', 'Sun', 'Thur']
Sums over dummies for sex, smoker, day, time and size are all equal to one (by constrution of the one-hot encoded vectors).
day_cols = [col for col in df_one_hot.columns if col.startswith("day")]
df_one_hot[day_cols].head()
df_one_hot[day_cols].sum(axis=1)0 1
1 1
2 1
3 1
4 1
..
239 1
240 1
241 1
242 1
243 1
Length: 244, dtype: int64
The most standard solution is to remove a modality (i.e. remove a one-hot encoding vector). Simply achieved by specifying drop_first=True in the get_dummies function.
['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Fri', 'Sat', 'Sun', 'Thur']
| total_bill | tip | size | tip_percentage | sex#Male | smoker#Yes | day#Sat | day#Sun | day#Thur | time#Lunch | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | 2 | 0.059447 | False | False | False | True | False | False |
| 1 | 10.34 | 1.66 | 3 | 0.160542 | True | False | False | True | False | False |
| 2 | 21.01 | 3.50 | 3 | 0.166587 | True | False | False | True | False | False |
| 3 | 23.68 | 3.31 | 2 | 0.139780 | True | False | False | True | False | False |
| 4 | 24.59 | 3.61 | 4 | 0.146808 | False | False | False | True | False | False |
Now, if a categorical feature has \(K\) modalities, we use only \(K-1\) dummies. For instance, there is no more sex#Female binary column.
Question. So, a linear regression won’t fit a weight for sex#Female. But, where do the model weights of the dropped binary columns go ?
Answer. They just “go” to the intercept: interpretation of the population bias depends on the “dropped” one-hot encodings.
So, we actually fit
\[ \begin{array}{rl} \texttt{tip} \approx b &+ w_1 \times \texttt{total_bill} + w_2 \times \texttt{size} \\ &+ w_3 \times \texttt{sex#Male} + w_4 \times \texttt{smoker#Yes} \\ &+ w_5 \times \texttt{day#Sat} + w_6 \times \texttt{day#Sun} + w_7 \times \texttt{day#Thur} \\ &+ w_8 \times \texttt{time#Lunch} \end{array} \]