flights
dataset from nycflights
packagepacman::p_load("nycflights13")
On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
flights
is a multivariate sample
Each row
corresponds to an observation
Each column corresponds to a measurement
Columns: 19$ year <int> 2013,…$ month <int> 1, 1,…$ day <int> 1, 1,…$ dep_time <int> 517, …$ sched_dep_time <int> 515, …$ dep_delay <dbl> 2, 4,…$ arr_time <int> 830, …$ sched_arr_time <int> 819, …$ arr_delay <dbl> 11, 2…$ carrier <chr> "UA",…$ flight <int> 1545,…$ tailnum <chr> "N142…$ origin <chr> "EWR"…$ dest <chr> "IAH"…$ air_time <dbl> 227, …$ distance <dbl> 1400,…$ hour <dbl> 5, 5,…$ minute <dbl> 15, 2…$ time_hour <dttm> 2013…
A sample is a sequence of items of the same type
In , if the type matches a basetype, a sample is typically represented using a vector
A numerical (quantitative) sample is just a sequence of numbers (either integers
or floats), it is represented by a numeric
vector.
A categorical (qualitative) sample is a sequence of values from some predefined finite set
In , categorical samples are handled using a special class: factor
arr_delay
mean(flights$arr_delay)
## [1] NA
mean(flights$arr_delay, na.rm=T)
## [1] 6.895377
sd(flights$arr_delay)
## [1] NA
sd(flights$arr_delay, na.rm=T)
## [1] 44.63329
summary(flights$arr_delay) %>% broom::tidy() %>% knitr::kable(digits = 2, caption = "Summary statistics for nycflights13 arrival delay")
minimum | q1 | median | mean | q3 | maximum | na |
---|---|---|---|---|---|---|
-86 | -17 | -5 | 6.9 | 14 | 1272 | 9430 |
sum_funs <- list(mean, sd, median, IQR)purrr::map_dbl(sum_funs, ~ .x( flights$arr_delay, na.rm=T))
## [1] 6.895377 44.633292 -5.000000 31.000000
NAs matter!
dplyr::summarise
flights %>% select(- year, -month, -day, -hour, -minute , -time_hour, -ends_with("time"), -flight) %>% summarise(across(where(is.numeric), .fns=list("mean"=mean, "median"=median , "IQR"=IQR), na.rm=T, .names="{.col}_{.fn}")) %>% pivot_longer(cols = everything()) %>% knitr::kable(digits=2)
name | value |
---|---|
dep_delay_mean | 12.64 |
dep_delay_median | -2.00 |
dep_delay_IQR | 16.00 |
arr_delay_mean | 6.90 |
arr_delay_median | -5.00 |
arr_delay_IQR | 31.00 |
distance_mean | 1039.91 |
distance_median | 872.00 |
distance_IQR | 887.00 |
The code for mean.default
goes beyond computing the weighted mean
function (x, trim = 0, na.rm = FALSE, ...){ if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) { warning("argument is not numeric or logical: returning NA") return(NA_real_) } if (na.rm) x <- x[!is.na(x)] if (!is.numeric(trim) || length(trim) != 1L) stop("'trim' must be numeric of length one") n <- length(x) if (trim > 0 && n) { if (is.complex(x)) stop("trimmed means are not defined for complex data") if (anyNA(x)) return(NA_real_) if (trim >= 0.5) return(stats::median(x, na.rm = FALSE)) lo <- floor(n * trim) + 1 hi <- n + 1 - lo x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi] } .Internal(mean(x))}
The first if (...) {...}
is an example of defensive programming
if (na.rm) { ... }
handles missing data when the default behavior is overriden.
NA
are filtered out
When positive, parameter trim
allows us to compute trimmed means
Depending on optional arguments, fucntion mean
allows to compute either arithmetic means, or trimmed (arithmetic) means
In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quantiles (more specifically quartiles)
Box plots also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram
Outliers may be plotted as individual points
Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length)
The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers
Tukey's boxplot assumes symmetry for the whiskers and normality for their length
What does this mean?
Assume that we boxplot the N(μ,σ2)∼μ+σN(0,1) distribution instead of an empirical distribution.
Then:
The median is equal to μ because of the symmetry of N(0,1),
The quartiles are μ−σΦ←(3/4) and μ+σΦ←(3/4).
The IQR is 2σΦ←(3/4)≈1.35×σ
The default length of the whiskers (under geom_boxplot
) is
1.5×IQR
For a Gaussian population, 1.5IQR=3σΦ←(3/4)
In a Gaussian population, the probability that a sample point lies outside the whiskers is 2×(1−Φ(4×Φ←(3/4)))≈7/1000
n <- 100fake_data <- tibble(normal=rnorm(n),t=rt(n, df=3))title <- stringr::str_c("Standard Gaussian sample, size : ", n)fake_data %>% ggplot(mapping = aes(x=1L, y=normal)) + geom_boxplot(outlier.colour = "red", outlier.shape = 2, coef=1.5) + ggtitle(title) + ylab("") + ggplot2::theme( axis.text.x = element_blank(), axis.ticks.x = element_blank() )
summary(fake_data$normal)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.10528 -0.67292 -0.20090 -0.07495 0.43615 2.29645
Playing with boxplots
Tune whiskers lengths using coef
Jitter to spot the whole
Point out the distinction between outliers and extremes
The sampling distribution ( t-distribution with 3 degrees of freedom) is heavy-tailed
Some sample points lie far away from the whiskers
title <- stringr::str_c("Student sample with 3 df, size : ", n)ggplot(fake_data, aes(y=t, x=1L)) + geom_boxplot(outlier.shape=2, outlier.colour = 'red', outlier.alpha=.5, coef=1.5) + ylab("") + ggtitle(title) + ggplot2::theme( axis.text.x = element_blank(), axis.ticks.x = element_blank() )
summary(fake_data$t)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -5.629301 -0.736732 -0.005991 0.038924 0.841745 5.314943
n <- 1000tibble(normal=rnorm(n), t=rt(n, df=3)) %>% pivot_longer(cols=c(normal, t), names_to='Distribution') %>% ggplot() + aes(x=Distribution, y=value) + geom_boxplot(outlier.alpha = .5, outlier.colour = 'red', outlier.shape = 2) + ggtitle("Gaussian versus heavy-tailed samples")
With usual whiskers length
In Gaussian samples:
In heavy-tailed distributions:
tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + geom_histogram(aes(y=..density..), alpha=.5, fill="white", colour="black", bins=30) + stat_function(linetype="dashed", fun=dnorm, args=list(mean=2, sd=1)) + annotate("text", x=4.5, y=.3, label="Gaussian density with mean 2")
tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + geom_histogram(aes(y=..density..), alpha=.5, fill="white", colour="black", bins=30) + stat_function(linetype="dashed", fun=dnorm, args=list(mean=2, sd=1)) + annotate("text", x=4.5, y=.3, label="Gaussian density with mean 2") + ggtitle("Histogram for Gaussian sample")
tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + geom_histogram(aes(y=..density..), alpha=.5, fill="white", colour="black", bins=30) + stat_function(linetype="dashed", fun=dnorm, args=list(mean=2, sd=1)) + annotate("text", x=4.5, y=.3, label="Gaussian density with mean 2") + ggtitle("Histogram for Gaussian sample") + labs(caption="Sample size=1000, 30 bins")
flights %>% select( ends_with("_delay") ) %>% pivot_longer( cols=ends_with("_delay"), names_pattern = "(.*)_delay", values_to = "delay" ) %>% filter( abs(delay) < 120 ) %>% ggplot() + aes(x=delay, y=..density.., fill=name) + geom_histogram(alpha=.3, bins=30)
A real-valued sample defines a probability distribution, called the empirical distribution. Each sample point is assigned a probability equal to its relative frequency.
The cumulative distribution function associated with this empirical distribution is called the empirical cumulative distribution function (ECDF).
Any numerical dataset may be used to illustrate ECDFs.
For example,
faithful
: a data frame with 272
observations on 2
variables concerning waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
eruptions
: numeric
, Eruption time in minutess
wating
: numeric
Waiting time to next eruption (in minutes)
A closer look at
faithful$eruptions
reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequent than expected under non-human measurement.
ECDF plots may be used to compare two samples or to compare an empirical cumulative distribution function with the cumulative distribution function of some well-known probability distribution.
Fn <- ecdf(faithful$eruptions)x_grid <- seq(min(faithful$eruptions)-1, max(faithful$eruptions)+1, 0.01)df <- data_frame(x=x_grid, Fn=Fn(x_grid))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.## Please use `tibble()` instead.## This warning is displayed once every 8 hours.## Call `lifecycle::last_warnings()` to see where this warning was generated.
df %>% ggplot() + aes(x=x, y=Fn) + geom_step(direction="hv")
In order to plot the quantile function, it is almost enough to exchange the role of the axes in the ECDF plot.
faithful[1:10,] %>% ggplot() + aes(x=eruptions) + geom_step(stat="ecdf", direction='vh') + coord_flip() + xlab("Quantiles") + ylab("Probabilities") + ggtitle("Old Faithful Geyser Data")
Why is it safe to set the direction
argument in the call to geom_step()
When it comes to comparing the empirical distributions defined by two samples,
quantile-quantile plots ( qqplot
) are favored.
F_n and G_m : two ECDFs
the quantile-quantile function is defined by
G_m^\leftarrow \circ F_n(z) = Y_{k+1:m}\qquad \text{if} \qquad \frac{k}{m}< F_n(z) \leq \frac{k+1}{m}
If G_m = F_n (if the two samples are equal up to a permutation), then
F_n^\leftarrow \circ F_n(z) = X_{k+1:n} \text{if} \qquad X_{k:n} < z \leq X_{k+1:n}
The quantile-quantile plot lies above line y=x and meets this line at every unique order statistics.
The departure of the quantile-quantile function from identity reflects the differences between the two empirical distribution functions.
We can for example compare the first and second half of the faithfull
dataset.
Function equantile()
returns a function
The returned function is parametrized by argument
sample
A function captures (encloses) the environment in which it is created
Function equantile
creates a new execution environment every time it is run. This environment is usually ephemeral, but here it becomes the enclosing environment of the manufactured function
The enclosing environment of the manufactured function is an execution environment of the function factory
equantile <- function(sample){ sample <- sort(sample) m <- length(sample) function(p){ ks <- pmax(1, ceiling(m * p)) sample[ks] }}
]
not_cancelled <- flights %>% filter(!is.na(arr_delay)) %>% sample_n(100) %>% arrange(dep_delay) %>% filter(abs(arr_delay) < 120) nr <- nrow(not_cancelled)not_cancelled %>% ggplot() + aes(x=dep_delay, y=quantile(arr_delay, seq(1, nr)/nr) ) + geom_point(size=.1) + coord_fixed() + geom_abline(slope=1, intercept = 0, alpha=.5) + ggtitle("QQ plot clipped arr_delay versus dep_delay") + ylab("arr_delay") + xlab("dep-delay")
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
flights
dataset from nycflights
packagepacman::p_load("nycflights13")
On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
flights
is a multivariate sample
Each row
corresponds to an observation
Each column corresponds to a measurement
Columns: 19$ year <int> 2013,…$ month <int> 1, 1,…$ day <int> 1, 1,…$ dep_time <int> 517, …$ sched_dep_time <int> 515, …$ dep_delay <dbl> 2, 4,…$ arr_time <int> 830, …$ sched_arr_time <int> 819, …$ arr_delay <dbl> 11, 2…$ carrier <chr> "UA",…$ flight <int> 1545,…$ tailnum <chr> "N142…$ origin <chr> "EWR"…$ dest <chr> "IAH"…$ air_time <dbl> 227, …$ distance <dbl> 1400,…$ hour <dbl> 5, 5,…$ minute <dbl> 15, 2…$ time_hour <dttm> 2013…
A sample is a sequence of items of the same type
In , if the type matches a basetype, a sample is typically represented using a vector
A numerical (quantitative) sample is just a sequence of numbers (either integers
or floats), it is represented by a numeric
vector.
A categorical (qualitative) sample is a sequence of values from some predefined finite set
In , categorical samples are handled using a special class: factor
\overline{X}_n = \frac{1}{n}\sum_{i=1}^n x_i \qquad \textbf{(location)}
s^2 = \frac{1}{n-1} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^2 \quad \text{or}\quad s^2 = \frac{1}{n} \sum_{i=1}^n \big(x_i - \overline{X}_n\big)^2\qquad\textbf{(dispersion)}
arr_delay
mean(flights$arr_delay)
## [1] NA
mean(flights$arr_delay, na.rm=T)
## [1] 6.895377
sd(flights$arr_delay)
## [1] NA
sd(flights$arr_delay, na.rm=T)
## [1] 44.63329
summary(flights$arr_delay) %>% broom::tidy() %>% knitr::kable(digits = 2, caption = "Summary statistics for nycflights13 arrival delay")
minimum | q1 | median | mean | q3 | maximum | na |
---|---|---|---|---|---|---|
-86 | -17 | -5 | 6.9 | 14 | 1272 | 9430 |
sum_funs <- list(mean, sd, median, IQR)purrr::map_dbl(sum_funs, ~ .x( flights$arr_delay, na.rm=T))
## [1] 6.895377 44.633292 -5.000000 31.000000
NAs matter!
dplyr::summarise
flights %>% select(- year, -month, -day, -hour, -minute , -time_hour, -ends_with("time"), -flight) %>% summarise(across(where(is.numeric), .fns=list("mean"=mean, "median"=median , "IQR"=IQR), na.rm=T, .names="{.col}_{.fn}")) %>% pivot_longer(cols = everything()) %>% knitr::kable(digits=2)
name | value |
---|---|
dep_delay_mean | 12.64 |
dep_delay_median | -2.00 |
dep_delay_IQR | 16.00 |
arr_delay_mean | 6.90 |
arr_delay_median | -5.00 |
arr_delay_IQR | 31.00 |
distance_mean | 1039.91 |
distance_median | 872.00 |
distance_IQR | 887.00 |
The code for mean.default
goes beyond computing the weighted mean
function (x, trim = 0, na.rm = FALSE, ...){ if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) { warning("argument is not numeric or logical: returning NA") return(NA_real_) } if (na.rm) x <- x[!is.na(x)] if (!is.numeric(trim) || length(trim) != 1L) stop("'trim' must be numeric of length one") n <- length(x) if (trim > 0 && n) { if (is.complex(x)) stop("trimmed means are not defined for complex data") if (anyNA(x)) return(NA_real_) if (trim >= 0.5) return(stats::median(x, na.rm = FALSE)) lo <- floor(n * trim) + 1 hi <- n + 1 - lo x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi] } .Internal(mean(x))}
The first if (...) {...}
is an example of defensive programming
if (na.rm) { ... }
handles missing data when the default behavior is overriden.
NA
are filtered out
When positive, parameter trim
allows us to compute trimmed means
Depending on optional arguments, fucntion mean
allows to compute either arithmetic means, or trimmed (arithmetic) means
Order statistics: x_{1:n} \leq x_{2:n} \leq \ldots \leq x_{n:n}
Median x_{\lfloor n/2\rfloor+1:n} if n is odd, (x_{n/2:n}+ x_{n/2+1:n})/2 if n is even
Quartiles x_{\lfloor n/4\rfloor:n} and x_{\lfloor 3 n/4\rfloor:n}
Quintiles x_{\lfloor k n/5\rfloor:n} pour k \in 1, \ldots, 4
Deciles x_{\lfloor k n/10\rfloor:n} pour k \in 1, \ldots, 9
Percentiles
...
In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quantiles (more specifically quartiles)
Box plots also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram
Outliers may be plotted as individual points
Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length)
The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers
Tukey's boxplot assumes symmetry for the whiskers and normality for their length
What does this mean?
Assume that we boxplot the \mathcal{N}(\mu, \sigma^2) \sim \mu + \sigma \mathcal{N}(0, 1) distribution instead of an empirical distribution.
Then:
The median is equal to \mu because of the symmetry of \mathcal{N}(0,1),
The quartiles are \mu - \sigma \Phi^{\leftarrow}(3/4) and \mu + \sigma \Phi^{\leftarrow}(3/4).
The IQR is 2 \sigma \Phi^{\leftarrow}(3/4) \approx 1.35\times \sigma
The default length of the whiskers (under geom_boxplot
) is
1.5 \times \text{IQR}
For a Gaussian population, 1.5 \text{IQR} = 3 \sigma \Phi^{\leftarrow}(3/4)
In a Gaussian population, the probability that a sample point lies outside the whiskers is 2 \times (1 - \Phi(4 \times \Phi^{\leftarrow}(3/4))) \approx 7/1000
n <- 100fake_data <- tibble(normal=rnorm(n),t=rt(n, df=3))title <- stringr::str_c("Standard Gaussian sample, size : ", n)fake_data %>% ggplot(mapping = aes(x=1L, y=normal)) + geom_boxplot(outlier.colour = "red", outlier.shape = 2, coef=1.5) + ggtitle(title) + ylab("") + ggplot2::theme( axis.text.x = element_blank(), axis.ticks.x = element_blank() )
summary(fake_data$normal)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.10528 -0.67292 -0.20090 -0.07495 0.43615 2.29645
Point out the distinction between outliers and extremes
The sampling distribution ( t-distribution with 3 degrees of freedom) is heavy-tailed
Some sample points lie far away from the whiskers
title <- stringr::str_c("Student sample with 3 df, size : ", n)ggplot(fake_data, aes(y=t, x=1L)) + geom_boxplot(outlier.shape=2, outlier.colour = 'red', outlier.alpha=.5, coef=1.5) + ylab("") + ggtitle(title) + ggplot2::theme( axis.text.x = element_blank(), axis.ticks.x = element_blank() )
summary(fake_data$t)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -5.629301 -0.736732 -0.005991 0.038924 0.841745 5.314943
n <- 1000tibble(normal=rnorm(n), t=rt(n, df=3)) %>% pivot_longer(cols=c(normal, t), names_to='Distribution') %>% ggplot() + aes(x=Distribution, y=value) + geom_boxplot(outlier.alpha = .5, outlier.colour = 'red', outlier.shape = 2) + ggtitle("Gaussian versus heavy-tailed samples")
With usual whiskers length
In Gaussian samples:
In heavy-tailed distributions:
tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + geom_histogram(aes(y=..density..), alpha=.5, fill="white", colour="black", bins=30) + stat_function(linetype="dashed", fun=dnorm, args=list(mean=2, sd=1)) + annotate("text", x=4.5, y=.3, label="Gaussian density with mean 2")
tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + geom_histogram(aes(y=..density..), alpha=.5, fill="white", colour="black", bins=30) + stat_function(linetype="dashed", fun=dnorm, args=list(mean=2, sd=1)) + annotate("text", x=4.5, y=.3, label="Gaussian density with mean 2") + ggtitle("Histogram for Gaussian sample")
tibble(x=rnorm(1000, mean = 2)) %>% ggplot() + aes(x=x) + geom_rug(alpha=.1) + geom_histogram(aes(y=..density..), alpha=.5, fill="white", colour="black", bins=30) + stat_function(linetype="dashed", fun=dnorm, args=list(mean=2, sd=1)) + annotate("text", x=4.5, y=.3, label="Gaussian density with mean 2") + ggtitle("Histogram for Gaussian sample") + labs(caption="Sample size=1000, 30 bins")
flights %>% select( ends_with("_delay") ) %>% pivot_longer( cols=ends_with("_delay"), names_pattern = "(.*)_delay", values_to = "delay" ) %>% filter( abs(delay) < 120 ) %>% ggplot() + aes(x=delay, y=..density.., fill=name) + geom_histogram(alpha=.3, bins=30)
A real-valued sample defines a probability distribution, called the empirical distribution. Each sample point is assigned a probability equal to its relative frequency.
The cumulative distribution function associated with this empirical distribution is called the empirical cumulative distribution function (ECDF).
Let x_1, \ldots, x_n denote the points from a real sample (they may not be pairwise distincts), let P_n be the empirical distribution and F_n be the ECDF.
\begin{array}{rl} P_n(A) & = \frac{1}{n} \sum_{i=1}^n \mathbb{I}_{x_i \in A} \\ F_n(z) & = \frac{1}{n} \sum_{i=1}^n \mathbb{I}_{x_i \leq z} \end{array}
Let x_{1:n} \leq x_{2:n} \leq \ldots \leq x_{n:n} denote the order statistics of the sample (that is the non-decreasing rearrangement of sample, with possible repetitions).
The ECDF is easily computed from the order statistics:
F_n(z) = \begin{cases} 0 & \text{if } z < x_{1:n} \\ \frac{k}{n} & \text{if } x_{k:n} \leq z < x_{k+1:n}\\ 1 & \text{if } x_{n:n} \leq z \end{cases}
Any numerical dataset may be used to illustrate ECDFs.
For example,
faithful
: a data frame with 272
observations on 2
variables concerning waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
eruptions
: numeric
, Eruption time in minutess
wating
: numeric
Waiting time to next eruption (in minutes)
A closer look at
faithful$eruptions
reveals that these are heavily rounded times originally in seconds, where multiples of 5 are more frequent than expected under non-human measurement.
ECDF plots may be used to compare two samples or to compare an empirical cumulative distribution function with the cumulative distribution function of some well-known probability distribution.
Fn <- ecdf(faithful$eruptions)x_grid <- seq(min(faithful$eruptions)-1, max(faithful$eruptions)+1, 0.01)df <- data_frame(x=x_grid, Fn=Fn(x_grid))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.## Please use `tibble()` instead.## This warning is displayed once every 8 hours.## Call `lifecycle::last_warnings()` to see where this warning was generated.
df %>% ggplot() + aes(x=x, y=Fn) + geom_step(direction="hv")
The quantile function of a probability distribution over \mathbb{R} is the generalized, left-continuous inverse function of the cumulative distribution function.
For p \in (0,1)
F_n^{\leftarrow}(p) = \inf \Big\{ z : F_n(z) \geq p\Big\}
F_n^{\leftarrow}(p) = X_{k+1:n} \qquad \text{if } \qquad\frac{k}{n} < p \leq \frac{k+1}{n}
In order to plot the quantile function, it is almost enough to exchange the role of the axes in the ECDF plot.
faithful[1:10,] %>% ggplot() + aes(x=eruptions) + geom_step(stat="ecdf", direction='vh') + coord_flip() + xlab("Quantiles") + ylab("Probabilities") + ggtitle("Old Faithful Geyser Data")
Why is it safe to set the direction
argument in the call to geom_step()
When it comes to comparing the empirical distributions defined by two samples,
quantile-quantile plots ( qqplot
) are favored.
F_n and G_m : two ECDFs
the quantile-quantile function is defined by
G_m^\leftarrow \circ F_n(z) = Y_{k+1:m}\qquad \text{if} \qquad \frac{k}{m}< F_n(z) \leq \frac{k+1}{m}
If G_m = F_n (if the two samples are equal up to a permutation), then
F_n^\leftarrow \circ F_n(z) = X_{k+1:n} \text{if} \qquad X_{k:n} < z \leq X_{k+1:n}
The quantile-quantile plot lies above line y=x and meets this line at every unique order statistics.
The departure of the quantile-quantile function from identity reflects the differences between the two empirical distribution functions.
We can for example compare the first and second half of the faithfull
dataset.
Function equantile()
returns a function
The returned function is parametrized by argument
sample
A function captures (encloses) the environment in which it is created
Function equantile
creates a new execution environment every time it is run. This environment is usually ephemeral, but here it becomes the enclosing environment of the manufactured function
The enclosing environment of the manufactured function is an execution environment of the function factory
equantile <- function(sample){ sample <- sort(sample) m <- length(sample) function(p){ ks <- pmax(1, ceiling(m * p)) sample[ks] }}
]
not_cancelled <- flights %>% filter(!is.na(arr_delay)) %>% sample_n(100) %>% arrange(dep_delay) %>% filter(abs(arr_delay) < 120) nr <- nrow(not_cancelled)not_cancelled %>% ggplot() + aes(x=dep_delay, y=quantile(arr_delay, seq(1, nr)/nr) ) + geom_point(size=.1) + coord_fixed() + geom_abline(slope=1, intercept = 0, alpha=.5) + ggtitle("QQ plot clipped arr_delay versus dep_delay") + ylab("arr_delay") + xlab("dep-delay")