Aim of the package

The Ecume package provides statistical methods to test whether samples are from the same or distinct distributions. It contains non-parametric tests for different settings.

Two sample tests for univariate distribution

General Setting

Consider two distributions P₁ and P₂. We want to test the null hypothesis H₀ : P₁ = P₂. To do this, we have two sets of observations x_i ∼ P₁, with i ∈ [1, …m] and y_j ∼ P₁, with j ∈ [1, …n].

Kolmogorov-Smirnov test

The Kolmogorov-Smirnov statistic for two samples relies on the empirical cumulative distributions F_x and F_y for univariate distributions and is computed as

D_n, m = sup_{x ∈ {x_i, y_j}, i ∈ [1, …m], j ∈ [1, …n]}|F_x(x) − F_y(x)|

We can compute the distribution of D_n, m under the null and device a test. If the distributions are identical we expect to not reject the null about 95% of the time, for a nominal level of .05.

set.seed(20)
x <- rnorm(100, 0, 1)
y <- rnorm(200, 0, 1)
ks.test(x, y)

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.095, p-value = 0.5842
## alternative hypothesis: two-sided

However, if the distributions are not identical, we see that we would reject the null hypothesis.

set.seed(20)
x <- rnorm(100, 0, 1)
y <- rnorm(200, 0, 2)
ks.test(x, y)

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.235, p-value = 0.001268
## alternative hypothesis: two-sided

Adding observation weights and threshold

Now, let us imagine we also have observations weights w_x, i and w_y, j for all samples. Moreover, instead of testing H₀ : D_m, n = 0, we want to test against an effect size, i.e. H₀ : D_m, n ≤ c. Relying on (Monahan 2011), we can do this using the ks_test function:

library(Ecume)
set.seed(20)
x <- rnorm(100, 0, 1)
w_x <- runif(100, 0, 1)
y <- rnorm(200, 0, 1)
w_y <- runif(200, 0, 1)
ks_test(x = x, y = y, w_x = w_x, w_y = w_y, thresh = .01)

## 
##  Two-sample Weighted Kolmogorov-Smirnov test with threshold 0.01
## 
## data:  x and y
## = 0.09097, p-value = 0.8142
## alternative hypothesis: two-sided

set.seed(20)
x <- rnorm(100, 0, 1)
w_x <- runif(100, 0, 1)
y <- rnorm(200, 0, 2)
w_y <- runif(200, 0, 1)
ks_test(x = x, y = y, w_x = w_x, w_y = w_y, thresh = .01)

## 
##  Two-sample Weighted Kolmogorov-Smirnov test with threshold 0.01
## 
## data:  x and y
## = 0.23954, p-value = 0.0074
## alternative hypothesis: two-sided

Mutivariate Distributions

However, the KS test does not work with multivariate distributions. Other statistics have been proposed. Here, we have implemented two

The two-sample kernel test from (Gretton et al. 2012)

The two-sample kernel test relies on a kernel function (x, y) → k(x, y) and the Mean Maximum Discrepancy:

$$MMD^2_u = \frac{1}{m(m-1)}\sum_{x\neq x'}k(x, x') + \frac{1}{n(n-1)}\sum_{y\neq y'}k(y, y') - \frac{2}{mn}\sum_{x, y}k(x, y)$$

While this statistic has some closed-form bounds under the null for some kernels, we can also compute its distribution using permutations

set.seed(20)
x <- matrix(c(rnorm(100, 0, 1),
              rnorm(100, 0, 1)),
            ncol = 2)
y <- matrix(c(rnorm(200, 0, 2),
              rnorm(200, 0, 1)),
            ncol = 2)
mmd_test(x = x, y = y, iterations = 10^4)

## $statistic
## [1] 0.03135917
## 
## $p.value
## [1] 1e-04

If the number of samples is too large, we can use a “linear” form of the statistics that samples elements from the sums above.

set.seed(20)
x <- matrix(c(rnorm(100, 0, 1),
              rnorm(100, 0, 1)),
            ncol = 2)
y <- matrix(c(rnorm(200, 0, 2),
              rnorm(200, 0, 1)),
            ncol = 2)
mmd_test(x = x, y = y, iterations = 10^4, type = "linear")

## $statistic
## [1] 0.2239044
## 
## $p.value
## [1] 0.018

Classifier Test

If we split the data into a training and test set and train a classifier Cl on the training set to distinguish between points from P₁ and P₂, then under the null, the classifier will not do better than chance on the test set. So if we have n_test samples in the test set, then the number of correctly classified points c_{Cl, n_test} follows

c_{Cl, n_test}∼_H₀Binom(.5, n_test)

This provides a valid test statistic and its distribution under the null. The quality of the classifier only matters for the power of the test. To provide flexibility for the users, we rely on the caret package (Kuhn 2020). By default, we use a k-NN classifier where k is selected through cross-validation on the training set.

set.seed(20)
x <- matrix(c(rnorm(200, 0, 1),
              rnorm(200, 0, 1)),
            ncol = 2)
y <- matrix(c(rnorm(200, 0, 2),
              rnorm(200, 0, 1)),
            ncol = 2)
classifier_test(x = x, y = y)

## $statistic
## [1] 0.6583333
## 
## $p.value
## [1] 0.0001652

Considering more than 2 distributions

Now we consider the case where we have more than two distributions. Instead we have k distributions and we want to test the hypothesis

H₀ : ∀(i, j) ∈ [1…k], P_i = P_j

The classifier test can be readily extended to more than two distributions. The main difference is that, under the null,

$$ c_{Cl, n_{test}}\sim_{H_0} Binom(\frac{1}{k}, n_{test})$$

set.seed(20)
x1 <- matrix(c(rnorm(200, 0, 1),
              rnorm(200, 0, 1)),
            ncol = 2)
x2 <- matrix(c(rnorm(200, 0, 2),
              rnorm(200, 0, 1)),
            ncol = 2)
x3 <- matrix(c(rnorm(200, 1, 1),
               rnorm(200, 0, 1)),
            ncol = 2)
classifier_test(x = list("x1" = x1, "x2" = x2, "x3" = x3))

## $statistic
## [1] 0.4333333
## 
## $p.value
## [1] 0.002040437

Note that here, we assume that we have as many as samples from each distribution. In case of class imbalance, we downsample everything to the smallest class.

References

Gretton, A., K. Borgwardt, Malte J. Rasch, B. Schölkopf, and Alex Smola. 2012. “A Kernel Two-Sample Test.” Undefined.

Kuhn, Max. 2020. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Monahan, John F. 2011. Numerical methods of statistics, second edition. Cambridge University Press. https://doi.org/10.1017/CBO9780511977176.