Package 'Ecume'

Title: Equality of 2 (or k) Continuous Univariate and Multivariate Distributions
Description: We implement (or re-implements in R) a variety of statistical tools. They are focused on non-parametric two-sample (or k-sample) distribution comparisons in the univariate or multivariate case. See the vignette for more info.
Authors: Hector Roux de Bezieux [aut, cre]
Maintainer: Hector Roux de Bezieux <[email protected]>
License: MIT + file LICENSE
Version: 0.9.2
Built: 2024-12-24 04:14:20 UTC
Source: https://github.com/hectorrdb/ecume

Help Index


Classifier k-sample test

Description

Classifier k-sample test

Usage

classifier_test(
  x,
  y,
  split = 0.7,
  thresh = 0,
  method = "knn",
  control = caret::trainControl(method = "cv"),
  ...
)

Arguments

x

Samples from the first distribution or a list of samples from k distribution

y

Samples from the second distribution. Only used if x is a vector.

split

How to split the data between training and test. Default to .7

thresh

Value to add to the null hypothesis. See details.

method

Which model(s) to use during training. Default to knn.

control

Control parameters when fitting the methods. See trainControl

...

Other parameters passed to train

Details

See Lopez-Paz et .al for more background on those tests.

Value

A list containing the following components:

  • statistic the value of the test statistic.

  • p.value the p-value of the test.

References

Lopez-Paz, D., & Oquab, M. (2016). Revisiting Classifier Two-Sample Tests, 1–15. Retrieved from http://arxiv.org/abs/1610.06545

Examples

x <- matrix(c(runif(100, 0, 1),
               runif(100, -1, 1)),
             ncol = 2)
 y <- matrix(c(runif(100, 0, 3),
               runif(100, -1, 1)),
             ncol = 2)
 classifier_test(x, y)

Weighted KS Test

Description

Weighted Kolmogorov-Smirnov Two-Sample Test with threshold

Usage

ks_test(x, y, thresh = 0.05, w_x = rep(1, length(x)), w_y = rep(1, length(y)))

Arguments

x

Vector of values sampled from the first distribution

y

Vector of values sampled from the second distribution

thresh

The threshold needed to clear between the two cumulative distributions

w_x

The observation weights for x

w_y

The observation weights for y

Details

The usual Kolmogorov-Smirnov test for two vectors X and Y, of size m and n rely on the empirical cdfs ExE_x and EyE_y and the test statistic

D=supt(X,Y)Ex(x)Ey(x))D = sup_{t\in (X, Y)} |E_x(x) - E_y(x))

. This modified Kolmogorov-Smirnov test relies on two modifications.

  • Using observation weights for both vectors X and Y: Those weights are used in two places, while modifying the usual KS test. First, the empirical cdfs are updates to account for the weights. Secondly, the effective sample sizes are also modified. This is inspired from https://stackoverflow.com/a/55664242/13768995, using Monahan (2011).

  • Testing against a threshold: the test statistic is thresholded such that D=max(Dthresh,0)D = max(D - thresh, 0). Since 0D10\le D\le 1, the value of the threshold is also between 0 and 1, representing an effect size for the difference.

Value

A list with class "htest" containing the following components:

  • statistic the value of the test statistic.

  • p.value the p-value of the test.

  • alternative a character string describing the alternative hypothesis.

  • method a character string indicating what type of test was performed.

  • data.name a character string giving the name(s) of the data.

References

Monahan, J. (2011). Numerical Methods of Statistics (2nd ed., Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge: Cambridge University Press. doi:10.1017/CBO9780511977176

Examples

x <- runif(100)
 y <- runif(100, min = .5, max = .5)
 ks_test(x, y, thresh = .001)

Perform the Maximum Mean Discrepancy unbiased bootstrap test

Description

Maximum Mean Discrepancy Unbiased Test

Usage

mmd_test(
  x,
  y,
  kernel = "rbfdot",
  type = ifelse(min(nrow(x), nrow(y)) < 1000, "unbiased", "linear"),
  null = c("permutation", "exact"),
  iterations = 10^3,
  frac = 1,
  ...
)

Arguments

x

d-dimensional samples from the first distribution

y

d-dimensional samples from the first distribution

kernel

A character that must match a known kernel. See details.

type

Which statistic to use. One of 'unbiased' or 'linear'. See Gretton et al for details. Default to 'unbiased' if the two vectors are of length less than 1000 and to 'linear' otherwise.

null

How to asses the null distribution. This can only be set to exact if the type is 'unbiased' and the kernel is 'rbf'.

iterations

How many iterations to do to simulate the null distribution. Default to 10^4. Only used if null is 'permutations'

frac

For the linear statistic, how many points to sample. See details.

...

Further arguments passed to kernel functions

Details

This computes the MMD^2u unbiased statistic or the MMDl linear statistic from Gretton et al. The code relies on the pairwise_kernel function from the python module sklearn. To list the available kernels, see the examples.

Value

A list containing the following components:

  • statistic the value of the test statistic.

  • p.value the p-value of the test.

References

Gretton, A., Borgwardt, K., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A Kernel Two-Sample Test Journal of Machine Learning Research (2012)

Examples

x <- matrix(rnorm(1000, 0, 1), ncol = 10)
y <- matrix(rnorm(1000, 0, 2), ncol = 10)
mmd_test(x, y)
mmd_test(x, y, type = "linear")
x <- matrix(rnorm(1000, 0, 1), ncol = 10)
y <- matrix(rnorm(1000, 0, 1), ncol = 10)
 # Set iterations to small number for runtime
 # Increase for more accurate results
mmd_test(x, y, iterations = 10^2)

Stouffer

Description

Stouffer's Z-score method

Usage

stouffer_zscore(pvals, weights = rep(1, seq_along(pvals)), side = "two")

Arguments

pvals

A vector of p-values

weights

A vector of weights

side

How the p-values were generated. One of 'right', 'left' or 'two'.

Details

Given a set of i.i.d p-values and associated weights, it combines the p-values pip_i. Letting ϕ\phi be the standard normal cumulative distribution function and Zi=ϕ1(1pi)Z_i =\phi^{-1} (1-p_i), the meta-analysis Z-score is

Z=(wiZi)((wi)2)(1/2)Z = (\sum w_i Z_i) * (\sum (w_i)^2)^(-1/2)

Value

A list containing the following components:

  • statistic the value of the test statistic.

  • p.value the p-value of the test.

References

Samuel Andrew Stouffer. Adjustment during army life. Princeton University Press, 1949.

Examples

pvals <- runif(100, 0, 1)
 weights <- runif(100, 0, 1)
 stouffer_zscore(pvals, weights)

Permutation test based on Wasserstein distance

Description

Permutation test based on Wasserstein distance

Usage

wasserstein_permut(
  x,
  y,
  iterations = 10^4,
  fast = nrow(x) + nrow(y) > 10^3,
  S = NULL,
  ...
)

Arguments

x

Samples from the first distribution

y

Samples from the second distribution. Only used if x is a vector.

iterations

How many iterations to do to simulate the null distribution. Default to 10^4.

fast

If true, uses the subwasserstein approximate function. Default to true if there are more than 1,000 samples total.

S

Number of samples to use in approximate mode. Must be set if fast=TRUE. See subwasserstein.

...

Other parameters passed to wasserstein or wasserstein1d

Value

A list containing the following components:

  • statistic the Wasserstein distance between x and y.

  • p.value the p-value of the permutation test.

Examples

x <- matrix(c(runif(100, 0, 1),
               runif(100, -1, 1)),
             ncol = 2)
 y <- matrix(c(runif(100, 0, 3),
               runif(100, -1, 1)),
             ncol = 2)
 # Set iterations to small number for runtime
 # Increase for more accurate results
 wasserstein_permut(x, y, iterations = 10^2)