Title: | Equality of 2 (or k) Continuous Univariate and Multivariate Distributions |
---|---|
Description: | We implement (or re-implements in R) a variety of statistical tools. They are focused on non-parametric two-sample (or k-sample) distribution comparisons in the univariate or multivariate case. See the vignette for more info. |
Authors: | Hector Roux de Bezieux [aut, cre] |
Maintainer: | Hector Roux de Bezieux <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.9.2 |
Built: | 2024-12-24 04:14:20 UTC |
Source: | https://github.com/hectorrdb/ecume |
Classifier k-sample test
classifier_test( x, y, split = 0.7, thresh = 0, method = "knn", control = caret::trainControl(method = "cv"), ... )
classifier_test( x, y, split = 0.7, thresh = 0, method = "knn", control = caret::trainControl(method = "cv"), ... )
x |
Samples from the first distribution or a list of samples from k distribution |
y |
Samples from the second distribution. Only used if x is a vector. |
split |
How to split the data between training and test. Default to .7 |
thresh |
Value to add to the null hypothesis. See details. |
method |
Which model(s) to use during training. Default to knn. |
control |
Control parameters when fitting the methods. See trainControl |
... |
Other parameters passed to train |
See Lopez-Paz et .al for more background on those tests.
A list containing the following components:
statistic the value of the test statistic.
p.value the p-value of the test.
Lopez-Paz, D., & Oquab, M. (2016). Revisiting Classifier Two-Sample Tests, 1–15. Retrieved from http://arxiv.org/abs/1610.06545
x <- matrix(c(runif(100, 0, 1), runif(100, -1, 1)), ncol = 2) y <- matrix(c(runif(100, 0, 3), runif(100, -1, 1)), ncol = 2) classifier_test(x, y)
x <- matrix(c(runif(100, 0, 1), runif(100, -1, 1)), ncol = 2) y <- matrix(c(runif(100, 0, 3), runif(100, -1, 1)), ncol = 2) classifier_test(x, y)
Weighted Kolmogorov-Smirnov Two-Sample Test with threshold
ks_test(x, y, thresh = 0.05, w_x = rep(1, length(x)), w_y = rep(1, length(y)))
ks_test(x, y, thresh = 0.05, w_x = rep(1, length(x)), w_y = rep(1, length(y)))
x |
Vector of values sampled from the first distribution |
y |
Vector of values sampled from the second distribution |
thresh |
The threshold needed to clear between the two cumulative distributions |
w_x |
The observation weights for x |
w_y |
The observation weights for y |
The usual Kolmogorov-Smirnov test for two vectors X and Y, of size m
and n rely on the empirical cdfs and
and the test statistic
. This modified Kolmogorov-Smirnov test relies on two modifications.
Using observation weights for both vectors X and Y: Those weights are used in two places, while modifying the usual KS test. First, the empirical cdfs are updates to account for the weights. Secondly, the effective sample sizes are also modified. This is inspired from https://stackoverflow.com/a/55664242/13768995, using Monahan (2011).
Testing against a threshold: the test statistic is thresholded such
that . Since
, the value of
the threshold is also between 0 and 1, representing an effect size for the
difference.
A list with class "htest"
containing the following components:
statistic the value of the test statistic.
p.value the p-value of the test.
alternative a character string describing the alternative hypothesis.
method a character string indicating what type of test was performed.
data.name a character string giving the name(s) of the data.
Monahan, J. (2011). Numerical Methods of Statistics (2nd ed., Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge: Cambridge University Press. doi:10.1017/CBO9780511977176
x <- runif(100) y <- runif(100, min = .5, max = .5) ks_test(x, y, thresh = .001)
x <- runif(100) y <- runif(100, min = .5, max = .5) ks_test(x, y, thresh = .001)
Maximum Mean Discrepancy Unbiased Test
mmd_test( x, y, kernel = "rbfdot", type = ifelse(min(nrow(x), nrow(y)) < 1000, "unbiased", "linear"), null = c("permutation", "exact"), iterations = 10^3, frac = 1, ... )
mmd_test( x, y, kernel = "rbfdot", type = ifelse(min(nrow(x), nrow(y)) < 1000, "unbiased", "linear"), null = c("permutation", "exact"), iterations = 10^3, frac = 1, ... )
x |
d-dimensional samples from the first distribution |
y |
d-dimensional samples from the first distribution |
kernel |
A character that must match a known kernel. See details. |
type |
Which statistic to use. One of 'unbiased' or 'linear'. See
Gretton et al for details. Default to 'unbiased' if the two vectors are of
length less than |
null |
How to asses the null distribution. This can only be set to exact
if the |
iterations |
How many iterations to do to simulate the null distribution.
Default to 10^4. Only used if |
frac |
For the linear statistic, how many points to sample. See details. |
... |
Further arguments passed to kernel functions |
This computes the MMD^2u unbiased statistic or the MMDl linear statistic from Gretton et al. The code relies on the pairwise_kernel function from the python module sklearn. To list the available kernels, see the examples.
A list containing the following components:
statistic the value of the test statistic.
p.value the p-value of the test.
Gretton, A., Borgwardt, K., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A Kernel Two-Sample Test Journal of Machine Learning Research (2012)
x <- matrix(rnorm(1000, 0, 1), ncol = 10) y <- matrix(rnorm(1000, 0, 2), ncol = 10) mmd_test(x, y) mmd_test(x, y, type = "linear") x <- matrix(rnorm(1000, 0, 1), ncol = 10) y <- matrix(rnorm(1000, 0, 1), ncol = 10) # Set iterations to small number for runtime # Increase for more accurate results mmd_test(x, y, iterations = 10^2)
x <- matrix(rnorm(1000, 0, 1), ncol = 10) y <- matrix(rnorm(1000, 0, 2), ncol = 10) mmd_test(x, y) mmd_test(x, y, type = "linear") x <- matrix(rnorm(1000, 0, 1), ncol = 10) y <- matrix(rnorm(1000, 0, 1), ncol = 10) # Set iterations to small number for runtime # Increase for more accurate results mmd_test(x, y, iterations = 10^2)
Stouffer's Z-score method
stouffer_zscore(pvals, weights = rep(1, seq_along(pvals)), side = "two")
stouffer_zscore(pvals, weights = rep(1, seq_along(pvals)), side = "two")
pvals |
A vector of p-values |
weights |
A vector of weights |
side |
How the p-values were generated. One of 'right', 'left' or 'two'. |
Given a set of i.i.d p-values and associated weights, it combines the
p-values . Letting
be the standard normal cumulative distribution function
and
, the meta-analysis Z-score is
A list containing the following components:
statistic the value of the test statistic.
p.value the p-value of the test.
Samuel Andrew Stouffer. Adjustment during army life. Princeton University Press, 1949.
pvals <- runif(100, 0, 1) weights <- runif(100, 0, 1) stouffer_zscore(pvals, weights)
pvals <- runif(100, 0, 1) weights <- runif(100, 0, 1) stouffer_zscore(pvals, weights)
Permutation test based on Wasserstein distance
wasserstein_permut( x, y, iterations = 10^4, fast = nrow(x) + nrow(y) > 10^3, S = NULL, ... )
wasserstein_permut( x, y, iterations = 10^4, fast = nrow(x) + nrow(y) > 10^3, S = NULL, ... )
x |
Samples from the first distribution |
y |
Samples from the second distribution. Only used if x is a vector. |
iterations |
How many iterations to do to simulate the null distribution. Default to 10^4. |
fast |
If true, uses the subwasserstein approximate function. Default to true if there are more than 1,000 samples total. |
S |
Number of samples to use in approximate mode. Must be set if |
... |
Other parameters passed to wasserstein or wasserstein1d |
A list containing the following components:
statistic the Wasserstein distance between x and y.
p.value the p-value of the permutation test.
x <- matrix(c(runif(100, 0, 1), runif(100, -1, 1)), ncol = 2) y <- matrix(c(runif(100, 0, 3), runif(100, -1, 1)), ncol = 2) # Set iterations to small number for runtime # Increase for more accurate results wasserstein_permut(x, y, iterations = 10^2)
x <- matrix(c(runif(100, 0, 1), runif(100, -1, 1)), ncol = 2) y <- matrix(c(runif(100, 0, 3), runif(100, -1, 1)), ncol = 2) # Set iterations to small number for runtime # Increase for more accurate results wasserstein_permut(x, y, iterations = 10^2)