Vectorising chisq.test()
Writing fast code is actually about writing efficient code. Writing code in R doesn’t magically make electrons slower. The perceived slowness of higher-level languages like R and Python comes from inefficiencies at run time, i.e. an increased number of steps from start to finish. Code written in a programming language written in C must take more steps to execute than a program written in C itself (or some other equal or lower-level language). Luckily high-level languages have ways to access the underlying low-level language allowing users to utilise or write more efficient code.
Vectorise your code
A simple way to write more efficient code is to vectorise it. Vectorising code amounts to replacing loops or element-wise operations with low-level vector-based functions. The easiest way to vectorise your code is to replace functions in your program with existing low-level functions. This might involve replacing mean()
with rowMeans()
in R, replacing a loop-based element-wise product with numpy.multiple()
in Python, or doing some matrix math. If this fails, one can always drop into the world of C though Rcpp or the Python/C API.
An Example in R
Below is the is an example of how I vectorised the chisq.test
function in R. A friend wanted a faster way to compute 1000’s of Chi-square tests. The traditional approach might be to do this in parallel, i.e. batching the tests and running them on multiple cores. However, we don’t always have access to a HPC, so I chose to vectorise the function.
If you run View(chisq.test)
in the R console you’ll see the source code of the original chisq.test method. A simple way to speed things up (that doesn’t have anything to do with vectorisation) is to strip out the stuff you don’t need. Functions are written more generally than is required. If we know precisely how our function is going to be used we can remove a lot of the pointless checks included for the general use case. I want a Chi-squared test for given probabilities with a p-value, all other code can be removed.
> chisq.slim.test <- function (x, p) {
n <- sum(x)
E <- n * p
STATISTIC <- sum((x - E)^2/E)
PARAMETER <- length(x) - 1
PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
structure(
list(statistic = STATISTIC, parameter = PARAMETER, p.value = PVAL))
}
The chisq.slim.test
function above performs a chi-squared test for given probabilities and returns a p-value. It is 120 lines shorter than the original chisq.test
and runs about five times faster. Though it is faster we have lost all the safety checks in the original function, so we need to be sure we are using it correctly.
> library("microbenchmark")
> microbenchmark(
chisq.slim.test(c(100, 0, 100), p=rep(1/3, 3)),
chisq.test(c(100, 0, 100), p=rep(1/3, 3))
)
mean (microseconds)
chisq.test 37.70792
chisq.slim.test 5.22655
To run this slim function 1000 times would require using something like apply()
, i.e. looping over every test and thus taking 1000 times longer. To improve the efficiency of our function, we can vectorise the code. Below is the vectorised version. Lets go through the changes.
> chisq.vec.test <- function(x, p) {
n <- matrix(rowSums(x))
E <- n %*% p
STATISTIC <- rowSums((x - E) ^ 2 / E)
PARAMETER <- ncol(x) - 1
PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
structure(
list(statistic = STATISTIC, parameter = PARAMETER, p.value = PVAL))
}
Our function now expects x
to be a data frame and p
a vector of probabilities equal length to the number of columns of x
. We don’t have any checks for the inputs as more checks are more steps. The responsibility falls onto us to make sure we are using the function correctly. An example input might look like:
> chisq.vec.test(
x = data.frame(c(100,50,0), c(100,100,50), c(0,50,50)),
p = rep(1/3, 3)
)
We have three independent tests (rows of x
) each with three observations (columns of x
), and we expect the proportion of each observation to be one 3rd (p
). The chi-square function is performed on each row of the data frame x
. To start, we replace sum()
with rowSums()
to produce a one column matrix of the sums of each row. To get the product of each row (observed counts) and the expected proportion, i.e. the expected frequency, we use some matrix math. Matrix multiplication in R uses the infix operator %*%
. The line E <- n %*% p
is the equivalent of multiplying the sum of each row by each expected proportions, resulting in a 3x3 matrix of expected frequencies.
With a little ingenuity a lot of problems can be cast into a matrix multiplication form. This is generally quite efficient relative to alternatives. - R inferno, page 23
The rest of the function is straight forward, we again substitute sum()
with rowSums()
and do more matrix math to calculate the test statistic. The function pchisq
is a wrapper around a vectorised C function, and so it can take a matrix as an input.
To test the chisq.vec.test
function let’s compare it to running chisq.test
1000 times.
> df <- data.frame(
sample(1:100, 10000, replace = T),
sample(1:100, 10000, replace = T),
sample(1:100, 10000, replace = T)
)
> p <-rep(1/3, 3)
> microbenchmark(
res <- apply(df, 1, function(x) chisq.test(x = x, p = p)),
vec_res <- chisq.vec.test(x = df,p = p)
)
mean (milliseconds)
chisq.test 398.860303
chisq.vec.test 5.453677
> all(res$p.value == vec_res$p.value)
[1] TRUE
The chisq.vec.test
function runs about 80 times faster than chisq.test
and produces the same result.
Conclusion
Vectorisation is the way to go if you want a simple method to make your code more efficient. For further reading, check out the links below.
Comments