Vectorising chisq.test()

4 minute read

Writing fast code is actually about writing efficient code. Writing code in R doesn’t magically make electrons slower. The perceived slowness of higher-level languages like R and Python comes from inefficiencies at run time, i.e. an increased number of steps from start to finish. Code written in a programming language written in C must take more steps to execute than a program written in C itself (or some other equal or lower-level language). Luckily high-level languages have ways to access the underlying low-level language allowing users to utilise or write more efficient code.

Vectorise your code

A simple way to write more efficient code is to vectorise it. Vectorising code amounts to replacing loops or element-wise operations with low-level vector-based functions. The easiest way to vectorise your code is to replace functions in your program with existing low-level functions. This might involve replacing mean() with rowMeans() in R, replacing a loop-based element-wise product with numpy.multiple() in Python, or doing some matrix math. If this fails, one can always drop into the world of C though with Rcpp or the Python/C API.

An Example in R

Below is the is an example of how I vectorised the chisq.test function in R. I wanted a faster way to compute 1000’s of Chi-square tests. The traditional approach might be to do this in parallel, i.e. batching the tests and running them on multiple cores. However, we don’t always have access to a HPC, so I chose to vectorise the function.

If you run View(chisq.test) in the R console you’ll see the source code of the original chisq.test method. A simple way to speed things up (that doesn’t have anything to do with vectorisation) is to strip out the stuff you don’t need. Functions are written more generally than is required. If we know precisely how our function is going to be used we can remove a lot of the pointless checks included for the general use case. I want a Chi-squared test for given probabilities with a p-value, all other code can be removed.

> chisq.slim.test <- function (x, p) {
  n <- sum(x)
  E <- n * p
  STATISTIC <- sum((x - E)^2/E)
  PARAMETER <- length(x) - 1
  PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
  structure(
    list(statistic = STATISTIC, parameter = PARAMETER, p.value = PVAL))
}

The chisq.slim.test function above performs a chi-squared test for given probabilities and returns a p-value. It is 120 lines shorter than the original chisq.test and runs about five times faster. Though it is faster we have lost all the safety checks in the original function, so we need to be sure we are using it correctly.

> library("microbenchmark")
> microbenchmark(
  chisq.slim.test(c(100, 0, 100), p=rep(1/3, 3)),
  chisq.test(c(100, 0, 100), p=rep(1/3, 3))
)
                 mean (microseconds) 
chisq.test       37.70792 
chisq.slim.test  5.22655 

To run this slim function 1000 times would require using something like apply(), i.e. looping over every test and thus taking 1000 times longer. To improve the efficiency of our function, we can vectorise the code. Below is the vectorised version. Lets go through the changes.

> chisq.vec.test  <- function(x, p) {
  n <- matrix(rowSums(x))
  E <- n %*% p
  STATISTIC <- rowSums((x - E) ^ 2 / E)
  PARAMETER <- ncol(x) - 1
  PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
  structure(
    list(statistic = STATISTIC, parameter = PARAMETER, p.value = PVAL))
}

Our function now expects x to be a data frame and p a vector of probabilities equal length to the number of columns of x. We don’t have any checks for the inputs as more checks are more steps. The responsibility falls onto us to make sure we are using the function correctly. An example input might look like:

> chisq.vec.test(
 x = data.frame(c(100,50,0), c(100,100,50), c(0,50,50)),
 p = rep(1/3, 3)
)

We have three independent tests (rows of x) each with three observations (columns of x), and we expect the proportion of each observation to be one 3rd (p). The chi-square function is performed on each row of the data frame x. To start, we replace sum() with rowSums() to produce a one column matrix of the sums of each row. To get the product of each row (observed counts) and the expected proportion, i.e. the expected frequency, we use some matrix math. Matrix multiplication in R uses the infix operator %*%. The line E <- n %*% p is the equivalent of multiplying the sum of each row by each expected proportions, resulting in a 3x3 matrix of expected frequencies.

With a little ingenuity a lot of problems can be cast into a matrix multiplication form. This is generally quite efficient relative to alternatives. - R inferno, page 23

The rest of the function is straight forward, we again substitute sum() with rowSums() and do more matrix math to calculate the test statistic. The function pchisq is a wrapper around a vectorised C function, and so it can take a matrix as an input.

To test the chisq.vec.test function let’s compare it to running chisq.test 1000 times.

> df <- data.frame(
  sample(1:100, 10000, replace = T),
  sample(1:100, 10000, replace = T),
  sample(1:100, 10000, replace = T)
)
> p <-rep(1/3, 3)
> microbenchmark(
  res <- apply(df, 1, function(x) chisq.test(x = x, p = p)),
  vec_res <- chisq.vec.test(x = df,p = p)
)

                 mean (milliseconds)
chisq.test       398.860303
chisq.vec.test   5.453677

> all(res$p.value == vec_res$p.value)
[1] TRUE

The chisq.vec.test function runs about 80 times faster than chisq.test and produces the same result.

Conclusion

Vectorisation is the way to go if you want a simple method to make your code more efficient. For further reading, check out the links below.

Comments