# Vectorising chisq.test()

Writing fast code is actually about writing efficient code. Writing code in R doesn’t magically make electrons slower. The perceived slowness of higher-level languages like R and Python comes from inefficiencies at run time, i.e. an increased number of steps from start to finish. Code written in a programming language written in C must take more steps to execute than a program written in C itself (or some other equal or lower-level language). Luckily high-level languages have ways to access the underlying low-level language allowing users to utilise or write more efficient code.

## Vectorise your code

A simple way to write more efficient code is to vectorise it. Vectorising code amounts to replacing loops or element-wise operations with low-level vector-based functions. The easiest way to vectorise your code is to replace functions in your program with existing low-level functions. This might involve replacing `mean()`

with `rowMeans()`

in R, replacing a loop-based element-wise product with `numpy.multiple()`

in Python, or doing some matrix math. If this fails, one can always drop into the world of C though Rcpp or the Python/C API.

### An Example in R

Below is the is an example of how I vectorised the `chisq.test`

function in R. A friend wanted a faster way to compute 1000’s of Chi-square tests. The traditional approach might be to do this in parallel, i.e. batching the tests and running them on multiple cores. However, we don’t always have access to a HPC, so I chose to vectorise the function.

If you run `View(chisq.test)`

in the R console you’ll see the source code of the original chisq.test method. A simple way to speed things up (that doesn’t have anything to do with vectorisation) is to strip out the stuff you don’t need. Functions are written more generally than is required. If we know precisely how our function is going to be used we can remove a lot of the pointless checks included for the general use case. I want a Chi-squared test for given probabilities with a p-value, all other code can be removed.

```
> chisq.slim.test <- function (x, p) {
n <- sum(x)
E <- n * p
STATISTIC <- sum((x - E)^2/E)
PARAMETER <- length(x) - 1
PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
structure(
list(statistic = STATISTIC, parameter = PARAMETER, p.value = PVAL))
}
```

The `chisq.slim.test`

function above performs a chi-squared test for given probabilities and returns a p-value. It is 120 lines shorter than the original `chisq.test`

and runs about five times faster. Though it is faster we have lost all the safety checks in the original function, so we need to be sure we are using it correctly.

```
> library("microbenchmark")
> microbenchmark(
chisq.slim.test(c(100, 0, 100), p=rep(1/3, 3)),
chisq.test(c(100, 0, 100), p=rep(1/3, 3))
)
mean (microseconds)
chisq.test 37.70792
chisq.slim.test 5.22655
```

To run this slim function 1000 times would require using something like `apply()`

, i.e. looping over every test and thus taking 1000 times longer. To improve the efficiency of our function, we can vectorise the code. Below is the vectorised version. Lets go through the changes.

```
> chisq.vec.test <- function(x, p) {
n <- matrix(rowSums(x))
E <- n %*% p
STATISTIC <- rowSums((x - E) ^ 2 / E)
PARAMETER <- ncol(x) - 1
PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
structure(
list(statistic = STATISTIC, parameter = PARAMETER, p.value = PVAL))
}
```

Our function now expects `x`

to be a data frame and `p`

a vector of probabilities equal length to the number of columns of `x`

. We don’t have any checks for the inputs as more checks are more steps. The responsibility falls onto us to make sure we are using the function correctly. An example input might look like:

```
> chisq.vec.test(
x = data.frame(c(100,50,0), c(100,100,50), c(0,50,50)),
p = rep(1/3, 3)
)
```

We have three independent tests (rows of `x`

) each with three observations (columns of `x`

), and we expect the proportion of each observation to be one 3rd (`p`

). The chi-square function is performed on each row of the data frame `x`

. To start, we replace `sum()`

with `rowSums()`

to produce a one column matrix of the sums of each row. To get the product of each row (observed counts) and the expected proportion, i.e. the expected frequency, we use some matrix math. Matrix multiplication in R uses the infix operator `%*%`

. The line `E <- n %*% p`

is the equivalent of multiplying the sum of each row by each expected proportions, resulting in a 3x3 matrix of expected frequencies.

With a little ingenuity a lot of problems can be cast into a matrix multiplication form. This is generally quite efficient relative to alternatives. - R inferno, page 23

The rest of the function is straight forward, we again substitute `sum()`

with `rowSums()`

and do more matrix math to calculate the test statistic. The function `pchisq`

is a wrapper around a vectorised C function, and so it can take a matrix as an input.

To test the `chisq.vec.test`

function let’s compare it to running `chisq.test`

1000 times.

```
> df <- data.frame(
sample(1:100, 10000, replace = T),
sample(1:100, 10000, replace = T),
sample(1:100, 10000, replace = T)
)
> p <-rep(1/3, 3)
> microbenchmark(
res <- apply(df, 1, function(x) chisq.test(x = x, p = p)),
vec_res <- chisq.vec.test(x = df,p = p)
)
mean (milliseconds)
chisq.test 398.860303
chisq.vec.test 5.453677
> all(res$p.value == vec_res$p.value)
[1] TRUE
```

The `chisq.vec.test`

function runs about 80 times faster than `chisq.test`

and produces the same result.

## Conclusion

Vectorisation is the way to go if you want a simple method to make your code more efficient. For further reading, check out the links below.

## Comments