Renormalizing data with Arsinh instead of log

By Nicolas Gambardella

Do you need to normalize data quickly but are bothered by null or negative values? You can use the Inverse hyperbolic sine, Arsinh, function instead of a simple log function. This approach also allows for treating differently small and high values. Arsinh is defined as:

Firstly, since x+sqrt(x²+1) is always strictly positive, arsinh is defined for all real values, contrary to log, which is only defined for strictly positive numbers. Furthermore, as can easily be seen, for small values of x, the function tends to ln(x+1), something often used to overcome the 0 measurements. For large values of x, arsinh(x) progresses as log(x).

Let’s say we have a dataset that is quite noisy, with unevenly spread sampling, and that includes an unwanted baseline. Here is a made-up dataset:

To create it, we generated 1000 lognormal-distributed sampling values x. The variable value is equal to the sampling value, plus a random noise in which the standard deviation varies as the ratio of sqrt(x)/x (biological noise), plus a noisy constant technical baseline (5 plus a normal noise with SD=0.01).

We are clever, and notice the background noise, so we subtract it:

Now, the first issue is that plenty of values are negative. In some cases, your normalization will fail. Sometimes, the normalization will proceed, ditching values (as R says, “Warning message: NaNs produced”). As can be seen below, there is a large area sparsely populated on the left for low values of x.

If, on the contrary, we use arsinh, we rescue all those values.

Arsinh is used, for instance, in flow cytometry and in mass spectrometry. It is one of the corrections used by the R package BestNormalize