By Nicolas Gambardella
Do you need to quickly normalize data but are bothered by null of negative values? You can use the Inverse hyperbolic sine, Arsinh, function instead of a simple log function. This approach also allows for treating differently small and high values. Arsinh is defined as:
Firstly, since x+sqrt(x²+1) is always strictly positive, arsinh is defined for all real values, contrary to log which is only defined for strictly positive numbers. Furthermore, as can easily be seen, for small values of x, the function tends to ln(x+1), something often used to overcome the 0 measurements. For large values of x, arsinh(x) progresses as log(x).
Let’s say we have a dataset that is quite noisy, with unevenly spread sampling, and that includes an unwanted baseline. Here is a made-up dataset:
To create it, we generated 1000 lognormal-distributed sampling values x. The variable value is equal to the sampling value, plus a random noise in which the standard deviation varies as the ratio of sqrt(x)/x (biological noise), plus a noisy constant technical baseline (5 plus a normal noise with SD=0.01) .
We are clever, and notice the background noise, so we subtract it:
Now, the first issue is that plenty of values are negative. In some cases, your normalization will fail. Sometimes, the normalization will proceed, ditching values (as R says, “Warning message: NaNs produced”). As can be seen below, there is a large area sparsely populated on the left, for low values of x.
If, on the contrary, we use arsinh, we rescue all those values.