Appendix A — Pseudo-log transformations

A.1 Introduction

This supplemental section illustrates how pseudo-log transformations can be used to transform skewed distributions towards normality.

The transformation $f(x) = sinh^-1(x/2\sigma)/log(10)$ is a pseudo-logarithmic transformation mentioned by (Johnson 1949). It has the following advantages over ordinary logarithmic transformations:

it is defined also for $x=$ as $f(0) = 0$
it is a signed logarithmic transformation, and is defined also for negative values as $f(-|x|)=-f(|x|)$

Of course, this comes at the cost of deviation from the logarithmic transformation in terms of interpretability.

The parameter $\sigma$ may be used to adapt the transformation to a specific range of an empirical distribution.

We define the pseudo-logarithmic transformation in R as:

Code

pseudo_log <-
  function(x, sigma = 1, base = 10)
    asinh(x / (2 * sigma)) / log(base)

inv_pseudo_log <-
  function(x, sigma = 1, base = 10)
    2 * sigma * sinh(x * log(base))

A.2 Impact of choice of $\sigma$

Next, we investigate how the parameter $\sigma$ impacts the result of the transformation. We assume that $x$ follows a log-normal distribution, and we will show results of $f(x)$ with different choices of $\sigma$. We center and scale $f(x)$ before plotting.

Code

p <- seq(0.001, 0.999, 0.001)
x <- exp(qnorm(p, mean = 0, sd = 1))

hist(x)

Code

y <-
  cbind(
    log(x),
    scale(pseudo_log(x, 1)),
    scale(pseudo_log(x, 0.5)),
    scale(pseudo_log(x, 2)),
    scale(pseudo_log(x, 0.05))
  )

plot(x,
     y[, 2],
     type = "l",
     ylab = "f(x)",
     ylim = range(y))
lines(x, y[, 1], col = "red")
lines(x, y[, 3], type = "l", lty = 2)
lines(x, y[, 4], type = "l", lty = 3)
lines(x, y[, 5], type = "l", lty = 4)
legend(
  "topleft",
  lty = c(1, 1, 2, 3, 4),
  col = c("red", "black", "black", "black", "black"),
  legend = c("log", "sigma=1", "sigma=0.5", "sigma=2", "sigma=0.05")
)

In the next code chunk, we multiply by 10 and repeat the exercise. We learn that the transformations become more similar and the choice of $\sigma$ less relevant.

Code

x <- x * 10

y <-
  cbind(
    scale(log(x)),
    scale(pseudo_log(x, 1)),
    scale(pseudo_log(x, 0.5)),
    scale(pseudo_log(x, 2)),
    scale(pseudo_log(x, 0.05))
  )


plot(x, y[, 2], type = "l", ylab = "f(x)")
lines(x, y[, 1], col = "red")
lines(x, y[, 3], type = "l", lty = 2)
lines(x, y[, 4], type = "l", lty = 3)
lines(x, y[, 5], type = "l", lty = 4)
legend(
  "bottomright",
  lty = c(1, 1, 2, 3, 4),
  col = c("red", "black", "black", "black", "black"),
  legend = c("log", "sigma=1", "sigma=0.5", "sigma=2", "sigma=0.05")
)

Finally, we apply the pseudo-logarithmic transformation to the original normal deviates. We learn that a higher value for the parameter $\sigma$ makes the distribution ‘slimmer’ while a lower value makes it ‘fatter’, and it is even possible to induce bimodality with low values of sigma:

Code

z <- qnorm(p, mean = 0, sd = 1)
hist(z)
hist(pseudo_log(z, sigma = 1))
hist(pseudo_log(z, sigma = 2))
hist(pseudo_log(z, sigma = 0.5))
hist(pseudo_log(z, sigma = 0.05))

A.3 Finding a parameter that best achieves normality

Any test statistic for testing normality could be chosen to find a suitable value of $\sigma$ that induces normality into the transformed values. Here we use the (Pearson) correlation coefficient to compare the empirical distribution with normal deviates.

A.3.1 Deviates from log normal distribution

We simulate from a shifted log normal distribution and evaluate the value of $\sigma$ that optimizes agreement with a normal distribution:

Code

# #| layout-ncol: 2    #Remark. did not render nicely in docx (GH)

x <- sort(exp(rnorm(1000) + 3))

hist(x)

Code

sigmas <- 2 ** seq(-10, 10, 1)
origcor <- cor(qnorm((1:length(x) - 0.5) / length(x)), x)

ncorx <-
  sapply(sigmas, function(X)
    cor(qnorm((1:length(
      x
    ) - 0.5) / length(x)), pseudo_log(x, X)))

cat("Optimal sigma: ")

Optimal sigma:

Code

(optsigma <- sigmas[ncorx == max(ncorx)])

[1] 0.25

Code

plot(log2(sigmas), ncorx, ylab = "Correlation with normal deviates", ylim =
       c(0, 1))
points(log2(sigmas)[ncorx == max(ncorx)], max(ncorx), pch = 19)
abline(h = origcor)
legend(
  "bottomright",
  lty = c(1, NA, NA),
  pch = c(NA, 1, 19),
  legend = c("Original", "Pseudo-log", "Pseudo-log with optimal sigma")
)
box()

Code

hist(pseudo_log(x, optsigma))

A.3.2 Deviates from exponential distribution

Also with an exponential distribution, the pseudo-logarithm may achieve a better agreement with a normal:

Code

x <- sort(exp(rnorm(1000) + 3))

hist(x)

Code

sigmas <- 2 ** seq(-10, 10, 1)
origcor <- cor(qnorm((1:length(x) - 0.5) / length(x)), x)

ncorx <-
  sapply(sigmas, function(X)
    cor(qnorm((1:length(
      x
    ) - 0.5) / length(x)), pseudo_log(x, X)))

cat("Optimal sigma: ")

Optimal sigma:

Code

(optsigma <- sigmas[ncorx == max(ncorx)])

[1] 0.0009765625

Code

plot(log2(sigmas), ncorx, ylab = "Correlation with normal deviates", ylim =
       c(0, 1))
points(log2(sigmas)[ncorx == max(ncorx)], max(ncorx), pch = 19)
abline(h = origcor)
legend(
  "bottomright",
  lty = c(1, NA, NA),
  pch = c(NA, 1, 19),
  legend = c("Original", "Pseudo-log", "Pseudo-log with optimal sigma")
)
box()

Code

hist(pseudo_log(x, optsigma))

A.3.3 Deviates from mixture distribution

Now we mix a normal, lognormal and exponential distribution:

Code

x1 <- scale(rnorm(1000))
x2 <- scale(rexp(1000))
x3 <- scale(exp(rnorm(1000)))

p1 <- rbinom(1000, 1, 0.33)
p2 <- rbinom(1000, 1, 0.5)

x <- p1 * x1 + (1 - p1) * (p2 * x2 + (1 - p2) * x3)
x <- x - min(x)
x <- sort(x)

hist(x)

Code

sigmas <- 2 ** seq(-10, 10, 1)
origcor <- cor(qnorm((1:length(x) - 0.5) / length(x)), x)

ncorx <-
  sapply(sigmas, function(X)
    cor(qnorm((1:length(
      x
    ) - 0.5) / length(x)), pseudo_log(x, X)))

cat("Optimal sigma: ")

Optimal sigma:

Code

(optsigma <- sigmas[ncorx == max(ncorx)])

[1] 1

Code

plot(log2(sigmas), ncorx, ylab = "Correlation with normal deviates", ylim =
       c(0, 1))
points(log2(sigmas)[ncorx == max(ncorx)], max(ncorx), pch = 19)
abline(h = origcor)
legend(
  "bottomright",
  lty = c(1, NA, NA),
  pch = c(NA, 1, 19),
  legend = c("Original", "Pseudo-log", "Pseudo-log with optimal sigma")
)
box()

Code

hist(pseudo_log(x, optsigma))

A.3.4 Normal deviates

With simulated normal deviates, the pseudo-logarithm cannot improve the already perfect normality.

Code

x <- sort(rnorm(1000))

hist(x)

Code

sigmas <- 2 ** seq(-10, 10, 1)
origcor <- cor(qnorm((1:length(x) - 0.5) / length(x)), x)

ncorx <-
  sapply(sigmas, function(X)
    cor(qnorm((1:length(
      x
    ) - 0.5) / length(x)), pseudo_log(x, X)))

cat("Optimal sigma: ")

Optimal sigma:

Code

(optsigma <- sigmas[ncorx == max(ncorx)])

[1] 4

Code

plot(log2(sigmas), ncorx, ylab = "Correlation with normal deviates", ylim =
       c(0, 1))
points(log2(sigmas)[ncorx == max(ncorx)], max(ncorx), pch = 19)
legend(
  "bottomright",
  lty = c(1, NA, NA),
  pch = c(NA, 1, 19),
  legend = c("Original", "Pseudo-log", "Pseudo-log with optimal sigma")
)
abline(h = origcor)
box()

Code

hist(pseudo_log(x, optsigma))

# Pseudo-log transformations {.appendix} ```{r pl-setup, message=FALSE, warning=FALSE, echo=FALSE, fig.height=3} knitr::opts_chunk$set(echo = TRUE) ``` ## Introduction This supplemental section illustrates how pseudo-log transformations can be used to transform skewed distributions towards normality. The transformation $f(x) = sinh^-1(x/2\sigma)/log(10)$ is a pseudo-logarithmic transformation mentioned by [@Johnson49]. It has the following advantages over ordinary logarithmic transformations: * it is defined also for $x=$ as $f(0) = 0$ * it is a signed logarithmic transformation, and is defined also for negative values as $f(-|x|)=-f(|x|)$ Of course, this comes at the cost of deviation from the logarithmic transformation in terms of interpretability. The parameter $\sigma$ may be used to adapt the transformation to a specific range of an empirical distribution. We define the pseudo-logarithmic transformation in R as: ```{r pl02, message=FALSE, warning=FALSE, echo=TRUE, fig.height=3} pseudo_log <- function(x, sigma = 1, base = 10) asinh(x / (2 * sigma)) / log(base) inv_pseudo_log <- function(x, sigma = 1, base = 10) 2 * sigma * sinh(x * log(base)) ``` ## Impact of choice of $\sigma$ Next, we investigate how the parameter $\sigma$ impacts the result of the transformation. We assume that $x$ follows a log-normal distribution, and we will show results of $f(x)$ with different choices of $\sigma$. We center and scale $f(x)$ before plotting. ```{r pl03, message=FALSE, warning=FALSE, echo=TRUE, fig.height=4} p <- seq(0.001, 0.999, 0.001) x <- exp(qnorm(p, mean = 0, sd = 1)) hist(x) ``` ```{r pl03b, message=FALSE, warning=FALSE, echo=TRUE, fig.height=8} y <- cbind( log(x), scale(pseudo_log(x, 1)), scale(pseudo_log(x, 0.5)), scale(pseudo_log(x, 2)), scale(pseudo_log(x, 0.05)) ) plot(x, y[, 2], type = "l", ylab = "f(x)", ylim = range(y)) lines(x, y[, 1], col = "red") lines(x, y[, 3], type = "l", lty = 2) lines(x, y[, 4], type = "l", lty = 3) lines(x, y[, 5], type = "l", lty = 4) legend( "topleft", lty = c(1, 1, 2, 3, 4), col = c("red", "black", "black", "black", "black"), legend = c("log", "sigma=1", "sigma=0.5", "sigma=2", "sigma=0.05") ) ``` In the next code chunk, we multiply by 10 and repeat the exercise. We learn that the transformations become more similar and the choice of $\sigma$ less relevant. ```{r pl04, message=FALSE, warning=FALSE, echo=TRUE, fig.height=5} x <- x * 10 y <- cbind( scale(log(x)), scale(pseudo_log(x, 1)), scale(pseudo_log(x, 0.5)), scale(pseudo_log(x, 2)), scale(pseudo_log(x, 0.05)) ) plot(x, y[, 2], type = "l", ylab = "f(x)") lines(x, y[, 1], col = "red") lines(x, y[, 3], type = "l", lty = 2) lines(x, y[, 4], type = "l", lty = 3) lines(x, y[, 5], type = "l", lty = 4) legend( "bottomright", lty = c(1, 1, 2, 3, 4), col = c("red", "black", "black", "black", "black"), legend = c("log", "sigma=1", "sigma=0.5", "sigma=2", "sigma=0.05") ) ``` Finally, we apply the pseudo-logarithmic transformation to the original normal deviates. We learn that a higher value for the parameter $\sigma$ makes the distribution 'slimmer' while a lower value makes it 'fatter', and it is even possible to induce bimodality with low values of sigma: ```{r pl05, message=FALSE, warning=FALSE, echo=TRUE, fig.height=5} #| layout-ncol: 3 z <- qnorm(p, mean = 0, sd = 1) hist(z) hist(pseudo_log(z, sigma = 1)) hist(pseudo_log(z, sigma = 2)) hist(pseudo_log(z, sigma = 0.5)) hist(pseudo_log(z, sigma = 0.05)) ``` ## Finding a parameter that best achieves normality Any test statistic for testing normality could be chosen to find a suitable value of $\sigma$ that induces normality into the transformed values. Here we use the (Pearson) correlation coefficient to compare the empirical distribution with normal deviates. ### Deviates from log normal distribution We simulate from a shifted log normal distribution and evaluate the value of $\sigma$ that optimizes agreement with a normal distribution: ```{r pl06, message=FALSE, warning=FALSE, echo=TRUE, fig.height=3} # #| layout-ncol: 2 #Remark. did not render nicely in docx (GH) x <- sort(exp(rnorm(1000) + 3)) hist(x) sigmas <- 2 ** seq(-10, 10, 1) origcor <- cor(qnorm((1:length(x) - 0.5) / length(x)), x) ncorx <- sapply(sigmas, function(X) cor(qnorm((1:length( x ) - 0.5) / length(x)), pseudo_log(x, X))) cat("Optimal sigma: ") (optsigma <- sigmas[ncorx == max(ncorx)]) plot(log2(sigmas), ncorx, ylab = "Correlation with normal deviates", ylim = c(0, 1)) points(log2(sigmas)[ncorx == max(ncorx)], max(ncorx), pch = 19) abline(h = origcor) legend( "bottomright", lty = c(1, NA, NA), pch = c(NA, 1, 19), legend = c("Original", "Pseudo-log", "Pseudo-log with optimal sigma") ) box() hist(pseudo_log(x, optsigma)) ``` ### Deviates from exponential distribution Also with an exponential distribution, the pseudo-logarithm may achieve a better agreement with a normal: ```{r pl07, message=FALSE, warning=FALSE, echo=TRUE, fig.height=5} x <- sort(exp(rnorm(1000) + 3)) hist(x) sigmas <- 2 ** seq(-10, 10, 1) origcor <- cor(qnorm((1:length(x) - 0.5) / length(x)), x) ncorx <- sapply(sigmas, function(X) cor(qnorm((1:length( x ) - 0.5) / length(x)), pseudo_log(x, X))) cat("Optimal sigma: ") (optsigma <- sigmas[ncorx == max(ncorx)]) plot(log2(sigmas), ncorx, ylab = "Correlation with normal deviates", ylim = c(0, 1)) points(log2(sigmas)[ncorx == max(ncorx)], max(ncorx), pch = 19) abline(h = origcor) legend( "bottomright", lty = c(1, NA, NA), pch = c(NA, 1, 19), legend = c("Original", "Pseudo-log", "Pseudo-log with optimal sigma") ) box() hist(pseudo_log(x, optsigma)) ``` ### Deviates from mixture distribution Now we mix a normal, lognormal and exponential distribution: ```{r pl08, message=FALSE, warning=FALSE, echo=TRUE, fig.height=5} x1 <- scale(rnorm(1000)) x2 <- scale(rexp(1000)) x3 <- scale(exp(rnorm(1000))) p1 <- rbinom(1000, 1, 0.33) p2 <- rbinom(1000, 1, 0.5) x <- p1 * x1 + (1 - p1) * (p2 * x2 + (1 - p2) * x3) x <- x - min(x) x <- sort(x) hist(x) sigmas <- 2 ** seq(-10, 10, 1) origcor <- cor(qnorm((1:length(x) - 0.5) / length(x)), x) ncorx <- sapply(sigmas, function(X) cor(qnorm((1:length( x ) - 0.5) / length(x)), pseudo_log(x, X))) cat("Optimal sigma: ") (optsigma <- sigmas[ncorx == max(ncorx)]) plot(log2(sigmas), ncorx, ylab = "Correlation with normal deviates", ylim = c(0, 1)) points(log2(sigmas)[ncorx == max(ncorx)], max(ncorx), pch = 19) abline(h = origcor) legend( "bottomright", lty = c(1, NA, NA), pch = c(NA, 1, 19), legend = c("Original", "Pseudo-log", "Pseudo-log with optimal sigma") ) box() hist(pseudo_log(x, optsigma)) ``` ### Normal deviates With simulated normal deviates, the pseudo-logarithm cannot improve the already perfect normality. ```{r pl09, message=FALSE, warning=FALSE, echo=TRUE, fig.height=5} x <- sort(rnorm(1000)) hist(x) sigmas <- 2 ** seq(-10, 10, 1) origcor <- cor(qnorm((1:length(x) - 0.5) / length(x)), x) ncorx <- sapply(sigmas, function(X) cor(qnorm((1:length( x ) - 0.5) / length(x)), pseudo_log(x, X))) cat("Optimal sigma: ") (optsigma <- sigmas[ncorx == max(ncorx)]) plot(log2(sigmas), ncorx, ylab = "Correlation with normal deviates", ylim = c(0, 1)) points(log2(sigmas)[ncorx == max(ncorx)], max(ncorx), pch = 19) legend( "bottomright", lty = c(1, NA, NA), pch = c(NA, 1, 19), legend = c("Original", "Pseudo-log", "Pseudo-log with optimal sigma") ) abline(h = origcor) box() hist(pseudo_log(x, optsigma)) ```

A.1 Introduction

A.2 Impact of choice of \(\sigma\)

A.3 Finding a parameter that best achieves normality

A.3.1 Deviates from log normal distribution

A.3.2 Deviates from exponential distribution

A.3.3 Deviates from mixture distribution

A.3.4 Normal deviates