Do not over-think about ‘outliers’, use a student-t distribution instead – Towards Data Science

A Students t-distribution is nothing more than a Gaussian distribution with heavier tails. In other words, we can say that the Gaussian distribution is a special case of the Students t-distribution. The Gaussian distribution is defined by the mean () and the standard deviation (). The Student t distribution, on the other hand, adds an additional parameter, the degrees of freedom (df), which controls the thickness of the distribution. This parameter assigns greater probability to events further from the mean. This feature is particularly useful for small sample sizes, such as in biomedicine, where the assumption of normality is questionable. Note that as the degrees of freedom increase, the Student t-distribution approaches the Gaussian distribution. We can visualize this using density plots:

Note in Figure 1 that the hill around the mean gets smaller as the degrees of freedom decrease as a result of the probability mass going to the tails, which are thicker. This property is what gives the Students t-distribution a reduced sensitivity to outliers. For more details on this matter, you can check this blog.

We load the required libraries:

So, lets skip data simulations and get serious. Well work with real data I have acquired from mice performing the rotarod test.

First, we load the dataset into our environment and set the corresponding factor levels. The dataset contains IDs for the animals, a groping variable (Genotype), an indicator for two different days on which the test was performed (day), and different trials for the same day. For this article, we model only one of the trials (Trial3). We will save the other trials for a future article on modeling variation.

As the data handling implies, our modeling strategy will be based on Genotype and Day as categorical predictors of the distribution of Trial3.

In biomedical science, categorical predictors, or grouping factors, are more common than continuous predictors. Scientists in this field like to divide their samples into groups or conditions and apply different treatments.

Lets have an initial view of the data using Raincloud plots as shown by Guilherme A. Franchi, PhD in this great blog post.

Figure 2 looks different from the original by Guilherme A. Franchi, PhD because we are plotting two factors instead of one. However, the nature of the plot is the same. Pay attention to the red dots, these are the ones that can be considered extreme observations that tilt the measures of central tendency (especially the mean) toward one direction. We also observe that the variances are different, so modeling also sigma can give better estimates. Our task now is to model the output using the brms package.

See the rest here:

Do not over-think about 'outliers', use a student-t distribution instead - Towards Data Science

Related Posts

Comments are closed.