The essential role of statistical thinking in animal ethics: dealing with reduction

The essential role of statistical thinking in animal ethics: dealing with reduction

Dr. Vanessa Cave

10 May 2022
image_blog

Having spent over 15 years working as an applied statistician in the biosciences, I’ve come across my fair-share of animal studies. And one of my greatest bugbears is that the full value is rarely extracted from the experimental data collected. This could be because the best statistical approaches haven’t been employed to analyse the data, the findings are selectively or incorrectly reported, other research programmes that could benefit from the data don’t have access to it, or the data aren’t re-analysed following the advent of new statistical methods or tools that have the potential to draw greater insights from it.


An enormous number of scientific research studies involve animals, and with this come many ethical issues and concerns. To help ensure high standards of animal welfare in scientific research, many governments, universities, R&D companies, and individual scientists have adopted the principles of the 3Rs: Replacement, Reduction and Refinement. Indeed, in many countries the tenets of the 3Rs are enshrined in legislation and regulations around the use of animals in scientific research.

Replacement

Use methods or technologies that replace or avoid the use of animals.

Reduction

Limit the number of animals used.

Refinement

Refine methods in order to minimise or eliminate negative animal welfare impacts.

In this blog, I’ll focus on the second principle, Reduction, and argue that statistical expertise is absolutely crucial for achieving reduction.

The aim of reduction is to minimise the number of animals used in scientific research whilst balancing against any additional adverse animal welfare impacts and without compromising the scientific value of the research. This principle demands that before carrying out an experiment (or survey) involving animals, the researchers must consider and implement approaches that both:

  1. Minimise their current animal use – the researchers must consider how to minimise the number of animals in their experiment whilst ensuring sufficient data are obtained to answer their research questions, and
  2. Minimise future animal use – the researchers need to consider how to maximise the information obtained from their experiment in order to potentially limit, or avoid, the subsequent use of additional animals in future research.

Both these considerations involve statistical thinking. Let’s begin by exploring the important role statistics plays in minimising current animal use.

Statistical aspects to minimise current animal use

Reduction requires that any experiment (or survey) carried out must use as few animals as possible. However, with too few animals the study will lack the statistical power to draw meaningful conclusions, ultimately wasting animals. But how do we determine how many animals are needed for a sufficiently powered experiment? The necessary starting point is to establish clearly defined, specific research questions. These can then be formulated into appropriate statistical hypotheses, for which an experiment (or survey) can be designed. 

Statistical expertise in experimental design plays a pivotal role in ensuring enough of the right type of data are collected to answer the research questions as objectively and as efficiently as possible. For example, sophisticated experimental designs involving blocking can be used to reduce random variation, making the experiment more efficient (i.e., increase the statistical power with fewer animals) as well as guarding against bias. Once a suitable experimental design has been decided upon, a power analysis can be used to calculate the required number of animals (i.e., determine the sample size). Indeed, a power analysis is typically needed to obtain animal ethics approval - a formal process in which the benefits of the proposed research is weighed up against the likely harm to the animals. 

Researchers also need to investigate whether pre-existing sources of information or data could be integrated into their study, enabling them to reduce the number of animals required. For example, by means of a meta-analysis. At the extreme end, data relevant to the research questions may already be available, eradicating the need for an experiment altogether! 

Statistical aspects to minimise future animal use: doing it right the first time

An obvious mechanism for minimising future animal use is to ensure we do it right the first time, avoiding the need for additional experiments. This is easier said than done; there are many statistical and practical considerations at work here. The following paragraphs cover four important steps in experimental research in which statistical expertise plays a major role: data acquisition, data management, data analysis and inference.

Above, I alluded to the validity of the experimental design. If the design is flawed, the data collected will be compromised, if not essentially worthless. Two common mistakes to avoid are pseudo-replication and the lack of (or poor) randomisation. Replication and randomisation are two of the basic principles of good experimental design. Confusing pseudo-replication (either at the design or analysis stage) for genuine replication will lead to invalid statistical inferences. Randomisation is necessary to ensure the statistical inference is valid and for guarding against bias. 

Another extremely important consideration when designing an experiment, and setting the sample size, is the risk and impact of missing data due, for example, to animal drop-out or equipment failure. Missing data results in a loss of statistical power, complicates the statistical analysis, and has the potential to cause substantial bias (and potentially invalidate any conclusions). Careful planning and management of an experiment will help minimise the amount of missing data. In addition, safe-guards, controls or contingencies could be built into the experimental design that help mitigate against the impact of missing data. If missing data does result, appropriate statistical methods to account for it must be applied. Failure to do so could invalidate the entire study.

It is also important that the right data are collected to answer the research questions of interest. That is, the right response and explanatory variables measured at the appropriate scale and frequency. There are many statistical related-questions the researchers must answer, including: what population do they want to make inference about? how generalisable do they need their findings to be? what controllable and uncontrollable variables are there? Answers to these questions not only affects enrolment of animals into the study, but also the conditions they are subjected to and the data that should be collected. 

It is essential that the data from the experiment (including meta-data) is appropriately managed and stored to protect its integrity and ensure its usability. If the data get messed up (e.g., if different variables measured on the same animal cannot be linked), is undecipherable (e.g., if the attributes of the variables are unknown) or is incomplete (e.g., if the observations aren’t linked to the structural variables associated with the experimental design), the data are likely worthless. Statisticians can offer invaluable expertise in good data management practices, helping to ensure the data are accurately recorded, the downstream results from analysing the data are reproducible and the data itself is reusable at a later date, by possibly a different group of researchers.

Unsurprisingly, it is also vitally important that the data are analysed correctly, using the methods that draw the most value from it. As expected, statistical expertise plays a huge role here! The results and inference are meaningful only if appropriate statistical methods are used. Moreover, often there is a choice of valid statistical approaches; however, some approaches will be more powerful or more precise than others. 

Having analysed the data, it is important that the inference (or conclusions) drawn are sound. Again, statistical thinking is crucial here. For example, in my experience, one all too common mistake in animal studies is to accept the null hypothesis and erroneously claim that a non-significant result means there is no difference (say, between treatment means). 

Statistical aspects to minimise future animal use: sharing the value from the experiment

The other important mechanism for minimising future animal use is to share the knowledge and information gleaned. The most basic step here is to ensure that all the results are correctly and non-selectively reported. Reporting all aspects of the trial, including the experimental design and statistical analysis, accurately and completely is crucial for the wider interpretation of the findings, reproducibility and repeatability of the research, and for scientific scrutiny. In addition, all results, including null results, are valuable and should be shared. 

Sharing the data (or resources, e.g., animal tissues) also contributes to reduction. The data may be able to be re-used for a different purpose, integrated with other sources of data to provide new insights, or re-analysed in the future using a more advanced statistical technique, or for a different hypothesis. 

Statistical aspects to minimise future animal use: maximising the information obtained from the experiment

Another avenue that should also be explored is whether additional data or information can be obtained from the experiment, without incurring any further adverse animal welfare impacts, that could benefit other researchers and/or future studies. For example, to help address a different research question now or in the future. At the outset of the study, researchers must consider whether their proposed study could be combined with another one, whether the research animals could be shared with another experiment (e.g., animals euthanized for one experiment may provide suitable tissue for use in another), what additional data could be collected that may (or is!) of future use, etc. 

Statistical thinking clearly plays a fundamental role in reducing the number of animals used in scientific research, and in ensuring the most value is drawn from the resulting data. I strongly believe that statistical expertise must be fully utilised through the duration of the project, from design through to analysis and dissemination of results, in all research projects involving animals to achieving reduction. In my experience, most researchers strive for very high standards of animal ethics, and absolutely do not want to cause unnecessary harm to animals. Unfortunately, the role statistical expertise plays here is not always appreciated or taken advantage of. So next time you’re thinking of undertaking research involving animals, ensure you have expert statistical input!

About the author

Dr. Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems.  Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.

Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.

Related Reads

READ MORE

Dr. Salvador A. Gezan

02 March 2022

Significance levels: a love-hate relationship

We frequently hear in the scientific community a debate in favour of or against the use of p-values for statistical testing. The focus of this note is not to contribute (or muddle) this debate, but just to bring attention to a couple of elements critical to understanding the concept of the significance level. We will focus on three keywords: convention, trade-off and risk.

Recall that significance level refers to that value, say α = 0.05, that we use to decide if we should reject a null hypothesis (e.g., no difference in the means between two treatments). This α value represents the proportion of mistakes (here 5%) that we will make when we reject the null hypothesis when in fact it is true. This type I error, implies that we declare two means significantly different when in fact they are not (an undesirable situation with a false positive). Recall that we also have type II error, or β, the case in which we do not declare two means significantly different when in fact they are (another undesirable situation). This last error is associated with the power of a test, but here we will only focus on the controversial α value.

alt text

Let’s start with the keyword convention. Depending on the field, we use different significance levels to ‘dictate’ when a test result is significant. Often, we use 0.10, 0.05 or 0.01; the use of a higher value usually is associated with scientific fields where there are greater levels of uncertainty (e.g., ecology), whereas stricter (lower) values are used in medical or pharmaceutical studies.

Without a doubt, the most common value used is 0.05: but why this value and what does it mean? This value, as indicated earlier, indicates that if the null hypothesis is true, we will only reject it wrongly 5% of the time (or 1 in 20 times). This seems a reasonable number, but a somehow arbitrary choice. This significance level originated from the tables of critical values reported by Fisher and Yates back in 1938 and this value has stayed as a convenient number. (However, they also reported 0.10, 0.02, 0.01, 0.002 and 0.001). In part these significance levels were chosen by the above authors given their limitations to calculate all possible probabilities. Now, we usually get p-values instead of critical values, but interestingly, we still focus on that historical reference of 5%. 

The convenient and arbitrary choice of a 5% significance level has now become a convention. But there is no scientific or empirical reasoning for the use of this or any other value. This is one of the reasons for the present debate on its use. It also seems odd that, for example, when α is 0.05 we declare a test significant with a p-value of 0.047, but not significant if this p-value is 0.053. As you can see this enhances its arbitrary nature, and maybe we should focus on the strength of the evidence against our null hypothesis, namely, talking about mild or strong significance in our results, instead of a yes/no attitude. 

The second keyword is trade-off. In any decision, even in the simplest life decisions, there is always the possibility of making some mistakes. If, for your statistical inference, you are unwilling to make almost any mistakes, then you should select a very small value: an α, say of 0.0001 (here, 1 in 10,000 times you will make the wrong decision). This, at first glance, seems reasonable, but it has other implications. Requiring a very high level of confidence will mean that you will almost never reject the null hypothesis even if rejecting it is the correct decision! This occurs because you are setting up very strict thresholds to report a significant result and therefore you need extremely strong evidence. 

The above conservative philosophy results in two side effects. First, a waste of resources as you are extremely unlikely to report significant differences from almost any study you might execute; only those with extremely large effects (or sample sizes) will have sufficient power. Second, the scientific advancement in terms of discoveries (for example, a new drug or treatment) will be hampered by these strict values. For example, we will have fewer drugs or treatments available for some illnesses. Here, it is the society as a whole that loses in progress and scientific advancement with such a high threshold.

On the other hand, something different, but also concernin, occurs with a very flexible significance level (say α = 0.30). Here we are very likely to find a ‘better’ drug or treatment, but these are not necessarily true improvements as we are likely having too many false positives with results almost random. This too has strong side effects for all of us such as: 1) a large societal cost on having drugs or treatments that are not necessarily better than the original ones (even if we think they are), and 2) it will be hard, as an individual, to discriminate between the good and the bad drugs (or treatments) as there are too many ‘good’ options available and all are reported as the ‘best’ ones!

This is where there exists a trade-off between too much or too little; individuals and societies need to define what is an adequate level of mistakes. Somehow it seems the scientific community has chosen 0.05! But any value will be always good or bad depending on the above trade-offs.

The last keyword is risk. Any decision that involves a mistake implies a risk. This is better understood with a few examples. So, imagine that you are considering using an alternative drug to treat your chronic disease that potentially has some bad, but manageable, side effects. In this case, you might want stronger evidence that this drug is going to be better than the standard; hence, you might want to use an α of 0.0001, to make sure it is better (at least statistically). Hence, under potentially high personal risk of a bad decision, the significance level required has to be lower. 

In contrast, imagine that you want to use the seeds of an alternative tomato variety that has been recommended to you for planting in your garden. The risk of having the wrong genetic source is ever present, and it will imply a waste of resources (money, time and frustration); but, under failure, your risk is relatively low, as you can buy tomatoes from the supermarket and the following year, go back to the seeds of your typical (and well tested) variety. In this case, the personal risk is relatively low, and you might be happy with an α of 0.10 (implying 1 in 10 times you will be incorrect) to give a chance to this alternative variety.

Hence, it is difficult to define what is an adequate level of risk for our decisions, an aspect that is also subjective. And once more, it seems that the scientific community has decided that 5% is a good level of risk!

In summary, significance levels are a convention that we have carried for a while but going forward we need to be flexible and critical on how they are used and reported. In many senses they are subjective decisions, and we could even say they are individual decisions that take into consideration our personal like or dislike of risks and the potential trade-off in taking the right or wrong decisions. So, next time you see a p-value take a step back and think about the consequences!

About the Author

Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world. 

Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook “Statistical Methods in Biology: Design and Analysis of Experiments and Regression”.

READ MORE

Dr. Vanessa Cave

31 May 2022

Further inference from an ANOVA table: residual variance to standard errors and confidence intervals

Below is an example of a 2-way Analysis of Variance (ANOVA) for a randomised complete block design. From the ANOVA table, we’re going to see how to calculate: 

  • The standard error of a mean (SEM). 
  • The confidence interval (CI) for a mean.
  • The standard error of the difference between two means (SED).
  • The least significant difference between two means (LSD).
  • The confidence interval for the difference between two means.

alt text

The ANOVA table gives us an estimate of the residual mean square () – also known as the mean square error, residual error or residual variation. This is the variation in the data unexplained by the ANOVA model.  In the example above, this is the variation remaining after the block effects (block) and treatment effects (Nitrogen, Sulphur and the Nitrogen by Sulphur interaction, Nitrogen.Sulphur) have been accounted for.   

The standard error of a mean (SEM) is calculated using the following formula:

     S E M space equals space square root of s squared over n end root

where n is the number of replicates (or sample size).

The SEM describes the uncertainty in our estimate of the population mean for a treatment from the available data. The bigger the sample size (i.e., the larger n), the smaller the SEM and the more precise our estimate is. 

In our example, there are 12 unique treatment combinations: the 3 levels of Nitrogen by the 4 levels of Sulphur. Note, we can obtain the number of levels of each treatment, and number of blocks, from the degrees of freedom in the ANOVA table.alt text

Furthermore, the total number of degrees of freedom + 1, gives us the number of experimental units.  Thus, in our example, there are 36 experimental units, with each of the 12 unique treatment combinations occurring exactly once in each of the 3 blocks. 

Therefore…

  • The 12 Nitrogen by Sulphur means have replication of 3 (1 replicate per block).

For example, the 3 replicates of the treatment corresponding to the first level of Nitrogen and the second level of Sulphur (Nitrogen 1 Sulphur 2) are highlighted yellow in the schematic below of our randomised complete block design:alt text

Thus, the standard error for the Nitrogen by Sulphur means is:

      table attributes columnalign right center left columnspacing 0px end attributes row blank equals cell space square root of fraction numerator 0.04483 over denominator 3 end fraction end root end cell row blank equals cell space 0.1222 end cell end table

  • The 3 Nitrogen means, pooled over the four Sulphur levels, have replication of 12 (3 blocks x 4 levels of Sulphur).

For example, the replicates of the first level of Nitrogen (Nitrogen 1) are highlighted in yellow:alt text

(Note, within each of the 3 blocks, a given level of Nitrogen corresponds to 4 unique treatment combinations: 1 at each level of Sulphur). 

Thus, the standard error for the Nitrogen means is:

      table attributes columnalign right center left columnspacing 0px end attributes row blank equals cell space square root of fraction numerator 0.04483 over denominator 12 end fraction end root end cell row blank equals cell space 0.0611 end cell end table

  • The 4 Sulphur means, pooled over the 3 Nitrogen levels, have replication of 9 (3 blocks x 3 levels of Nitrogen).

For example, the replicates of the second level of Sulphur (Sulphur 2) are highlighted yellow:alt text

Thus, the standard error for the Sulphur means is:

      table attributes columnalign right center left columnspacing 0px end attributes row blank equals cell space square root of fraction numerator 0.04483 over denominator 9 end fraction end root end cell row blank equals cell space 0.0706 end cell end table

The confidence interval (CI) for a mean is

      x

where is the critical value of the  distribution with degrees of freedom. For a confidence interval of C%, . For example, for a 95% confidence interval, . The refers to the residual degrees of freedom. This can be read directly from the ANOVA table.

alt text

In a nutshell, a C% confidence interval for a mean is a range of values that you can be C% certain contains the true population mean. Although, strictly speaking, the confidence level C% represents a long-run percentage: the C% confidence interval gives an estimated range of values that we would expect the true, but unknown, population parameter to lie within C% of the times, should we repeat our experiment a large number of times. 

The tables of means for our example are given below:

alt text

For example, the 95% confidence interval for the overall mean for Sulphur level 4 is:

      x

      x

     

Similarly, the 99% confidence interval for the Nitrogen level 1, Sulphur level 3 mean is:

      x

      x

     

The standard error of the difference between two means (SED) is calculated using the following formula:

S E D space equals space S E M space x space square root of 2

Note: The formula is different when the sample sizes of the means being compared are unequal.

The SED describes the uncertainty in our estimate of the difference between two population means. 

For our example, the SED between …

 a) two Nitrogen by Sulphur means is:

       table attributes columnalign right center left columnspacing 0px end attributes row blank equals cell space 0.1222 space x space square root of 2 end cell row blank equals cell space 0.1728 end cell end table

b) two overall Nitrogen means is:

      table attributes columnalign right center left columnspacing 0px end attributes row blank equals cell space 0.0611 space x space square root of 2 end cell row blank equals cell space 0.0864 end cell end table

and

c) two overall Sulphur means is:

      table attributes columnalign right center left columnspacing 0px end attributes row blank equals cell space 0.0706 space x space square root of 2 end cell row blank equals cell space 0.0998 end cell end table

Two means can be compared using their least significant difference (LSD). The LSD gives the smallest value in which the absolute difference between the two means is deemed to be is statistically significant at the α level of significance. The LSD is given by:

     ( x 100%) x

Thus, for our example, the LSD(5%) for comparing …

a)  two Nitrogen by Sulphur means is:

     % = x

                                                      = x

                                                      =

b)  two overall Nitrogen means is:

     % = x

                                         = x

                                         =

c)  two overall Sulphur means is:

     % = x

                                         = x

                                         =

Using the overall means for Nitrogen as an example, at the 5% significance level…

  • there is statistical evidence that the Nitrogen 1 and Nitrogen 2 means differ

, the LSD(5%)

  • there is statistical evidence that the Nitrogen 1 and Nitrogen 3 means differ

, the LSD(5%)

  • there is NO statistical evidence that the Nitrogen 2 and Nitrogen 3 means differ

, the LSD(5%)

The difference between two means can be also compared, and more fully described, using a confidence interval for the difference between two means. This is given by:

x

or, equivalently,

x x 100%

where and are the two means being compared.

Once again, using the overall means for Nitrogen as an example, the 95% confidence interval for the difference between:

  • the Nitrogen 1 and Nitrogen 2 means is:

  • the Nitrogen 1 and Nitrogen 3 means is:

  • the Nitrogen 2 and Nitrogen 3 means is:

Notice that the CIs comparing Nitrogen 1 with Nitrogen 2, and Nitrogen 1 with Nitrogen 3, both exclude zero. Hence, we can conclude, at the 5% significance level, that the mean for Nitrogen 1 is significant differently from both the Nitrogen 2 and Nitrogen 3 means. (In this case the mean for Nitrogen 1 is lower than that of Nitrogen 2 and Nitrogen 3). Conversely, as the CI comparing Nitrogen 2 and Nitrogen 3 includes zero, we conclude that there is no evidence of a difference between these two means (at the 5% significance level).

Luckily for us, we rarely need to calculate these quantities ourselves, as they are generated by most statistical software packages. However, it is useful to understand how they are calculated and how they are related. For example, in order to scrutinize reported results, or to calculate, at a later date, a quantity that you’ve forgotten to generate.alt text

Genstat has a very powerful set of ANOVA tools, that are straightforward and easy to use. In addition to the ANOVA table, you can readily output the treatment means, SEMs, SEDs, LSDs and CIs.

About the author

Dr. Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems.  Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.

Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.

READ MORE

Dr. Salvador A. Gezan

09 March 2022

Meta analysis using linear mixed models

Meta-analysis is a statistical tool that allows us to combine information from related, but independent studies, that all have as an objective to estimate or compare the same effects from contrasting treatments. Meta-analysis is widely used in many research areas where an extensive literature review is performed to identify studies that had a similar research question. These are later combined using meta-analysis to estimate a single combined effect. Meta-analyses are commonly used to answer healthcare and medical questions, where they are widely accepted, but they also are used in many other scientific fields.

By combining several sources of information, meta-analyses have the advantage of greater statistical power, therefore increasing our chance of detecting a significant difference. They also allow us to assess the variability between studies, and help us to understand potential differences between the outcomes of the original studies.

The underlying premise in meta-analysis is that we are collecting information from a group of, say n, studies that individually estimated a parameter of interest, say . It is reasonable to consider that this parameter has some statistical properties. Mainly we assume that it belongs to a Normal distribution with unknown mean and variance. Hence, mathematically we say:

In meta-analysis, the target population parameter θ can correspond to any of several statistics, such as a treatment mean, a difference between treatments; or more commonly in clinical trials, the log-odds ratio or relative risk.

There are two models that are commonly used to perform meta-analyses: the fixed-effect model and the random-effects model. For the fixed-effect model, it is assumed that there is only a single unique true effect our single θ above, which is estimated from a random sample of studies. That is, the fixed-effect model assumes that there is a single population effect, and the deviations obtained from the different studies are only due to sampling error or random noise. The linear model (LM) used to describe this process can be written as:


where is each individual observed response, is the population parameter (also often known as  μ, the overall mean), and is a random error or residual with assumptions of . The variance component is a measurement of our uncertainty in the information (i.e., response) of each study. The above model can be easily fitted under any typical LM routine, such as R, SAS, GenStat and ASReml.

For the random-effects model we still assume that there is a common true effect between studies, but in addition, we allow this effect to vary between studies. Variation between these effects is a reasonable assumption as no two studies are identical, differing in many aspects; for example, different demographics in the data, slightly differing measurement protocols, etc. Because, we have a random sample of studies, then we have a random sample of effects, and therefore, we define a linear mixed model (LMM) using the following expression:


where, as before, is each individual observed response, is the study-specific population parameter, with the assumption of and is a random error or residual with the same normality assumptions as before. Alternatively, the above LMM can be written as:


where and is a random deviation from the overall effect mean θ with assumptions .

This is a LMM because we have, besides the residual, an additional random component that has a variance component associated to it, that is or . This variance is a measurement of the variability ‘between’ studies, and it will reflect the level of uncertainty of observing a specific  . These LMMs can be fitted, and variance components estimated, under many linear mixed model routines, such as nlme in R, proc mixed in SAS, Genstat or ASReml.

Both fixed-effect and random-effects models are often estimated using summary information, instead of the raw data collected from the original study. This summary information corresponds to estimated mean effects together with their variances (or standard deviations) and the number of samples or experimental units considered per treatment. Since the different studies provide different amounts of information, weights should be used when fitting LM or LMM to summary information in a meta-analysis, similar to weighted linear regression. In meta-analysis, each study has a different level of importance, due to, for example, differing number of experimental units, slightly different methodologies, or different underlying variability due to inherent differences between the studies. The use of weights allows us to control the influence of each observation in the meta-analysis resulting in more accurate final estimates.

Different statistical software will manage these weights slightly differently, but most packages will consider the following general expression of weights:


where is the weight and is the variance of the observation. For example, if the response corresponds to an estimated treatment mean, then its variance is , with MSE being the mean square error reported for the given study, and the number of experimental units (or replication).

Therefore, after we collect the summary data, we fit our linear or linear mixed model with weights and request from its output an estimation of its parameters and their standard errors. This will allow us to make inference, and construct, for example, a 95% confidence interval around an estimate to evaluate if this parameter/effect is significantly different from zero. This will be demonstrated in the example below.

Motivating example

The dataset we will use to illustrate meta-analyses was presented and previously analysed by Normand (1999). The dataset contains infromation from nine independent studies where the length of hospitalisation (measured in days) was recorded for stroke patients under two different treatment regimes. The main objective was to evaluate if specialist inpatient stroke care (sc) resulted in shorter stays when compared to the conventional non-specialist (or routine management) care (rm).

The complete dataset is presented below, and it can also be found in the file STROKE.txt. In this table, the columns present for each study are the sample size (n.sc and n.rm), their estimated mean value (mean.sc and mean.rm) together with their standard deviation (sd.sc and sd.rm) for both the specialist care and routine management care, respectively.

alt text

Statistical analyses

We will use the statistical package R to read and manipulate the data, and then the library ASReml-R (Butler et al. 2017) to fit the models. 
First, we read the data in R and make some additional calculations, as shown in the code below:

STROKE <- read.table("STROKE.TXT", header=TRUE)
STROKE$diff <- STROKE$mean.sc - STROKE$mean.rm 
STROKE$Vdiff <- (STROKE$sd.sc^2/STROKE$n.sc) + (STROKE$sd.rm^2/STROKE$n.rm) 
STROKE$WT <- 1/(STROKE$Vdiff) 

The new column diff contains the difference between treatment means (as reported from each study). We have estimated the variance of this mean difference, Vdiff, by taking from each treatment its individual MSE (mean square error) and dividing it by the sample size, and then summing the terms of both treatments. This estimate assumes, that for a given study, the samples from both treatments are independent, and for this reason we did not include a covariance. Finally, we have calculated a weight (WT) for each study as the inverse of the variance of the mean difference (i.e., 1/Vdiff).

We can take another look at this data with these extra columns:

alt text

The above table shows a wide range of values between the studies in the mean difference of length of stay between the two treatments, ranging from as low as −71.0 to 11.0, with a raw average of −15.9. Also, the variances of these differences vary considerably, which is also reflected in their weights.

The code to fit the fixed-effect linear model using ASReml-R is shown below:

library(asreml) 
meta_f<-asreml(fixed=diff~1, 
               weights=WT, 
               family=asr_gaussian(dispersion=1), 
               data=STROKE)

In the above model, our response variable is diff, and the weights are indicated by the variate WT. As the precisions are contained within the weights the command family is required to fix the residual error (MSE) to exactly 1.0, hence, it will not be estimated.

The model generates output that can be used for inference. We will start by exploring our target parameter, i.e. θ, by looking at the estimated fixed effect mean and its standard error. This is done with the code:

meta_effect <- summary(meta_f, coef=TRUE)$coef.fixed

Resulting in the output:

alt text

The estimate of θ is equal to −3.464 days, with a standard error of 0.765. An approximate 95% confidence interval can be obtained by using a z-value of 1.96. The resulting approximate 95% confidence interval [−4.963;−1.965] does not contain zero. The significance of this value can be obtained by looking at the approximated ANOVA table using the command:

wald.asreml(meta_f)

Note that this is approximated as, given that weights are considered to be known, the degrees of freedom are assumed to be infinite; hence, this will be a liberal estimate.

alt text

The results from this ANOVA table indicate a high significance of this parameter (θ) with an approximated p-value of < 0.0001. Therefore, in summary, this fixed effect model analysis indicates a strong effect of the specialised care resulting in a reduction of approximately 3.5 days in hospitalisation.

However, as indicated earlier, a random-effects model might seem more reasonable given the inherent differences in the studies under consideration. Here, we extend the model to include the random effect of study. In order to do this, first we need to ensure that this is treated as a factor in the model by running the code:

STROKE$study <- as.factor(STROKE$study)_f)

The LMM to be fitted using ASReml-R is:

meta_r<-asreml(fixed=diff~1,  
               random=~study, 
               weights=WT, 
               family=asr_gaussian(dispersion=1), 
               data=STROKE)

Note in this example the only difference from the previous code is the inclusion of the line random=~study. This includes the factor study as a random effect. An important result from running are the variance component estimates. These are obtained with the command:

summary(meta_r)$varcomp

alt text

In this example, the variance associated with the differences in the target parameter (θ) between the studies is 684.62. When expressed as a standard deviation, this corresponds to 26.16 days. Note that this variation is large in relation to the scale of the data, reflecting large differences between the random sample of studies considered in the meta-analysis.

We can output the fixed and random effects using the following commands:

meta_effect <- summary(meta_r, coef=TRUE)$coef.fixed 
BLUP <- summary(meta_r, coef=TRUE)$coef.random

alt text

Note that now that our estimated mean difference corresponds to −15.106 days with an standard error of 8.943, and that the approximate 95% confidence interval [−32.634;2.423] now contains zero. An approximated ANOVA based on the following code:

wald.asreml(meta_r)

results in the output:

alt text

We have a p-value of 0.0912, indicating that there is no significant difference in length of stay between the treatments evaluated. Note that the estimates of the random effects of study, also known as BLUPs (best linear unbiased predictions) are large, ranging from −45.8 to 22.9, and widely variable. The lack of significance in the random-effects model, when there is a difference of −15.11 days, is mostly due to the large variability of 684.62 found between the different studies, resulting in a substantial standard error for the estimated mean difference.

In the following graph we can observe the 95% confidence intervals for each of the nine studies together with the final parameter estimated under the Random-effects Model. Some of these confidence intervals contain the value zero, including the one for the random-effects model. However, it can be observed that the confidence interval from the random-effects model is an adequate summarization of the nine studies, representing a compromising confidence interval.

alt text

An important aspect to consider is the difference in results between the fixed-effect and the random-effects model that are associated, as indicated earlier, with different inferential approaches. One way to understand this is by considering what will happen if a new random study is included. Because we have a large variability in the study effects (as denoted by ), we expect this new study to have a difference between treatments that is randomly within this wide range. This, in turn, is expressed by the large standard error of the fixed effect θ, and by its large 95% confidence interval that will ensure that for ‘any’ observation we cover the parameter estimate 95% of the time. Therefore, as shown by the data, it seems more reasonable to consider the random-effects model than the fixed-effect model as it is an inferential approach that deals with several sources of variation.

Summary

In summary, we have used the random-effects model to perform meta-analysis on a medical research question of treatment differences by combining nine independent studies. Under this approach we assumed that all studies describe the same effect but we allowed for the model to express different effect sizes through the inclusion of a random effect that will vary from study to study. The main aim of this analysis was not to explain why these differences occur, here, our aim was to incorporate a measure of this uncertainty on the estimation of the final effect of treatment differences.

There are several extensions to meta-analysis with different types of responses and effects. Some of the relevant literature recommended to the interested reader are van Houwelingen et al. (2002) and Vesterinen et al. (2014). Also, a clear presentation with further details of the differences between fixed-effect and random-effects models is presented by Borenstein et al. (2010).

Files to download

Dataset: STROKE.txt
R code: STROKE_METAA.R

References

Borenstein, M; Hedges, LV; Higgins, JPT; Rothstein, HR. 2010. A basic introduction to fixed-effect and random-effects models for meta-analysis. Research Synthesis Methods 1: 97-111.

Butler, DG; Cullis, BR; Gilmour, AR; Gogel, BG; Thompson, R. 2017. ASReml-R Reference Manual Version 4. VSNi International Ltd, Hemel Hempstead, HP2 14TP, UK.

Normand, ST. 1999. Meta-analysis: Formulating, evaluating, combining, and reporting. Statistics in Medicine 18: 321-359.

van Houwelingen, HC; Arends, LR; Stignen, T. 2002. Advanced methods in meta-analysis: multivariate approach and meta-regression. Statistics in Medicine 21: 589-624.

Vesterinen, HM; Sena, ES; Egan, KJ; Hirst, TC; Churolov, L; Currie, GL; Antonic, A; Howells, DW; Macleod, MR. 2014. Meta-analysis of data from animal studies: a practical guide. Journal of Neuroscience Methods 221: 92-102.

About the author

Salvador Gezan is a statistician/quantitative geneticist with more than 20 years’ experience in breeding, statistical analysis and genetic improvement consulting. He currently works as a Statistical Consultant at VSN International, UK. Dr. Gezan started his career at Rothamsted Research as a biometrician, where he worked with Genstat and ASReml statistical software. Over the last 15 years he has taught ASReml workshops for companies and university researchers around the world. 

Dr. Gezan has worked on agronomy, aquaculture, forestry, entomology, medical, biological modelling, and with many commercial breeding programs, applying traditional and molecular statistical tools. His research has led to more than 100 peer reviewed publications, and he is one of the co-authors of the textbook “Statistical Methods in Biology: Design and Analysis of Experiments and Regression”.