The VSNi Team
Last Wednesday at 7:16 AMASRgenomics: filling the gap on processing molecular data for quantitative genetics
Most breeding programs are supported by an array of genomic information that will provide new options to increase the rates of genetic gain. However, performing statistical analyses with molecular data can be a difficult task. This type of data has to communicate properly with available phenotypic and pedigree data. The overall success of this integration depends on a set of checks, verifications, filters, and careful preparation of all these datasets in order to be able to fit genetic models successfully and to obtain the required output to make correct decisions.
The workflow of molecular data-driven analysis varies based on the source of the datasets and of course, on personal preferences. Nevertheless, regardless of these aspects, an efficient genomics pipeline should rely on answering some of the following questions:
We developed ASRgenomics to help deal with the above questions. This is a free to use R library which can be downloaded from the ASReml knowledgbase. It is a compilation of proven routines developed over several years of study and hands-on experience in the field. ASRgenomics was built with advanced statistical modeling in mind and it fills a gap by helping you make sure your analyses are as efficient and accurate as they can be with several explicit diagnostic tools.
The package is aimed at geneticists and breeders with the purpose of improving their experience with genomic analyses, such as Genomic Selection (GS) and Genome Wide Association Studies (GWAS), in a straightforward and efficient manner. The main capabilities of the package include:
The functions included within ASRgenomics are very flexible and can be used for a tailored workflow from raw molecular data to well-behaved model-ready matrices. Additionally, ASRgenomics is capable of seamlessly preparing genomic datasets for integration with ASReml-R to fit linear mixed models (LMMs; e.g., GBLUP or ssGBLUP).
Please try this free library and check out the user guide included withinin the doc folder inside the download package for a walk-though of the features along with details of the methods.
The VSNi Team
13 June 2022What t-test to use: Student’s or Welch’s?
Imagine we are interested in whether female undergraduate students with blonde or brown hair have different heights, on average. To investigate this, we randomly sample blonde and brown haired female undergraduate students and measure their height.
=
≠
Assumptions | Student’s t-test | Welch’s t-test |
Two independent, random samples | ||
The two populations have Normal distributions | ||
The variances of the two populations are equal |
If the sample sizes are unequal between the two groups, Welch's t-test performs better than Student's t-test.
If you have | Use |
Unequal sample variances and/or unequal sample sizes | Welch's t-test (also known as the unequal variances t-test) performs better (i.e., is more reliable) than Student’s t-test |
Equal population variances and equal sample sizes | Student's t-test has more power than Welch's t-test |
Performing a Student’s t-test or Welch’s t-test in Genstat is straightforward.
From the menu bar, select Stats | Statistical tests | One- and two-sample t-tests. Then, in the T-Tests menu, set the type of test to Two-sample.
We can run a Student’s t-test by clicking the Options button, then selecting Pooled as the method used to estimate the variances for the test.
To run a Welch’s t-test, in the T-Test Options menu select Separate as the method used to estimate the variances for the test.
Dr. Vanessa Cave
31 May 2022Further inference from an ANOVA table: residual variance to standard errors and confidence intervals
Below is an example of a 2-way Analysis of Variance (ANOVA) for a randomised complete block design. From the ANOVA table, we’re going to see how to calculate:
The ANOVA table gives us an estimate of the residual mean square () – also known as the mean square error, residual error or residual variation. This is the variation in the data unexplained by the ANOVA model. In the example above, this is the variation remaining after the block effects (block) and treatment effects (Nitrogen, Sulphur and the Nitrogen by Sulphur interaction, Nitrogen.Sulphur) have been accounted for.
The standard error of a mean (SEM) is calculated using the following formula:
where n is the number of replicates (or sample size).
The SEM describes the uncertainty in our estimate of the population mean for a treatment from the available data. The bigger the sample size (i.e., the larger n), the smaller the SEM and the more precise our estimate is.
In our example, there are 12 unique treatment combinations: the 3 levels of Nitrogen by the 4 levels of Sulphur. Note, we can obtain the number of levels of each treatment, and number of blocks, from the degrees of freedom in the ANOVA table.
Furthermore, the total number of degrees of freedom + 1, gives us the number of experimental units. Thus, in our example, there are 36 experimental units, with each of the 12 unique treatment combinations occurring exactly once in each of the 3 blocks.
Therefore…
For example, the 3 replicates of the treatment corresponding to the first level of Nitrogen and the second level of Sulphur (Nitrogen 1 Sulphur 2) are highlighted yellow in the schematic below of our randomised complete block design:
Thus, the standard error for the Nitrogen by Sulphur means is:
For example, the replicates of the first level of Nitrogen (Nitrogen 1) are highlighted in yellow:
(Note, within each of the 3 blocks, a given level of Nitrogen corresponds to 4 unique treatment combinations: 1 at each level of Sulphur).
Thus, the standard error for the Nitrogen means is:
For example, the replicates of the second level of Sulphur (Sulphur 2) are highlighted yellow:
Thus, the standard error for the Sulphur means is:
The confidence interval (CI) for a mean is
x
where is the critical value of the distribution with degrees of freedom. For a confidence interval of C%, . For example, for a 95% confidence interval, . The refers to the residual degrees of freedom. This can be read directly from the ANOVA table.
In a nutshell, a C% confidence interval for a mean is a range of values that you can be C% certain contains the true population mean. Although, strictly speaking, the confidence level C% represents a long-run percentage: the C% confidence interval gives an estimated range of values that we would expect the true, but unknown, population parameter to lie within C% of the times, should we repeat our experiment a large number of times.
The tables of means for our example are given below:
For example, the 95% confidence interval for the overall mean for Sulphur level 4 is:
x
x
Similarly, the 99% confidence interval for the Nitrogen level 1, Sulphur level 3 mean is:
x
x
The standard error of the difference between two means (SED) is calculated using the following formula:
Note: The formula is different when the sample sizes of the means being compared are unequal.
The SED describes the uncertainty in our estimate of the difference between two population means.
For our example, the SED between …
a) two Nitrogen by Sulphur means is:
b) two overall Nitrogen means is:
and
c) two overall Sulphur means is:
Two means can be compared using their least significant difference (LSD). The LSD gives the smallest value in which the absolute difference between the two means is deemed to be is statistically significant at the α level of significance. The LSD is given by:
( x 100%) x
Thus, for our example, the LSD(5%) for comparing …
a) two Nitrogen by Sulphur means is:
% = x
= x
=
b) two overall Nitrogen means is:
% = x
= x
=
c) two overall Sulphur means is:
% = x
= x
=
Using the overall means for Nitrogen as an example, at the 5% significance level…
, the LSD(5%)
, the LSD(5%)
, the LSD(5%)
The difference between two means can be also compared, and more fully described, using a confidence interval for the difference between two means. This is given by:
x
or, equivalently,
x x 100%
where and are the two means being compared.
Once again, using the overall means for Nitrogen as an example, the 95% confidence interval for the difference between:
Notice that the CIs comparing Nitrogen 1 with Nitrogen 2, and Nitrogen 1 with Nitrogen 3, both exclude zero. Hence, we can conclude, at the 5% significance level, that the mean for Nitrogen 1 is significant differently from both the Nitrogen 2 and Nitrogen 3 means. (In this case the mean for Nitrogen 1 is lower than that of Nitrogen 2 and Nitrogen 3). Conversely, as the CI comparing Nitrogen 2 and Nitrogen 3 includes zero, we conclude that there is no evidence of a difference between these two means (at the 5% significance level).
Luckily for us, we rarely need to calculate these quantities ourselves, as they are generated by most statistical software packages. However, it is useful to understand how they are calculated and how they are related. For example, in order to scrutinize reported results, or to calculate, at a later date, a quantity that you’ve forgotten to generate.
Genstat has a very powerful set of ANOVA tools, that are straightforward and easy to use. In addition to the ANOVA table, you can readily output the treatment means, SEMs, SEDs, LSDs and CIs.
Dr. Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems. Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.
Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.
Dr. Vanessa Cave
10 May 2022The essential role of statistical thinking in animal ethics: dealing with reduction
Having spent over 15 years working as an applied statistician in the biosciences, I’ve come across my fair-share of animal studies. And one of my greatest bugbears is that the full value is rarely extracted from the experimental data collected. This could be because the best statistical approaches haven’t been employed to analyse the data, the findings are selectively or incorrectly reported, other research programmes that could benefit from the data don’t have access to it, or the data aren’t re-analysed following the advent of new statistical methods or tools that have the potential to draw greater insights from it.
An enormous number of scientific research studies involve animals, and with this come many ethical issues and concerns. To help ensure high standards of animal welfare in scientific research, many governments, universities, R&D companies, and individual scientists have adopted the principles of the 3Rs: Replacement, Reduction and Refinement. Indeed, in many countries the tenets of the 3Rs are enshrined in legislation and regulations around the use of animals in scientific research.
Replacement | Use methods or technologies that replace or avoid the use of animals. |
Reduction | Limit the number of animals used. |
Refinement | Refine methods in order to minimise or eliminate negative animal welfare impacts. |
In this blog, I’ll focus on the second principle, Reduction, and argue that statistical expertise is absolutely crucial for achieving reduction.
The aim of reduction is to minimise the number of animals used in scientific research whilst balancing against any additional adverse animal welfare impacts and without compromising the scientific value of the research. This principle demands that before carrying out an experiment (or survey) involving animals, the researchers must consider and implement approaches that both:
Both these considerations involve statistical thinking. Let’s begin by exploring the important role statistics plays in minimising current animal use.
Reduction requires that any experiment (or survey) carried out must use as few animals as possible. However, with too few animals the study will lack the statistical power to draw meaningful conclusions, ultimately wasting animals. But how do we determine how many animals are needed for a sufficiently powered experiment? The necessary starting point is to establish clearly defined, specific research questions. These can then be formulated into appropriate statistical hypotheses, for which an experiment (or survey) can be designed.
Statistical expertise in experimental design plays a pivotal role in ensuring enough of the right type of data are collected to answer the research questions as objectively and as efficiently as possible. For example, sophisticated experimental designs involving blocking can be used to reduce random variation, making the experiment more efficient (i.e., increase the statistical power with fewer animals) as well as guarding against bias. Once a suitable experimental design has been decided upon, a power analysis can be used to calculate the required number of animals (i.e., determine the sample size). Indeed, a power analysis is typically needed to obtain animal ethics approval - a formal process in which the benefits of the proposed research is weighed up against the likely harm to the animals.
Researchers also need to investigate whether pre-existing sources of information or data could be integrated into their study, enabling them to reduce the number of animals required. For example, by means of a meta-analysis. At the extreme end, data relevant to the research questions may already be available, eradicating the need for an experiment altogether!
An obvious mechanism for minimising future animal use is to ensure we do it right the first time, avoiding the need for additional experiments. This is easier said than done; there are many statistical and practical considerations at work here. The following paragraphs cover four important steps in experimental research in which statistical expertise plays a major role: data acquisition, data management, data analysis and inference.
Above, I alluded to the validity of the experimental design. If the design is flawed, the data collected will be compromised, if not essentially worthless. Two common mistakes to avoid are pseudo-replication and the lack of (or poor) randomisation. Replication and randomisation are two of the basic principles of good experimental design. Confusing pseudo-replication (either at the design or analysis stage) for genuine replication will lead to invalid statistical inferences. Randomisation is necessary to ensure the statistical inference is valid and for guarding against bias.
Another extremely important consideration when designing an experiment, and setting the sample size, is the risk and impact of missing data due, for example, to animal drop-out or equipment failure. Missing data results in a loss of statistical power, complicates the statistical analysis, and has the potential to cause substantial bias (and potentially invalidate any conclusions). Careful planning and management of an experiment will help minimise the amount of missing data. In addition, safe-guards, controls or contingencies could be built into the experimental design that help mitigate against the impact of missing data. If missing data does result, appropriate statistical methods to account for it must be applied. Failure to do so could invalidate the entire study.
It is also important that the right data are collected to answer the research questions of interest. That is, the right response and explanatory variables measured at the appropriate scale and frequency. There are many statistical related-questions the researchers must answer, including: what population do they want to make inference about? how generalisable do they need their findings to be? what controllable and uncontrollable variables are there? Answers to these questions not only affects enrolment of animals into the study, but also the conditions they are subjected to and the data that should be collected.
It is essential that the data from the experiment (including meta-data) is appropriately managed and stored to protect its integrity and ensure its usability. If the data get messed up (e.g., if different variables measured on the same animal cannot be linked), is undecipherable (e.g., if the attributes of the variables are unknown) or is incomplete (e.g., if the observations aren’t linked to the structural variables associated with the experimental design), the data are likely worthless. Statisticians can offer invaluable expertise in good data management practices, helping to ensure the data are accurately recorded, the downstream results from analysing the data are reproducible and the data itself is reusable at a later date, by possibly a different group of researchers.
Unsurprisingly, it is also vitally important that the data are analysed correctly, using the methods that draw the most value from it. As expected, statistical expertise plays a huge role here! The results and inference are meaningful only if appropriate statistical methods are used. Moreover, often there is a choice of valid statistical approaches; however, some approaches will be more powerful or more precise than others.
Having analysed the data, it is important that the inference (or conclusions) drawn are sound. Again, statistical thinking is crucial here. For example, in my experience, one all too common mistake in animal studies is to accept the null hypothesis and erroneously claim that a non-significant result means there is no difference (say, between treatment means).
The other important mechanism for minimising future animal use is to share the knowledge and information gleaned. The most basic step here is to ensure that all the results are correctly and non-selectively reported. Reporting all aspects of the trial, including the experimental design and statistical analysis, accurately and completely is crucial for the wider interpretation of the findings, reproducibility and repeatability of the research, and for scientific scrutiny. In addition, all results, including null results, are valuable and should be shared.
Sharing the data (or resources, e.g., animal tissues) also contributes to reduction. The data may be able to be re-used for a different purpose, integrated with other sources of data to provide new insights, or re-analysed in the future using a more advanced statistical technique, or for a different hypothesis.
Another avenue that should also be explored is whether additional data or information can be obtained from the experiment, without incurring any further adverse animal welfare impacts, that could benefit other researchers and/or future studies. For example, to help address a different research question now or in the future. At the outset of the study, researchers must consider whether their proposed study could be combined with another one, whether the research animals could be shared with another experiment (e.g., animals euthanized for one experiment may provide suitable tissue for use in another), what additional data could be collected that may (or is!) of future use, etc.
Statistical thinking clearly plays a fundamental role in reducing the number of animals used in scientific research, and in ensuring the most value is drawn from the resulting data. I strongly believe that statistical expertise must be fully utilised through the duration of the project, from design through to analysis and dissemination of results, in all research projects involving animals to achieving reduction. In my experience, most researchers strive for very high standards of animal ethics, and absolutely do not want to cause unnecessary harm to animals. Unfortunately, the role statistical expertise plays here is not always appreciated or taken advantage of. So next time you’re thinking of undertaking research involving animals, ensure you have expert statistical input!
Dr. Vanessa Cave is an applied statistician interested in the application of statistics to the biosciences, in particular agriculture and ecology, and is a developer of the Genstat statistical software package. She has over 15 years of experience collaborating with scientists, using statistics to solve real-world problems. Vanessa provides expertise on experiment and survey design, data collection and management, statistical analysis, and the interpretation of statistical findings. Her interests include statistical consultancy, mixed models, multivariate methods, statistical ecology, statistical graphics and data visualisation, and the statistical challenges related to digital agriculture.
Vanessa is currently President of the Australasian Region of the International Biometric Society, past-President of the New Zealand Statistical Association, an Associate Editor for the Agronomy Journal, on the Editorial Board of The New Zealand Veterinary Journal and an honorary academic at the University of Auckland. She has a PhD in statistics from the University of St Andrew.