Introduction
The purpose of this activity is to review important concepts for quantitative analysis so that you can interpret, plan and conduct a number of commonly used statistical tests.
Objectives
- Demonstrate an understanding of key statistical concepts
- Appropriately interpret common statistical tests used to describe populations, compare two groups, and assess correlations between two variables
- Conduct a qualitative or quantitative analysis
This is not a course in statistics, and most of the content will be a review of the introductory statistics that you will have (at minimum) completed in your undergraduate degree.
The overall objective of the module is to get you comfortable enough with basic statistics so that you can interpret, plan and conduct simple quantitative analysis (which also puts you in a good position to get involved in more advanced statistical analysis).
There is an elective component to this module, based on your personal objectives. If you elect to undertake the qualitative version of Analysis Assignment, you only need to get as far as being able to interpret and plan simple quantitative analysis (for the purposes of this course). If you elect to undertake the quantitative version of the Analysis Assignment, you need to be able to interpret, plan and conduct simple quantitative analyses.
Resources
I will be unapologetically brief in the discussion below. This is a primer after all! Everything I say here is easily found in any introductory statistics text.
The following two resources will help you work through the activities
- Mackridge and Rowe (2018) (especially Chapter 2 and Chapter 4), see UQ Library link, and
- Illowsky and Dean (2018), see website
Both are available online. Illowsky and Dean (2018) is freely available, and Mackridge and Rowe (2018) is available via the UQ Library.
I have also used Suemoto and Lee (2018) in developing the content.
Key questions
There are a couple of important questions you need to address when considering what statistical test you will conduct.
- What type of data are you collecting? (i.e. nominal, ordinal, …)
- What are the possible values of the variable? (i.e. the possible values for a die roll are [1,2,…,5,6])
- How is the data distributed? (What is the probability of observing different values of the variable?)
How you answer these questions determines which statistical analysis is appropriate.
What type of data are you collecting?
I provide descriptors and examples below. For more detail, see Mackridge and Rowe (2018), Chapter 2.
Data Type | Description | Examples | |
---|---|---|---|
Nominal | Unordered categories or classes | gender, race, favourite food | |
Ordinal | Ordered categories | NYHA classification of heart failure | |
Discrete | Numerical values that represent measurable quantities; restricted to whole numbers | counts | |
Continuous | Measurable quantities that may be fractions or decimals | height, weight |
Nominal data is also referred to as “categorical data”. There is no natural ordering for nominal data. Ordinal data has an ordered scale, but the interval between items on the scale may not be the same. The NYHA Heart Failure classification is a good example of an ordered scale. The differences in levels is a clinical assessment of key features of heart failure and severity, it is not numeric (someone with a classification of II is not twice as sick as someone with a classification of I).
By contrast, discrete and continuous data are interval measures. The values of the variable give information about order (e.g. larger, smaller) and the distance between the values. Count data is discrete and does both of these things. For discrete data, there comes a point at which measure can’t be broken down any further, e.g. an event either happened or it didn’t. Continuous data also provides information on order and distance between values, but it can be broken down to the limits of how we measure it, e.g. an individual’s height might be measured as 183.258 cm.
The type of data you have informs how to present the data and the kinds of calculations you can do with it.
Summarizing data
Nominal and ordinal data are typically presented as absolute and relative frequencies. How many participants were female? What percentage of the participants had a diagnosis of diabetes? Tables and bar charts are often appropriate.
Discrete and continuous data are frequently summarized using some function of the data, such as measures of central tendency (e.g. mean) and measures of dispersion (e.g. standard deviation).
If you need to, review the meaning of the following in a statistics text:
- Mode
- Mean
- Median
- Range
- Interquartile range
- Variance and standard deviation
- Confidence interval
What are the possible values of the variable?
The possible values that a variable can take is called the sample space.
The sample space for a single coin toss is [H,T].
The sample space for two tosses of a coin is [HH, HT, TH, TT].
It is harder to put boundaries on some continuous measures. For example, the sample distribution for heights (cm) of students enrolled in the course would be something like [100–200]; or if we definitely wanted to include every possible person [50–300].
The possible values for a 5-point Likert scale is [0,1,2,3,4,5]. The possible values of an average of 5-point Likert scales from a sample is continuous [0–5] (assuming it is appropriate to interpret the Likert scale as interval data).
How is the data distributed?
The decision of which statistical test is appropriate depends on how we expect values of the variable to be distributed. This involves assigning probabilities to the sample space: a probability distribution. The probabilities that are assigned will depend on what we know about the mechanism that generates the data.
It is probably easiest to start by thinking about a discrete variable.
Consider an experiment involving the toss of a single coin. There are two possible outcomes Heads or Tails. The sample space is [H,T]. If the coin is fair, we can easily assign probabilities to the sample space: each outcome has a probability of 0.5. Probability is equally distributed over the two possible outcomes.
Suppose we weren’t sure whether a coin was fair and we wanted to design an experiment to test whether it was. One strategy would be to record the outcome of 100 tosses of the coin and record how many Heads we observe.
Now the sample space is [0,1,2,…,99,100]—i.e we could get anything from no Heads to 100 Heads (and every number in between). Since we don’t know whether the coin we are testing is fair, we don’t know the probability of each of the outcomes, [0,1,…,100], for the coin we are interested in.
But we do know what we would expect if the coin was fair. For example, if we observed no Heads in a trial of 100 tosses, very few people would be willing to entertain the idea that the coin was fair. Extending this idea further, if we assume the coin is fair, we can calculate the probability of each of the possible outcomes of the experiment using the binomial distribution.1 The binomial distribution is defined by \(n\), the number of independent trials, and \(p\), the probability of success on any trial.
This allows us to conduct the experiment and calculate the probability of observing the result we observe if the coin was fair. If this probability is very low, we might make the inference that the coin is biased.
There are many different probability distributions for discrete and continuous data. A particularly important probability distribution for continuous data is the normal distribution.
The normal distribution is important because it can be used to estimate probabilities for many continuous variables. Many statistical tests are available for normally distributed variables.
If you need to, review the properties of the normal distribution: see Illowsky and Dean (2018), Chapter 6.
Given the many statistical tests that assume data is normally distributed, it is often important to make a judgement whether the variable you are assessing is normally distributed. Many continuous physiological measurements are normally distributed. So too are many functions on other types of data. For example, while responses on a 5-point Likert scale are unlikely to be normally distributed, the mean of the difference between responses on a 5-point Likert scale in two groups of participants is likely to be normally distributed.
It is sometimes necessary to check your assumption that the data is normally distributed. A good way to do this is to plot the data to see if the observed data is consistent with the assumption. If you have a small sample size, you wouldn’t expect to see a perfect normal distribution, but seeing the data plotted can give you an idea if the assumption is reasonable. There are also several statistical tests that can be used, though these require a sufficient sample size to be reliable.
Choosing the right statistical test
Once you have addressed the questions listed above, you are close to being able to identify an appropriate statistical test to address your research question.
Use Mackridge and Rowe (2018), chapter 4, as a resource for this section.
A research question that can be answered using quantitative analysis can be framed in terms of the influence of one or more factors (independent variables) on an outcome (dependent variable). Selecting an appropriate statistical test depends on:
- the type of data provided by the outcome variable ,
- the type of data and number of factors that are of interest
- whether or not the outcome variable is normally distributed
- whether the groups you are comparing ‘related’ or ‘independent’
We have discussed the first three of these, (iv) is new. The choice of statistical test depends on the relationship between the groups you are comparing. If you are conducting a comparison where the ‘before’ group contains the same participants that you compare ‘after’ an intervention, your sample is ‘related’, and you need to select an appropriate test (e.g. paired t-test). If you are conducting a comparison between two groups that have been randomly allocated, your sample is ‘independent’, and you need to select a test for independent groups (e.g. two-sample t-test).
Question/Factor | Outcome | |||
---|---|---|---|---|
Continuous (normal) | Continuous (non-normal) | Ordinal | Categorical | |
Compare two groups (one factor) | t-test (paired or two-sample) | Mann-Whitney Wilcoxon | Mann-Whitney Wilcoxon | Chi-square |
Compare more than two groups (2 or more factors) | ANOVA | Kruskal-Wallis (unpaired) or Freidman test (paired) | Kruskal-Wallis (unpaired) or Freidman test (paired) | Chi-square |
Association between two variables (independent, continuous) | Pearson correlation | Spearman correlation | Spearman correlation | Logistic regression |
Association between three or more variables (independent, continuous) | Multiple linear regression | Ordinal logistic regression | Multiple logistic regression or multinominal logistic regression |
The table should help you start thinking about an appropriate statistical analysis for a particular research question.
It is important to recognise that it is a starting point.
Seek further information about these tests and the assumptions that need to hold in order to use them reliably.
You will find this information in standard statistics texts, online, and via help functions in common statistical packages.
It is important to keep in mind that statistical tests are implemented in statistical software in different ways.
For example, the Mann-Whitney/Wilcoxon test is implemented in R via the wilcox.test
—what some call the Mann-Whitney rank sum test is implemented via the wilcox.test
with the option paired=FALSE
, and the Wilcoxon signed-rank test is implemented via the wilcox.test
with the option paired=TRUE
.
The lesson here is to read the help/instruction files associated with the tests you use in your statistical software.
References
Illigens, Ben M. W., Fernanda Lopes, and Felipe Fregni. 2018. “Parametric Statistical Tests.” In Critical Thinking in Clinical Research, edited by Felipe Fregni and Ben M. W. Illigens, 181–205.
Illowsky, Barbara, and Susan Dean. 2018. Introductory Statistics. Houston: OpenStax.
Mackridge, Adam J, and Philip H Rowe. 2018. A Practical Approach to Using Statistics in Health Research: From Planning to Reporting. New Jersey: Wiley.
Suemoto, Claudia Kimie, and Catherine Lee. 2018. “Basics of Statistics.” In Critical Thinking in Clinical Research, edited by Felipe Fregni and Ben M. W. Illigens, 518.
You need to know that statistics often relies on assumptions about probability distributions. You don’t particularly need to know a lot about a range of different distributions, including the binomial distribution. Focus on the general point, rather than the specific example.↩︎