Oral-nasal airflow: statistics
Some brief notes below, on statistics around nasality.
Populations and samples
When we examine speech data we are usually trying to determine the vocal behaviour of some population of people (the main exception would be a clinical examination of an individual). This might be a large population, such as the entire Australian English speech community or a smaller population such as the inhabitants of a particular village in Papua New Guinea. Unless the population is very small and we have access to the entire population (very unlikely even for small populations) then we need to record data from a sample (sub-set) of that population. Once we have done so we then need to determine whether that sample gives us a reasonable estimate of the vocal behaviour of the entire population. We might, for example, determine the mean (average) value for the nasal airflow (as a percentage of total airflow) in a particular context (eg. a particular vowel in a particular consonantal context). The question that we should ask is whether this value, based on measurements from a small sample of speakers, is a good estimate of the population mean. In general, the larger the sample the more likely that a value derived from a random sample is a good estimate of the population mean. Random samples of 30 or more tend to result in a close match to a normal (bell-curve) distribution (irrespective of the actual shape of the population distribution). Mathematical formulae known as probability distributions provide us with a measure of the probability that a mean derived from a sample of a particular size is a good estimate of the population mean.
Very often what we are particularly interested in is the comparison of means in order to determine the probability that a difference between two sample means is a true indication of an actual difference between the equivalent two population means. For example, we might note that the mean nasal airflow (as a percentage of total airflow) for a particular passage (eg. the "rainbow passage" spoken at a normal speaking rate) is different for a sample of male speakers of Australian English compared to a sample of female speakers of Australian English. Does this difference between these two sample means reasonably suggest that there is such a difference for the Australian English speech community as a whole?
In order to determine this we need to know the two means, but we also need to know the standard deviation of the two distributions (ie. the spread of results for the two samples) as well as the number of subjects in the two samples. The measurement of a standard deviation is only accurate if we have a normal distribution. For a normal curve, one standard deviation from the mean represents 34.1% of the sample and one standard deviation either side of the mean represents 68.2% of the sample. Find more information on standard deviation (skip the maths and scroll down to the graph). Given two samples with their means and standard deviations as well as the sample size, we can determine the probability that the two means differ significantly and therefore represent a true population difference.
There are many statistical tests for comparing means and the one that we choose depends upon the type of data that we are examining and the extent to which the samples are normally distributed (bell-curve shaped). In the oral-nasal airflow and the oral-nasal sound intensity data we have what is known as continuous data. That is, there is a range of values with all possible graduations of measurement in between. In other words we aren't dealing with discrete counts or with binary data (such as yes-no answers). Furthermore, we assume that our samples are normally distributed, and this is generally true for larger samples. The statistic chosen for this experiment is the t-test and I have opted for the two-tailed t-test as it simply tests whether two distributions are different and makes no assumptions about the direction of the difference. For some of the comparisons used in this workshop I probably should have used a one-tailed t-test as we can reasonably predict that, for example, the degree of nasal airflow in the "Naomi passage" will be greater than the degree of nasal airflow in the "rainbow passage" as the former passage has a larger percentage of syllables with nasal consonants. In spite of this I opted for simplicity and used the same two-tailed t-test throughout the workshop.
In the tables on the results page I have not quoted actual t-test results, but have only quoted the probability that such results indicated a significant difference between two means.
Probability of significant difference between two means
On the results pages I have only quoted three probability (p) values, "p<0.05", "p<0.01" and "p<0.001" or alternatively I have indicated that two means are not significantly different.
Such probabilities are actually tests of what is known as the "null hypothesis". The null hypothesis is simply the hypothesis that there is no true population difference for a particular pair of measurements. For example, when we are comparing the oral-nasal airflow for male and female speakers of Australian English reading the "rainbow passage" at a normal speaking rate we set out to test the hypothesis that there is no true difference for the population as a whole. This is the goal of most statistical tests. Even though we might wish to show that there is a difference, our statistical tests return the probability that there is no population difference. So when we say that "p<0.05" we are saying that there is a less than 5% (1/20) chance that the two means are not different. Similarly, when we say that "p<0.01" we are saying that there is a less than 1% (1/100) chance that the two means are not different.
When we say that there is a 5% probability that the null hypothesis is true, we are saying that there is a 1 in 20 chance that any differences between the means that we observe are due to random factors and therefore that these differences are not real. That implies that there is a 95% chance that there really is a population difference between the two measures. Are we willing to accept a 5% chance that we may be wrong (that is we would be wrong in 1 out of 20 random measurements). In the physical sciences (eg. physics) we usually consider this too high a chance of error, but in the social sciences a 5% probability is often considered acceptable. In certain physical sciences it is normal to only accept a much smaller probability of error and the acceptable error is typically 1% or even smaller for some types of physical measurement. In no area of the physical or social sciences is it considered appropriate to accept a probability of error of greater than 5%. In the present workshop we are dealing with highly variable characteristics of human vocal performance. We can quote more exact measures of probability, but it is common to only quote whether the probability is less than 5% (marginally reliable) or less than 1% (highly reliable). Sometimes we might also quote a probability of "p<0.001" or 1 chance in 1000 of error (very highly reliable). For probabilities greater than 5% (p>=0.05) we assume that there is no significant difference between the measurements (because the chances of error for thinking otherwise are too high).
Sample size and degrees of freedom
In the tables on the results page you will find, in addition to the mean and standard deviation values, a value for "Number". In tables 1, 2 and 4 "Number" refers to the number of subjects. The t-test can be calculated from the mean and standard deviation plus the number of degrees of freedom. Degrees of freedom (df) is directly related to the number of tokens or subjects being compared and df for the t-tests used in this workshop is equal to "number - 1".
The "number" quoted in table 3 is more problematic and is a combination of the number of subjects and the number of tokens (eg. 16 subjects x 3 vowels = 48 tokens). This assumption greatly affects the outcome of the t-test. Its only possible to combine number of tokens per subject plus number of subjects to calculate degrees of freedom if tokens and subjects are completely independent. This assumption was made here, but it is not a completely compelling assumption as it seems likely that there is an interaction between subjects and tokens. It may be that I should have instead calculated df from the number of subjects alone (not subjects times tokens) and this would have resulted in no significant difference between the "High-back" and the "Non-high-back" vowels. If these same means and standard deviations were to be found for 30 or more subjects then, and only then, would the difference between these two vowel categories be found to be significant (when df is calculated from number of subjects alone).