Basic
Statistics that summarise/describe a mass of
information.
Graham Tall research@grahamtall.com
September 2003
Averages Mean Median Mode
Spread of Scores Standard Deviation Mean Deviation Range Inter-quartile Range Semi-Inter-Quartile Range
Correlation Pearson's r Correlation Matrix
I. AVERAGE
| a)
|
The MEAN is the average that everybody recognises. It is simply the sum of all the figures, divided by the number of figures. Its widespread use demonstrates its value. It is the only average that can be calculated on most calculators. Requires Interval numbers. |
| b)
|
The MEDIAN is the middle mark and can be used with nominal numbers. To calculate: Place the numbers in order and find the middle number. (NB With an even number of marks, the median is halfway between the mark above and below). |
| c) | The MODE is the most common result. The grade/response that most people obtained/gave. |
EXAMPLE 1: MARKS:
| Data: Marks placed in rank order: | 2 | 4 | 6 | 6 | 6 | 7 | 8 | 8 | 9 | 10 | |
| Rank or Position | 10 | 9 | 6= | 6= | 6= | 5 | 3= | 3= | 2 | 1 | |
| Number from one end | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
The MEAN = (2 + 4 + 6 + 6 + 6 + 7 + 8 + 8 + 9 + 10) ¸ 10 66 ¸ 10 = 6. 6
The MEDIAN = The middle response: between 5th & 6th numbers = 6.5
The MODE = The commonest number = 6
II. Spread of Scores
Whilst the most frequently used statistic is standard deviation, other statistics which are sometimes used include:
| a)
|
Range This is the easiest statistic to calculate, it
simply the difference between the smallest and largest scores. The
major weakness of the range is that a single very low/or very high result suggests a much
greater general spread of scores than really exists. N.B. as was mentioned when the
median was discussed, such extreme results are statistically known as outliers. In
the example below, there are no outliers. Calculation: |
Marks placed in
rank order: 2, 2, 3, 4, 5, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9, 10
2- - - - - - - - - - -range - - - - - - - - - - 10
| b)
|
Inter-quartile and Semi inter-quartile range are used with the median
(the middle score). The inter-quartile range simply states the range within which the
middle half of the scores lie and is very easy to calculate: Calculation:
|
| c)
|
Mean deviation: This is now rarely used it is, however,
worth understanding because it explains that the logic of standard deviation is that it is
simply attempting to measure the average spread of marks from the group
mean. Calculation: Discover the group mean. Discover how far each score is from the mean: (score - mean). Add the differences together, ignoring negative signs. Divide the sum of the differences by the number of scores.
|
| d)
|
Standard deviation. (S.D. or s ) In statistical terms the
standard deviation is the most commonly used measure of spread and measures the
distribution of scores from the mean. It is measured in the same units as the original
numbers: If the original data is examination marks the standard deviation is in examination marks, if in percentages it is in percentages. Generally speaking 64% of scores are within ± 1 S.D. (i.e. plus or minus one S.D.) of the mean and the great majority of the scores are within ± 2 S.D. of the mean. Thus knowing the standard deviation we know the spread of scores which most children attain. |
Figure 1 Groups where the mean scores are similar but the spread of scores differ substantially
Standard Deviation formula. THIS DOES NOT HAVE TO BE LEARNED
1.
The formula first calculates the sum of the deviations of each score from the mean: (X means each individual X score; is the MEAN of the X scores)
2.
The problem with the above is that since about half of the numbers are less than the mean (and the remainder are greater than the mean) adding up the differences would always give an answer of zero! The mathematical solution is to square the product of each subtraction: 3.
To find out how big each deviation is from the mean on average, the statisticians divide the above sum by the number of scores (N). 4.
Finally to remove the effect of the squaring, the variance is square-rooted:
(This is the Standard Deviation)
Note 1: Because the number has been square rooted, the standard deviation is measured in original units. Note 2: If the general distribution of scores is bell-shaped, i.e. most of individuals score around the mean score, the remaining scores being distributed evenly outside this middle band - then the distribution is similar to the normal distribution. When this occurs just over 60% of the scores lie within one standard deviation of the mean.
5. For calculation purposes the following formula is used.
![]() |
å X2 = sum of each number squared (å X)2 = total squared |
The advantage of this formula is that the MEAN does not have to be calculated first and hence the calculation can be started before all the data has been collected. It is the formula used by calculators and computer programmers.
| e)
|
Population* Estimate of the Standard Deviation: The only
reason for mentioning this is so that you are not foxed if you meet it in
articles. It is calculated using virtually the same formulae as the standard deviation.
The only difference in the calculation is that instead of finally dividing by N (Standard
Deviation) the population estimate divides by N-1. The population estimate of the standard
deviation is, therefore, always slightly larger than the samples standard deviation.
|
.
III. Correlation Coefficient
The correlation coefficient measures the extent to which two factors (columns of paired numbers) appear to be related. As such it is a useful measure. Unfortunately, a correlation does not mean that two factors are directly related with one factor causing the other; the two sets of data may appear to be be related because they are caused by a third factor. A high correlation between Maths and Science GCSE results is probably caused by the fact that both are related to a particular kind of thinking. The fact that the more cigarettes individuals smoke the greater their risk of dying from lung cancer/heart disease is supported by the observation that the act of smoking causes tar etc. to enter the lungs, but the reality that some individuals who smoke heavily also live to be 100 illustrates that correlation evidence cannot provide absolute proof.
Calculation Formula of Correlation Coefficient THIS FORMULA DOES NOT HAVE TO BE LEARNED.Mathematically orientated students should see similarities between the standard deviation and correlation formulae.
The maximum/perfect correlation score is +1 or -1; the negative sign simply means that a high score in one set of data is linked with a low score in the other set rather than a high score. No correlation is a number at or near zero. Statistically significant correlations can, if large numbers of pairs of number are involved, be very low - hence, always bear in mind that correlations need to be practically as well as statistically significant. Correlations as high as ±.7 or ±.8 are only possible with examination or test marks. For factor analysis purposes, with attitude question responses, correlations as low as ±.4 or even ±.3 can be used.
Correlation Matrix - the
inter-correlation of all relevant tests and/or attitude statements.
The purpose of correlation matrices is to help researchers check their own perceptions
as well as allowing them to search for unexpected associations of ideas.
However, whilst a correlation matrix is immensely simpler than the
original raw data bank, there is still a daunting mass of numbers - with 16 statements
there are 120 different correlations!
Attitude Items |
||||||||
|
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
Attitude Items |
1 |
1 |
|
|
|
|
|
|
|
2 |
.63 |
1 |
|
|
|
|
|
|
3 |
.09 |
.22 |
1 |
|
|
|
|
|
4 |
.21 |
.01 |
.21 |
1 |
|
|
|
|
5 |
.18 |
.33 |
.17 |
.57 |
1 |
|
|
|
6 |
-.42 |
-.25 |
.29 |
.32 |
.28 |
1 |
|
|
7 |
.58 |
.65 |
.08 |
.21 |
.24 |
-.19 |
1 |
To highlight underlying patterns in the correlation matrix it is useful to circle, or highlight, all the correlations that are large enough to be statistically significant. (In this example, data was collected from 60 teachers, all correlations greater than 0.25 are therefore statistically significant.) If there are many columns of data it may be helpful to initially only study higher correlations. In this example, say, all those greater than ±.4. The ± symbol means both negative and positive correlations with a number value equal to or greater than .4 respectively.
Home Page Research Introduction Quantitative Advice Index Statistical Tests Types of Number Factor Analysis Cluster Analysis