Basic Statistics that summarise/describe a mass of information.
Graham Tall   
research@grahamtall.com     September 2003

Averages                    Mean        Median         Mode

Spread of Scores       Standard Deviation    Mean Deviation    Range     Inter-quartile Range    Semi-Inter-Quartile Range

Correlation                 Pearson's    r              Correlation Matrix

I.        AVERAGE

a)

 

The MEAN is the average that everybody recognises. It is simply the sum of all the figures, divided by the            number of figures. Its widespread use demonstrates its value. It is the only average that can be calculated on most calculators.   Requires Interval numbers.
b)

 

The MEDIAN is the middle mark and can be used with nominal numbers. To calculate:    Place the numbers in order and find the middle number. (NB With an even number of marks,  the median is halfway between the mark above and below).
c) The MODE is the most common result. The grade/response that most people obtained/gave.

EXAMPLE 1: MARKS:

Data: Marks placed in rank order: 2 4 6 6 6 7 8 8 9 10
Rank or Position 10 9 6= 6= 6= 5 3= 3= 2 1
Number from one end 1 2 3 4 5 6 7 8 9 10

The MEAN = (2 + 4 + 6 + 6 + 6 + 7 + 8 + 8 + 9 + 10) ¸ 10 66 ¸ 10 = 6. 6

The MEDIAN = The middle response: between 5th & 6th numbers = 6.5

The MODE = The commonest number = 6

II.         Spread of Scores

Whilst the most frequently used statistic is standard deviation, other statistics which are sometimes used include:

a)

 

 

Range This is the easiest statistic to calculate, it simply  the difference between the smallest and   largest scores.  The major weakness of the range is that a single very low/or very high result suggests a much greater general spread of  scores than really exists. N.B. as was mentioned when the median was discussed, such extreme results are statistically  known as outliers. In the example below, there are no outliers.
   Calculation:

        Marks placed in rank order: 2, 2, 3, 4, 5, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9, 10
                                                    2- - - - - - - - - - -range - - - - - - - - - -  10

b)

 

 

 

 

 

 

Inter-quartile and Semi inter-quartile range are used with the median (the middle score). The inter-quartile range simply states the range within which the middle half of the scores lie and is very easy to calculate:    Calculation:
        Place marks in rank order :   2, 2, 3; 4, 6, 6; 6, 7, 7; 7, 7, 8; 8, 8, 9; 9, 10
        Quartiles                                             1                1                1
        Inter-quartile range                              5 - - - - - - - - - - - - - 8

The semi-interquartile range is simply the interquartile range divided by two. The reason for prefering the semi interquartile range is that it allows simple generalisations such as ‘about half the students obtained 7 + 1.5 marks’ (Calculation of + 1.5 is as follows: (8 - 5)/ 2 ).   N.B.   Such statements are approximations: In this example just over half of the individuals (10) obtained scores between 5.5 and 8.5 (7 obtained marks outside it).

 

c)

 

 

 

 

Mean deviation: This is now rarely used it is, however, worth understanding because it explains that the logic of standard deviation is that it is simply attempting to measure the average spread of  marks from the group mean.       Calculation:
                  Discover the group mean.
                  Discover how far each score is from the mean: (score - mean).
                  Add the differences together, ignoring negative signs.
                  Divide the sum of the differences by the number of scores.

 

d)

 

 

 

 

Standard deviation. (S.D. or s ) In statistical terms the standard deviation is the most commonly used measure of spread and measures the distribution of scores from the mean. It is measured in the same units as the original numbers:

If the original data is examination marks the standard deviation is in examination marks, if in percentages it is in  percentages.

Generally speaking 64% of scores are within ± 1 S.D. (i.e. plus or minus one S.D.) of the mean and the great majority of the scores are within ± 2 S.D. of the mean. Thus knowing the standard deviation we know the spread of scores which most children attain.

        Figure 1 Groups where the mean scores are similar but the spread of scores differ substantially

Standard Deviation formula.      THIS DOES NOT HAVE TO BE LEARNED

1.
The formula first calculates the sum of the deviations of each score from the mean: (X means each individual  X score; is the MEAN of the X scores)
2.

 

The problem with the above is that since about half of the numbers are less than the mean (and the remainder are greater than the mean) adding up the differences would always give an answer of zero! The mathematical solution is to square the product of each subtraction:
3.
To find out how big each deviation is from the mean on average, the statisticians divide the above sum by the number of scores (N).
4.

 

 

 

 

 

 

 

Finally to remove the effect of the squaring, the variance is square-rooted:

(This is the Standard Deviation)

Note 1:   Because the number has been square rooted, the standard deviation is measured in original units.
Note 2:  If the general distribution of scores is ‘bell-shaped’, i.e. most of individuals score around the mean score, the  remaining scores being distributed evenly outside this middle band - then the distribution is similar to the ‘normal  distribution’. When this occurs just over 60% of the scores lie within one standard deviation of the mean.

 

5.       For calculation purposes the following formula is used.

å X2     = sum of each number squared 

(å X)2 = total squared

The advantage of this formula is that the MEAN does not have to be calculated first and hence the calculation can be started before all the data has been collected. It is the formula used by calculators and computer programmers.

e)

 

 

 

 

Population* Estimate of the Standard Deviation: The only reason for mentioning this is so that you are not ‘foxed’ if you meet it in articles. It is calculated using virtually the same formulae as the standard deviation. The only difference in the calculation is that instead of finally dividing by N (Standard Deviation) the population estimate divides by N-1. The population estimate of the standard deviation is, therefore, always slightly larger than the samples standard deviation.

* ‘Population’ is the much larger group of individuals who could have been questioned/tested etc. The       individuals actually questioned/tested being known as the ‘sample’. The ‘population estimate of the standard deviation’ is simply an estimate of the standard deviation of all those who could have been used

.

III.  Correlation Coefficient

The correlation coefficient measures the extent to which two factors (columns of paired numbers) appear to be related.  As such it is a useful measure.  Unfortunately, a correlation does not mean that two factors are directly related with one factor causing the other; the two sets of data may appear to be be related because they are caused by a third factor. A high correlation between Maths and Science GCSE results is probably caused by the fact that both are related to a particular kind of thinking. The fact that the more cigarettes individuals smoke the greater their risk of dying from lung cancer/heart disease is supported by the observation that the act of smoking causes tar etc. to enter the lungs, but  the reality that some individuals who smoke heavily also live to be 100 illustrates that correlation evidence cannot provide absolute proof.

Calculation Formula of Correlation Coefficient      THIS FORMULA DOES NOT HAVE TO BE LEARNED.

Mathematically orientated students should see similarities between the standard deviation and correlation formulae.

The maximum/perfect correlation score is +1 or -1;  the negative sign simply means that a high score in one set of data is linked with a low score in the other set rather than a high score.  No correlation is a number at or near zero.  Statistically significant   correlations can, if large numbers of pairs of number are involved, be very low - hence, always bear in mind that correlations need to be practically as well as statistically significant.  Correlations as high as ±.7 or ±.8 are only possible with examination or test marks.  For factor analysis purposes, with attitude question responses, correlations as low as ±.4 or even ±.3 can be used. 

Correlation Matrix  -  the inter-correlation of all relevant tests and/or attitude statements.
The purpose of correlation matrices is to help researchers check their own perceptions as well as allowing them to search for unexpected associations of ideas.     However, whilst a correlation matrix is immensely simpler than the original raw data bank, there is still a daunting mass of numbers - with 16 statements there are 120 different correlations!

   

Attitude Items

 

 

1

2

3

4

5

6

7

Attitude Items

1

1

 

 

 

 

 

 

 

2

.63

1

 

 

 

 

 

 

3

.09

.22

1

 

 

 

 

 

4

.21

.01

.21

1

 

 

 

 

5

.18

.33

.17

.57

1

 

 

 

6

-.42

-.25

.29

.32

.28

1

 

 

7

.58

.65

.08

.21

.24

-.19

1

To highlight underlying patterns in the correlation matrix it is useful to circle, or highlight, all the correlations that are large enough to be statistically significant. (In this example, data was collected from 60 teachers, all correlations greater than 0.25 are therefore statistically significant.) If there are many columns of data it may be helpful to initially only study higher correlations. In this example, say, all those greater than ±.4. The ± symbol means both negative and positive correlations with a number value equal to or greater than .4 respectively.

Home Page     Research Introduction   Quantitative Advice   Index   Statistical Tests   Types of Number   Factor Analysis  Cluster Analysis   

Research and Statistics Courses