Cluster Analysis
Graham Tall    
research@grahamtall,com     September 2003

The purpose of cluster analysis is to identify groups (clusters) of individuals who have similar abilities or common views.  Analyses the individual rows of data in a spreadsheet to group (cluster) together the individuals who have answered it similarly.  Selection of clusters is usually done visually, based on the fact that the quicker clusters form, the more they have in common.  Note: Labels are NOT used in the clustering process.

Non-Research uses of cluster analysis:

setting:      pupils are grouped on the basis of their attainment in that subject.

streaming: pupils are grouped using the combined (sometimes weighted) scores on a number of subjects.

Research use  (using questionnaire responses for the example):

To identify individuals who share a common outlook.

Cluster analysis is intuitively simple:
Simply leaf through the questionnaires and find the two which are most similar. Average the response, then search for the next pair etc. etc.

But, as indicated on the right, the calculation process is immense without the aid of a computer.

Further determining which number of clusters should be studied in more detail , without a computer's aid, will require considerably extra work

Consider:
Since it is highly unlikely that any two individuals would answer every question identically the analysis usually begins with as many clusters as there are questionnaires.  In order to discover the two individuals who have the most similar views, the researcher would , if there are 30 questionnaires:
have to compare the first questionnaire with  29 others to identify the one that is most similar to it.
But, because another pair might be more alike, each questionnaire would have to be compared with the others.
hence with 29+28+27+26+26 ...... +5+4+2+1 there must be a total of 435 comparisons   to identify the first two individuals who answered the questionnaire most similarly. 
Once the first two individuals have been identified and joined, there will be 29 clusters; the first, of which, is the average of the two individuals grouped together.   Fortunately, assuming the researcher keeps accurate records, the combined pair only has to be compared with the 28 others to find which pair is next most alike.  The process is repeated until there are just 2 clusters....

In order to
        a) simplify cluster analysis
and
       b) take into account knowledge of the inter-relationships of the statements.

Computer programs usually either:
i)    first carry out a factor analysis procedure to identify the relationships between the various statements and emphasise
        them in the clustering process.
or,
ii)    allow the researcher to state which statements should be used.   Thus allowing the researcher to emphasise
        statements which appear most relevant to the aspect of the research being studied.

Once, a clustering analysis procedure has been undertaken, how many clusters should be used - 10? 8? 7? 3? There is NO software answer. The researcher has to study the mean responses of each cluster on each statement to justify whatever decision is made - there is no magical formula.

 

An example of cluster analysis:

Good cluster analysis diagrams allow the researcher to label :the individuals in each cluster (see the horizontal axis), and indicate the extent of commonality of the individuals in each cluster by giving different heights at which the horizontal lines are drawn.

Each individual is represented by a single vertical line which begins at the horizontal axis of the diagram (next to the individual labels) . The cluster analysis procedure joins individuals/clusters together by horizontal lines - see figure below.

Figure: A schematic cluster analysis diagram

The nearer to the horizontal axis the lines join together the greater the similarity (degree of commonality) of the views of the individuals who answered the questionnaire. Each new cluster being represented by a new single line.

The cluster diagram below uses the responses of 60 teachers from 5 different schools to 16 attitude questions concerned with pastoral care and the factors described earlier. The labels are as described above.

 

wpe1.jpg (54334 bytes)

How Many Clusters in the above diagram?

The actual number of ‘clusters’ that could have been selected varies from the 60 individuals actually questioned to the single cluster where the mean view of the whole group is given. The latter, though usually unrecognised, is the approach commonly used with questionnaires and interviews where no subdivisions of the database are made and the views summarised are those of all the individuals questioned!

FIRST REASON. As indicated in the activity description the closeness of the quantitative relationship between the individuals is indicated by the height on the vertical axis when the various individual and cluster lines combine. Visually this suggests a study of 4, 5, 6 or more clusters.

SECOND REASON: The decision on four clusters was made was because it took into account two of the four variables selected by the researcher:

1. School where teachers worked
    
  It is evident in cluster I, for example, that all the teachers in the two sub-groups are from schools A and B.
       Most teachers in cluster II are from school E
and
        Cluster IV contains almost half the teachers from school D.

Few pastoral heads of year (the title of the managerial post coordinating the work of the form tutors for each class year) would be surprised to learn that the one cluster which, contained at least one teacher from every school, was the cluster that demonstrated a negative commitment to the range of student care statements. teachers who felt that their commitment to their subject was greater than to their role as a tutor.

As the analysis continued, and further information was collected, the importance of the school factor in teacher attitude became undeniable.

2. Gender of staff
  
     It is suggestive that there is a much greater proportion of women in one cluster, though this could be linked to relationships
        in the school concerned.

The THIRD REASON, was the mean scores on the various attitude statements of the clusters produced.

Home Page   Research Introduction   Quantitative Advice   Index    Statistical Tests    Factor Analysis   Research and Statistics Courses