Correlation [1]

Correlation is a statistical technique that can show whether and how strongly pairs of population characteristics are related. For example, height and weight are related: taller people tend to be heavier than shorter people. The relationship isn't perfect. People of the same height vary in weight, and you can easily think of two people, the shorter of whom is heavier than the taller. Nonetheless, the average weight of people 5'5'' is less than the average weight of people 5'6'', and their average weight is less than that of people 5'7'', etc. Correlation can tell you just how much of the variation in peoples' weights is related to their heights.

Although the correlation described above is fairly obvious, your data may contain unsuspected correlations. You may also suspect there are correlations, but don't know which are the strongest. An intelligent correlation analysis can lead to a greater understanding of your data.

Like all statistical techniques, correlation is only appropriate for certain kinds of data. Correlation works for data in which numbers are meaningful, usually quantities of some sort. It cannot be used for purely categorical data, such as gender, brands purchased or favourite colour.

The main measure of correlation in a dataset is a statistic called the correlation coefficient (or r). It ranges from -1 to +1. The closer r is to +1 or -1, the more closely the two population characteristics are related. If r is close to 0, it means there is no relationship between the population characteristics. If r is positive, it means that as one population characteristic gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often called an "inverse" correlation).

Never assume a correlation means that a change in one population characteristic causes a change in another. Sales of personal computers and athletic shoes have both risen strongly in the last several years and there is a high correlation between them, but you cannot assume that buying computers causes people to buy athletic shoes, or vice versa.

The second caveat is that the correlation statistic is only meaningful with respect to linear relationships: as one population characteristic gets larger, the other gets larger (or smaller) in direct proportion. It does not work well with curvilinear relationships, in which the relationship does not follow a straight line. An example of a curvilinear relationship is age and health care. They are related, but the relationship doesn't follow a straight line. Young children and older people both tend to use much more health care than teenagers or young adults. Multiple linear regression can be used to examine curvilinear relationships, but it is beyond the scope of this encyclopedia.


1. This definition is based on www.surveysystem.com/correlation.htm (accessed December 28 2006).