In this article we will discuss about Correlation in Statistics:- 1. Subject-Matter of Correlation 2. Types of Correlation 3. Methods.
Subject-Matter of Correlation:
Sometimes two or more events are interrelated, i.e., any change in an event may affect the other events. If such changes are expressed in the form of numerical data and they appear to be interdependent they are said to be correlated. For example, the weight of human body increases with the increase in height and age.
Here the age and body weight are two separate characters but they are interdependent so they are correlated.
Thus, the purpose of correlation is to study the relationship between the data of one series and those of the other series. If two or more variables are so related that an increase in one variable may cause change in the other variables they are said to be correlated.
According to Bowie when two quantities are so related that the fluctuations in one are in sympathy with the fluctuations in the other so that an increase or decrease in one is found in connection with an increase or decrease or inversely of the other and the greater the magnitude of the change in one, the greater the magnitude the change in other, the two quantities are said to be correlated.
Types of Correlation:
Depending upon the direction and proportion of changes in the variables and the number of data series, the correlation may be of the following types:
1. Positive and negative correlations.
2. Linear and non-linear correlations.
3. Simple, multiple and partial correlations.
1. Positive and Negative Correlations:
When the changes take place in same direction in two variables or data series, the correlation between them is said to be positive and direct. For example, if increase in one variable causes increase in the other variable or a decrease in one variable causes decrease in the other variable, the two variables show positive correlation.
But when the changes in two variables occur in opposite directions, the correlation is said to be inverse or indirect or negative, i.e., if increase in one variable may cause decrease in the other or vice-versa, the two variables show negative or inverse correlation.
When the increase or decrease in the value of one variable does not affect the other variable, the correlation is said to be zero. Sometime two variables show positive correlation with each other but later they show negative correlation. Then the correlation between two such variable is said to be curvilinear.
2. Linear and Non-Linear Correlations:
When the proportion of changes in the two variables or data series is fixed, i.e., the dispersions in two series are equal, the correlation is said to be linear. If two such variables are plotted on O-X and O-y axes of graph paper, the changes in two will result in straight line graph.
Contrary to it, if the change rates are not fixed the correlation graph will be a curved line and this type of correlation is called non-linear or curvilinear.
3. Simple, Multiple and Partial Correlation:
The correlation existing between two variables or data series is said to be simple correlation. Of the two series, one which causes change in the other is called independent or subject series and the other which is affected is called dependent series.
When the correlation involves three or more variables or data series that is called multiple correlation. Out of several correlated series, one is dependent series and the remaining two or more series are independent or subject series which jointly affect the dependent series.
When the correlation exists between two or more variables at a time, the correlation is considered between any two of them keeping the interest of other series constant, such a correlation is called partial correlation.
For example, there is a multiple correlation among seed production, number of fruits per plant and weight of seeds where the fruit number per plant and the weight of seeds, both jointly affect the seed production. If the correlation is studied only between seed, production and the number of fruits per plant keeping the seed weight constant, this correlation is said to be partial correlation.
Degree of Correlation:
The degree of correlation is expressed by the value of coefficient of correlation which ranges between + 1 and -1.
The value of coefficient of correlation directly indicates the degree of correlation as detailed in the following table:
Value of Correlation Coefficient:
Methods of Determining Correlation:
The following methods are generally used to determine simple correlation:
a. Graphic method.
b. Scatter diagram or Dotogram method.
c. Karl Pearson’s method
d. Spearman’s ranking method.
e. Coefficient of correlation by concurrent deviation.
a. Graphic Method:
When the values of dependent series are plotted on O –X axis and independent series are plotted on O-Y axis of graph paper, a linear or non-linear graph will be obtained which will simply indicate the direction of correlation and not the numerical magnitude.
If the graph lines of two independent series move in upward direction from left to right, the correlation is positive, but if the graph line of one series moves upward from left to right and that of the other independent series moves downward from left to right, they show negative correlations.
If the values of two data series do not show either positive or negative trend then it should be inferred that there is no correlation.
b. Scatter Diagram or Dotogram Method:
This method is more or less similar to graphic method. In this method, the values of independent data series are plotted on O – X axis and those of dependent series on 0-Y axis and then the pairs of values are plotted on the graph paper.
In this ways, graphs of dots are obtained. These dots are scattered in different forms. Therefore, the graphs are called scatter diagrams or dotograms. The patterns of scatter diagrams indicate the Erection and magnitude of correlation.
The scatter diagram may indicate the following conditions:
(i) If the dots of the two series are advancing in a definite direction like a current, this condition indicates that the data series are definitely correlated.
(ii) When the arrays of dots advance from left to right in upward direction, the correlation is definitely positive [Fig. 34.2 (c)].
(iii) When the scatter diagram advances from left to right in downward direction, the correlation’ is negative [Fig. 34.2 (a)].
(iv) When the dots are not in definite arrays and are scattered haphazardly, this condition indicates that there is no correlation between the data series [Fig. 34.2 (b)].
(v) When the dots appear to be situated on a line which advances upward at 45° angle from the O-X axis, this condition indicates perfect positive correlation among the data series.
(vi) If the dots appear to be situated on a line which moves from left to right in downward direction at 45° angle from 0-X axis, this condition is indicative of perfect negative correlation.
c. Karl Pearson’s Coefficient of Correlation Method:
This is the best mathematical method of determining the correlation. Coefficient of correlation (r) is obtained by dividing the product of values of covariance of the two series by the product of their standard deviations.
Where σX and σY are the standard deviation of variables of data series, X and Y. Covariance of tow series is obtained by dividing the sum of the products of deviations of two series and the arithmetic means by the number of observations
If the data in two series are classified, Pearson’s coefficient of correlation is calculated by the following formula:
d. Spearman’s Ranking Method:
Professor Charls Spearman worked out a method for determining correlation in which the values of all data of a series are assigned ranks in decreasing or increasing (ascending) order. In this ranking process, the highest value is given rank 1 and the next higher value is given rank 2 and so on. In some series the values of two or more data are similar.
In that case, the mean of the ranks will be equally shared by those data, as for example in one series there are two observations; one at S. No. 3 and the other at S. No. 10 of 67 each. In ranking process 67 at S. No. 3 and 67 at S. No. 10 instead of being ranked 6 and 7 respectively are ranked at 6.5 (mean of rank 6 and rank 7).
In the same way if there are three or more data in a series as have got same value, all those data will share the rank which will be the mean of their ranks. The number or frequency of the data with similar value is indicated by m.
In the next step, the difference between the ranks (D) of respective data of the two series arc obtained (D = R1-R2) which may be positive or negative figures. Then after, the values of D2 and sum of D2 (= ∑D2) are determined.
For two such series as are taking in data with similar values, the following formula is used to determine the coefficient of correlation by ranking method (Symbolized by Rho = p):
For determining correlation coefficient by ranking method of two such series as have got 2 or more data of similar values, the following Spearman’s formula is used:
P = 1-0.278 = 0.722 (coefficient of correlation), i.e., the correlation between the two data series is moderate.
Calculate the coefficient of correlation by Spearman’s ranking method and indicate the degree of correlation in the following two data series:
e. Correlation Coefficient by Concurrent Deviation:
This method is used to indicate whether the correlation is in positive or negative direction especially in the data series characterized by short-term fluctuations of data.
Correlation coefficient by concurrent deviation is calculated as follows:
1. First of all, the direction of deviation [positive (+) or negative (-)] of each observation in respect of preceding data are marked for different series in separate columns. If the value of data is greater than that of the preceding data of the series, the direction of deviation is marked + and if it is less, then the direction of deviation will be marked
2. Next, the deviation signs of respective data of the two series are multiplied (+ x + = +, + X – = – and – X – = +) and the products are recorded in a separate column.
3. The total number of positive signs in the column for product of deviation signs is recorded which is called concurrent deviation (= C)
4. The coefficient of correlation (RC) by concurrent deviation is determined by the following formula:
Where, C = total number of + signs in the column for products of two deviations
N = number of observations in a series.
The following example will illustrate the process:
Calculate the co-efficient of correlation of the following two data series by concurrent deviation method: