Correlation Research Project
Introduction: Correlation is one of the commonly encountered and significant statistical feature that often comes across in the study of relationship between variables in research work. Understanding correlation and quantifying the extent of correlation using correlation coefficients will provide a basis and understanding of the relationship between the two variables. Such understanding will further works for taking up decisions in accordance with the correlation characteristics as per the requirements. When the changeover of the two variables is causing a proportional and similar change in the other variable, then it is termed as positive correlation. At the same time if the changeover of one of the variable is causing a proportional but opposite changeover then it is termed as negative correlation. Higher the correlation coefficient the relationship between these two variables will be higher and if the correlation coefficient (Fleiss & Cohen, 1973) is less, the extent of relationship between the two variables is less. At the same time the sign of the correlation coefficient will indicate the direction of the relationship between the two variables. For example there can be positive correlation between education and the salary of the job. Also it is possible that there can be negative correlation between the obesity and the lifespan of the individual. Further there can be correlation between two variables or even more number of variables. In the later case it is termed as multiple correlations. Also based on the trends of the correlation they can be termed as linear or nonlinear correlations based on the relationship between the two variables. The advantages contained in the correlation studies include, the ease in collection of the data, more detailed intimation of the relationship between the two variables, also it is possible that there are several locations where the correlation studies can be employed for understanding the relationship between the two variables. The following part of the report is the study of the correlation between two variables and the complete statistical procedures from the collection of the data to the analysis of the correlation is presented.
Question 1: Define the relationship of interest and a data collection technique.
Answer: The current focus of correlation study is between the age of the male individual and his monthly salary. It is assumed that there will be correlation between the male individual age and his monthly salary. Further for the sake of analysis, the study is planned to conduct in the town selected near to California of a population of about 15000 people in total. The hypothesis is that the there is positive correlation between the age of the male individual in the city and all such people by the age of 37 years will have a minimum monthly salary of $5000 or more. This hypothesis is going to be tested in the current case. The data required for this exercise is collected by primary data collection as well as secondary data collection techniques. Primary data collection techniques include directly meeting the people in random and collecting the data from them in the form of face to face interactions and the secondary data collection is employed by using statistics from the local offices and government organizations.
Question 2: Determine the appropriate sample size and collect the data.
Answer: The total population of the town is 15000 and there is about 8000 male population in the city. Out of them people do have age till 95 years. In the current case, collecting data from each and every male person takes in more time and for practical reasons Cochran formula (Cochran, 2007) and procedure is employed for sample collection. Also the testing is confined only to the male population in the salaried group which is arbitrarily selected from 21 years to 58 years. Also the sample collection will be randomized and based on simple random sampling(Rubin,1980)
Cochran formula:
n_{0} = Z^{2}pq/e^{2}
When e is the desired level of precision, p is the estimated proportion which has the attribute in question, q= 1p.
For example if the proportion of the estimated population having 37 years of age and more than $5000 monthly salary then it can be considered as 0.5. Assuming 95% confidence level and 5% error rate, the total sample number for testing this hypothesis will be about
No = 1.96^{2}* 0.5 *0.5/ (0.05)^{2} = 385
( z=1.96 for 95% confidence level)
The data collected for the 385 members is provided in the table of the excel.
Question 3: Perform the appropriate analysis to determine if there is a statistically significant linear relationship between the two variables. Describe the relationship in terms of strength and direction.
Answer: Findings of the strength and the relationship between the two variables of the population sample collected for analysis.
Based on analysis of the correlation, it is found that there is no proper relation between the salaries and the age of the male individuals in the selected sample. There is very weak correlation between the two. The value of correlation coefficient is 0.0056 which is very less. Also there is no definite direction of the correlation seen between the age and the salaries. Hence the stated hypothesis not correct in two parts of the same (Manly, 1992).
• There is no correlation between the two variables
In the total selected sample of 385 population, there are about 236 numbers with age more than or equal to 37 years. For these 229 instances the salaries of only 48 people are more than $5000 and hence p is 0.19 <<0.5
• Also there is no valid evidence for the hypothesis that more than 37 years people can have salary higher than $5000.
So neither there is proper correlation as well there is no evidence for the hypothesis of age and corresponding salary.
Question 4: Construct a model of the relationship and evaluate the validity of that model.
Answer: As per the relation between the age and the salaries from the given sample, model between the two can be as follows,
y = 0.171x^{2} + 21.335x + 3342.5
(where Y is the salary and X is the age of the person)
Randomly checking 5 people statistics,
S.No

Age

Salary Actual

Salary computed

Difference

1

26

2600

3781.6

1181.614

2

26

2600

3781.614

295.954

3

54

3700

3995.954

397.796

4

37

3500

3897.796

1029.79

5

34

4900

3870.214

1443.7

There is only very weak correlation between the salaries computed as per the model and the actual values. The regression coefficient is 0.0056
The model has limited validity in this case example.
Conclusion: Based on the correlation study conducted between the two variables it can be concluded that there is no any significant correlation between the two variables considered for analysis. Also the hypothesis that all people with the age of 37 years and more has not acceptable evidence from the sample data collected. Hence the hypothesis is wrong. The findings can be stated with 95% confidence level and the findings will fall within +/ 5% of the total accuracy values.
