# Data Analysis

After running your database queries, you received the following printout detailing the reservoirs:
[Scatterplot 1] [Scatterplot 2] [Scatterplot 3]

6. What conclusions can you draw from the above scatterplots?

Answer (a) — incorrect: There is a clear difference in incidence proportions between the two populations
Answer (b) — correct: The line fitted to the incidence proportions of Susser Syndrome in areas served by the Rothman Reservoir suggests a strong correlation. Specifically, there is a negative correlation between increased distance from the Porks-A-Lot Farm and Susser Syndrome (as distance increases, incidence decreases).
Answer (c) — incorrect: This is an example of the 'ecological fallacy' where we are applying group-level characteristics to individuals within that group. Using the ecological study design, we can only draw conclusions concerning the groups or populations under analysis. We cannot draw conclusions about the individual members of the population because we do not have exposure and outcome data on each member.

## Intellectually curious?

Analysis of the data presented in the tables accompanying scatterplots 1 and 2 shows that in both instances correlations between the distance from Porks-A-Lot Pig Farm and incidence of Susser Syndrome are very strong (-0.97 for Rothman Reservoir and -0.92 for Greenland reservoir.) This means that whenever distance increases, incidence always decreases. Since the relationships appear to be linear, we can fit a linear model using the least squares method. Regression coefficients which predict the average magnitude of the expected change in incidence given a change in the distance from the farm are quite different for the two reservoirs (-0.57 for Rothman and -0.09 for Greenland reservoir.)

Correlation coefficients measure the degree of linear dependence between two variables. That is, correlation coefficients use the standard deviation in two variables to determine the extent to which the standard deviations seem, on average, to vary linearly together. Correlation coefficients are useful for an initial characterization of the relationships among variables, but are sensitive to deviations from normality and extreme observations. Regression coefficients fit a line to the relationship between two variables that minimizes the distance between each observation and the prediction that the line would provide. Regression coefficients standardize the magnitude estimate of the relationship between two variables. The estimate that one gets from a linear regression represents the average change in the outcome variable given a one-unit change in the predictor variable. If the regression coefficient is equal to zero, it indicates that a one-unit change in the predictor does not provide any information about the outcome. While regression coefficients also assume a normally distributed outcome and are also affected by extreme observations, in general the regression format is more robust to deviations from assumptions as compared to correlation coefficients.