# Data Analysis

After running your database queries, you received the following printout detailing the reservoirs:

**[Scatterplot 1] [Scatterplot 2] [Scatterplot 3]**

6. What conclusions can you draw from the above scatterplots?

- The incidence proportion of Susser Syndrome in areas served by the Rothman and Greenland reservoirs appears to be identical
- Specifically, there is a negative correlation between distance from the Porks-A-Lot Farm and Susser Syndrome (as distance increases, incidence decreases)
- Individuals drinking Rothman Reservoir water have approximately twice the risk of developing Susser Syndrome than individuals drinking Greenland Reservoir water.

## Intellectually curious?

Analysis of the data presented in the tables accompanying scatterplots 1 and 2 shows that in both instances correlations between the distance from Porks-A-Lot Pig Farm and incidence of Susser Syndrome are very strong (-0.97 for Rothman Reservoir and -0.92 for Greenland reservoir.) This means that whenever distance increases, incidence always decreases. Since the relationships appear to be linear, we can fit a linear model using the least squares method. Regression coefficients which predict the average magnitude of the expected change in incidence given a change in the distance from the farm are quite different for the two reservoirs (-0.57 for Rothman and -0.09 for Greenland reservoir.)

Correlation coefficients measure the degree of linear dependence between two variables. That is, correlation coefficients use the standard deviation in two variables to determine the extent to which the standard deviations seem, on average, to vary linearly together. Correlation coefficients are useful for an initial characterization of the relationships among variables, but are sensitive to deviations from normality and extreme observations. Regression coefficients fit a line to the relationship between two variables that minimizes the distance between each observation and the prediction that the line would provide. Regression coefficients standardize the magnitude estimate of the relationship between two variables. The estimate that one gets from a linear regression represents the average change in the outcome variable given a one-unit change in the predictor variable. If the regression coefficient is equal to zero, it indicates that a one-unit change in the predictor does not provide any information about the outcome. While regression coefficients also assume a normally distributed outcome and are also affected by extreme observations, in general the regression format is more robust to deviations from assumptions as compared to correlation coefficients.