Quick recap of simple linear regression from last week and some examples of spurious correlation.
- Shark attacks and ice-cream sales
- AFL Premiership and Economy
- Piracy and Global Warming (that was my offering to the class)
Back to last week's example involving annual sales versus size of stores we ended up with the following relationship:
We can define a Measure of Variation: Sum of Squares
- SST (Total Sample Variability) = SSR (Explained Variability) + SSE (Unexplained Variability)
Where:
- SSE = total sum of squares and measures the variation of the Y values around their mean
- SSR = regression sum of squares which is the explained variation attributable to the relationship between X and Y
- SSE = error (residual) sum of squares and variation is attributable to factors other than the relationship between X and Y
The Coefficient of Determination r
2 = SSR/SST = Regression Sum of Squares / Total Sum of Squares (the higher the better)
| Regression Statistics | |
| Multiple R | 0.971 |
| R Square | 0.942 |
| Adjusted R Square | 0.930 |
| Standard Error | 611.752 |
| Observations | 7 |
Analysis: R Square of 0.942 means 94.2% of variation in sales is due to variation on floor space.
BUT If we do Staff versus Floor Space r
2 = 0.850 or 85% because staff and floor space are also tied.
Regression and Prediction Error
Predicting Y as YBar (not using regression)
- Errors are approximately 2318.46 ($000)
Predicting Y as b + bX (using regression)
- Errors are approximately 611.75 ($000) - smaller! (more information/variables = less error)
Linear Regression Assumptions- Normality: Y values are normally distributed for each X and probability distribution of errors is normal
- Linearity (residual analysis will show)
- Homoscedasticity (constant variance) (residual analysis will show)
- Independence of errors (residual analysis will show)
Sales Excel Regression Output| ANOVA | df | SS | MS | F | Signif F |
| Regression | 1 | 30380456 | 30380456 | 81.179 | 0.000 |
| Residual | 5 | 1871200 | 374240 | | |
| Total | 6 | 32251656 | | | |
| MODEL | Coeff | SE | t Stat | P-Value | L95% | U95% |
| Intercept | 1636.415 | 451.495 | 3.624 | 0.015 | 475.811 | 2797.029 |
| Square Metres | 1.487 | 0.165 | 9.012 | 0.00 | 1.062 | 1.911 |
Analysis: P-Value says probability of NO relationship between Sales and Floor size is 1.5% (0.015). This is key. Also 85% confidence interval works out to1.487 +- 0.425 (1.062, 1.911) [000$] and this does not contain 0.
Purpose of Correlation Analysis- used to measure the strength of association (linear relationship) between two numerical variables (ranges between -1 and +1 where 1 to 0.7 is strong, 0.7 to 0.3 is moderate and 0.3 to 0 is weak)
- only concerned with the strength of the relationship
- no causal effect is implied
- population correlation coefficient (Rho) is used to measure the strength between the variables
- Sample correlation coefficient R is estimate of Rho and is used to measure the strength of the linear relationship in the sample observations
Pitfalls of Regression Analysis- Linear model may be wrong (non-linear? unequal variability? clustering?)
- Incorrect use of model (interpolate in range of X values, do not extrapolate)
- Intercept may not be meaningful (if there is not data near X = 0)
- Explaining Y from X versus explaining X from Y (use care in selecting Y)
- Is there a hidden 'Third Factor' (spurious correlation)
Strategies for Avoiding Pitfalls- Start with scatter-plot of X on Y to observe possible relationship
- Perform residual analysis to check the assumptions
- Use a histogram, stem-&-leaf display or box-&-whisker plot of residuals to uncover possible non-normality
- if there is no evidence of assumption violation, then test for the significance of the regression coefficients & construct confidence intervals
Multiple Regression: predicting a single Y variable from two of more X variables is sometimes necessary because one explanatory variable alone is not enough. Multiple regression is concerned with creating a model using the most efficient combination of explanatory variables and is used to
- Describe and understand the relationship (understand the effect of one X variable while holding the other fixed)
- Forecast (predict) a new observation (allows you to you to use all available information (X variables) to find out about what you don't know (Y variable for new situation))
Example: Develop a model for estimating heating oil used (gallons) for a single family home in the month of January based on average temperature (F) and amount of insulation (inches), for 15 houses selected in American cities.
| Regression Statistics | |
| Multiple R | 0.983 |
| R Square | 0.966 |
| Adjusted R Square | 0.960 |
| Standard Error | 26.014 |
| Observations | 15 |
| ANOVA | df | SS | MS | F | Signif F |
| Regression | 2 | 228015 | 114007 | 168.400 | 0.000 |
| Residual | 12 | 8121 | 677 | | |
| Total | 14 | 236136 | | | |
| MODEL | Coeff | SE | t Stat | P-Value | L95% | U95% |
| Intercept | 562.151 | 31.093 | 26.651 | 0.000 | 516.193 | 608.109 |
| Temp | -5.437 | 0.336 | -16.170 | 0.000 | -6.169 | -4.704 |
| Insulation | -20.012 | 2.343 | -8.542 | 0.000 | -25.116 | -14.908 |
Analysis: R Square says combined effect of temp and insulation is high. P-Values are 0.000 = Yay! Signif F is below 0.05 = Good!
- Y = 562.151 - 5.457 x temp(F) - 20.012 x Insulation(")
Can use to work out how much insulation it's worth to buy before buying oil is cheaper.
Coefficient of Multiple Determination - proportion of total variation in Y explained by all X variables taken together
- never decreases when a new X variable is added to model (disadvantage when comparing models)