Tuesday, June 29th, 2010 11:11 pm
Quick recap of simple linear regression from last week and some examples of spurious correlation.
  1. Shark attacks and ice-cream sales
  2. AFL Premiership and Economy
  3. Piracy and Global Warming (that was my offering to the class)
Back to last week's example involving annual sales versus size of stores we ended up with the following relationship:
  • Y = 1636.415 +1.487X
We can define a Measure of Variation: Sum of Squares
  • SST (Total Sample Variability) = SSR (Explained Variability) + SSE (Unexplained Variability)
Where:
  • SSE = total sum of squares and measures the variation of the Y values around their mean
  • SSR = regression sum of squares which is the explained variation attributable to the relationship between X and Y
  • SSE = error (residual) sum of squares and variation is attributable to factors other than the relationship between X and Y
The Coefficient of Determination r2 = SSR/SST = Regression Sum of Squares / Total Sum of Squares (the higher the better)

Regression Statistics 
Multiple R0.971
R Square0.942
Adjusted R Square0.930
Standard Error611.752
Observations7

Analysis: R Square of 0.942 means 94.2% of variation in sales is due to variation on floor space.

BUT If we do Staff versus Floor Space r2 = 0.850 or 85% because staff and floor space are also tied.

Regression and Prediction Error
Predicting Y as YBar (not using regression)
  • Errors are approximately 2318.46 ($000)
Predicting Y as b + bX (using regression)
  • Errors are approximately 611.75 ($000) - smaller! (more information/variables = less error)
Linear Regression Assumptions
  1. Normality: Y values are normally distributed for each X and probability distribution of errors is normal
  2. Linearity (residual analysis will show)
  3. Homoscedasticity (constant variance) (residual analysis will show)
  4. Independence of errors (residual analysis will show)
Sales Excel Regression Output
ANOVAdfSSMSFSignif F
Regression1303804563038045681.1790.000
Residual51871200374240  
Total632251656   

MODELCoeffSEt StatP-ValueL95%U95%
Intercept1636.415451.4953.6240.015475.8112797.029
Square Metres1.4870.1659.0120.001.0621.911

Analysis: P-Value says probability of NO relationship between Sales and Floor size is 1.5% (0.015). This is key. Also 85% confidence interval works out to1.487 +- 0.425 (1.062, 1.911) [000$] and this does not contain 0.

Purpose of Correlation Analysis
  • used to measure the strength of association (linear relationship) between two numerical variables (ranges between -1 and +1 where 1 to 0.7 is strong, 0.7 to 0.3 is moderate and 0.3 to 0 is weak)
    • only concerned with the strength of the relationship
    • no causal effect is implied
  • population correlation coefficient (Rho) is used to measure the strength between the variables
  • Sample correlation coefficient R is estimate of Rho and is used to measure the strength of the linear relationship in the sample observations
Pitfalls of Regression Analysis
  • Linear model may be wrong (non-linear? unequal variability? clustering?)
  • Incorrect use of model (interpolate in range of X values, do not extrapolate)
  • Intercept may not be meaningful (if there is not data near X = 0)
  • Explaining Y from X versus explaining X from Y (use care in selecting Y)
  • Is there a hidden 'Third Factor' (spurious correlation)
Strategies for Avoiding Pitfalls
  • Start with scatter-plot of X on Y to observe possible relationship
  • Perform residual analysis to check the assumptions
  • Use a histogram, stem-&-leaf display or box-&-whisker plot of residuals to uncover possible non-normality
  • if there is no evidence of assumption violation, then test for the significance of the regression coefficients & construct confidence intervals
Multiple Regression: predicting a single Y variable from two of more X variables is sometimes necessary because one explanatory variable alone is not enough. Multiple regression is concerned with creating a model using the most efficient combination of explanatory variables and is used to
  • Describe and understand the relationship (understand the effect of one X variable while holding the other fixed)
  • Forecast (predict) a new observation (allows you to you to use all available information (X variables) to find out about what you don't know (Y variable for new situation))
Example: Develop a model for estimating heating oil used (gallons) for a single family home in the month of January based on average temperature (F) and amount of insulation (inches), for 15 houses selected in American cities.

Regression Statistics 
Multiple R0.983
R Square0.966
Adjusted R Square0.960
Standard Error26.014
Observations15

ANOVAdfSSMSFSignif F
Regression2228015114007168.4000.000
Residual128121677  
Total14236136   

MODELCoeffSEt StatP-ValueL95%U95%
Intercept562.15131.09326.6510.000516.193608.109
Temp-5.4370.336-16.1700.000-6.169-4.704
Insulation-20.0122.343-8.5420.000-25.116-14.908

Analysis: R Square says combined effect of temp and insulation is high. P-Values are 0.000 = Yay! Signif F is below 0.05 = Good!
  • Y = 562.151 - 5.457 x temp(F) - 20.012 x Insulation(")
Can use to work out how much insulation it's worth to buy before buying oil is cheaper.

Coefficient of Multiple Determination
  • proportion of total variation in Y explained by all X variables taken together
  • never decreases when a new X variable is added to model (disadvantage when comparing models)