Wednesday, June 9th, 2010 11:04 am
Talked a bit about the project - supernatural_fic is so going to be my assignment topic. Lecturer recommends regression analysis... there are NUMBERS EVERYWHERE.

Back to Probability, we can do a Sample Space for the sum of two dice that gives us a pretty good idea of what outcomes are likely.

e.g. Sample Space for Sum of Two Dice
D1/D2123456
1234567
2345678
3456789
45678910
567891011
6789101112
Analysis: Pretty! Symmetrical! 7 is more likely

We can also represent this as a Probability Distribution
Sum
P(Sum)
21/36
32/36
43/36
54/36
65/36
7
6/36
85/36
94/36
103/36
112/36
121/36
Total36/36
Analysis: Pretty! Symmetrical! 7 is more likely

We could also do a Histogram which I shall not draw but have a pretty link to wikipedia which talks about histograms in detail.

Summary Measures:
  • Expected value (the mean) which is the weighted average of the probability distribution
  • Standard Deviation which is the weighted average of the squared deviations about the mean
  • Covariance which is the combined variance of X and Y
In a business or financial context the:
  • mean represents expected return on investment
  • standard deviation is a measure of the associated risk
Covariance for Investment Returns - I steal this from wikipedia again. All this course wants me to know is that if the covariance is 1000 the two investments are positively correlated. Yay!

Binomial Probability Distribution; characteristics of
  • 'n' identical trials e.g. 15 tossed of a coin, 10 light bulbs taken from a warehouse
  • two mutually exclusive outcomes on each trial e.g. head or tail in each toss of a coin
  • trials are independent e.g. what happens previously does not affect next outcome
  • constant probability for each trial e.g. probability of getting a tail is the same each time we toss (assumes 'fair' coin)
Pause for example of how roulette works in casinos and pause to discuss impressive impact of false positives.

Normal Distribution is regarded as the most important theoretical distribution on business statistics. It approximates the observed frequency distributions of many natural and physical measurements such as height, weight, sales, IQ, product lifetimes and the variability of human and machine outputs.
We can find the probability of events occurring by looking at the area underneath the bell curve and we have tables that allow us to look them up. The tables only work for a single normal distribution curve so we use z-scores to standardise the data - yes, I got scaled in high school too.

Discussion of how to use tables to look up probabilities.

Pause to work through examples - have very messy scribbled notes for this :)

Assessing Normality:
  • Construct charts
    • For small datasets, do stem-and-leaf display & box-and-whisker display look symmetric?
    • For large datasets does the histogram or polygon appear bell shaped? (FYI supernatural_fic doesn't)
  • Compute descriptive summary measures
    • Do the mean, median and mode have similar values?
    • is the interquartile rages approximately 1.35?
    • Is the range approximately 6?
  • Observe the distribution of the dataset
    • Do approximately 2/3 of the observations lie between 1 standard deviation?
    • Do approximately 4/5 of the observations lie between 2 standard deviations?
    • Do approximately 19/20 of the observations lie between 3 standard deviations?
Sampling Distributions: different samples produce different estimates (large samples are better but cost more), we need a framework.

Off to Chapter 7 Sampling and Sample Distributions p 7-9.

As far as I can make out, if you take a sample, then take the mean of that sample then depending on your sample it can vary a bit. If you take the mean of the sample mean you get a better result. The bigger the sample size the less variation.

How large is large enough?
  • for most distributions, n greater than or equal to 30
  • for 'fairly symmetric' distributions, n is greater than or equal to 15
  • for normal distribution (sampling distribution of the mean is always normally distributed for all values of n, n is greater than or equal to 1
Population Proportion
  • is a Categorical variable i.e. gender, voted in last election, pregnant
  • ps = X / n - if two outcomes X has a binomial distribution
...and then we ran away