Some discussion about 'house' stuff - downloading phstat was problematic for some; will bring a copy next seminar for people to take away. Asked if small dead tree could be placed on webct, possible progress there.
Back to Sampling, and evaluating the worthiness of surveys. Ask what is the purpose of the survey? Is survey based on a probability sample?
Organising Data
Data: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Table:
Not including histogram or polygon or bar chart or line chart or cumulative frequency or cumulative % polygon or bivariate scatter plot or time series plot or pareto diagram but I want you to know it was weirdly engrossing.
Principles of Graphical Excellence:Crimes in preventing Data Errors in Presenting Data:
Numerical Descriptive Measures:
Topics:
Back to Sampling, and evaluating the worthiness of surveys. Ask what is the purpose of the survey? Is survey based on a probability sample?
- Coverage error - pick an appropriate frame
- Non-Response error - follow up
- Measurement error - ask good questions
- Sampling error - will always exist
Organising Data
- Stem-and-leaf Display: groups numbers according to 10 or 100 ranking
- e.g. 21, 24, 26, 27, 27, 30, 32, 38, 41 becomes
- 2 14467
- 3 028
- 4 1
- e.g. 21, 24, 26, 27, 27, 30, 32, 38, 41 becomes
- Histogram: displays frequency of entries in groups as bars on a chart
- Table: displays raw-ish data
- Ogive: who knows? ETA cumulative line graph
- Polygon:
- Sort Raw Data (ascending order)
- Find Range (top minus bottom)
- Select number of classes (usually between 5 & 15)
- Calculate class interval (Range/No. classes)
- Determine Class Boundaries (bin values)
- Calculate Class Midpoints
- Count observations and assign to classes
- Cumulative: the proportion of values (generally) equal to or below a given interval endpoint. Applicable to numerical data Later leads to percentiles and quartiles.
- Relative: the proportion of values in a particular interval. Applicable to numerical and categorical data. Probability of event occurring.
Data: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Table:
Class | Frequency | Rel. Freq. | Percentage |
10 but under 20 | 3 | 0.15 | 15 |
20 but under 30 | 6 | 0.30 | 30 |
30 but under 40 | 5 | 0.25 | 25 |
40 but under 50 | 4 | 0.20 | 20 |
50 but under 60 | 2 | 0.10 | 10 |
Total | 20 | 1.00 | 100 |
Not including histogram or polygon or bar chart or line chart or cumulative frequency or cumulative % polygon or bivariate scatter plot or time series plot or pareto diagram but I want you to know it was weirdly engrossing.
Principles of Graphical Excellence:
- Presents data in a way that provides substance, statistics and design
- Communicates complex ideas with clarity, precision and efficiency
- Gives the largest number of ideas in the most efficient manner
- Almost always involves several dimensions
- Tells the truth about the data
- Using 'chart junk' (lots of visual crap)
- Failing to provide a relative basis for comparing data between groups
- Compressing vertical axis
- Not providing a zero point on the vertical axis
Numerical Descriptive Measures:
Topics:
- Measures of central tendency (mean, mode, median)
- Measures of variation (range, interquartile range, variance and standard deviation. coefficient of variation, z-scores & outliers)
- Shape (symmetric, skewed, box-&-whisker plots, dot-scale diagrams)
- Correlation coefficient
- Central Tendency (average)
- Mean (computational - average - affected by extreme values) Good for normal data (average height of Australian women in 1995 was 1.634m - I'm 1.73)
- Mode (frequency - most common) Good for discrete data (more cars are white)
- Median (positional - the middle score) Good for skewed data or data with outliers (median house price in Queens Park, WA is $369,000)
- Geometric Mean (rate of investment over time was 13.5%)
- Quartiles (split data into four quarters; 25%, 50%, 75%, 100%
- Interquartile Range (Q3 - Q1) also known as middle spread
- Variation
- Coefficient of Variation = (SD/mean) x 100%
- Range (maximum minus minimum)
- Variance (shows variation around the mean
- Standard Deviation (square root of variance)
- Coefficient of Variation = (SD/mean) x 100%
no subject