Friday, May 28th, 2010 02:18 pm
Some discussion about 'house' stuff - downloading phstat was problematic for some; will bring a copy next seminar for people to take away. Asked if small dead tree could be placed on webct, possible progress there.

Back to Sampling, and evaluating the worthiness of surveys. Ask what is the purpose of the survey? Is survey based on a probability sample?
  • Coverage error - pick an appropriate frame
  • Non-Response error - follow up
  • Measurement error - ask good questions
  • Sampling error - will always exist
Then we did a vastly simplified version of the MBTI with Simpson characters, I am sorry to inform you I am Krusty the Clown and we make up about 2.9% of the population (ENTJ).

Organising Data
  • Stem-and-leaf Display: groups numbers according to 10 or 100 ranking
    • e.g. 21, 24, 26, 27, 27, 30, 32, 38, 41 becomes
      • 2 14467
      • 3 028
      • 4 1
  • Histogram: displays frequency of entries in groups as bars on a chart
  • Table: displays raw-ish data
  • Ogive: who knows? ETA cumulative line graph
  • Polygon:
Tables: (If it's any consolation I suspect most people who've ever made a chart, ever, do this automatically without thinking about it. The below is like a very technical description of how to put your pants on.)
  • Sort Raw Data (ascending order)
  • Find Range (top minus bottom)
  • Select number of classes (usually between 5 & 15)
  • Calculate class interval (Range/No. classes)
  • Determine Class Boundaries (bin values)
  • Calculate Class Midpoints
  • Count observations and assign to classes
Types of Frequencies:
  • Cumulative: the proportion of values (generally) equal to or below a given interval endpoint. Applicable to numerical data Later leads to percentiles and quartiles.
  • Relative: the proportion of values in a particular interval. Applicable to numerical and categorical data. Probability of event occurring.
e.g. Some number crunching.

Data: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Table:
Frequency, relative frequency and percentage distributions
ClassFrequencyRel. Freq.Percentage
10 but under 2030.1515
20 but under 3060.3030
30 but under 4050.2525
40 but under 5040.2020
50 but under 6020.1010
Total201.00100











Not including histogram or polygon or bar chart or line chart or cumulative frequency or cumulative % polygon or bivariate scatter plot or time series plot or pareto diagram but I want you to know it was weirdly engrossing.

Principles of Graphical Excellence:
  • Presents data in a way that provides substance, statistics and design
  • Communicates complex ideas with clarity, precision and efficiency
  • Gives the largest number of ideas in the most efficient manner
  • Almost always involves several dimensions
  • Tells the truth about the data
Crimes in preventing Data Errors in Presenting Data:
  • Using 'chart junk' (lots of visual crap)
  • Failing to provide a relative basis for comparing data between groups
  • Compressing vertical axis
  • Not providing a zero point on the vertical axis
Cue phstat using tutorial and a fantastic moment of revelation explaining what box plot diagrams are for (comparing two sets of data at a high level)

Numerical Descriptive Measures:

Topics:
  • Measures of central tendency (mean, mode, median)
  • Measures of variation (range, interquartile range, variance and standard deviation. coefficient of variation, z-scores & outliers)
  • Shape (symmetric, skewed, box-&-whisker plots, dot-scale diagrams)
  • Correlation coefficient
Summary Measures:
  • Central Tendency (average)
    • Mean (computational - average - affected by extreme values) Good for normal data (average height of Australian women in 1995 was 1.634m - I'm 1.73)
    • Mode (frequency - most common) Good for discrete data (more cars are white)
    • Median (positional - the middle score) Good for skewed data or data with outliers (median house price in Queens Park, WA is $369,000)
    • Geometric Mean (rate of investment over time was 13.5%)
  • Quartiles (split data into four quarters; 25%, 50%, 75%, 100%
    • Interquartile Range (Q3 - Q1) also known as middle spread
  • Variation
    • Coefficient of Variation = (SD/mean) x 100%
      • Range (maximum minus minimum)
      • Variance (shows variation around the mean
    • Standard Deviation (square root of variance)
...and then we ran away
Wednesday, August 25th, 2010 03:12 am (UTC)
something I've been meaning to do for a while - point you at http://junkcharts.typepad.com/junk_charts/. It seemed appropriate to put it on this post (which I had in my 'to respond to' list of things to do....). Tis a geeky blog for particular geeky types.