Understanding Variances based on Sample Sizes

General

Every now and then you read something that really furthers your understanding of the world around us. I read this fascinating piece in the book by Howard Wainer: Picturing the Uncertain World. The specific chapter I read was called “The Most Dangerous Equation” where he discusses De Moivre’s equation. It’s quite a bite to chew on and I tried explaining it to my team using just words and that just didn’t cut it. So I put together a quick graphic visualizing some of the basis of it. This may not be academically super accurate, but gets the gist across, so bear with me and I welcome you to follow along :)

Below are 32 hypothetical students’ heights, each represented by one vertical bar. They are grouped by color into individual classrooms A, B, C, D … H making it 8 classrooms in all.

In the first row at the top, the solid green horizontal line shows the average of the heights of all the individual students across all 32 individual measurements. The rightmost section shows the average height and also shows the maximum height and the minimum height for this sample of all students.

In the second part, we first calculate the average height of each classroom separately e.g. instead of looking at each yellow bar separately, we are now only looking at the single green line across those yellow bars that represents the average height of that classroom. And we do that for each cluster of colors. So now we only have 8 measurements that reflect the average height of each classroom. Taking an average of those 8 averages results in the exact same average height. However, the variance in this sample is much lower i.e. it’s more likely that the tallest kid in a class gets balanced out by other short kids in a class so the average height of a classroom will show less variation than the average height of the kids individually.

Also, a large classroom is always closer to the mean than the average height of smaller classrooms which will have more outliers as it’s easy for a single tall student to throw off the average of a small classroom. But in a large class room, a single tall student has less impact on the average height.

The third section shows that distribution. Classrooms with the tallest average height tends to be smaller classrooms. Similarly, classrooms with the shortest average height also tend to be the smaller classrooms.

It would be erronous to just look at the top of the distribution and conclude that smaller classrooms have taller students compared to large classrooms. However, now replace height with grades. And that’s exactly the premise of the “small schools” movement. Without understanding the underlying real world distribution of data and how sample sizes affect variance, small school lobbying centers around the belief that small schools have better grades. This is true. But due to statistics and how data is distributed and measured. Not because small schools actually do something different. Also, the worst performing schools are also small schools by the same distribution.

Understanding this relationship between sample sizes and variances observed in them is very important when making sense of data. Yet, the chapter states, many examples of large policy decisions have been made by incorrect understanding of the datasets or by looking at just one side of the distribution.

iPhone vs Foursquare: comparing what they know about me

visualization

One of the biggest technology news this week has been the announcement made by Alasdair Allan and Pete Warden, researchers at O’Reilly, that theiPhone keeps a log of every location you have been to over the past one year and more. One could argue that it isn’t really news but it definitely is a rude surprise to most people. More so because the researchers also made a tool which makes it super easy for anyone to easily parse the contents of the file their own iPhone has been keeping on them.

Though I agree that saving an indefinite history of sensitive location data without explicit user notification is a terrible oversight at the least, I was also tempted to see what my own data held. So I went ahead and here’s what it looks like.


My iPhone faithfully recorded my road trip halfway across the country, my SXSW visit to Austin, Bay Area and LA trips and also my trip to Michigan and Ohio. I think it makes a very interesting sharing object at this level of zoom. Especially because I have been voluntarily giving that data to Foursquare anyway. Foursquare is a lot sparser than the iPhone data but it has more explicit knowledge of the exact business/venue I went to as opposed to the iPhone data that can only be used to make a reasonable guess. However, overall the data that the iPhone has been accumulating is obviously more exhaustive.

I am curious to run more detailed analysis on my own data, and possibly compare it with other people I know and other data sources I have to see what interesting stuff I can find. For example, it would be cool to see how much time my wife and I spend with each other and how it correlates to how many steps I took that day, what I ate, or what music I listened to.

Are we really as unique and different as we like to believe or are we just predictable dots on the map? At a higher aggregate level, data from cellphone carriers has already been used to find that we actually are quite predictable!


Why I love my Saturn Ion: A visual graph of my MPG

General

We have a 2004 Saturn Ion. Check out the awesome MPG we get on it:

image

This is real data. From actual data points my wife and I have painstakingly recorded each time we fill up. The spikes in the chart above correspond to road-trips. You can tell by the nature of the spikes that we have not taken more than one-tank road trips lately :(

Predictably, there is a lower MPG in the winter months where you have more stops and slow-downs.

Corresponding to the above, this is how much we have been using our car:

image

If you want to record and view data about your car in a fun way like this, head over to http://www.mymilemarker.com