Flowing Data

This graphics was in the current newsletter of the local utility company, and shows the (estimated) water consumption during the quarter final of the soccer world cup between Germany and Argentina.

I think there is no need to translate the annotations, and everybody might guess what the beer drinking people of Germany did during the break.

Just wondering how a different outcome would have impacted the flow of water …

Is Data the new Plague?

This post is neither fish nor fowl. My review on Kaiser’s book is way overdue, as I got stuck somewhere in the middle of the book. In the meantime, Georgios pointed me to this video of David McCandless on TED, as we recently talked about people’s fears and how the media has its share in it (see video at about 3:30).

But what really struck me was the metaphor “data is the new oil” (now used over and over again at 5:19), which was modified by David to “data is the new soil”. You might ask what this has to do with Kaiser’s book. The answer comes here:

Whereas most the examples in Kaiser’s book illustrate nicely how data analysis can be used to make people’s lives better (reduce congestions on highways, minimize waiting times in Disney Land or fight an E. coli outbreak), i.e. data is collected and analyzed to serve people’s needs, most of the data collection which spurred the “data is the new oil” metaphor is only analyzed for some company’s profit.

So far, where is the problem? If online retailers (which whom I have a business relation) analyze my buying behavior, this is ok – the same thing would happen in a conventional (physical) store, with a salesperson who remembers my preferences and taste. But data collection at Google, Facebook, twitter and others is different. Without actually “doing business” with these companies, people start to expose more and more of their privacy and personality – mostly without having a clue what is going on.

This example is taken from David’s video (6:22) showing incidents on Facebook, which talk about a “break up”:

I am sure, most of those who posted didn’t mean to supply to this kind of statistics.

I will stop before I get too philosophical – but one thing is for sure: there are tons of useful data out there and much of it can be used (and still needs to be analyzed) to make our lives better, i.e. improve our health, save us time for our families, save resources for our kids, … All the other data mostly generated in so called “social networks” seems to stick around like the plague – hard to get rid of and highly infectious; at least it hardly can serve any of us.

PS: Reading the post, you might guess that I recommend Kaiser’s book …

The Good & the Bad [9/2010]

This is quite an unusual Good & Bad posting, as it does not refer to some extraordinarily bad graph, but just wants to show some additional aspects of a dataset, compared to the original visualization found on Kaiser’s Junk Charts.

The comments on Kaiser’s post mainly picked on the variability of ranks, such that I set off to the US Bureau of Labor Statistics to get the (raw, though seasonal adjusted) yearly data. Here is what it looks like for the last 10 years with the US total rate highlighted:

As we can see from Kaiser’s post, the ranks end up to be a big zig-zag so I leave this graph out for a while. To judge the variability and range a bit easier, here is the corresponding plot based on boxplots.

The difference between the medians and the US average is a hint that in years with higher unemployment larger states seem to be hit more severely.

But how does the data look like when we use the US average as the reference? The following figure centers all data around the US average retaining the same scale, but using different shifts:

Now we can see more clearly that there are “winners” and losers within the evolving crisis starting in 2008. I highlighted three states that somewhat stick out of the rest. Alaska seems not to do very well in the first years, but also does not seem to be hit by the crisis very much. Nevada did improve until 2004 to a top 10 state, but fell behind starting in 2005 and was hit by the crisis most severely. Finally Michigan was worsening steadily with the upcoming crisis not really making things even worse:

Being down to only three states of interest, ranks seem to be the ideal view to show the ups and downs of the unemployment rate:

The post is already way too long, so I leave you with the data (incl. map) and the software to play around on your own and find some more interesting facts …

(Note: The data is not identical to the data Kaiser used, so there are smaller differences in the plots. The currently released version of Mondrian does not show the colored lines emphasized so nicely yet … stay tuned)

Let’s do it in Parallel

Parallel coordinate displays are popular – especially in InfoVis – for quite a while. Now we have the ultimate reference with Al Inselberg‘s book (not surprisingly) called “Parallel Coordinates“.

Al Inselberg giving a talk at the DataVis workshop in Berlin 2006

Most of the book actually looks at geometric properties of parallel coordinates, and thus tends to overtax my mathematical education. The most interesting part of the book (from my point of view, which is always biased towards data analysis applications) is Chapter 10: “Data Mining and Other Applications”. One soon gets the idea that parallel coordinate views need an interactive working style/tools and thus many graphics of real world data sets fill this chapter. To give a good idea what is crucial about parallel coordinates, let me point to the discussion of this older post on Andrew’s blog.

So if you still can’t figure out what these funny plots mean, go and get the book!

The only point I don’t like too much about the book is the “useless” CD which comes with the book, which has some sample data but misses the real word examples discussed in chapter 10. Nowadays everybody would expect this data to come via a webpage.

The ultimate question though, will not be answered by the book: Who did invent parallel coordinates? This miracle will still be open and only real insiders will know the answer ;-).

Why do we go to Conferences?

Andrew pointed to a blog post on his blog, by Panos Ipeirotis who asked the question, why we do not use peer reviewing for conference talks in the same way we are used to it for journal papers.

His idea (which is not coming up the first time, and this year’s InfoVis worked pretty much this way) is to improve the overall quality of presentations, as we all have been sitting in boring or technically disastrous talks, which we would have liked to have seen improved.

As you can see from the image above, taken at the joint stat. computing and stat. graphics mixer at the 2009 JSM, I see a very important point of conference in the informal meetings around talks. Here is my comment on Andrew’s blog:

This idea might be interesting, but I think it totally misses the idea of oral presentations at conferences.

Conferences are for meeting people and exchanging ideas – that is what brings research forward. Having a reviewing process will destroy most of it.

What about being provocative and spontaneous? The reviewing would destroy all of this spice.

What is the point of a conference, which essentially gives us journal papers read aloud?

Mac vs PC Reloaded – really?

It is the silly season aka as the “Sommerloch”, or the “morte-saison”, so there is time for this post. We all know the legendary “Mac vs. PC” spots, which Apple aired between 2006 and 2009. The underlying idea was the smart newcomer Mac attacking the bold and not always very smart acting established PC.

Mac vs. PC

So far, everything matches the system. If you are in a weaker market position, with an apparently smarter product, you need to attack.

Well, now the still 10:1 outselling market leader Microsoft seems to strike back – although it is hard to understand why – given the 10:1 share for Microsoft is still true? Microsoft’s campaign seems to be quite odd. Among other surprising things, MS claims that “Macs can take time to learn” and qualify their claim with “Things just don’t work the same way on Macs if you’re used to a PC”.

And now we are getting to the point, which I notice frequently. Developers of ill designed software (and I am really only relating to the user interface here) managed to completely screw up the user’s expectation of how things should work. Computer users who are exposed to quirky interfaces for years (if not generations) do not expect the obvious any more. Being trained to look for the work-around in the first place, one seems to be unable to expect the straight solution.

Given this situation, MS strange claim seems to sell: “Things just don’t work the same way on Macs if you’re used to a PC”, but it does not say at all that PCs do a good job in helping us solving our problems – no, they only meet people’s degraded expectations … sad.

The Good & the Bad [08/2010]

The last regular issue of “The Good & the Bad” dates back to [11/2006], so it is more than time to post.

I found this flowchart on Kaiser’s junkchart.


The graph was originally posted on the Internet Monk‘s blog – the data comes from a study, which can be found here. There was no data for this migration matrix posted in either of the blogs, so I reverse engineered the graphics (pixel by pixel) and created the data table.

Although the migration matrix only has 36 potential values to depict (from which some don’t even exist), the flowchart is already tremendously cluttered. The general question, which group loses most is hard to see, and the many small migration paths obscure the graph’s message to some extend.

Here is my suggestion which uses barcharts for source and destination distribution and a fluctuation diagram for the migrations.

What can we read from the graphs? There is a general movement towards “None”, which is by far the biggest receiving group. Both “Catholics” and “Evangelicals” lose substantially, but “Evangelicals” at least gains somewhat from other groups. “Evangelicals” and “Mainline Protestants” seem to be the biggest 2×2 exchange. “Black Protestants” only lose to “None”, which also might just be a data error.

The Seven Deadly Sins of Conducting a Survey Study

I stumbled upon this “survey analysis” on an Apple related list called “iPad Opinion Profile – iPad Personality Clash: Elites vs. Geeks”. The brief summary of this survey suggests that iPad owners are “Selfish Elites” and those who oppose the iPad are “Independent Geeks”.

It takes a bit to get an idea of what these guys did, but here is my list of the seven deadly sins of conducting a survey:

  1. don’t care about being representative, just ask some guys on, say facebook
    (the survey actually started before the iPad was released …)
  2. normalize the data – somehow
    (it says: “The survey sample was normalized to match the gender, age and personality distribution of 13-49 year olds living in the United States”), good luck!
  3. only pick a tiny fraction of the data to make things more interesting
    (the “study” only looks at 9% of the survey data and mixes owners of an iPad with those who intend to buy one …)
  4. ask a lot of unrelated questions
    (the question for “The Biggest Sin” is really something that haunts us, especially when thinking about touch devices)
  5. never ever show the questions you actually asked
    (no sample questionnaire is supplied)
  6. don’t mention the absolute numbers behind the results, only use ratios, which really go haywire for small numbers; compare to the undefined “average person”
    (no quantities at all, only the overall size of 20,000)
  7. only pick the findings that make a good headline and match your insinuation – never point to contradicting results
    (according to the study, iPad owners are strongest for families with many children, but are referred to being “not very kind or altruistic” – great)

Although I am neither an iPad owner, nor intend to buy one anytime soon (though donations are welcome ;-)), my verdict from the results of the study are, that iPad critics are “low stimulated, introverted, reserved, insecure, neurotic young males, tending to be aggressive and lazy, mainly found in Hawaii and Alaska”.

– sometimes I feel guilty being a statistician

Surprise Me!

I was a bit puzzled when I read the lines in Robert’s hint to the InfoVis Workshop called “Telling Stories with Data“, saying: “If you haven’t watched the Hans Rosling video yet, you probably haven’t realized that visualization isn’t just there for data analysis, it’s also a great tool for telling stories.

This is exactly what I mentioned earlier in an older post:

A good visualization should tell us a story about the data you didn’t know before and not the other way round, i.e., once you know the story, you create a visualization around it.

Here is a nice proof of how this usually works:

Antony and Alan talking about some visualization at the 2002 JSM in NYC

If the result of your graphical analysis is not something you can put into a story, you probably didn’t really succeed with your analysis. Of course, we are not equally gifted in telling stories …

Tour de France 2010 Statistics Art Gallery

Sometimes the title may promise more than the post can hold … but I still try my best. As you might know, there is the usual visualization of the stage times, total times and ranks of all riders in the regular post to start with.

As we have more data on the Tour and riders, it is fun to look at these data as well:

Lets first look at the different types of riders and how they performed:

Total Time by Type of Rider

Note that smaller numbers in the boxplots of the total time by Type of Rider correspond to shorter times and better performance. Obviously the classification is quite accurate. The ordering of the types is not surprising, and given the many hard core mountain stages climbers are definitely in for a good all over all performance.

Type of Rider vs Year of Birth (Age)

Interestingly “Leader and Top Rider” are oldest on average, and less surprisingly you seem to start your career as a “Helper”, try your luck as a “Sprinter” and at some probably get to be more than (only) a specialist.

Team by Type

We only highlighted the Leader and Top Rider and Helper here, and sorted after the number of top riders. Team RadioShack really was different, although they won the team classement only by 9′.

Age by Team

Although Team Radioshack is not the oldest on average (actually median), if you look at the Top Riders in the team (highlighted in red), you definitely see that they will need some “fresh blood” the next years – no, I do not refer to doping here ;-).

As we already looked at Age, some more physicals might be interesting:

Result vs Age

We actually look at year of birth (which is a bit more time invariant than Age). The Tour de France best ager (apart from the older professionals, who “survived” over many years) seem to be around 32, i.e, born in 1978.

There is obviously the correlation between height and weight of the riders, which applies to all of us, such that we rather look at the BMI

Result vs. BMI

Although the variance gets quite big at the ends of the data range, we see that it is no good to enjoy the good french food and red vine during the tour too much.

Well, the post is already way to long, though there is more to explore here. I can only encourage you to grab the data and play around yourself using Mondrian or some other visualization package – it’s fun.

Blinded by Animation?

Stopping by at http://www.gapminder.org, you will easily get to the “default” example, which shows the scatterplot of life-expectancy vs. income-per-person running through the years 1800 to 2009.

You really have to look carefully to spot the problem with Russia in 1933. How do we explain a spontaneous drop of life-expectancy from 33 years to only 12 years? This is obviously an error, which was not fixed before the data was released.

It gets quite clear when you look at the time series itself, which you get when you select time to be the x-axis:

Apart from the drop in 1933, we also find strange data for the time of WWII. The war apparently has no effect on the life-expectancy at all. Hard to believe – but this data might just have been “optimized” by the political regime.

Once again time to remember Peter J. Huber’s words: “Never underestimate the rawness of raw data!”

Why do we do it – ’cause we can!

I was pointed to this nice video of work from Robert Kosara by Hadley via Antony.

Emerging technologies – and muti-touch must be counted as such – offer new possibilities of creating an interaction with graphics. This implementation of Robert is certainly clean and straight-forward, but still raises the question, whether or not these operations are really things we need during a data analysis.

What I found always very distracting when selecting data dynamically, was the amount of coordination which was necessary for the selection, which ultimately drew attention from the highlighting triggered by the selection. Often enough, this highlighting was most interesting in a different plot, and thus hard to watch while trying to get the dynamic selection right.

I wonder how much this is the case for Robert’s prototype, but I am afraid I can’t tell until I get my hands on the software and a new MacBook Pro.

The final question though for me is whether it will help people to get their data analysis jobs done more easily or not!