Honourable Mention
ACM SIGCHI Conference on Human Factors in Computing Systems
2017
Datasets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data. This paper presents a novel method for generating such datasets, along with several examples. Our technique varies from previous approaches in that new datasets are iteratively generated from a seed dataset through random perturbations of individual data points, and can be directed towards a desired outcome through a simulated annealing optimization strategy. Our method has the benefit of being agnostic to the particular statistical properties that are to remain constant between the datasets, and allows for control over the graphical appearance of resulting output.
It can be difficult to demonstrate the importance of data visualization. Some people are of the impression that charts are simply "pretty pictures", while all of the important information can be divined through statistical analysis. An effective (and often used) tool used to demonstrate that visualizing your data is in fact important is Anscome's Quartet. Developed by F.J. Anscombe in 1973, Anscombe's Quartet is a set of four datasets, where each produces the same summary statistics (mean, standard deviation, and correlation), which could lead one to believe the datasets are quite similar. However, after visualizing (plotting) the data, it becomes clear that the datasets are markedly different. The effectiveness of Anscombe's Quartet is not due to simply having four different datasets which generate the same statistical properties, it is that four clearly different and visually distinct datasets are producing the same statistical properties. In contrast the "Unstructured Quartet" on the right in Figure 1 also shares the same statistical properties as Anscombe's Quartet, however without any obvious underlying structure to the individual datasets, this quartet is not nearly as effective at demonstrating the importance of visualizing your data.
While very popular and effective for illustrating the importance of visualizing your data, they have been around for nearly 45 years, and it is not known how Anscombe came up with his datasets. So, we developed a technique to create these types of datasets – those which are identical over a range of statistical properties, yet produce dissimilar graphics.
Recently, Alberto Cairo created the Datasaurus dataset which urges people to "never trust summary statistics alone; always visualize your data", since, while the data exhibits normal seeming statistics, plotting the data reveals a picture of a dinosaur. Inspired by Anscombe's Quartet and the Datasaurus, we present, The Datasaurus Dozen (download .csv):
These 13 datasets (the Datasaurus, plus 12 others) each have the same summary statistics (x/y mean, x/y standard deviation, and Pearson's correlation) to two decimal places, while being drastically different in appearance. This work describes the technique we developed to create this dataset, and others like it.
The key insight behind our approach is that while it is relatively difficult to generate a dataset from scratch with particular statistical properties, it is relatively easy to take an existing dataset, modify it slightly, and maintain those statistical properties. We do this by choosing a point at random, moving it a little bit, then checking that the statistical properties of the set haven't strayed outside of the acceptable bounds (in this particular case, we are ensuring that the means, standard deviations, and correlations remain the same to two decimal places.)
Repeating this subtle "perturbation" process enough times, results in a completely different dataset. However, as mentioned above, in order for these datasets to be effective tools for underscoring the importance of visualizing your data, they need to be visually distinct and clearly different. We accomplish this by biasing the random point movements towards a particular shape. In the animation below, we show the process of 200,000 iterations of perturbations towards a 'circle' shape:
To move the points towards a particular shape, we perform an additional check at each random perturbation. Besides checking that the statistical properties are still valid, we also check to see if the point has moved closer to the target shape. If both of those conditions are met, we "accept" the new position, and move to the next iteration. To mitigate the possibility of getting stuck in a locally-optimal solution, where other, more globally-optimal solutions closer to the target shape are possible, we use a simulated annealing technique which begins by accepting some solutions where the point moves away from the target in the early iterations, and reduces the frequency of such acceptances over time.
To generate the Datasaurus Dozen, we created 12 shapes to direct the dots towards. Each of the resulting plots has the same summary statistics as the original Datasaurus, and in fact, all of the intermediate frames do as well. The process of converting the Datasaurus into each of these shapes can be seen below. Of course, the technique is not limited to these shapes, any collection of line segments could be used as a target.
Iterating through the datasets sequentially, we can see how the data points morph from one shape to another, all the while maintaining the same summary statistical values to two decimal places throughout the entire process.
Besides the Datasaurus Dozen, we have created several other example datasets using our technique. They are explained in more detail in the paper, and can be downloaded for your own visualizations.
One interesting property of our technique is that it can work for visualizations other than 2D scatter plots, and statistical properties besides the standard summary statistics. In the example below each of the datasets start out as a normal distribution of points. The boxplot shown at the bottom is a standard "Tukey Boxplot" which shows the 1st quartile, median, and 3rd quartile values on the "box", and the "whiskers" showing the location of the furthest data points within 1.5 interquartile ranges from the 1st and 3rd quartiles. Boxplots are commonly used to show the distribution of a dataset, and are better than simply showing the mean or median value. However, here we can see as the distribution of points changes, the box-plot remains the same.
Another way to look at these 1D distributions is to consider a dataset with seven categories (Figure 8, below). The data in each category is shifting overtime, as can clearly be seen in the "Raw" data view, yet the boxplots remain static. Violin plots are a good method for presenting the distribution of a dataset with more detail than is available with a traditional boxplot. This is not to say that using a boxplot is never appropriate, but if you are going to use a boxplot, it is important to make sure the underlying data is distributed in a way that important information is not hidden.
The datasets presented on this page (and in the paper) are available for download. The Python source code is available for download here. I have tried to remove as many "extra" things from the code as possilbe, to make it more readable, but is still a bit rough... I would like to put it up on GitHub soon, and if there is enough interest, turn it into a real library. If you are a researcher and would like to see all of the original code (even though it might not run properly, and is qutie "researchy"), please contact me.
Amazingly, and without any work on my part, these datasets have been turned into an R package (GitHub, cran package). This effort has been led by Stephanie Locke and Lucy McGowan. Thanks!
A special thanks to Alberto Cairo for creating the Datasaurus. When I asked if he had saved the datapoints from his original tweet, he hadn't, but he very graciously (and quickly!) created a new (and even better) dinosaur drawing (using the fantasitc DrawMyData tool). Also thanks to Fraser Anderson for the idea to start with an existing dataset which has the desired statistical properties, rather than try to create one from scratch.
The following are links to some of the attention this article has received. If you have found additional coverage, I'd love to hear about it.
Fast Co. Design: These 12 Graphs Show Why Data Viz is So Important May 10, 2017
Boing Boing: Automatically generate datasets that teach people how (not) to create statistical mirages May 3, 2017
FlowingData: Same summary statistics, completely different plots May 2, 2017
Pour la Science: Coïncidences surprenantes, mais banales N° Spécial 481, November 2017
Microsoft Revolutions: The Datasaurus Dozen May 2, 2017
Eager Eyes: InfoVis Papers at CHI 2017 May 22, 2017
Linear Digressions Podcast: Anscombe's Quartet June 18, 2017
Digital Analytics Power Hour Podcast: The Democratization of the Data July 4, 2017
JWZ: Datasaurus Dozen May 2, 2017
Locke Data: The making of datasauRus May 2, 2017
Dabbling with Data: The Datasaurus: a monstrous Anscombe for the 21st century May 3, 2017
Hacker News (Front Page): Same Stats, Different Graphs May 2, 2017
Reddit (Front Page): Be wary of boxplots, they might be hiding important information! Aug 10, 2017
Meneame (Spanish, Front Page): Los doce del datosaurio: misma estadística, diferentes gráficas Aug 13, 2017
Tomaz Tonguz: When Statistics Will Mislead You May 3, 2017
Waxy.org: The Datasaurus Dozen May 3, 2017
Jake Thompson: Recreating the Datasaurus Dozen Using tweenr and ggplot2 May 5, 2017
Prior Probability: Same Stats, Different Graphs May 2, 2017
The Command Line: Generating datasets with the same summary stats but very different graphs May 3, 2017
Introduction to the New Statistics: What the datasaurus tells us: Data pictures are cool May 11, 2017
Psychometric Studio: Datasaurus May 17, 2017
Lars P. Syll: Don't Trust Summary Statistics Alone May 6, 2017
De Correspondent (Dutch): Na deze 4 grafieken kijk je nooit meer hetzelfde naar het gemiddelde July 6, 2017
Blog Statystyczny (Polish): Kwartet Anscombe’a, Datasaurus – czyli po co w ogóle rysować? May 28, 2017
QED Insight: Statistics in the Triad, Part VII: Mapping The Datasaurus Dozen Aug 18, 2017
InformationWeek: Avoid the Danger Zone of Metrics Sept 28, 2017
Urban Demographics: Why you should always visualize your data Oct 10, 2017
Intuition Machine: Why Probability Theory Should be Thrown Under the Bus Nov 2, 2017
The Morning Paper: Same Stats, Different Graphs Oct 31, 2017
Robert Grant Stats: Dataviz and Methodviz of the Year 2017 Dec 28, 2017
Yannick Assogba: Data Visualization Favorites from 2017 Dec 29, 2017
For more information, please see the research paper or watch the longer research video.
If you have any questions please contact Justin Matejka through email Justin.Matejka@Autodesk.com or Twitter @JustinMatejka.
Loading...
Visual data representations leverage the power of human perception to process complex information, and through interaction, garner new insights. Our research focuses on visualizing data from a wide variety of domains and fundamentally tackles the question, what makes a visualization effective? We explore novel visual encodings and interaction techniques, multiscale approaches, and even simulation to bridge human and automated analysis of multivariate, time-series, and graph data, ultimately aiding in hypothesis generation, testing, and sense making.
The OrgOrgChart (Organic Organization Chart) project looks at the evolution of a company's structure over time. A snapshot of the Autodesk organizational hierarchy was taken each day between May 2007 and June 2011, a span of 1498 days.
With the Customer Involvement Program (CIP) we have a database of over 60 million commands issued by anonymized users. We start to visualize the product usage by ordering the commands from most-frequently-used to least-frequently-used. The size of the text and length of the bar are proportional to how frequently the command is used.