Data Mining

Submitted to CRTNET, August 10, 2003

Although I'm late to the discussion, I thought there might be value in extending the discussion of data mining a bit in a direction that hasn't been raised as I can see. Data mining is just the latest wrinkle on what I was taught under the label "exploratory data analysis", a term that had a long history before Tukey's classic text "Exploratory Data Analysis". I guess I don't find it odd that folks in our field would have been taught that inferential statistics were the only kind that matter. I've certainly been in courses that leaned that way, but my education in statistics has always valued both hypothesis based data analysis and exploratory methods. Indeed, I've always been taught to look beyond the results of the experiment and see what else a data set had to reveal. It has always been my understanding that such discovery was, at least in part, the function of the discussion section of a paper, but I can certainly understand why it might be isolated to its own section, as Tim Levine advocates.

Still, exploratory data analysis has a rich tradition of scholarship in its own right. Research questions that start as simply as "there has been very little scholarship exploring this issue but there is some data that may help us to understand the issue" are important. They are the starting point for finding systematic questions and hypotheses in underexplored areas of scholarship. EDS is, in effect, the statistical analog to ethnography. The analogy is, I think, an apt one, as many of the same risks and rewards apply. The risks, both in terms of spotting what may be spurious patterns and interpreting even the meaningful patterns badly, are not insubstantial. But the rewards, in terms of finding things that aren't in the literature and may well be important, argue for judicious exploration of data sets where you have the expertise to both find the patterns and offer meaningful interpretations.

The obviously problematic (and almost certainly biased) interpretation of the 12 hot dog pattern that was cited in a previous post is a good case in point. It might well be that there is something about the making of hot dogs that makes them a health hazard at some level, even if there is no corollary result associated with similar packaged foods (bologna, for instance). There are differences (probably insignificant differences, but real ones) in the way these foods are made. It might also be that hot dog preparation presents a problem. Most package foods are eaten cold. Hot dogs are eaten hot and (especially those bought in convenience stores) are often substantially overcooked. Most people eat hot dogs with mustard, and there may be a critical mass of mustard after which it becomes a dietary hazard. One might just as easily interpret such a pattern as a possible indication of a problematic eating disorder (substantial overeating, a highly unbalanced diet, or a correlate pattern of abuse of other junk foods (how many sodas a week does it take to down 12 hot dogs). My point in this is that exploration is nothing more than it claims to be. If a research finds a pattern there is an obligation to document all of the possibilities that are consistent with the observations and to either advance a program of research to narrow the alternatives or at least encourage others to do so.

These are important questions to ask, and we often ignore them. A 1960's survey of over 100,000 people found that eating no fried food in a week was as bad for you as eating a fairly large amount. I no longer remember the exact amounts, but it remains that there was a sweet spot (eating fried foods about twice a week as I recall) that appeared to be healthiest. This result has generally been ignored by those who advocate the removal of fat and, especially, fried food, from our diets, but I note that we see similar results today in studies of wine consumption. Drinking no wine is as unhealthy as drinking two glasses a day. One glass appears to be the healthy sweet spot. I love these kinds of results, as they provide rich veins for hypothesis based research and theory building. Viva le interaction effect.

I note that these are not new problems. We routinely teach students the difference between observation of patterns and interpretation of patterns in the introductory course in Interpersonal Communication. Misinterpretation is no less a problem in data analysis than it is in any venue of human observation. We tell our students to beware fundamental biases in interpretation including the fundamental attribution error, and I repeatedly have my students brainstorm as many explanations for an observation as they can. We obviously need to be wary of misinterpretation and overintepretation ourselves. Interestingly enough, we have analysts from the CIA saying the same things now. They looked at data relating to Iraq and, in general, offered a range of possible interpretations. Other decidedly more biased individuals apparently decided that only one of those interpretations could possibly be right and dropped the alternative evaluations as the analysis climbed the management chain. The result, of course, is an increasing clamor about lying from the Bush White House. You can be sure that those CIA analysts made extensive use of data mining. You can also be sure that their interpretations documented a range of possibilities and suggested means by which those possibilities might have been more rationally narrowed. That's what a good exploratory data analyst does.

Data mining differs from exploratory data analysis in one important fundamental. It attempts to automate the process of finding patterns in data through the use of pattern identification algorithms. I understand this process at its fundamentals. I built a series of automated pattern seeking agents at IBM Research, came up with at least one powerful algorithm for finding patterns in serial data, and even consulted a few times when IBM was getting its data mining software product efforts going. Other things typically associated with data mining, including graphical visualization, aren't so much an innovation as an extension of the methodologies already associated with exploratory data analysis techniques. The most important innovations in Tukey's EDS where visualization techniques based on interesting ways of organizing data. Factor, cluster, and smallest space analysis all have even older histories of using visualization as a tool for identifying patterns.

The automation of pattern discovery is the important fundamental in data mining. Automated pattern identification structures and algorithms are a really neat innovation, and they come with a nice side guarantee. Computer software isn't going to imagine a pattern and superimpose it on top of otherwise indifferent data (as people might easily do). If it finds a pattern it will be able to state exactly what it is such that the same pattern might be readily identified again in another data set. If it finds a pattern it will be usually to make a statement of how confident it is that a pattern is really there. If it finds a pattern it will generally be able to isolate it in a way that makes it easy for an analyst to visualize.

What data mining can't do is insure that a pattern is generalizable beyond a single data set or reach any reasonable interpretation about what a pattern might mean. Software algorithms and clever data structures can help one to find patterns in data, but they are of only limited value when it comes to interpreting and further testing them. The task of making a pattern meaningful; of discovering a relationship that might inform our understanding of the world remains for us. What fun. Why puzzles make the best research problems.