As someone who earns his daily bread and crumpets by doing data analysis, yet has a deep affection for the worldviews of times past, it has struck me more and more of late that "data" has become a term granted incredible reverence in our modern world. Constantly I hear people who are educated and move within educated circles, but who have no particular understanding of data itself, insist that they must be "shown the data" to believe something.
Some little while ago I asserted in conversation about youthful sexual morality that it was entirely possible for a young person to, should he or she so choose, remain chaste until marriage. "The data doesn't support that," I was told. When I responded that data on the topic merely showed that many people do not so choose, but that it was nonetheless entirely possible to pursue this course (and mentioned my own experiences and those of a few friends by way of example) I was informed tendentiously that, "The plural of anecdote is not data."
Others address problems which might rightly be addressed via data, but have no understanding of what data means. "It's shocking that in this day and age half our students are still below average in reading ability," I was once told. I'm afraid I laughed.
Or slightly more less obviously foolish, "The bottom 20% of earners don't make any more, after you adjust for inflation, then they did twenty years ago." This is true in a certain statistical sense, but it fails to account for the fact that individual people move through these classifications quite fluidly. The nineteen-year-old who was in the bottom 20% ten years ago is by no means necessarily still there now.
From my vantage point as a producer of data, it strikes me that the illusion under which far too many people labor is that data itself tells you something, and that it tells you this clearly and with authority.
Data is a highly modern phenomenon. When William the Conqueror commissioned the Domesday book in 1085, it took a year to complete and the feat of cataloging with relative accuracy all the people and their property in England was so remarkable that it is remembered to this day. The project required sending people out on horseback throughout the kingdom to interview people and write down the results over the course of a year, and no one attempted anything of similar scale again in England for several hundred years.
Today computers and telephones not only make it easy to collect census and survey data, but all manor of transactions (performed on computers) pour out fast quantities of data as a sort of golden waste product. In a modern corporation such as the one I work for, it is impossible to sell people things and ship them out without in the process producing so much data about who bought what, when, and for how much, that we "data monkeys" are challenged to drink in half the available insights from the firehose stream of information turned upon us.
Data sits out there in tables: thousands or millions of records of individual facts or events which those of us who access them can sum and average and graph and flip this way and that in pivot tables. Sitting at my computer I can take millions of rows of data which are the by product of a week's worth of consumer electronics orders and in an hour or two's worth of work in Access and Excel tell you how often certain products were bought together, what people think is a good price for a 42" TV, or whether a glossy insert in the weekend newspaper sells more digital cameras than our website. The sorts of functions I can run against a data set in moments represents levels of analysis that would almost certainly have been impossible more than fifty years ago. In the days when people really did wear green eye shades and sharpen pencils, it would simply not have been possible to gather and casually experiment with hundreds of thousands of rows of data in the way that I so casually do every day.
The fact that there is simply so much data around in the modern world allows us to investigate all sorts of interesting questions using data. But what must be realized is that "data" is simply the collection of lots of individual records about individual events. It may not be the plural of anecdote, but the it is the plural of event. And data does not itself have obvious meaning. One must seek to find some sort of pattern in it, and that pattern may not be right, in the sense that it may well not accurately describe the experiences or motivations of most or any of the people who were involved in the individual events whose descriptions are now "data".
And this is what people need to understand about data. Data is not the deeper essence of the universe, the real world of which mere events are but an imperfect instantiations. Rather, data represents the partial leavings of reality. Traces of past events. Footprints and shadows. Clues left behind by the real events, which can only at times be accurately deduced from them.
There are amazing insights that can be gained from data analysis, not only about present day events, but about the past. (Some fascinating historical research I've read lately has been based on analysis of data built off the centuries of data recorded in parish registries from throughout Europe, now entered into databases by historians.) But these insights are only good to the extent that a good analyst is able to correctly identify patterns which reflect reality.
I'm glad that we live in the age of data. All other things aside, I find it fascinating to play with, and it gives me a good living. But given the modern obsession with data as a defining source of truth, people need to see through the hype and recognize what data actually is and isn't. Data is simply the collection of a set of records which tell you "someone did this" or "this person had this characteristic". By looking at the frequency with which people did some given thing or possessed some given trait, we can learn some very interesting things. But data can never answer qualitative questions for us, though it may provide us with the inputs to make qualitative judgments.
Data cannot tell you what people are capable of or what they should do, but it can tell you what people often do do. Data cannot tell you what the best health care system is, but it can tell you the life expectancies of people with various illnesses in different countries, the average cost of treatments, or the average wait times for procedures. Data cannot tell you what marriage is or what culture has a desirable family structure, but it can tell you who gets divorced and how frequently. And data cannot tell you how accurately data answers all the worlds problems -- though there is data on how much more data we produce very year. Indeed if the data I've seen on that point is accurate, the age of data is just beginning.
We Can't Handle the Truth?
4 hours ago