Because most philosophies that frown on reproduction don't survive.

Friday, January 23, 2009

The Age of Data

As someone who earns his daily bread and crumpets by doing data analysis, yet has a deep affection for the worldviews of times past, it has struck me more and more of late that "data" has become a term granted incredible reverence in our modern world. Constantly I hear people who are educated and move within educated circles, but who have no particular understanding of data itself, insist that they must be "shown the data" to believe something.

Some little while ago I asserted in conversation about youthful sexual morality that it was entirely possible for a young person to, should he or she so choose, remain chaste until marriage. "The data doesn't support that," I was told. When I responded that data on the topic merely showed that many people do not so choose, but that it was nonetheless entirely possible to pursue this course (and mentioned my own experiences and those of a few friends by way of example) I was informed tendentiously that, "The plural of anecdote is not data."

Others address problems which might rightly be addressed via data, but have no understanding of what data means. "It's shocking that in this day and age half our students are still below average in reading ability," I was once told. I'm afraid I laughed.

Or slightly more less obviously foolish, "The bottom 20% of earners don't make any more, after you adjust for inflation, then they did twenty years ago." This is true in a certain statistical sense, but it fails to account for the fact that individual people move through these classifications quite fluidly. The nineteen-year-old who was in the bottom 20% ten years ago is by no means necessarily still there now.

From my vantage point as a producer of data, it strikes me that the illusion under which far too many people labor is that data itself tells you something, and that it tells you this clearly and with authority.

Data is a highly modern phenomenon. When William the Conqueror commissioned the Domesday book in 1085, it took a year to complete and the feat of cataloging with relative accuracy all the people and their property in England was so remarkable that it is remembered to this day. The project required sending people out on horseback throughout the kingdom to interview people and write down the results over the course of a year, and no one attempted anything of similar scale again in England for several hundred years.

Today computers and telephones not only make it easy to collect census and survey data, but all manor of transactions (performed on computers) pour out fast quantities of data as a sort of golden waste product. In a modern corporation such as the one I work for, it is impossible to sell people things and ship them out without in the process producing so much data about who bought what, when, and for how much, that we "data monkeys" are challenged to drink in half the available insights from the firehose stream of information turned upon us.

Data sits out there in tables: thousands or millions of records of individual facts or events which those of us who access them can sum and average and graph and flip this way and that in pivot tables. Sitting at my computer I can take millions of rows of data which are the by product of a week's worth of consumer electronics orders and in an hour or two's worth of work in Access and Excel tell you how often certain products were bought together, what people think is a good price for a 42" TV, or whether a glossy insert in the weekend newspaper sells more digital cameras than our website. The sorts of functions I can run against a data set in moments represents levels of analysis that would almost certainly have been impossible more than fifty years ago. In the days when people really did wear green eye shades and sharpen pencils, it would simply not have been possible to gather and casually experiment with hundreds of thousands of rows of data in the way that I so casually do every day.

The fact that there is simply so much data around in the modern world allows us to investigate all sorts of interesting questions using data. But what must be realized is that "data" is simply the collection of lots of individual records about individual events. It may not be the plural of anecdote, but the it is the plural of event. And data does not itself have obvious meaning. One must seek to find some sort of pattern in it, and that pattern may not be right, in the sense that it may well not accurately describe the experiences or motivations of most or any of the people who were involved in the individual events whose descriptions are now "data".

And this is what people need to understand about data. Data is not the deeper essence of the universe, the real world of which mere events are but an imperfect instantiations. Rather, data represents the partial leavings of reality. Traces of past events. Footprints and shadows. Clues left behind by the real events, which can only at times be accurately deduced from them.

There are amazing insights that can be gained from data analysis, not only about present day events, but about the past. (Some fascinating historical research I've read lately has been based on analysis of data built off the centuries of data recorded in parish registries from throughout Europe, now entered into databases by historians.) But these insights are only good to the extent that a good analyst is able to correctly identify patterns which reflect reality.

I'm glad that we live in the age of data. All other things aside, I find it fascinating to play with, and it gives me a good living. But given the modern obsession with data as a defining source of truth, people need to see through the hype and recognize what data actually is and isn't. Data is simply the collection of a set of records which tell you "someone did this" or "this person had this characteristic". By looking at the frequency with which people did some given thing or possessed some given trait, we can learn some very interesting things. But data can never answer qualitative questions for us, though it may provide us with the inputs to make qualitative judgments.

Data cannot tell you what people are capable of or what they should do, but it can tell you what people often do do. Data cannot tell you what the best health care system is, but it can tell you the life expectancies of people with various illnesses in different countries, the average cost of treatments, or the average wait times for procedures. Data cannot tell you what marriage is or what culture has a desirable family structure, but it can tell you who gets divorced and how frequently. And data cannot tell you how accurately data answers all the worlds problems -- though there is data on how much more data we produce very year. Indeed if the data I've seen on that point is accurate, the age of data is just beginning.


Anonymous said...

The Blackadder Says:

"The plural of anecdote is not data" is kind of like an educated person's version of "I'm rubber, you're glue."

Ginkgo100 said...

You do data analysis, eh? You don't happen to work with Audit Command Language, do you?

Darwin said...




No. Just Excel, Access, SQL and a little SAS.

Patrick said...

Of course, from your point of view it's irrelevant that the vast majority of those who take an abstinence-till-marriage pledge will break it (in fact, the same proportion as their peers who hadn't taken such a pledge), and a plus that fewer of them will use a condom when they do so. It just doesn't seem you can argue on secular grounds that such pledges are a good thing.

More generally, if you can't show empirically that the Catholic virtue of chastity has great value to a human life, then the attempt looks (to a non-Catholic) like a lot of effort for an ephemeral payoff and a substantial risk.

(Of course you're preaching more than pledges; but that's one particularly well-studied example.)

Darwin said...


Actually, I addressed that study (went back to the origional data) and pointed out that what the researcher did here was some statistical sleight of hand. She defined "peers" as other young people with exactly the same religious practices, home life and attitudes towards sex as those who took pledges. Compared the general population of teenagers, those who took the pledges had sex for the first time much later and half of them did indeed "wait until marriage".

Which kind of serves to underline my point that a great many people believe "data" as if is were some deep insight into the nature of the world without bothering to understand what the data is actually looking at.

As a mathematician, I'd have expected you'd agree pretty easily with that one.

CMinor said...

half our students are still below average in reading ability

You've heard, perhaps, of the Lake Wobegon effect in certain public schools?

All the children are above average.

I kid you not.

Daddio said...

"It's shocking that in this day and age half our students are still below average in reading ability,"

Actually, half are below the median. With a large sample, the median and mean could be very close. But there is a difference, which I love pointing out to my customers. They try to escape the blame for their poor insurance claims reporting timeliness. "Yeah, well, that one guy didn't even tell us he was injured until three months later." Fine. That's why we are pointing out the MEDIAN, not the MEAN. Work on improving it, please.

DMinor said...


From my vantage point as a producer of data, it strikes me that the illusion under which far too many people labor is that data itself tells you something, and that it tells you this clearly and with authority.
. . . .
And data does not itself have obvious meaning. One must seek to find some sort of pattern in it, and that pattern may not be right. . . .

Amen and Amen! Data retrieval and arrangement is not the same as data analysis, although all too often the former is confused with the latter. I call this the "Bat Computer" model -- if I punch the right buttons, the computer will spit out my answer. If there is no critical thinking, there is no analysis.

Thadeus said...

Thanks for the insightful post. It reminded my of a recent article in "Columbia," the K of C magazine where they referenced Pope Benedict XVI's discussion of faith and reason.

To quote, "...for it is self-contradictory to hold that we should only accept what can be proven scientifically. After all, no scientific experiment verifies that reason is limited to scientific experiments alone."