Professor David Hand says the 15 types of dark data we can’t, or choose not to, see could have a devastating impact on our decision-making.
A lot of stories about data – be they popular stories about big data, open data or data science, or technical statistical books about how to analyse data – are about the data you have. They are about the data sitting in folders on your computer, in files on your desk, or as records in your notebook. In contrast, this story is about data you don’t have – perhaps data you wish you had, or hoped to have, or thought you had; but, nonetheless, data you don’t have.
I believe that these missing data are at least as important as the data you have. Data you can't see have the potential to mislead, sometimes with catastrophic consequences. But, perhaps surprisingly, you can also use the dark data perspective to flip the conventional way of looking at data analysis: hiding data can, if you are clever enough, lead to deeper understanding, better decisions and better choice of actions.
So, what exactly are dark data? They might be blanks in a data set, they might be data you didn’t think to collect, they might be the details in data that are too heavily rounded, or they might be data which are hidden from you for some other reason.
Some are very familiar, such as the non-respondents in surveys, or data from people who drop out of a clinical trial. Others are more subtle, such as selective dropout of companies from financial indexes, or abandoned phone calls to the emergency services, or ceilings on values from some measuring instruments, or the opinions of the man I came across who handed his phone to his five-year old son to complete a web survey.
The fundamental problem is that the data you don’t see might be very different from the data you do see. It means that basing your analysis solely on the data you have in front of you can lead to dramatically mistaken conclusions and poor decisions. And it affects both small and large data sets.
One of the most devastating small data examples is the well-known case of the Challenger Space Shuttle, where a calculation showing problems with the seals between the booster rockets’ segments omitted data on launches that had had no problems. Including those extra data gave a dramatically different picture, and one which could well have led to a delay in the launch, with the saving of seven lives.
An early example of large data I encountered was a consultancy project involving many millions of financial transactions, with the aim of building a model to decide which customers were low risk and should therefore be given a loan. Statistical models of default risk are based on analysing previous customers, where their behaviour and outcome (whether they defaulted or not) is known. The problem is that previous customers have already undergone a selection process – they were presumably given loans because they were thought to be low risk. They are unlikely to be representative of the population of future applicants. A model based solely on previous customers could be seriously misleading.
These two examples illustrate the dangers of dark data: the data you see may not give a full picture. The same problem arises with econometric models built during benign economic periods. You don’t have a full picture of potential variability, and when things deteriorate (the 2008 financial crash, for example), the performance of the models may degrade. Physicists use the term ‘cosmic variance’ to describe the same idea: since we observe only a part of the universe, and at only one time, making statements about the universe on a wider scale is difficult.
In scientific experiments, decisions must often be made about which data points to retain and analyse. We might be suspicious about departures from the ideal experimental conditions when some of the measurements were taken, and consider dropping some data. Different choices of values to include, with the excluded ones constituting dark data, can lead to different overall experimental results. The case of Robert Millikan and whether or not electric charge came in discrete quantities is a classic example.
Millikan asserted that he had included all his experimental results in his analysis (“This is not a selected group of drops but represents all of the drops experimented upon during 60 consecutive days”), but his notebooks told a different story. In fact, as I describe in my book Dark Data: Why What You Don’t Know Matters, the truth seems to be that the values he excluded were obtained while he was calibrating his measurement procedure.
The Millikan case involves a small data set in modern terms, but as data capture and data storage have become easier and cheaper, so more and more larger and larger data sets have been kept. This often means that the data are accumulating without being looked at.
Indeed, the term ‘dark data’ is sometimes used in a restricted sense to describe data an organisation has collected but hasn’t yet analysed – data stored in metaphorical filing cabinets gathering dust but possibly containing all sorts of treasures. But this is a fairly innocuous kind of dark data. Most dark data – data whose values you don’t know or perhaps even data you don’t know you are missing – are potentially much more serious.
It is this kind of data which have led to business failures, as is illustrated by the fact that lack of data about the intended market is one of the prime causes of failure of startups. Sometimes data generation occurs at such a rate that it is impossible to store it all, and numbers can arrive so rapidly that selection is necessary on the fly, often using quick and dirty selection methods.
This means there is always the possibility that the discarded data might contain interesting discoveries. In the Large Hadron Collider, for example, about 30 million particle collisions occur per second. This is so many that data for only around 1,200 are saved. As an article by Ethan Siegel in Forbes magazine put it: “We may have collected hundreds of petabytes, but we’ve discarded, and lost forever, many zettabytes of data: more than the total amount of internet data created in a year.” While the more interesting processes are selected, it inevitably prompts the question of whether the 99.996 per cent of the data which are dark contains useful information.
The term dark data comes from an analogy with astrophysics, when it was discovered that galaxies rotated more quickly than they should, according to our understanding of gravity. An explanation was devised in terms of ‘dark matter’: something which had mass but which was invisible to electromagnetic radiation, so we could not see it. It is now thought that 23 per cent of the universe consists of this dark matter.
Dark data come in many flavours. They might be the values not measured in an experiment because the instrument had a ceiling beyond which it could not register, or the events that occurred too rapidly for your detector to register them. They might be data missing from medical records because the topic was sensitive and people didn’t like to mention it. Data collected about people often allows them to opt in or opt out of having their data collected, leading to potential distortion of the overall data set. In my book I give a taxonomy of 15 different types of dark data.
If dark data are ubiquitous and have potentially very serious consequences for understanding and for actions, what can we do about it? How do we know we have something missing? And does it matter? Is it misleading us, and how seriously?
Statisticians have developed tools to answer these questions. They are based on the general principle of using the data you do have to tell you about the data you don’t. Again, dark matter illustrates this: it was the data on galaxy rotation that led to the detection of dark matter. But the critical thing is to be aware of the dangers of what you don’t know.
However, and here’s the good news, it’s not all downside. Surprisingly, you can actually use dark data to your advantage, enabling you to make better decisions and gain greater understanding – think blinding in clinical trials and blind analysis in physics to avoid bias, for example. I call this the strategic application of ignorance. The dark data perspective inverts the normal way of looking at things, so that the strategic casting of shadows can lead to greater understanding and greater illumination.
Dark Data: Why What You Don’t Know Matters, by David Hand, is published by Princeton University Press.
Imperial is the magazine for the Imperial community. It delivers expert comment, insight and context from – and on – the College’s engineers, mathematicians, scientists, medics, coders and leaders, as well as stories about student life and alumni experiences.
This story was published originally in Imperial 48/Summer 2020.