Dark Data: Why What You Don’t Know Matters

I like data and I like books about data. Unsurprisingly, I found the book Dark Data: Why What You Don’t Know Matters interesting to read. The book is packed with fun and interesting examples of what data can and cannot tell us. For people who teach introductory statistics and are running low on examples, I can recommend consulting this book.

That being said, I have some issues with the book. While the book is about missing data, only a small part of the book deals with actual missing data (as defined by text books in statistics). This is not necessarily a problem but the book does not succeed in fully connecting the notion of dark data with the traditional understanding of missing data. The problem is that upon reading the book, most issues with data can be understood through the concept of dark data, including issues that have very little to do with data.

To understand different types of dark data, the book operates with a list of 15 different types of dark data (DD):

  1. Data We Know Are Missing
  2. Data We Don’t Know Are Missing
  3. Choosing Just Some Cases
  4. Self-Selection
  5. Missing What Matters
  6. Data Which Might Have Been
  7. Changes with Time
  8. Definitions of Data
  9. Summaries of Data
  10. Measurement Error and Uncertainty
  11. Feedback and Gaming
  12. Information Asymmetry
  13. Intentionally Darkened Data
  14. Fabricated and Synthetic Data
  15. Extrapolating beyond Your Data

If you look at the list and think that these types of dark data are not mutually exclusive, it is because these types of dark data are not mutually exclusive. Accordingly, the list does not add up and does not provide a unified framework to help the reader understand the various characteristics of dark data. To make matters worse, the overview of different types of dark data is not used to provide structure to the book or the chapters. On the contrary, the various types of dark data is introduced at various places in non-chronological order and some are used multiple times and in relation to other types of dark data. I guess this is to help the reader and show the complexities with the different types of dark data, but it is not working.

The reason I believe the book serves as a good introduction to various topics in statistics is that it uses several examples that you find in introductory statistics books, including the Simpson’s paradox, regression towards the mean, correlation and causation and measurement validity. Dark Data 9 (Summaries of Data), for example, is about how the average can hide important information (that’s as introductory as it gets).

The book is good at providing several great examples but does rarely focus on the actual solutions or potential tools that can help us work with “dark data” (except the awareness of such issues). While the book describes the different types of missing data (missing completely at random, missing at random and not missing at random), there is very little information on how to maximize the information we have available and minimize bias in relation to the different issues we encounter when working with “dark data”. I know it’s not the point of the book to deal with listwise deletion, imputation etc., but I would have liked to see the book deal a little more with the statistics of missing data.

In sum, it’s an OK book. If you are already familiar with the concepts above, you don’t need to read the book. However, if you have no to little experience with statistics, this book might serve as a good primer on several problems and issues we encounter when we use and analyse data.