With data as big as we can get it today, the scientific method doesn't work anymore. (Don't take my word for it. Listen to Sandy Pentland.)
A correlation between two factors is judged statistically significant if there is less than a 5%, or 1%, or 0.5% chance that the results would come out this way by chance. At the strictest level, this means 1:200 false hypotheses will show up as true out of randomness. With tremendous data, we can test effectively infinite hypotheses. Plenty of them will look significant when they are not. As Sandy puts it, you can learn that people who drive Fords on Thursdays are more likely to get the flu. The correlation exists, but it's bullshit.
With big data, it's time to bring the word "significant" back to its regular-people meaning. We have to look for causality. We have to look for the micropatterns that lead to better health, smoother traffic, lower energy use. No more "this happened and this happened to the same people, so they must be related!" Causality delineates the difference between truth and publishability of an academic paper.
How can we find that causality? It is complex: many influences together trigger each event, and each of these factors are triggered by many influences including each other. How are we to analyze this?
|A painfully simplified example: Jay's new web site|
Manufacturing has a tool that could be useful. Quality Function Deployment, and in particular the House of Quality tool, addresses the chains and webs of causality. As Chad Fowler explained yesterday at 1DevDayDetroit, the House of Quality starts with desired product characteristics. It identifies the relative importance of each characteristic; a list of measurable factors that influence the characteristics; and which factors influence which characteristics, how much, and in what direction. Magic multiplication formulas then calculate which factors are the most important to the final product.
But don't stop there. Take the factors and turn them into the target characteristics in the next House of Quality. Find factors that influence this new, more detailed set of characteristics. Repeat the determination of what factors influence what characteristics and how much.
|The factors from Iteration 1 become the goals in Iteration 2.|
This kind of causality analysis is a lot of work. Creating this sad little example made my brain hurt. This analysis is no simple graph of heart attacks vs strawberry consumption across populations. On the upside, Big Data drastically expands our selection of measurable factors. If we can identify causality at a level this detailed, we can get a deeper level of information. We can get closer to truth.