Big Data is More Hazardous When it’s Wider

When carrying out data analytics work and we find an interesting effect in the data, we always need to ask the standard statistical question: how likely could this result have occurred by chance? 

A P-value of less than 0.05 or 0.01 can be useful in testing single ideas in a traditional experimental approach but can quickly become meaningless in the face of big data (and especially for wide data) where 1,000’s of ideas can be tested at a time. 

Dr. John Elder spoke about this last year at the Predict Conference in Dublin and termed it the problem of vast search. “The best result stumbled upon during a vast search has a much greater chance of being spurious than if the vast search was not performed and the effect was the result of a single experiment” said Dr. Elder.

Think about this example: you have a 1 in 10 million chance of winning the lottery and there are 20 million people playing the lottery, so each person has a 1/10m chance of winning the lottery. It is not unlikely that someone will win the lottery.

Studying that lucky person who did win the lottery e.g. what they eat for breakfast, what colour socks they wear, what their favourite colour is, etc… will not increase your chances of winning the lottery in future.

There is an analogy here for vast search for hypotheses using big data. If you test 1,000’s of hypotheses against a big data set – it is not unlikely that you will find some that look good, but be careful – if you find one, is it random chance in the data or an important effect? For a more detailed explanation see Dr. John Elder’s paper Orange Cars Aren't Lemons? in which he shows a nice used car data example.

Target shuffling is a rarely-used tool which lets you test what the chances the “effect” you are seeing could be simply random noise or blind luck, in other words, an accident. Like the name suggests, it uses a random sampling method on the data to calibrate results to ensure that they are reliable. 

Dr. Elder is a key promoter of using this method – which is currently being rediscovered in multiple fields as people start to question the results of their vast search machine learning projects before betting the farm on the results.

To further complicate the matter, Dr. Elder explained the pitfalls of cognitive bias and recommended Kahneman’s book: Thinking Fast and Slow for further reading. 

So, a key lesson for me at PAW was reinforcing the solution to the vast search problem to apply a target shuffling method. This is a tool we are looking to build in the Expert Models platform. 

Further Reading:

Highlights from the Inaugural Predict Conference 

John Elder – Orange Cars Aren't Lemons?

The Expert Models platform

Recommended books:

Daniel Kahneman – Thinking, Fast and Slow  

Nassim Nicholas Taleb – Fooled by Randomness

Nate Silver – The Signal and the Noise  

David Leinweber – Nerds on Wall Street

Jeff Deal, Gerhard Pilcher – Mining Your Own Business

Written by Cronan McNamara on April 15 2016

Signup for our newsletter