Anyone who is used to working with data can tell you what a chore it is to keep data clean. Errors, duplications, omissions, and otherwise ‘dirty data’ throws off data analytics, sometimes significantly. The only difference in dealing with normal-sized data sets and big data is — well, the errors and omissions grow exponentially right along with the size of the data. Data scientists are continually working to determine when the data is sending them real signals, based on factual information, and when their results are nothing more than illusions based on noise from dirty data.
It’s Hard to Find the Right Combination of Data Sets, Variables, and Algorithms Even When the Data Doesn’t Lie
As the numbers and scope of data continues to grow, the numbers of relationships that can potentially be uncovered via modeling gets nearer and nearer to infinity. Sometimes what looks like a predictive variable actually isn’t. Finding actual correlations through data analytics means sniffing out the right data, specifying the exact variables, and analyzing it via the most perfect possible algorithm. After discovering the right combination of decisions, the patterns become quite clear. Unless, of course, they’re based on a lie. Your data is lying to you.
It’s actually pretty easy for data to lie to us. Most humans are guilty of analyzing the data just long enough and in just the right way to find what they want to find or prove what they want to prove. Even the mathematical and scientific geniuses can make honest mistakes when it comes to logic and math.
In his book Thinking, Fast and Slow, Nobel laureate in the field of economy, Daniel Kahneman, muses about how humans (no matter how smart or talented or dedicated to discovering the truth), see patterns where there are none, yet entirely miss the deep statistical patterns that actually do exist all around us in the real world. This especially holds true when the data analytics patterns go against what we believe, what we’ve been taught, or what we hope to find in our research.
Data Scientists Usually Stop the Data Analytics as Soon as They’ve ‘Proven’ Their Hypothesis
When a data scientist, deeply prided on their ability to explore the data and find correlations, polls the data, there is a deep, yet unconscious, desire to validate the assumptions that they’ve based their entire career upon. No data scientist wants to believe that they’ve done inferior work, merely rubber-stamping their own already-held conclusions and convictions. Yet after a scientist has found the “proof” they’re looking for, it’s difficult to justify deeper investigation into the data. Additional data analytics could uncover some truths that are mighty inconvenient.
Methods to Keep Data Analytics (and Data Scientists) Honest
There are, however, some methods that can help the data scientist who is truly dedicated to the truth find the truth, even when the data wants to hide it and lie about it. Here are these methods:
- Ensemble Learning — Ensemble learning is a process that strategically generates and combines multiple models (like classifiers or experts) to solve a computational problem. The multiple, independent models all use the same set of data, but different samples. Each model uses different algorithms, different variables, etc. This can mean more confidence in the results of the data analytics, because there’s far less chance that the researcher has stumbled on dirty data, data noise, or an anomaly.
- Robust Modeling — Robust modeling is merely a method of due diligence when it comes to modeling. It determines if the predictions being made are stable in regards to alternative data sets, sampling techniques, algorithms, time, etc.
- A/B Testing — A/B testing (often used in the realm of online marketing to compare two versions of a website for performance) uses two or more models, holding some variables constant while others differ. In real-world use cases, repeated runs of incrementally adjusted A and B models converge on a collection of variables with the highest predictive value.
Through disciplined use of one or more of these methods, the data scientist can have more confidence that the patterns and correlations they discover in big data analytics are true and accurate. In other words, they can catch their data in blatant lies.
Are you looking for ways to get better insight from your data analytics? Follow us on Twitter for news, insight, and more.