Missing: Statistical Due Diligence in Big Data Analysis

Big data has the answers. Lots of good ones, and lots of bad ones, too.

Sometimes separating the wheat from the chaff requires disciplines from outside the realm of data science.

Bill Luker writes in the Predictive Analytics Times:

Big data is and has been less easy to build, manage, and most importantly, analyze, than originally claimed… The facts are that we simply cannot analyze Big Datasets without the tools and empirically grounded theory from what I call the Statistical Data Sciences—known everywhere as just plain statistics. And CS-IT Data Science (with perhaps the exception of machine learning, as in automated applied statistics) has backed itself into a blind alley by dismissing statistics.

But with more and more data, CS-IT Data Scientists and data scientists from all disciplines have a greater need for the tools and statistical theory of data collection, selection, and construction; bivariate and multi-variate correlation analysis; and sampling. Ultimately R&V analysis also demands these analytic and testing routines, not the basically zero attention it too often receives.

[…]

The point of this discussion is not to bash Big Data; I’ve done it, along with many others. It’s that if you don’t pay attention to things like R&V (out of statistics research and the accompanying literature) that do require thought and adaptations to Big Data, unreliable measurements can really bite you.

Missing: Statistical Due Diligence in Big Data Analysis

Submit a Comment Cancel reply

Test Drive Our New Data Dashboards!

Newsletter

Latest Data Added