5 Things Pharma Professionals Need to Know About Big Data
Author: Andrew Bryk (Research Analyst)
- Big Data has its roots in a set of statistical methods called econometrics
- A main feature of the finance, biology, and the social sciences is the apparent randomness of the behaviors being measured. Econometrics exists to distill the random and unpredictable features of life from the underlying drivers. Doing this results in the creation of algorithms – mathematical formula – that describe the features of these processes and allow us to make predictions.
- Econometrics depends on human intuition and context to function
- Creating a useful model requires selecting the right data as well as their relationship to the response variable. Is the relationship linear? Or does the effect degrade over time? Is it a one-off change or a continuous process? Making the right choices requires contextual knowledge and persistent experimentation
- Such a technique is totally unfeasible with modern commercial datasets
- A dataset with 2 variables has 3 possible models. A dataset of 10 variables has 3.6 million. A dataset of a 100 has more than a googol’s worth of possibilities. Asking a person to figure out the best model from this massive universe of choices is not realistic.
- Big Data allows us to use non-parametric techniques
- While Big Data creates a great deal of complexity, large datasets also enable the most complex modeling techniques we have. Non-parametric models don’t require an analyst to guess at the underlying relationship. They function autonomously, and typically are much more accurate than any model an analyst could specify. These techniques were all but impossible to use, however, because they required gargantuan datasets. But now that those datasets exist, we can begin to apply them.
- Small data is about tactics; Big Data is about strategy
- Small data analysts make decisions about tactics and operations: what variables should I use? What kind of model should I specify?
- Big data analysts make decisions about a broader analytic strategy: how many variables should I use? How sensitive should my model be?
“KNN regression is a push-button technique and requires no input beyond raw data.”