Big Data in Pharma: What is Big Data?
We are past the peak of the Big Data hypecycle, not least of all because about half of the content on the topic put out by marketing gurus, branding experts, and other non-practitioners has moved on from extolling Big Data to actively pointing out that fact. However, marketers and other data-dependent roles have soured on Big Data so quickly not because of its actual promise but because at no point in the discussion have we really had a great definition for Big Data and talked about realistic applications.
There are reams of articles talking about 3, 4, or 5 Vs. And there are even more articles pointing out that “it’s not about Big Data it’s about Smart Data or Useful Data or Big Analysis.” And then there are the back-to-basics types who assert that Big Data is useless without Big Questions. All of this tripe gets a lot of play on LinkedIn because there is a dearth of practitioners taking part in these discussions and scores of pretenders and bystanders who know enough to sound credible.
I am one of those bystanders.
Big data the tech trend exists for three reasons:
- We have lots of data, some of which is textual or otherwise weird
- We have computer systems that can manipulate large datasets
- We have techniques that can mine large datasets
Let’s get the two boring things out of the way first. Facebook exists, your iPhone sends your GPS coordinates to receivers in trash cans hidden throughout WalMart, and your Tweets are being analyzed by sociologists at Penn who think rain in New York makes the entire country sad. This data is useful for making predictions and understanding why people buy stuff (or the effect of a government policy or the impact of a road on wildlife migrations or whatever you like).
The second component is that we’ve evolved beyond the SQL (structure query language) which is the basis for MS Access and lots of other proprietary database systems. However SQL isn’t super-efficient for large datasets (millions of records) and so it’s necessary to use beefier NoSQL (not only SQL) system to access them. Basically we are approaching a post-spreadsheet world where the ways we normally store and manipulate data are not convenient.
The third part is the most interesting. For a long time social scientists (and marketers, farmers, astronomers, and biologists) have been developing a family of techniques called regression. Basically regression lets me do this:
Let’s pretend that we’re plotting sales on the vertical axis and advertising on the horizontal axis. Regression lets me estimate the average increase in sales I get whenever I increase advertising spend. The line I’ve drawn through the cloud of data shows that relationship.
Now I can do this in many dimensions, with many variations, but the basic idea is that I have some response I’m interested in (will you open an email, will you buy this, etc) and some variables I think might predict that (age, income, etc). With that model I can make predictions or better understand the drivers of a trend or decision. The better my model is, the more accurate my predictions and the clearer my understanding.
We’ve been able to do this for a long time. In a world with small datasets, this works great. Let’s say I’ve got two variables in my dataset, age and income, and I want to predict the amount you’ll buy on a trip to my store. That means there are three possible models:
- I model sales as a function of age
- I model sales as a function of income
- I model sales as a function of income and age
Now let’s make my dataset a little bigger, with 10 variables to choose from. I now have the potential to make much more accurate predictions and get a clearer understanding of my customers’ behavior. But I now have 3.6 million possible models to choose from. Now we routinely work with datasets of this size all the time, and with our intuition and contextual knowledge we can probably select an acceptable model. But what if our dataset contains 100 variables? That means there are over 1 and a half googols worth of potential models to select from.
Clearly, we have a problem. It would be impossible for an analyst to evaluate this many different models, let alone find the best one. This is where Big Data comes in. Rather than have the analyst guess at the correct model for his entire life, he will instruct the computer (within certain constraints) to try many models at a time until it arrives at the best one.
That’s the true power of big data. It’s not about what kind of analysis you can do. It’s about analysis that you can’t do at all.