A joke's making the rounds in the analytics world. Question: What’s a data scientist? Answer: An intelligence practitioner from San Francisco.
An even more cogent aphorism: A data scientist can program better than a statistician, and has more statistical chops than a programmer.
There's little doubt that ours is now a data-driven world. Business intelligence, predictive analytics, big data and data science dominate business strategy discussions, while companies and vendors jockey to “compete on analytics”. New businesses revolving on data products are emerging daily. Indeed, 50 percent of the customers my company, Inquidia Consulting, does business with today didn't exist ten years ago. The “products” these companies sell? Data and analytics.
We've come a long way from the nascent data warehousing and business intelligence of the early 90's. Then and still today, BI, done properly helps businesses rigorously measure and manage their performance. Data marts, ad-hoc reporting, OLAP and dashboards were (and remain) the primary tools of business measurement. The major challenge with BI – and any variant of data-driven business for that matter – is to correctly and efficiently integrate data from multiple, disparate sources. Now, though BI remains critical, its focus on “history” is often viewed as limiting, and more companies have progressed to “super-crunching” that speaks to the use of historical data to predict the future – i.e. predictive analytics.
Serving the Data-Driven Masters
The theoretical underpinnings of predictive analytics come from both statistics and computer science. The field of statistical learning sits at the confluence of the two disciplines, embracing the theoretical rigor of statistics with the practical, outcomeobsessed orientation of CS. Rather than as a next step to BI, however, predictive analytics is probably best seen as a tool that serves many data-driven masters.
There's no shortage of Masters in Predictive Analytics programs in the university marketplace today. Many are excellent, preparing students to use models emanating from both statistical and computer science worlds.
One big problem though: many PA students go directly from undergrad to grad school, without the benefit of a stop in the business world to “experience” data. What would be much better for PA-hiring employers is for the new grads to first spend a few years in a BI role doing data integration and slicing-dicing-drilling OLAP cubes – developing an understanding of how to integrate and explore data.
Delineation of Data Science
O'Reilly Unix author and industry expert Mike Loukides distinguishes data science from other intelligence disciplines as “not just an application with data; it's a data product. Data science enables the creation of data products.” He notes Google's PageRank algorithm and Amazon's recommendation engine that exploits the exhaust of searches as examples of data science apps.
A characteristic of data science similar to BI's data integration is “data conditioning” that includes “mashups” and “munging” manipulations with tools such as perl, python and ruby.
DS is also comfortable working with missing and incongruous data.
“In data science, what you have is frequently all you're going to get. It's usually impossible to get 'better' data, and you have no alternative but to work with the data at hand.” One mitigating factor: you can often just kill the missing data problem with shear volume in DS. Relatedly, DS appears to be contented with “approximate” answers. “Most data analysis is comparative: if you're asking whether sales to Northern Europe are increasing faster than sales to Europe, you aren't concerned about the difference between 5.92 percent annual growth and 5.93 percent.”
Data science is also obsessed with handling “big data – when the size of the data itself becomes part of the problem.” For DS, this means the database structures that serve BI don't adequately scale for their problems. “Most of the organizations that have built data platforms have found it necessary to go beyond the relational database model. Traditional relational database systems stop being effective at this scale.”
Monica Rogati of LinkedIn confirms that DS is consumed with supporting the development of data products. She notes, “On one side, I’ve been working on building products.The other side is finding interesting stories in the data.”
Practically, data scientist Drew Conway notes: “First, one must have hacking skills (which) in this context mean proficiency working with large, unstructured chunks of electronic data. Second, one needs a basic understanding of mathematics and statistics. Finally, and perhaps most importantly, a data scientist must have some substantive expertise in the data being analyzed.” Former Facebook DS innovator Jeff Hammerbacher described a day in the life of a data scientist: “On any given day, a team member could author a multistage processing pipeline in python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization.”
Some data science thought leaders, though, see DS as simply the application of sophisticated analytics algorithms for business gain. For them, the data integration challenges are less important, or perhaps less sexy -- the thinking being that the DI is likely to be automated, while PA is where the brainpower resides. I disagree. Data science is every bit as much if not more about data work as it is about analytics. And the analytics side of DS can probably be easier automated than the data integration.
A few years ago, I saw firsthand a practical difference between the business intelligence and data science worlds. Attending the Netezza conference, I hobnobbed with peers my middle age. Several weeks later, I attended Strata in Santa Clara, where most participants could have been my kids.
The maturity divided between data science and BI carries with it a number of cultural differences. At present, BI is probably more methodical and bureaucratic than DS, though impatientwith- IT DS'ers argues that's a good thing. I suspect with maturity comes a governance “advantage” for BI as well. DS seems unencumbered with these “shackles,” but will probably start to look more like BI organizationally in time. Indeed, I believe that BI's methodical, governed approach will positively impact DS, just as DS's get-it-done intolerance of sloth and bureaucracy will disrupt BI for the better.
“Data science is every bit as much if not more about data work as it is about analytics. And the analytics side of DS can probably be easier automated than the data integration”
The age differences show in platform software choices as well. Young DS'ers arrives at commerce from academia armed with the open source tools they learned in school: Perl/Python/Ruby for data integration, MySQL and Postgres for database management, R for analytics and graphics and, increasingly, Cloud computing and the Hadoop ecosystem for big data handling. Traditional BI is foreign.
BI'ers, in contrast, are more likely to have settled in over the years on proprietary offerings from big technology vendors for their BI tasks – e.g. Informatica or DataStage for data integration, Oracle Teradata or IBM-Netezza for database management, BusinessObjects or Cognos for query and reporting, and SAS or SPSS for analytics.
With maturity also comes a work group size difference that promotes a wider division of labor in BI than in DS. In large BI shops now there are business analysts, data analysts, DBAs, infrastructure specialists, developers, user experience experts, analytics experts, statisticians, et al. While the more sophisticated DS groups are rapidly growing and diversifying, many are still relatively small with jack of all trade contributors.
My sense is that starting to emerge now is at least some convergence of the best of BI and DS to the ultimate benefit of data-driven business. Fueled by success, DS is growing larger and accepting both division of labor and governance, a development similar to that of BI over the last 20 years. At the same time, BI teams have had their cages rattled by the can-do attitude of DS; many have responded positively with more productivity and less IT bureaucracy. It's an exciting time for data-driven business.