Artificial intelligence may be poised to ease the shortage of data scientists who build models that explain and predict patterns in the ocean of “Big Data” representing today’s world. An MIT startup’s computer software has proved capable of building better predictive models than the majority of human researchers it competed against in several recent data science contests.
Until now, well-paid data scientists have relied on their human intuition to create and test computer models that can explain and predict patterns in data. But MIT’s “Data Science Machine” software represents a fully automated process capable of building such predictive computer models by identifying relevant features in raw data. Such a tool could make human data scientists even more effective by allowing them to build and test such predictive models in far less time. But it might also help more individuals and companies harness the power of Big Data without the aid of trained data scientists.
“I think the biggest potential is for increasing the pool of people who are capable of doing data science,” Max Kanter, a data scientist at MIT’s Computer Science and AI Lab and co-creator of the Data Science Machine software, told IEEE Spectrum. “If you look at the growth in demand for people with data science abilities, it’s far outpacing the number of people who have those skills.”
The Data Science Machine can automatically create accurate predictive models based on raw datasets within two to 12 hours; a team of human data scientists may require months. A paper on the Data Science Machine will be presented this week at the IEEE International Conference on Data Science and Advanced Analytics being held in Paris from 19–21 Oct.
Trained data scientists, who typically draw salaries above $100,000 on average, remain a coveted but scarce resource for companies as diverse as Facebook and Walmart. In 2011, the McKinsey Global Institute estimated that the United States alone might face a shortage of 140,000 to 190,000 people with the analytical skills necessary for data science. A 2012 issue of the Harvard Business Review declared data scientist as the sexiest job of the 21st century.
The reason for such high demand for data scientists comes from Big Data’s revolutionary promise of tapping into vast collections of data—whether it’s the online behavior of social media users, the movements of financial markets worth trillions of dollars, or the billions of celestial objects spotted by telescopes—to explain and predict patterns in the huge datasets. Such models could help companies predict the future behavior of individual customers or aid astronomers in automatically identifying an object in the starry nighttime sky.
But how do you transform a sea of raw data into information that can help businesses or researchers identify and predict patterns? Human data scientists usually have to spend weeks or months working on their predictive computer algorithms. First, they sift through the raw data to identify key variables that could help predict the behavior of related observations over time. Then they must continuously test and refine those variables in a series of computer models that often use machine learning techniques.
Such a time-consuming part of the data scientists’ job description inspired Kanter, an MIT grad student at the time, and Kalyan Veeramachaneni, a research scientist at MIT’s Computer Science and AI Lab who acted as Kanter’s master’s thesis advisor, to try creating a computer program that could automate the biggest bottlenecks in data science.
Previous computer software programs aimed at solving such data science problems have tended to be one dimensional, focusing on problems particular to specific industries or fields. But Kanter and Veeramachaneni wanted their Data Science Machine software to be capable of tackling any general data science problem. Veeramachaneni in particular drew on his experience of seeing similar connections among the many industry data science problems he had worked on during his time at MIT.