System that replaces human intuition with algorithms outperforms human teams

The goal of big-data analysis is to find hidden patterns with predictive ability. However, deciding which data "features" to examine often involves using human intuition. In a database that has, for example, the start and finish dates of different sales campaigns and weekly earnings, the important information could not be the dates themselves but rather the intervals between them, or the averages of the total profits across those intervals, rather than the dates themselves.

MIT researchers have developed a new system that not only looks for patterns but also develops the feature set, with the goal of eliminating the human factor from big-data analysis. They entered the first iteration of their system into three data science contests, pitting it against human teams to identify patterns in unknown data sets in order to evaluate it. Out of the 906 teams that competed throughout the course of the three events, the researchers' "Data Science Machine" placed higher than 615.

The Data Science Machine's predictions were 94% and 96% correct as compared to the winning entries in two out of the three competitions. In the third, it was a somewhat lower 87 percent. Although human teams would usually spend months refining their prediction algorithms, the Data Science Machine produced each entry in as little as two to twelve hours.

The Data Science Machine was developed by Max Kanter, whose computer science master's thesis at MIT served as its foundation. "We view the Data Science Machine as a natural complement to human intelligence," Kanter explains. "There is an abundance of data available for analysis. And it's doing nothing right now but sitting there. Therefore, perhaps we can devise a solution that would at least enable us to begin the process and go forward."

In a paper that Kanter will present at the IEEE International Conference on Data Science and Advanced Analytics next week, he and his thesis advisor, Kalyan Veeramachaneni, a research scientist at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), describe the Data Science Machine between the lines.

At CSAIL, Veeramachaneni co-leads the Anyscale Learning for All group, which uses machine learning techniques to solve real-world big data analysis problems like estimating wind farm sites' power-generation capacity or identifying students who may drop out of online courses.

"Feature engineering is one of the very critical steps that we observed from our experience solving a number of data science problems for industry," adds Veeramachaneni. "The first task is to determine which variables to extract or compose from the database, which requires a lot of brainstorming."

For example, how long before a deadline a student starts working on a problem set and how much time the student spends on the course website compared to peers are two important factors in predicting dropout. Although none of those numbers are recorded by MIT's online learning platform MITx, it does gather information from which they might be deduced.

Featured arrangement

A few techniques are used by Kanter and Veeramachaneni to create candidate characteristics for data analysis. Utilizing the structural relationships built into database architecture is one method. Different kinds of data are usually stored in separate tables in databases, with numerical IDs showing the relationships between the data. These correlations are monitored by the Data Science Machine, which uses them as a guide for feature creation.

For example, one table can display retail prices, while another might list the products that each consumer purchased. Costs from the first table would be imported into the second by the Data Science Machine first. In order to create candidate features, it would next do a series of actions, based on the association of several items in the second table with the same purchase number. These features would include the total cost per order, average cost per order, minimum cost per order, and so forth. The Data Science Machine finds minima of averages, averages of sums, and so forth by layering operations on top of each other as numerical identifiers multiply throughout tables.

Additionally, it searches for "categorical data," which includes things like day of the week and brand names and seems to be confined to a specific range of numbers. Then, by breaking up the current features into different groups, it produces more feature candidates.

After it generates a range of candidates, it selects those whose values appear to be associated in order to minimize the number of candidates. After then, it begins evaluating its condensed feature set on sample data, recombining the features in various ways to maximize the precision of the predictions they produce.

Professor of computer science at Harvard University Margo Seltzer, who was not engaged in the effort, says, "The Data Science Machine is one of those unbelievable projects where applying cutting-edge research to solve practical problems opens an entirely new way of looking at the problem." "I believe that what they've done will very quickly become the standard."