The goal of big-data analysis is to find hidden patterns with predictive
ability. However, deciding which data "features" to examine often involves
using human intuition. In a database that has, for example, the start and
finish dates of different sales campaigns and weekly earnings, the important
information could not be the dates themselves but rather the intervals
between them, or the averages of the total profits across those intervals,
rather than the dates themselves.
MIT researchers have developed a new system that not only looks for
patterns but also develops the feature set, with the goal of eliminating the
human factor from big-data analysis. They entered the first iteration of
their system into three data science contests, pitting it against human
teams to identify patterns in unknown data sets in order to evaluate it. Out
of the 906 teams that competed throughout the course of the three events,
the researchers' "Data Science Machine" placed higher than 615.
The Data Science Machine's predictions were 94% and 96% correct as compared
to the winning entries in two out of the three competitions. In the third,
it was a somewhat lower 87 percent. Although human teams would usually spend
months refining their prediction algorithms, the Data Science Machine
produced each entry in as little as two to twelve hours.
The Data Science Machine was developed by Max Kanter, whose computer
science master's thesis at MIT served as its foundation. "We view the Data
Science Machine as a natural complement to human intelligence," Kanter
explains. "There is an abundance of data available for analysis. And it's
doing nothing right now but sitting there. Therefore, perhaps we can devise
a solution that would at least enable us to begin the process and go
forward."
In a paper that Kanter will present at the IEEE International Conference on
Data Science and Advanced Analytics next week, he and his thesis advisor,
Kalyan Veeramachaneni, a research scientist at MIT's Computer Science and
Artificial Intelligence Laboratory (CSAIL), describe the Data Science
Machine between the lines.
At CSAIL, Veeramachaneni co-leads the Anyscale Learning for All group,
which uses machine learning techniques to solve real-world big data analysis
problems like estimating wind farm sites' power-generation capacity or
identifying students who may drop out of online courses.
"Feature engineering is one of the very critical steps that we observed
from our experience solving a number of data science problems for industry,"
adds Veeramachaneni. "The first task is to determine which variables to
extract or compose from the database, which requires a lot of
brainstorming."
For example, how long before a deadline a student starts working on a
problem set and how much time the student spends on the course website
compared to peers are two important factors in predicting dropout. Although
none of those numbers are recorded by MIT's online learning platform MITx,
it does gather information from which they might be deduced.
Featured arrangement
A few techniques are used by Kanter and Veeramachaneni to create candidate
characteristics for data analysis. Utilizing the structural relationships
built into database architecture is one method. Different kinds of data are
usually stored in separate tables in databases, with numerical IDs showing
the relationships between the data. These correlations are monitored by the
Data Science Machine, which uses them as a guide for feature creation.
For example, one table can display retail prices, while another might list
the products that each consumer purchased. Costs from the first table would
be imported into the second by the Data Science Machine first. In order to
create candidate features, it would next do a series of actions, based on
the association of several items in the second table with the same purchase
number. These features would include the total cost per order, average cost
per order, minimum cost per order, and so forth. The Data Science Machine
finds minima of averages, averages of sums, and so forth by layering
operations on top of each other as numerical identifiers multiply throughout
tables.
Additionally, it searches for "categorical data," which includes things
like day of the week and brand names and seems to be confined to a specific
range of numbers. Then, by breaking up the current features into different
groups, it produces more feature candidates.
After it generates a range of candidates, it selects those whose values
appear to be associated in order to minimize the number of candidates. After
then, it begins evaluating its condensed feature set on sample data,
recombining the features in various ways to maximize the precision of the
predictions they produce.
Professor of computer science at Harvard University Margo Seltzer, who was
not engaged in the effort, says, "The Data Science Machine is one of those
unbelievable projects where applying cutting-edge research to solve
practical problems opens an entirely new way of looking at the problem." "I
believe that what they've done will very quickly become the standard."
Provided by Massachusetts Institute of Technology