My approach in artificial intelligence have primarily been symbolic, and, in prior posts on AI, I indicated my skepticism on machine learning and other statistical techniques as a valid long-term approach to solving problems. With supervised learning techniques, it was possible to construct a function from inputs to output by learning from data. However, in many cases, particularly neural networks, the function remains a black box in which no model can be extracted out from which one can perform more complicated types of reasoning. This is not entirely true. In reality, neural networks involve a set of matrix calculations, which can be explored, and some techniques such as Bayesian models do offer multi-directional, not just bidirectional, inference in which the sought probabilities of any node in the graph may be conditioned on any other nodes.
I spoke with a former Harvard classmate of mine, who pursued a PhD in Natural Language Processing at Harvard under the tutelage of Professor Stuart Sheiber, who also interested me in natural language. He went into Microsoft Research after obtaining his degree, only to leave the field of NLP for a director of program management position in the product groups, because he felt that we still don't really understand natural language. Given that natural language processing is the basis of some of my work and I developed effective approaches to incorporating natural language understanding in the products that I develop, the comment was somewhat disheartening. Later, after reviewing his CV, I discovered that his entire focus on natural language processing was focused on statistical techniques, which to me offers easy heuristics but very little explanatory power that only a real model could provide. Also, my focus has been more on natural language manipulation which is more tractable than inference and to watch for any emergent intelligence properties that could reduce the need for searches that inference would entail.
My gradual warming to machine learning techniques is the result of taking Andrew Ng's online courses on Machine Learning. I have read about neural networks independently and encountered many of the techniques multiple times in my applied math and management coursework--Bayesian modeling, Markov models, Decision Trees, Regression, etc--and even recognized their potential in program by including some of these algorithms in my AI libraries, however I never fully appreciated their power.
My warming also mirrors the gradual acceptance of these techniques by industry over the 1990s. Neural networks were initially discredited by a paper in 1970s by a well-known researcher in AI; the limitations on the expressiveness of neural networks were later overcome and the field exploded. In economics, the term data mining was once looked upon with disdain and not regarded as serious research, but the mathematical rigor combined with the growing volume of data of the digital age changed its perception into the one of the hottest subject areas in the discipline. Machine learning reduces the need to discover models yet yield good approximate results.
Peter Norvig, author of AI: A Modern Approach, the leading AI text with 95% market share, recently gave a presentation on the rise of big data and machine learning. He is currently the director of Research at Google, where he applies AI techniques to make sense of the vast amounts of web data crawled by the search engine. Peter Norvig also followed the transition from symbolic AI with his books. His first text on AI, written in 1992, was Paradigms of AI: Case Studies in Common Lisp, incorporating only symbolic approaches; the second text written mentioned earlier consists mostly of non-symbolic approaches.
His work at Google led him to write about the rise of data in the famous paper, The Unreasonable Effectiveness of Data. Statistical approaches have automated and revolutionized natural language parsing and machine translation. In many cases, these proved superior to more expensive, human involved efforts. For instance, Chinese machine translation was automated without a single developer knowing the Chinese language.
In a lecture "Innovation in Search and Artificial Intelligence," Peter Norvig describes the rationale behind the movement from previous approaches to automated statistical approaches.
Below, I have included some of his remarks.
He uses this example in many of his lectures. Traditionally, programs were the focus of artificial intelligence, but now the red circle has shifted to data. The program is not longer a custom written component, but a generic learning algorithm (like a neural network) that takes data to learn from in order to produce the appropriate output for each input. The function is effectively determined by training data.
First I want to talk about the way we understand the world and make models of the world and try to get them to our computers and make sense. This is the process of theory formation. Here's a guy. We call him Isaac and he makes some observations of the world. Then he gets an idea an decides to formulate the idea into form of a theory or model.
Then you can apply the model to make predictions of the future.
It's great that approach works. But, of course, it could be thousands of years before we got someone who was smart enough to come up with a model like that. We need a process where we can iterate a lot faster--a more agile theory making process to get those kinds of advances.
One of the problems of this approach of formulating theories like that is that essentially all models are wrong, but some are useful. They all make approximations somehow. They don't model the world completely, but some of them are very useful, like the ones Isaac was using. So if you are going to be wrong anyways, the question is "is there some shortcut so that you can trade off development time to advance much faster, but that may be a little more wrong, but can still be more useful?"
Initially, computer programs were taught to behave in this manner:
There's input, output and data, but computer science was this stuff in the middle. In the past few decades, processing power of computers have increased dramatically.
As if to emphasize the point, Norvig mentions how it was once believe that certain algorithms were inherently better than others. The improvements were tweaked to incorporate more advanced models or additional variables. However, an interesting phenomenon occurs when more training is fed to each of the algorithm. As the size of data increases by factors of 10X from sample sizes of thousands to billions, the performance rankings of the algorithms change positions. At some point, the behaviors of the algorithms asymptotes, whereby additional data really doesn't add much more information. The simpler algorithms often outperform the more advanced ones.