« October 2011 | Main | May 2012 »

7 posts from January 2012

January 26, 2012

Rise of Big Data, Machine Learning and Data Mining

My approach in artificial intelligence have primarily been symbolic, and, in prior posts on AI, I indicated my skepticism on machine learning and other statistical techniques as a valid long-term approach to solving problems. With supervised learning techniques, it was possible to construct a function from inputs to output by learning from data. However, in many cases, particularly neural networks, the function remains a black box in which no model can be extracted out from which one can perform more complicated types of reasoning. This is not entirely true. In reality, neural networks involve a set of matrix calculations, which can be explored, and some techniques such as Bayesian models do offer multi-directional, not just bidirectional, inference in which the sought probabilities of any node in the graph may be conditioned on any other nodes.

I spoke with a former Harvard classmate of mine, who pursued a PhD in Natural Language Processing at Harvard under the tutelage of Professor Stuart Sheiber, who also interested me in natural language. He went into Microsoft Research after obtaining his degree, only to leave the field of NLP for a director of program management position in the product groups, because he felt that we still don't really understand natural language. Given that natural language processing is the basis of some of my work and I developed effective approaches to incorporating natural language understanding in the products that I develop, the comment was somewhat disheartening. Later, after reviewing his CV, I discovered that his entire focus on natural language processing was focused on statistical techniques, which to me offers easy heuristics but very little explanatory power that only a real model could provide. Also, my focus has been more on natural language manipulation which is more tractable than inference and to watch for any emergent intelligence properties that could reduce the need for searches that inference would entail.

My gradual warming to machine learning techniques is the result of taking Andrew Ng's online courses on Machine Learning. I have read about neural networks independently and encountered many of the techniques multiple times in my applied math and management coursework--Bayesian modeling, Markov models, Decision Trees, Regression, etc--and even recognized their potential in program by including some of these algorithms in my AI libraries, however I never fully appreciated their power.

My warming also mirrors the gradual acceptance of these techniques by industry over the 1990s. Neural networks were initially discredited by a paper in 1970s by a well-known researcher in AI; the limitations on the expressiveness of neural networks were later overcome and the field exploded. In economics, the term data mining was once looked upon with disdain and not regarded as serious research, but the mathematical rigor combined with the growing volume of data of the digital age changed its perception into the one of the hottest subject areas in the discipline. Machine learning reduces the need to discover models yet yield good approximate results.

Peter Norvig, author of AI: A Modern Approach, the leading AI text with 95% market share, recently gave a presentation on the rise of big data and machine learning. He is currently the director of Research at Google, where he applies AI techniques to make sense of the vast amounts of web data crawled by the search engine. Peter Norvig also followed the transition from symbolic AI with his books. His first text on AI, written in 1992, was Paradigms of AI: Case Studies in Common Lisp, incorporating only symbolic approaches; the second text written mentioned earlier consists mostly of non-symbolic approaches.

His work at Google led him to write about the rise of data in the famous paper, The Unreasonable Effectiveness of Data. Statistical approaches have automated and revolutionized natural language parsing and machine translation. In many cases, these proved superior to more expensive, human involved efforts. For instance, Chinese machine translation was automated without a single developer knowing the Chinese language.

In a lecture "Innovation in Search and Artificial Intelligence," Peter Norvig describes the rationale behind the movement from previous approaches to automated statistical approaches.

Below, I have included some of his remarks.

First I want to talk about the way we understand the world and make models of the world and try to get them to our computers and make sense. This is the process of theory formation. Here's a guy. We call him Isaac and he makes some observations of the world. Then he gets an idea an decides to formulate the idea into form of a theory or model.

Image [4]

Then you can apply the model to make predictions of the future.

Image [2]

It's great that approach works. But, of course, it could be thousands of years before we got someone who was smart enough to come up with a model like that. We need a process where we can iterate a lot faster--a more agile theory making process to get those kinds of advances.

One of the problems of this approach of formulating theories like that is that essentially all models are wrong, but some are useful.  They all make approximations somehow. They don't model the world completely, but some of them are very useful, like the ones Isaac was using. So if you are going to be wrong anyways, the question is "is there some shortcut so that you can trade off development time to advance much faster, but that may be a little more wrong, but can still be more useful?"

Initially, computer programs were taught to behave in this manner:

Image [3]

There's input, output and data, but computer science was this stuff in the middle. In the past few decades, processing power of computers have increased dramatically.

He uses this example in many of his lectures. Traditionally, programs were the focus of artificial intelligence, but now the red circle has shifted to data. The program is not longer a custom written component, but a generic learning algorithm (like a neural network) that takes data to learn from in order to produce the appropriate output for each input. The function is effectively determined by training data.


As if to emphasize the point, Norvig mentions how it was once believe that certain algorithms were inherently better than others. The improvements were tweaked to incorporate more advanced models or additional variables. However, an interesting phenomenon occurs when more training is fed to each of the algorithm. As the size of data increases by factors of 10X from sample sizes of thousands to billions, the performance rankings of the algorithms change positions. At some point, the behaviors of the algorithms asymptotes, whereby additional data really doesn't add much more information. The simpler algorithms often outperform the more advanced ones.

January 25, 2012

Microsoft AI Initiatives

Several computer science classes focus on algorithms. These include classes in data structures, artificial intelligence, computer graphics and numerical computing.

Some of these data structures are quite involved and I have felt that they should be incorporated inside system libraries. Many of the classical data structures have in the 1990s become a staple of standard libraries such as the Standard Template Library of C++ and with the frameworks included with the Java and .NET runtimes. However, libraries for numerical computing (manipulating matrices and performing statistics), handling artificial intelligence, or doing computationally geometry have still not found themselves as full-class citizens in modern APIs, although 3D graphics do have some presence.

There have been some recent activity in developing consumable AI libraries in the past few years at Microsoft.

With SQL Server 2005, Microsoft incorporated various AI and data mining packages: decision trees, association rules, naïve Bayes, sequence clustering, time series, neural nets, and text mining. A few years ago, Microsoft developed the Windows Solver Foundation libraries that include optimization, solvers, and latent term-rewriting functionality. A Technical Computing Initiative was launched, but some of the players involved have left the company and the output from the initiative remains to be seen. It's also not clear the goals of this initiative.

Microsoft had for a long while made available a Speech API, but its recognition capabilities are somewhat weak and frustrating. There is still no general purpose Natural Language API; this is somewhat complicated by the need to support multiple languages.

Recent developer events have introduced new libraries from research:

  • Infer.NET supports probabilistic inference. The application of this library though is quite limited.
  • A more promising library called Semantic Engine includes a range of technologies from Machine Learning, Computer Vision, Natural Language and others.


There are some downsides to most of these new libraries. They are based on managed code and currently have restrictions that prohibit non-internal commercial use.

January 23, 2012

Leverage in the Software Business

It’s a great time to be in the software business, because are many levers available to quickly produce products.

Open Source.

In recent years, open source has become a true phenomenon. One can find libraries for advanced technologies that are competitive with research offerings from the likes of Google and Microsoft. Even Google relies heavily on open source, which may be a key reason it iterates faster than Microsoft, which develops most of its software in-house. For instance, Chrome, itself based on the WebKit open source project, uses over 80 other open-source libraries credited in it About box. From machine translation to text-to-speech to optical character recognition to computer vision to numerical computing to video processing to GIS, the range of competencies offered from open-source to the new startup is breathtaking. In addition to the traditional source code repositories like SourceForge and CodeProject, many platform and book samples as well as course code offer ready-to-use technology.

Cross-platform languages.

Several cross-platform solutions have emerged C#/Mono, Qt, Air, HTML and Java to allow the products to be built on one platform such as Windows and quickly migrated to others such as mobile devices and the Mac.

Open Data.

Beside source code, data (both raw numbers and media files) is available freely from the government, universities and elsewhere. Natural language information is available from the Linguistic Data Consortium. Data for mapping, demographics and nutrition is freely available from the government. Websites like infochimp.com serve as a portal for these types of data files.

Component Libraries.

For hard to obtain source code and data, there are companies that offer for small sums access to that data. User interface libraries are pervasive. Nuance licenses its speech recognition technology for other companies to use within their products.

Web Services.

Web APIs potentially offer instant access to valuable services on the Web, though tend to be less stable that OS-specific APIS. Nick Bradbury wrote of the long-term failure of Web APIs, because web APIs have to be maintained continuously and any software that relies on them will need to be updated over time and could potentially break in the future.

A software company could provide its own gateway web service to ameliorate this situation, so that the client application should not have to change. Another advantage of this approach is that the company may use GPL code that would otherwise not be commercially viable.

C# Everywhere

Miguel de Icaza, founder of Xamarin, describes his C# Everywhere strategy for Mono. Earlier this year, there was a question of Mono’s survival, when the project was canceled after the Attachmate acquisition. However, the Mono team reconstituted itself under the umbrella of Xamarin, and have regained the rights to sell MonoTouch and MonoDroid.

I have standardized on C# years ago because it offers a cleaner and highly productive cross-platform solution than other languages that I have considered. C# is available for all Windows-based platforms. Mono fills in the gap for the other platforms with MonoMac for Mac OS, MonoTouch for iOS, and MonoDroid for Android. C# is used for games through Sony PSSuite, Unity, and XNA. C# is available in the browser with Google NativeClient support. One downside of Mono is that, in platforms that do not support C# natively, build times are considerably slower. Generally, I use C# for both development and for scripts. I don’t really see the strong benefits of using dynamic languages; with C#, I have easier access to and type-checking support for existing .NET libraries.

However, C++ has not been standing still. With the new additions in C++11, C++ has become increasingly tempting with its ruthless efficiency and new support for functional programming including lambda expressions. Most advanced software projects are in C++. Objective C is compatible with C++. Nokia’s Qt framework is a fantastic cross-platform objected-oriented C++ framework, better designed than MFC and supported by a much more polished IDE, QtCreator, than Visual Studio. I have gained enormous respected for the Nokia’s development team from my exposure to Qt and QtCreator.

The new Windows Runtime of Window 8 includes better integration with C++ than .NET with special C++/CX component extensions to the language. Though WinRT is currently limited to supporting Metro-based applications, I suspect over time that WinRT will expand to cover desktop applications and gradually replace the legacy Win32 APIs in future Windows releases.

There’s another candidate language, Adobe Air, that offers multiplatform support for mobile devices (Android, Blackberry, iOS) and desktops (PCs and Mac). The programming language/runtime is used by Balsamiq Mockups. While I am not altogether familiar with it, Air is based on web technologies like HTML, JavaScript, and ActionScript.

Conversational Interfaces Redux

In the past, I have talked about conversational interfaces with posts like the “Turing Test and the Loebner Prize Competition.” My interests are not purely theoretical, as I have actively explored integrating natural language deeply into applications in such ways as interpreting all text inside documents and code files and presenting a conversation stream.

The company I founded, SoftPerson, LLC, develops “smart software,” which are desktop applications that utilize mostly symbolic artificial intelligence including natural language processing. The overarching design criteria for my software is the capture of human thought process—human-like reasoning—into the codebase, so that software ultimately acts as an intelligent agent—a “soft person” or a virtual replacement for a human. In a sense, the applications are similar to Siri, which describes itself as a virtual assistant.

(The desktop market may seem mature, but there are many more undiscovered document types and, even in existing categories, there are additional ways of differentiations. Software applications are a high margin business. Despite Microsoft having a monopoly on some desktop application, niche versions of existing types of applications make considerable sums for their owners, and many successful companies are based on a single product: WhiteSmoke, FinalDraft, Moos ProjectViewer, SmartDraw, Quantrix, and Ventuz.)

In a sense, computers have historically been conversational, with a command line console being a form of conversation in which the user communicates to the computer in the computer’s own language with its limited grammar. The trend is moving towards the other end, where the computer understands more and more the language of the user.

The business plan I wrote for SoftPerson in 2002 featured a natural language writing product that incorporated a conversational interface in one of three different writing modes. Rather than feeling forced, the conversation interface was more natural than the one it would replace. The software would become one’s own personal ghostwriter. The plan was a finalist in two national business plan competitions and won prize money in the 2002 New Venture Championship competition held in Portland, Oregon. I have been working on this product for some amount of time. It includes a natural language parser that is an improvement on the Link parser from Carnegie Mellon University; however, the parser may be switched to the Stanford parser, which is more accurate, for a licensing fee.

I won’t talk much about the aforementioned product for intellectual property reasons, but I did look into the possibility of creating a graphics program by simply describing the desired image through words or sketching or other forms of input just as much as one would dictate to a human artist. In effect, the computer becomes one’s own personal artist.

Wouldn’t it be great if a computer can dynamically produce any image a person would so desired? How much more versatile could software be if it could render arbitrary scenes depending on the context as part of their operation instead of canned photographs? In addition to the complexity of determining meaning through words, a large library of graphical assets would be needed in mathematical form. A recent TED video of PhotoSynth (at 3:40) actually suggests that these assets can be data-mined from tagged Flicker images using computer vision techniques.

In the course of posing this question and researching the practicality of it in 2002, I discovered existing research called Word’s Eye that utilizes a conversational interface to construct images. The technology is described in the research paper WordsEye: Automatic Text-to-Scene Conversion System, but more accessible descriptions and examples are available in the following Creators Project blog post, “Wordseye Is An Artistic Software Designed To Depict Language.”


The above is an example of an uncanny WordsEye rendering an scene based on a text description. The graphics were licensed from a library of 3D model and are transformed to fit into the scene. This can be taken a step further to produce non-photorealistic rendering effects.

January 16, 2012

Online Courses

There have been video-taped lectures on the web for the past decade since the arrival of video-sharing sites. Early on, I watched a number of them. Some were computer science lectures from the University of Washington Professional Master Program, sponsored by Microsoft, like Data Mining.

However, for the most part, I avoided these video-based lectures or simply played them in the background (learning via osmosis). My issues with video lectures were manifold:

  1. Time. Video lectures consume a considerable amount of time, one to three hours.
  2. Not designed for online consumption. The video is a taping of an in-person lecture. Often times, relevant material in the blackboard or notes slide are not even visible.
  3. Not self-contained. Additional readings are required.

Instead, I would read through lecture presentation and notes for a course from MIT’s OpenCourseWare and other university programs. However, the retention of terminology and information from reading slides is weak. Slides typically have little content and explanation—just bullet points and diagrams, and the amount of time spent reviewing the presentation is a small fraction of the time watching a lecture—not enough to think deeply about a topic. Lecture notes can help, but they are often dry and usually not available.

Led by Andrew Ng, Professors at Stanford in fall 2011 launched three unofficial non-credit courses over the fall directed at the worldwide online audience. These also included regular homework and exams. About a hundred thousand students signed up for each course with over ten thousand fully completing all the requirements .

I took part in all three and found them to be high quality and as effective as regular courses. The Stanford online course lectures have become my sole hobby.

  1. Streamlined videos. Videos are delivered in small chunks with dead time edited out and an option for accelerated viewing.
  2. Optimized for online. Professors speak directly to the camera. Lecture notes are clearly viewable on top a white background instead of a distant blackboard.
  3. Course progress. Viewed videos and completed homeworks are marked.
  4. Community forums. Students communicate with each other and with the course staff.

The courses are taught by prominent professors in their field. Peter Norvig, Google director of R&D and author of AI: A Modern Approach used by 95% of students, teaches AI alongside Sebastien Thrun, an expert in robotics. The courses are somewhat less rigorous than the official Stanford classes and come complete with a certificate of accomplishment. In the AI class, I received congratulatory mail for perfect homework scores and towards the end an invitation to job placement program for the top 1000 students out of about an estimated 36,000 students. I easily obtained a perfect score on ML and DB class assignments.

Stanford initiated other less optimized offerings in previous years such as Stanford Education Everywhere (SEE) and Stanford’s Class X, where are videotaped lectures of courses targeted to Stanford’s professional program: The courses are still available for viewing. In addition to the three Stanford courses earlier, I watched through Introduction to Robotics, Program Analysis and Optimization, iPhone Application Development.

There are currently sixteen Stanford course planned for the winter quarter using the same interactive system in computer science, entrepreneurship, engineering and medicine, of which I plan to take as many as possible.

MITX is an upcoming online course program by MIT along the same vein, extending beyond the OpenCourseWare program.

January 14, 2012

The Computers and Internet of Yesteryear

The underlying experiences that we obtain from using computers and the Internet may not be as alien to prior generations as we may think. Things are faster and smaller, but not fundamentally different.

Many complex systems have been with us around for millennia albeit in somewhat different forms: the rule of law, sophisticated government systems, commerce and engineering. A complex system of check processing was possible from distances of 500 miles during the Middle Ages. Our forebears were just as smart as us. For instance, the ancient Latin language is more advanced and refined than modern English in its grammar and sophistication.

We know Charles Babbage at the “father of the computer” for having attempted to build a mechanical computer called the difference engine during the Victorian Era in England, but, some time ago, another ancient mechanical computer was discovered from a Roman shipwreck off of the Greek island of Antikythera dating around 100-150 BC. Scientists who studied using x-ray tomography and other techniques were amazed of its flawless manufacturing and high level of miniaturization and sophistication. One of the scientists speculated that these kinds of devices may have been quite common, because a whole chain of inventions would have been needed to precede them.

Early archaeologists were probably did not recognize such past technological advances earlier because the knowledge was lost through time and of high sophistication requiring the eyes and tools of a trained scientists in other disciplines. In the same way, future generations may not understand, much less reconstruct, modern-day computer chips due to the expertise and billion dollar expenses involved, were the technology lost due to world war or natural disaster.

In another working instance, the book Group Theory in the Bedroom describes the 160-year old astronomical clock of Strasbourg Cathedral, which is essence a mechanical computer and one that has not succumbed to the Y2K problems of modern computers at the turn of the century. The clock includes an vast eclectic set of features, among them,  for instance….

Wait! There’s even more! The clock is inhabited by enough animated figures to open a small theme park. The day of the week is marked by a slow progression of seven Greco-Roman gods in chariots. At noon each day, the twelve apostles appear saluting a figure of Christ, who blesses each in turn and at the end offers benedictions to all present. Every half hour a putto overturns a sandglass, and on the quarter hours another strikes a chime. Still more chimes are sounded by figures representing the four ages of mankind, followed by a skeletal Death, who rings the hours. And a mechanical cock crows on cue, flapping its metal winds….

With regard to the Internet, there were many analogous social practices in prior times. Card catalogs in libraries served the same function as search engines. Penny universities in 18th-century London mirrored the online communities of the day. Long before the era of email, the U.S. Postal Service delivered mail seven days a week multiple times a day. The high frequency of the postal deliveries compensated for the general slowness of technology. The book Victorian Internet: The Remarkable Story of the Telegraph and the Nineteenth Century’s On-line Pioneers details the revolutionary impact of the electric telegraph in nullifying distance and shrinking the world during the nineteenth century much as the Internet has done today.

In his argument against Internet Software Patents, Philip Greenspun invoked the argument that the important technologies of today were envisioned by the early thinkers well before it was even practical to develop. I think that this hints at the commonalities of our experiences across time.

The old timers wrote that we would have tens of millions of computers connected to the Internet, that we would be using those computers to support collaborative work, that we would be able to search for information that would have been digitized on a vast scale, that we would be exchanging digital multimedia information such as pictures or video streams, that there would be a glut of information and that advertisers would pay to get users' attention. The old timers predicted that hardware engineers would figure out how to make silicon-based integrated circuits ever more dense with transistors and powerful, that we would have vast memories, and that there would be computers in every home. The old timers wrote that most business would be conducted via computer network, that electronic mail would surpass hardcopy letters for person-to-person correspondence, that unwanted email would be annoying.

A commenter cites specific examples.

Regarding UI: In As We May Think (1945), Vannevar Bush described an interface that is stunningly recognizable as web browsing. Doug Engelbart's work in the '60s (from which sprang the GUI, as Phil mentioned) was single-mindedly focused on providing the best user interface for what he called "knowledge workers". Some of his UI features, such as those for condensing text passages for quick skimming, are still unmatched. Direct manipulation showed up in Ivan Sutherland's Sketchpad (1963), whose constrained drawing features are also still unmatched. Windowed UI and WYSIWYG editing came with the Xerox Alto (1973). Even today, Photoshop and its clones still use the UI invented by Bill Atkinson for MacPaint (1983). And many believe that Atkinson's HyperCard (1987) is what the web should have been.

The two decades between Englebart's 1962 opus and the release of the Macintosh in 1984 were far and away the most fertile period of UI innovation. Most progress since then has been simply the wide-scale adoption of those ideas.

Regarding Moore's law: Look at the section "A Simple Vision of the Future" in Alan Kay's Early History Of Smalltalk. You will see a man who intimately understood Moore's Law, even in the late '70s.

Old-timers were able to conceive much of the digital experiences of today. Modern digital experiences are another reflection through more advanced technologies of fundamental human interactions that have always existed in some form.