October 24, 2011

Siri Technology

Considerable press puffery has followed Apple’s announcement of Siri dramatizing the significant nature of the natural language technology. Siri is touted as a $200 million DARPA-funded research in the same league as the Internet and GPS. Siri is much advanced over existing popular voice control technology in its acceptance of free-form input, access to web services, and conversational feedback; however, it probably not too much of an advance. Microsoft and Google should catch up quickly. Microsoft previewed a future vision of its acquired Tellme voice control technology, and Google is likely to advance its Voice Actions to match Apple’s developments. Voiced-based assistant technology is ideal for a mobile device because the phone is designed to host conversations, geolocation is available, internet data is available, and all the user’s information is already stored in the device. Because contextual information is abundant and the domain knowledge is constrained, speech recognition can be very effective. My reaction to Siri was “finally”—finally a realization of state of art in technology into commercial products. There’s a lot of technology that I encountered in academia decades ago that have not yet made their way to consumers such as natural language analysis. In college, I wrote a Prolog-based natural language parser based on definite clause grammars that took parsed text and converted it into a semantic representation, on which we perform queries. As part of my entrepreneurial work, I rewrote the Link Grammar parser from CMU and obtained licenses to various natural language data sources like COMLEX, NOMLEX and WordNet. It felt odd that I might be the first to sell a consumer natural-language product aside from grammar checkers and translation software. So, finally, but… I am a bit skeptical about the technological advances, since I am simply one developer and, though I have leverage off the work of others, I did not need$200M. First, Siri uses Dragon’s speech recognition technology, which any developer can license as part of the Nuance Mobile Developer Program. Parsing natural language. Second, the DARPA halo is just for dramatics, but the kind of natural language analysis has been around for awhile. I credit Apple for pushing the quality of the communication beyond mere keyword and structure recognition and for putting existing art into its products. I suspect that most of those stated 300 SRI researchers consulted but did not actually work on CALO (let alone full-time). From the looks of the project involved, most of the technology appears in the backend (much of which may not even be relevant to Siri), very little in natural language analysis.

In How Siri Works, the author presents his own skepticism of Siri technology. Jeff is more impressed with application and integration of Siri into the OS than with the technology itself. Siri performs operations on a limited set of operations centered around built-in iPhone applications, plus it integrates with a number of web services such as Yelp, Wolfram|Alpha, OpenTable and Wikipedia. Despite the limitations, it still is an impressive achievement given the naturalness of the implementation.

Another CALO engineer confirms my thoughts of Siri as a compelling but not terribly advanced technology:

I worked at SRI on the CALO project, and built prototypes of the system that was spun off into SIRI. The system uses a simple semantic task model to map language to actions. There is no deep parsing - the model does simple keyword matching and slot filling, and it turns out that with some clever engineering, this is enough to make a very compelling system. It is great to see it launch as a built-in feature on the iPhone.

The NLP approach is based on work at Dejima, an NLP startup: “Iterative Statistical Language Model Generation for Use with an Agent-Oriented Natural Language Interface

A lot of the work is grounded in Adam Cheyer's (CTO of SIRI) work on the Open Agent Architecture: A more recent publication from Adam and Didier Guzzoni on the Active architecture, which is probably the closest you'll come to a public explanation of how SIRI works: Active, a Platform for Building Intelligent Software

His comments on the natural language parsing left me disappointed, but it’s possible that Apple upgraded that natural language processing capabilities of Siri with its homegrown version after acquiring Siri. However, after reading the Dejima paper, it turns out that a traditional parser may have too rigid a grammar for the short, conversional, and often grammatically incorrect speech input.

NLIs often use text as their main input modality; speech is however, a natural and, in many cases, preferred modality for NLIs. Various speech recognition techniques can be used to provide a speech front end to an NLI. Grammar-based recognizers are rigid and unforgiving, and thus can overshadow the robustness and usability of a good NLI. Word-spotting recognizers are reliable only when the input consists of short utterances, and the number of words to be spotted at each  given time is small. Dictation engines are processor and memory intensive, and often speaker dependent. The dictation vocabulary is often considerably larger than required for domain-specific tasks. General statistical language models (SLMs), although robust enough to be used as a front end for a structured domain, requires a very large training corpus. This is time consuming and expensive since a large number of users needs to be sampled and all speech has to be transcribed.

The system proposed in the paper is a statistical language designed explicitly for an agent-oriented natural language speech-based interface. It does have its failings such as the weaknesses detailed in Siri in Practice where Siri has trouble with parenthetical or quoted expressions that have might have been more properly handled with a grammar-based recognizer, assuming that Apple has not changed Siri’s parser.

Another telling failure is Siri’s response to “What’s the best iPhone wallpaper?” in which Siri responds with a canned response to “What’s the best phone?” as if it simply did not process the word “wallpaper” and simply hooked on keywords “best” and “iPhone.”

The response vaguely resembles that of fairly unsophisticated chatterbots. Siri could be performing a keyword match either on the syntactic or the semantic level. It would be easy to test this hypothesis by asking Siri variations of the questions. I doubt the seriousness of these mistakes, because Apple might be using a different catch-all, keyword matching-system for unanticipated queries.

Siri and the Human Connection—The Eliza Effect

In accordance with combining technology with the liberal arts, I believe that Apple attempted in Siri to establish a more human connection to the iPhone user, first, by allowing it to understand free-form conversation and, second, by giving it a personality whereby the user may feel that she has a relationship with a humanoid rather than a machine. Literally, this has been the vision of the Knowledge Navigator with its animated virtual human and other social interface projects over the past couple decades.

The Wall Street Journal recently explored Apple’s motivation in Siri while asking the question whether smartphones are becoming smart alecks? The reporter noted that the original creators of Siri put “deep thought” into its personality, giving it “a light attitude.”

When Apple began integrating Siri into the iPhone, the team focused on keeping its personality friendly and humble—but also with an edge, according to a person who worked at Apple on the project. As Apple's engineers worked on the software, they were often thinking, "How would we want a person to respond?" this person said.

The Siri group, one of the largest software teams at Apple, fine-tuned Siri's responses in an attempt to forge an emotional tie with its customers. To that end, Siri regularly uses a customer's nickname in responses, as well as those of other important people and places in his or her life. "We thought of it almost as a person on the phone," this person said.

As for its effect, we see evidence in twitter and news articles repeating many of the sassy responses of Siri. My iPhone says the darndest things, writes one reporter. In actuality, the emotional bond between the iPhone and users that Apple attempted to forge through Siri has been described years ago as the “Eliza effect.” The Eliza effect is the bond that has been observed from users chatting with less sophisticated chatterbots.

In a series of articles on the history of Eliza, the first and most widely known chatterbot, Jimmy Maher recounts the unexpected connection with humans who worked with it.

Perhaps the first person to interact extensively with Eliza was Weizenbaum’s secretary: “My secretary, who had watched me work on the program for many months and therefore surely knew it to be merely a computer program, started conversing with it. After only a few interchanges with it, she asked me to leave the room.” Her reaction was not unusual; Eliza became something of a sensation at MIT and the other university campuses to which it spread, and Weizenbaum an unlikely minor celebrity. Mostly people just wanted to talk with Eliza, to experience this rare bit of approachable fun in a mid-1960s computing world that was all Business (IBM) or Quirky Esoterica (the DEC hackers).

Weizenbaum’s reaction to all of this has become almost as famous as the Eliza program itself. When he saw people like his secretary engaging in lengthy heart-to-hearts with Eliza, it… well, it freaked him the hell out. The phenomenon Weizenbaum was observing was later dubbed “the Eliza effect” by Sherry Turkle, which she defined as the tendency “to project our feelings onto objects and to treat things as though they were people.” In computer science and new media circles, the Eliza effect has become shorthand for a user’s tendency to assume based on its surface properties that a program is much more sophisticated, much more intelligent, than it really is.

All that aside, I also believe that, at least in his strong reaction to the Eliza effect itself, Weizenbaum was missing something pretty important. He believed that his parlor trick of a program had induced “powerful delusional thinking in quite normal people.” But that’s kind of an absurd notion, isn’t it? Could his own secretary, who, as he himself stated, had “watched [Weizenbaum] work on the program for many months,” really believe that in those months he had, working all by himself, created sentience? I’d submit that she was perfectly aware that Eliza was a parlor trick of one sort or another, but that she willingly surrendered to the fiction of a psychotherapy session. It’s no great insight to state that human beings are imminently capable of “believing” two contradictory things at once, nor that we willingly give ourselves over to fictional worlds we know to be false all the time. Doing so is in the very nature of stories, and we do it every time we read a novel, see a movie, play a videogame. Not coincidentally, the rise of the novel and of the movie were both greeted with expressions of concern that were not all that removed from those Weizenbaum expressed about Eliza.

Aside from deluding oneself that the computer is human, the user also, aware of the failings of the computer, attempts to maintain this delusion by steering their behavior as Sherry Turkle writes in The Second Self.

As one becomes experienced with the ways of Eliza, one can direct one’s remarks either to “help” the program make seemingly pertinent responses or to provoke nonsense. Some people embark on an all-out effort to “psych out” the program, to understand its structure in order to trick it and expose it as a “mere machine.” Many more do the opposite. I spoke with people who told me of feeling “let down” when they had cracked the code and lost the illusion of mystery. I often saw people trying to protect their relationships with Eliza by avoiding situations that would provoke the program into making a predictable response. They didn’t ask questions that they knew would “confuse” the program, that would make it “talk nonsense.” And they went out of their way to ask questions in a form that they believed would provoke a lifelike response. People wanted to maintain the illusion that Eliza was able to respond to them.

Just imagine the potential impact in consumer brand loyalty that a well-designed assistant like Siri could impart, should users willfully engage in the illusion of a human-like assistant and even actively maintaining this self-deception. In the least technologically savvy portions of the population, the user may not even understand the technological limitations and could really believe the device truly understands her.

Technology & Liberal Arts

Steve Jobs often attributes that the popularity of Apple products comes from merging technology with the humanities.

During the introduction of the first iPad, Jobs notes that "the reason that Apple is able to create products like iPad is because we always try to be at the intersection of technology and liberal arts, to be able to get the best of both, to make extremely advanced products in a technology point of view but also have them be intuitive, easy to use, fun to use so they really fit the users. The users don’t have to come to them, they come to the users. It’s the combination of these two things that has let us make the kind of creative products like the iPad. "

He repeated this theme during the unveiling of the second iPad as he proclaimed a new post-PC era. “It’s in Apple’s DNA that technology alone is not enough. It is a marriage of technology with the liberal arts and humanities. The competitors we see it as a new PC market. That’s not the right approach. Tablet is a computer that needs to be easier to use than a PC and should be more intuitive.”

As emerging competitors in the tablet space have attempted to one-up Apple with better “specs,” Apple simply ignores their moves, instead talking about products like GarageBand in the iPad2 launch to enable users to produce studio-quality music without effort. Each product launch features an exciting consumer software element like iMovie for iPhone 3GS, FaceTime, or Siri that reaffirms the human element in software. Apple products are beautiful, fun, personal, and magical. It just works.

I am reminded of a post “Apple is professional, the web is amateur,” praising Apple’s craftsmanship and show of love in its software; after reading the post, I played with the iLife suite and was completely stunned with the quality of websites that could be created with iWeb. (In contrast, Microsoft FrontPage was the surest way to produce an amateurish webpage as if none of the developers designed webpages. My encounter with Dreamweaver in 2000 was a “wake up” moment in the mediocrity of some of Microsoft applications.) As opposed to Android and Microsoft commercials that push vague, unauthentic, corporate branding messages, Apple’s products speak for themselves front-and-center in commercials with their impeccable beauty and magical qualities, whether it’s the richness of applications (especially those augmenting reality using phone-based sensors or tapping into web services like OpenTable), understanding of natural speech, or video-based calls.

This stance of melding humanities into technology harkens back to the Steve Job’s launch of the “Think Different” campaign, which shifted Apple’s marketing focus from technology to values, which is a passion.

The Apple brand has clearly suffered from neglect in this area in the last few years and we need to bring it back. The way to do that is not to talk about speeds and feeds; not to talk about MIPS and megahertz; not to talk about why we’re better than Windows…

Our customers want to know who is Apple and what is it that we stand for, where do we fit in this world. What we are about isn’t making boxes to get people to get their jobs done, although we do that well, we do that better than almost anybody in some cases. But Apple is about something more than that. Apple at the core, it’s core value, is that we believe that people with passion can change the world for the better.

In the recent CBS interview , Jobs is taped saying that Microsoft never had the humanities and liberal arts in its DNA, that they are a pure technology company. He also believes Google is turning the same way.

October 14, 2011

Siri and the AI Revolution

With natural user interfaces (like touchscreens and microphones) and sensory rich devices (like GPS and cameras) rapidly become mainstream due to the rise of smartphones and tablets alongside growing acceptance of speech recognition, augmented reality and now the conversational UI in the form of Siri, there’s been growing predictions of a coming AI revolution in the computer industry.

In an interview with 9to4mac, Norman Winarsky, cofounder of Siri, predicted before the iPhone launch that the Siri Assistant would be a world-changing event. The other cofounder Tom Gruber, former Siri CEO, articulated the underpinnings for this change in his Jan 2010 Web 3.0 conference keynote, “Big Think Small Screen: How semantic computing in the cloud will revolutionize the consumer experience on the phone.”

After the iPhone launch, there is still widespread skepticism about the practicality of Siri and conversation interfaces given the past failures of speech and handwriting recognition software, but I believe that Apple has placed an indelible germ in people’s minds that will continue to grow in coming years that will manifest itself into new products and expectations. I would not be surprised, if Siri, in future versions, morph into an animated person as in the Knowledge Navigator videos.

Siri is also born in the right age, where mobile devices that can wirelessly connect to servers, maintain a copy of user’s important information and contacts, and retrieve context like geolocation and where third-party services like Yelp for businesses, OpenTable for restaurant reservations, and Wolfram Alpha for general facts abound on the semantic and social Web. There is also ample memory and computing resources available within a mobile device; natural languages assets have accumulated over time; the web has also offered a rich amount of information for data mining and machine learning. In a mobile device designed to host phone conversations, speech may even be the most convenient form of input, since the keyboard is so small and the relevant application, if there is one, may be several clicks away.

With developments such as Watson, Siri, natural user interfaces, real-time Google translations, AI might finally be here.

I am somewhat worried, though, since my entire business strategy is based on AI technologies and the giant software companies are coming in.

In a presentation of my business plan in 2002 for a natural language software company, one feedback that I received was that AI is a dead end with allusions to massive failures in the 1980s with Japan’s Fifth Generation Project. I was amused, not worried. AI is a big field. I am familiar with those failed projects which involved Prolog and expert systems but have very little relationship to the work that I am doing of which I already produced proofs of concept. A fellow business school classmate who helped with my business plan could not comprehend that a computer could truly understand natural language, offering up arguments as to why it was impossible. How quickly, times change.

Conversational Interfaces: Siri & Shrdlu

Apple announced Siri as a voice-activated “personal assistant” available exclusively in the iPhone 4S.

Siri reminds of a few projects from the past, but it’s the first apparent demonstration of natural language understanding in a consumer product (excluding simplistic natural-language-based games like Zork).

First, it’s the realization of the conversational interface envisioned by Apple’s Knowledge Navigator video in 1987 and a couple other copycat vision videos produced by IBM and Microsoft. I could easily see the Siri user interface expanding to include a digital, three-dimensionally rendered person in the future.

Second, it embodies the inherently social interaction people have with their computers—a thesis put forth by the social interface work of two Stanford researchers (Cliff Nass and Byron Reeves) that lead to the Microsoft Bob user interface project and the Microsoft Agent technology.

Mostly, it reminds me of SHRDLU - a natural language program written way back in 1968 (around the same time as the Mother of All Demos which predicted mouse-based interfaces, windows, and video conferencing) that presented the first compelling demonstration of a conversational interface in which the computer fully understood and operated upon the direction of a user. Although ELIZA existed four years earlier, ELIZA was built on heuristics and dumb pattern matching without any real understanding or memory of its conversation with humans.

There are other projects over the years like mention expert systems. Recent products in popular technology include IBM’s Watson and CYC, which can analyze the Web and direct answers to natural language queries, and Wolfram Alpha (also used as a source in Siri), which acts a computation engine to deliver or derive answers to natural language query from an organized database of facts.

SHRDLU was developed as part of a PhD dissertation generations ago both in Moore’s Law and human terms, and the original source code in LISP is still available. Despite press excitement that Siri was based on a DARPA-funded CALO project, many aspects of SHRDLU share similarities with Siri. The SHRDLU computer program was able to perform actions, ask questions to clarify any ambiguous requests, recall the history the conversation, learn new terminology.

Below is a transcript of a conversation used in SHRDLU demo:

Person: PICK UP A BIG RED BLOCK.
Computer: OK. (does it)
Person: GRASP THE PYRAMID.
Computer: I DON'T UNDERSTAND WHICH PYRAMID YOU MEAN.
Person: FIND A BLOCK WHICH IS TALLER THAN THE ONE YOU ARE HOLDING AND PUT IT INTO THE BOX.
Computer: BY "IT", I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING.
Computer: OK. (does it)
Person: WHAT DOES THE BOX CONTAIN?
Computer: THE BLUE PYRAMID AND THE BLUE BLOCK.
Person: WHAT IS THE PYRAMID SUPPORTED BY?
Computer: THE BOX.
Person: HOW MANY BLOCKS ARE NOT IN THE BOX?
Computer: FOUR OF THEM.
Person: IS AT LEAST ONE OF THEM NARROWER THAN THE ONE WHICH I TOLD YOU TO PICK UP?
Computer: YES, THE RED CUBE.
Person: IS IT SUPPORTED?
Computer: YES, BY THE TABLE.
Person: CAN THE TABLE PICK UP BLOCKS?
Computer: NO.
Person: CAN A PYRAMID BE SUPPORTED BY A BLOCK?
Computer: YES.
Person: CAN A PYRAMID SUPPORT A PYRAMID?
Computer: I DON'T KNOW.
Person: STACK UP TWO PYRAMIDS.
Computer: I CAN'T.
Person: WILL YOU PLEASE STACK UP BOTH OF THE RED BLOCKS AND EITHER A GREEN CUBE OR A PYRAMID?
Computer: OK. (does it)
Person: WHICH CUBE IS SITTING ON THE TABLE?
Computer: THE LARGE GREEN ONE WHICH SUPPORTS THE RED PYRAMID.
Person: IS THERE A LARGE BLOCK BEHIND A PYRAMID?
Computer: YES, THREE OF THEM: A LARGE RED ONE, A LARGE GREEN CUBE, AND THE BLUE ONE.
Person: PUT A SMALL ONE ONTO THE GREEN CUBE WHICH SUPPORTS A PYRAMID.
Computer: OK. (does it)
Person: PUT THE LITTLEST PYRAMID ON TOP OF IT.
Computer: OK. (does it)
Person: DOES THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS SUPPORT ANYTHING GREEN?
Computer: YES, THE GREEN PYRAMID.
Person: WHAT COLOR IS IT?
Computer: BY "IT", I ASSUME YOU MEAN THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS.
Computer: RED
Person: HOW MANY THINGS ARE ON TOP OF GREEN CUBES?
Computer: I'M NOT SURE WHAT YOU MEAN BY "ON TOP OF "IN THE PHRASE "ON TOP OF GREEN CUBES " .
DO YOU MEAN:
1 - DIRECTLY ON THE SURFACE
2 - ANWHERE ON TOP OF ?
Person: 2
Computer: THREE OF THEM.
Person: HAD YOU TOUCHED ANY PYRAMID BEFORE YOU PUT THE GREEN ONE ON THE LITTLE CUBE?
Computer: YES, THE GREEN ONE.
Person: WHEN DID YOU PICK IT UP?
Computer: WHILE I WAS STACKING UP THE RED CUBE, A LARGE RED BLOCK AND A LARGE GREEN CUBE.
Person: WHY?
Computer: TO GET RID OF IT.
Person: WHY DID YOU DO THAT?
Computer: TO CLEAN OFF THE RED CUBE.
Person: WHY DID YOU CLEAR OFF THAT CUBE?
Computer: TO PUT IT ON A LARGE GREEN CUBE.
Person: WHY DID YOU DO THAT?
Computer: TO STACK UP THE RED CUBE , A LARGE RED BLOCK AND A LARGE GREEN CUBE.
Person: WHY DID YOU DO THAT?
Computer: BECAUSE YOU ASKED ME TO.
Person: HOW DID YOU DO IT?
Computer: BY PUTTING A LARGE RED BLOCK ON THE TABLE ; THEN LETTING GO OF IT ; THEN PUTTING A LARGE GREEN CUBE ON IT ; THEN LETTING GO OF THAT CUBE ; THEN PUTTING THE RED CUBE ON THAT CUBE; THEN LETTING GO OF THAT CUBE.
Person: HOW MANY OBJECTS DID YOU TOUCH WHILE YOU WERE DOING IT?
Computer: FOUR OF THEM.
Person: WHAT DID THE RED CUBE SUPPORT BEFORE YOU STARTED TO CLEAN IT OFF?
Computer: THE GREEN PYRAMID.
Person: THERE WERE FIVE BLOCKS TO THE LEFT OF THE BOX THEN.
Computer: NO, ONLY FOUR OF THEM: THE RED CUBE, TWO LARGE GREEN CUBES AND A LARGE RED BLOCK.
Person: PUT THE BLUE PYRAMID ON THE BLOCK IN THE BOX.
Computer: OK. (does it)
Person: IS THERE ANYTHING WHICH IS BIGGER THAN EVERY PYRAMID BUT IS NOT AS WIDE AS THE THING THAT SUPPORTS IT?
Computer: YES, THE BLUE BLOCK.
Person: DOES A STEEPLE
Computer interrupting: SORRY, I DON'T KNOW THE WORD "STEEPLE".
Person: A "STEEPLE" IS A STACK WHICH CONTAINS TWO GREEN CUBES AND A PYRAMID.
Computer: I UNDERSTAND.
Person: ARE THERE ANY STEEPLES NOW?
Computer: NO.
Person: BUILD ONE.
Computer: OK. (does it)
Person: CALL THE BIGGEST BLOCK "SUPERBLOCK".
Computer: OK.
Person: HAVE YOU PICKED SUPERBLOCK UP SINCE WE BEGAN?
Computer: YES.
Person: WHY DID YOU DROP IT?
Computer: BY "IT", I ASSUME YOU MEAN THE SUPERBLOCK.
I CAN'T EXPLAIN A NON-EXISTENT EVENT.
Person: IS THERE ANYTHING TO THE RIGHT OF THE RED PYRAMID?
Computer: YES, FOUR THINGS: THE BOX , THE RED CUBE, THE BLUE PYRAMID AND THE BLOCK NAMED SUPERBLOCK
Person: THANK YOU.
Computer: YOU'RE WELCOME!