Analytical Views
January 2002

SALT Seasons The Speech Rec Market

BY BRIAN STRACHMAN


Microsoft recently (or should I say finally) settled their lawsuit with the Department of Justice with, in my opinion, a slap on the wrist. That is not to say that I disagree with the results. Microsoft has always been a ferocious competitor. Sometimes bending the rules a bit is just part of the game and I expect their most recent release to be much the same. Microsoft has founded a forum (in conjunction with Cisco, Comverse, Intel, Philips, and SpeechWorks) called Speech Applications Language Tags, or SALT. While we are so overrun by acronyms that they often seem meaningless, this one has considerable significance to our industry.

The forum’s official statement is that SALT’s mission is to “develop a royalty-free, platform-independent standard that will make possible multi-modal and telephony-enabled access to information, applications and Web services from PCs, telephones, tablet PCs, and wireless personal digital assistants (PDAs).” A few of these statements sound very familiar, particular the platform independent speech development part. For example, the Voice XML Forum has the following goal: “establishing and promoting the Voice Extensible Markup Language (XML), a new specification essential to making Internet content and information accessible via voice and phone.” At surface level, the two organizations sound very similar, however they have differences. First, SALT is a standard while VXML is merely specification. Well, admittedly that’s not much of a difference, but there is a more significant, more important distinction: SALT focuses on the multi-modal access to information.

Multi-modal communication has always been the technical dream of the speech industry, but significant boundaries have gotten in the way of achieving it. In a nutshell, multi-modal means communicating using different formats for the input and the output. Conversely, a single-modal communication method would only use one format, such as voice. For example, in an IVR application the user inputs a request to the system using either touch-tone or speech, and then the output the user receives is an audio reading of the information required. This is what happens when I call my bank’s IVR, ask for my account balance, and a machine reads back the information. Similarly, the Internet is single-modal. I type my request for information into the Web browser and the results are displayed on my screen. Multi-modal communication involves a combination of the two -- aural and visual.

Why is this so important? Because mobile Web browsing on a PDA or WAP phone is -- at best -- a painful experience. People rarely use the technology, and it has become a niche market populated only by technogeeks, never to see the mainstream. There is plenty of great information that can be displayed on a mobile device, but getting to that information is the problem. Something as simple as checking a movie listing can take upwards of five minutes on a Web-enabled PDA. It is so impractical that the technology is almost worthless. Moreover, when one considers the challenges (and dangers) of accessing this data while driving, it’s amazing anyone uses it at all.

The speech industry faces a similar challenge. Obtaining information from an IVR has almost become a pleasant experience with the recent advances in speech recognition. Call the number, say what you want, and bingo, you have your stock quotes, prescription refills, account balance, or airline arrival time. The problem is, it can’t do much else. IVRs are severely constrained by the limitation of the human mind. We can only deal with so many choices at once. Experienced IVR programmers will rarely give the user more than four or five options. Output is even worse. Generally, people can only think about one thing at a time, so IVRs are limited to applications where the output is relatively simple. The solution is multi-modal communications.

Imagine using a handheld device like a PDA or cellular handset with a large display, and browsing to a financial Web site. I would simply speak into the device and say, “I’m Brian Strachman. Show me portfolio number one.” Instantly I would see the twenty-plus stocks I own, the price I paid, current market shifts, and any other personalization. Based on this I would then speak into the mobile device, “Buy 100 shares of Microsoft, short sell 50 shares of Priceline, and sell 25 shares of AOL.” Again the transaction would be displayed visually on my device, as would the resulting changes in my portfolio. All of this would be far too much information for me to handle over the telephone without a pen, paper, and quite a bit of time.

This is simply one example of a multi-modal application and obviously there are countless others (e.g., salespeople checking account status on the way to a meeting). The possibilities are endless. Multi-modal communications opens up a new realm of products and solutions that were previously impossible to the IVR or PDA industry. It takes the best possible input -- speech, and couples it with the best possible output -- data. Theoretically, SALT makes this all possible.

It also becomes painfully clear why Microsoft is now interested in the speech industry. With the PC market declining, and the PDA industry underachieving, SALT could reinvigorate the mobile computing space. With applications designed to utilize speech on mobile devices running Windows powered for Pocket PC, Microsoft stands to do very well. Even if VXML gets put on the backburner because of the success of SALT, the market will benefit … particularly a notoriously ferocious competitor called Microsoft.

Brian Strachman is senior analyst, Voice and Data Communications, Cahners In-Stat Group. To correspond with the author, please send your comments to brians@instat.com.

[ Return To The January 2002 Table Of Contents ]