Microsoft recently (or should I say finally) settled their
lawsuit with the Department of Justice with, in my opinion, a slap on the
wrist. That is not to say that I disagree with the results. Microsoft has
always been a ferocious competitor. Sometimes bending the rules a bit is
just part of the game and I expect their most recent release to be much
the same. Microsoft has founded a forum (in conjunction with Cisco,
Comverse, Intel, Philips, and SpeechWorks) called Speech Applications
Language Tags, or SALT. While we are so overrun by acronyms that they
often seem meaningless, this one has considerable significance to our
industry.
The forum’s official statement is that SALT’s mission
is to “develop a royalty-free, platform-independent standard that will
make possible multi-modal and telephony-enabled access to information,
applications and Web services from PCs, telephones, tablet PCs, and
wireless personal digital assistants (PDAs).” A few of these statements
sound very familiar, particular the platform independent speech
development part. For example, the Voice XML Forum has the following goal:
“establishing and promoting the Voice Extensible Markup Language (XML),
a new specification essential to making Internet content and information
accessible via voice and phone.” At surface level, the two organizations
sound very similar, however they have differences. First, SALT is a
standard while VXML is merely specification. Well, admittedly that’s not
much of a difference, but there is a more significant, more important
distinction: SALT focuses on the multi-modal access to information.
Multi-modal communication has always been the technical
dream of the speech industry, but significant boundaries have gotten in
the way of achieving it. In a nutshell, multi-modal means communicating
using different formats for the input and the output. Conversely, a
single-modal communication method would only use one format, such as
voice. For example, in an IVR application the user inputs a request to the
system using either touch-tone or speech, and then the output the user
receives is an audio reading of the information required. This is what
happens when I call my bank’s IVR, ask for my account balance, and a
machine reads back the information. Similarly, the Internet is
single-modal. I type my request for information into the Web browser and
the results are displayed on my screen. Multi-modal communication involves
a combination of the two -- aural and visual.
Why is this so important? Because mobile Web browsing on a PDA or WAP
phone is -- at best -- a painful experience. People rarely use the
technology, and it has become a niche market populated only by technogeeks,
never to see the mainstream. There is plenty of great information that can
be displayed on a mobile device, but getting to that information is the
problem. Something as simple as checking a movie listing can take upwards
of five minutes on a Web-enabled PDA. It is so impractical that the
technology is almost worthless. Moreover, when one considers the
challenges (and dangers) of accessing this data while driving, it’s
amazing anyone uses it at all.
The speech industry faces a similar challenge. Obtaining
information from an IVR has almost become a pleasant experience with the
recent advances in speech recognition. Call the number, say what you want,
and bingo, you have your stock quotes, prescription refills, account
balance, or airline arrival time. The problem is, it can’t do much else.
IVRs are severely constrained by the limitation of the human mind. We can
only deal with so many choices at once. Experienced IVR programmers will
rarely give the user more than four or five options. Output is even worse.
Generally, people can only think about one thing at a time, so IVRs are
limited to applications where the output is relatively simple. The
solution is multi-modal communications.
Imagine using a handheld device like a PDA or cellular
handset with a large display, and browsing to a financial Web site. I
would simply speak into the device and say, “I’m Brian Strachman. Show
me portfolio number one.” Instantly I would see the twenty-plus stocks I
own, the price I paid, current market shifts, and any other
personalization. Based on this I would then speak into the mobile device,
“Buy 100 shares of Microsoft, short sell 50 shares of Priceline, and
sell 25 shares of AOL.” Again the transaction would be displayed
visually on my device, as would the resulting changes in my portfolio. All
of this would be far too much information for me to handle over the
telephone without a pen, paper, and quite a bit of time.
This is simply one example of a multi-modal application
and obviously there are countless others (e.g., salespeople checking
account status on the way to a meeting). The possibilities are endless.
Multi-modal communications opens up a new realm of products and solutions
that were previously impossible to the IVR or PDA industry. It takes the
best possible input -- speech, and couples it with the best possible
output -- data. Theoretically, SALT makes this all possible.
It also becomes painfully clear why Microsoft is now
interested in the speech industry. With the PC market declining, and the
PDA industry underachieving, SALT could reinvigorate the mobile computing
space. With applications designed to utilize speech on mobile devices
running Windows powered for Pocket PC, Microsoft stands to do very well.
Even if VXML gets put on the backburner because of the success of SALT,
the market will benefit … particularly a notoriously ferocious
competitor called Microsoft.
Brian Strachman is senior analyst, Voice and Data
Communications, Cahners In-Stat Group.
To correspond with the author, please send your comments to brians@instat.com.
[ Return
To The January 2002 Table Of Contents ] |