VoiceXML: Enabling Voice
Access To Information
BY KIMBERLEE KEMBLE
Voice technology used to be something akin to Star Trek -- cool, but
futuristic. Not anymore. With the convergence of voice, Web, and
telephony, and with improvements in processing power and more
sophisticated algorithms, voice technology is already starting to add real
value to companies’ IT infrastructures. New speech-enabled applications
are hitting the market as businesses realize that voice is the most
natural way to access information from the Internet, mobile phones, car
dashboards, or handheld organizers. Take call centers.
Voice technology is a natural way to improve service while also
reducing personnel costs. Voice recognition allows companies to use
automated voice recognition and text-to-speech audio to serve customers
over the phone, 24/7, without subjecting them to hold times or older
systems that require people to press buttons and respond to rigidly
structured menus. This means cost savings and increased customer
satisfaction.
Voice is poised for growth. The Kelsey Group predicts that the voice
business will reach $41 billion by 2005. Key to the adoption of voice is
VoiceXML: an open standard that gives regular Web developers a common,
easy way to build voice applications. VoiceXML means developers no longer
need as much specialized voice expertise that for so long has complicated
creation of these applications. VoiceXML promises to mean to voice what
other open standards such as Java and XML have meant to the Internet, and
will allow mainstream developers to build voice applications.
WHAT IS VoiceXML?
VoiceXML, or Voice eXtensible Markup Language, is an XML-based markup
language for creating distributed voice applications, much as HTML is a
markup language for creating distributed visual applications. It was
originally driven by the VoiceXML Forum, which is an industry organization
founded by AT&T, IBM, Lucent, and Motorola. Today, more than 500
companies support VoiceXML and use it to develop applications. The
specification has reached Release 2.0 level, and its continued development
is being managed by the W3C’s Voice Browser Working Group.
VoiceXML isn’t limited to voice-enabling Web applications. It is a
standard for voice-enabling any type of e-business application to which
you want to provide voice access. By utilizing the same networking
infrastructure, HTTP communications, and markup language programming
model, VoiceXML not only leverages an enterprise’s often very
significant investment in system resources, but also the skills of many of
its developers and administrators. With VoiceXML, you can quickly and
easily provide speech access to your applications.
This allows for anytime, anywhere access. Customers, employees, and
business partners can access information and complete transactions on the
move, without being tied to phone lines or Internet.
THE EXTENDED WEB WORLD
VoiceXML is designed to extend the existing Web environment by providing
another way of accessing information and services. With VoiceXML, you use
your voice and a telephone to access information instead of a computer and
a mouse.
There are many similarities between the visual (HTML) Web world and the
audio (VoiceXML) Web world. For example, in the visual world, you use a
Web browser to access the Web; in the VoiceXML world, you use a VoiceXML
browser. Web browsers present information to the user through HTML; voice
browsers present information to the user through VoiceXML.
And these similarities are by design. The primary goal of VoiceXML is
to bring the power of Web development and content delivery to voice
applications. It was designed to provide a way for Web developers to use a
familiar markup style and existing Web server-side logic to deliver voice
content to the Internet. If you know HTML or WML, VoiceXML is going to
look very familiar to you.
THE VoiceXML BROWSER
In the world of VoiceXML, you interact with your application over the
phone using a VoiceXML browser. The VoiceXML browser is analogous to a
graphical Web browser (such as Netscape Communicator and Microsoft
Internet Explorer). It is the way you interact with a Web server using
your voice and a telephone. Instead of rendering and interpreting HTML
(like a graphical browser), the VoiceXML browser renders and interprets
VoiceXML. Instead of clicking a mouse and using your keyboard, you use
your voice and a telephone (and even the phone keypad) to access
information and services.
One of the primary functions of the VoiceXML browser is to fetch
VoiceXML documents from the Web server (just like a graphical Web browser
fetches HTML documents). The request to fetch a document can be generated
either by the interpretation of a VoiceXML document, or in response to an
external event. The VoiceXML browser uses HTTP over a LAN or the Internet
to the fetch the documents (the very same HTTP requests that are used by
the graphical Web browser).
The VoiceXML browser interprets and renders the VoiceXML document. It
manages the dialog between the application and the user by playing audio
prompts, accepting user inputs, and acting on those inputs. The action
might involve jumping to a new dialog, fetching a new document, or
submitting user input to the Web server for processing.
ARCHITECTURE
Let’s take a look at how VoiceXML and the VoiceXML browser fit into the
current Web environment. We’re all very familiar with the Web as it
works today. You use a graphical Web browser (such as Netscape
Communicator or Internet Explorer), which renders and interprets HTML to
present information to the user (text, graphics, audio, hyperlinks, etc.).
When the user makes a selection (for example, a click on a hyperlink), the
graphical Web browser sends an HTTP request to the Web server (in this
case, to retrieve another page). The Web server responds by locating the
new page and returns HTML to the browser to present the new page to the
user. The Web server may also have to interact with a back-end
infrastructure (database, servlets, etc.) to obtain and return the
requested information.
The VoiceXML browser extends this paradigm by adding a telephone and a
voice server to the Web environment. For the purposes of this article, a
voice server is an abstraction. It is an entity that contains the VoiceXML
browser, the speech recognition software, and the text-to-speech software.
To provide voice access to information requires some underlying voice
technologies, not just a programming language (VoiceXML). These
technologies are speech recognition and speech synthesis.
Speech recognition is a software component that translates spoken input
into text. An application can then do something with that text. For
example, if a caller were to say “checking account,” the application
could retrieve the caller’s current checking account balance and tell
her what it is.
On the other hand, speech synthesis (or text-to-speech, as it is more
commonly called) is a voice technology that converts text into spoken
output. In the previous example, the application could “read” the
checking account balance to the caller by using text-to-speech.
VoiceXML introduces a new way of presenting the same information and
services to the user. Now, instead of presenting the information visually
(through HTML, graphics, and text), the VoiceXML browser presents the
information to the caller using VoiceXML. When the caller says something
(which is the voice equivalent of clicking on something to make a
selection), the VoiceXML browser sends an HTTP request to the Web server,
which may access the very same back-end infrastructure, to return
information -- this time, in VoiceXML -- to the user.
When the VoiceXML browser is started, it sends an HTTP request over the
LAN or Internet to request an initial VoiceXML document from the Web
server. The requested VoiceXML document can contain static information, or
it can be generated dynamically from data stored in an enterprise database
using the same type of server-side logic (CGI scripts, Java Beans, ASPs,
JSPs, Java servlets, etc.) that you use to generate dynamic HTML
documents.
The VoiceXML browser interprets and renders the document. Based on the
user’s input, the VoiceXML browser may request a new VoiceXML document
from the Web server, or may send data back to the Web server to update
information in the back-end database. The important thing is that the
mechanism for accessing your back-end enterprise data does not need to
change; your VoiceXML applications can access the same information from
your enterprise servers that your HTML applications do.
When thinking about voice-enabling the Web, there are two points to
keep in mind:
- Providing voice access to your application doesn’t mean throwing
away the graphics from a traditional visual interface and reading the
rest of the information aloud. That probably wouldn’t be very
useful. What it does mean, though, is providing a different way of
accessing the same information and services. Even though you are
providing the same information and services as you would with a
graphical interface, you probably need to change the way you present
this information. For example, you may be able to show a list box with
30 items in it in a visual application, but you probably don’t want
to read 30 items to the caller over the phone. The point is, you are
changing the presentation of the information, not the information or
how it’s generated (by the Web server and the back-end).
- Voice isn’t always the best user interface for an application.
There are some applications that are just more suited for a visual
medium. And that’s OK. For example, it may be quite acceptable to
purchase a music CD or a book over the phone, but you might not want
to purchase a $300 cashmere sweater without being able to see it and
feel it first. However, a customer might find it very useful to check
the status of their order over the phone.
Still, many applications can be voice enabled. What you have to do as
an application designer is to decide what kind of information to provide,
how much information to provide, and how and when to present it to the
user.
CONCLUSION
VoiceXML is an open standard that utilizes common programming
methodologies to enable voice access to critical e-business applications.
This saves companies money and speeds market adoption because mainstream
developers can more easily build voice applications using VoiceXML than
using proprietary software and arcane software languages.
Middleware infrastructure products built to utilize VoiceXML very
quickly provide a new access methodology to new and existing e-business
applications. Giving a “voice” to these business applications allow
customers to access vital information more quickly and easily, thus
enabling greater profitability for the companies that use it.
Kimberlee Kemble is the manager of Technical Marketing for IBM Voice
Systems. For more information, visit www.ibm.com.
[
Return
To The January 2002 Table Of Contents ]
|