Feature Article
January 2002 


VoiceXML: Enabling Voice Access To Information

BY KIMBERLEE KEMBLE

Voice technology used to be something akin to Star Trek -- cool, but futuristic. Not anymore. With the convergence of voice, Web, and telephony, and with improvements in processing power and more sophisticated algorithms, voice technology is already starting to add real value to companies’ IT infrastructures. New speech-enabled applications are hitting the market as businesses realize that voice is the most natural way to access information from the Internet, mobile phones, car dashboards, or handheld organizers. Take call centers.

Voice technology is a natural way to improve service while also reducing personnel costs. Voice recognition allows companies to use automated voice recognition and text-to-speech audio to serve customers over the phone, 24/7, without subjecting them to hold times or older systems that require people to press buttons and respond to rigidly structured menus. This means cost savings and increased customer satisfaction.

Voice is poised for growth. The Kelsey Group predicts that the voice business will reach $41 billion by 2005. Key to the adoption of voice is VoiceXML: an open standard that gives regular Web developers a common, easy way to build voice applications. VoiceXML means developers no longer need as much specialized voice expertise that for so long has complicated creation of these applications. VoiceXML promises to mean to voice what other open standards such as Java and XML have meant to the Internet, and will allow mainstream developers to build voice applications.

WHAT IS VoiceXML?
VoiceXML, or Voice eXtensible Markup Language, is an XML-based markup language for creating distributed voice applications, much as HTML is a markup language for creating distributed visual applications. It was originally driven by the VoiceXML Forum, which is an industry organization founded by AT&T, IBM, Lucent, and Motorola. Today, more than 500 companies support VoiceXML and use it to develop applications. The specification has reached Release 2.0 level, and its continued development is being managed by the W3C’s Voice Browser Working Group.

VoiceXML isn’t limited to voice-enabling Web applications. It is a standard for voice-enabling any type of e-business application to which you want to provide voice access. By utilizing the same networking infrastructure, HTTP communications, and markup language programming model, VoiceXML not only leverages an enterprise’s often very significant investment in system resources, but also the skills of many of its developers and administrators. With VoiceXML, you can quickly and easily provide speech access to your applications.

This allows for anytime, anywhere access. Customers, employees, and business partners can access information and complete transactions on the move, without being tied to phone lines or Internet.

THE EXTENDED WEB WORLD
VoiceXML is designed to extend the existing Web environment by providing another way of accessing information and services. With VoiceXML, you use your voice and a telephone to access information instead of a computer and a mouse.

There are many similarities between the visual (HTML) Web world and the audio (VoiceXML) Web world. For example, in the visual world, you use a Web browser to access the Web; in the VoiceXML world, you use a VoiceXML browser. Web browsers present information to the user through HTML; voice browsers present information to the user through VoiceXML.

And these similarities are by design. The primary goal of VoiceXML is to bring the power of Web development and content delivery to voice applications. It was designed to provide a way for Web developers to use a familiar markup style and existing Web server-side logic to deliver voice content to the Internet. If you know HTML or WML, VoiceXML is going to look very familiar to you.

THE VoiceXML BROWSER
In the world of VoiceXML, you interact with your application over the phone using a VoiceXML browser. The VoiceXML browser is analogous to a graphical Web browser (such as Netscape Communicator and Microsoft Internet Explorer). It is the way you interact with a Web server using your voice and a telephone. Instead of rendering and interpreting HTML (like a graphical browser), the VoiceXML browser renders and interprets VoiceXML. Instead of clicking a mouse and using your keyboard, you use your voice and a telephone (and even the phone keypad) to access information and services.

One of the primary functions of the VoiceXML browser is to fetch VoiceXML documents from the Web server (just like a graphical Web browser fetches HTML documents). The request to fetch a document can be generated either by the interpretation of a VoiceXML document, or in response to an external event. The VoiceXML browser uses HTTP over a LAN or the Internet to the fetch the documents (the very same HTTP requests that are used by the graphical Web browser).

The VoiceXML browser interprets and renders the VoiceXML document. It manages the dialog between the application and the user by playing audio prompts, accepting user inputs, and acting on those inputs. The action might involve jumping to a new dialog, fetching a new document, or submitting user input to the Web server for processing.

ARCHITECTURE
Let’s take a look at how VoiceXML and the VoiceXML browser fit into the current Web environment. We’re all very familiar with the Web as it works today. You use a graphical Web browser (such as Netscape Communicator or Internet Explorer), which renders and interprets HTML to present information to the user (text, graphics, audio, hyperlinks, etc.). When the user makes a selection (for example, a click on a hyperlink), the graphical Web browser sends an HTTP request to the Web server (in this case, to retrieve another page). The Web server responds by locating the new page and returns HTML to the browser to present the new page to the user. The Web server may also have to interact with a back-end infrastructure (database, servlets, etc.) to obtain and return the requested information.

The VoiceXML browser extends this paradigm by adding a telephone and a voice server to the Web environment. For the purposes of this article, a voice server is an abstraction. It is an entity that contains the VoiceXML browser, the speech recognition software, and the text-to-speech software. To provide voice access to information requires some underlying voice technologies, not just a programming language (VoiceXML). These technologies are speech recognition and speech synthesis.

Speech recognition is a software component that translates spoken input into text. An application can then do something with that text. For example, if a caller were to say “checking account,” the application could retrieve the caller’s current checking account balance and tell her what it is.

On the other hand, speech synthesis (or text-to-speech, as it is more commonly called) is a voice technology that converts text into spoken output. In the previous example, the application could “read” the checking account balance to the caller by using text-to-speech.

VoiceXML introduces a new way of presenting the same information and services to the user. Now, instead of presenting the information visually (through HTML, graphics, and text), the VoiceXML browser presents the information to the caller using VoiceXML. When the caller says something (which is the voice equivalent of clicking on something to make a selection), the VoiceXML browser sends an HTTP request to the Web server, which may access the very same back-end infrastructure, to return information -- this time, in VoiceXML -- to the user.

When the VoiceXML browser is started, it sends an HTTP request over the LAN or Internet to request an initial VoiceXML document from the Web server. The requested VoiceXML document can contain static information, or it can be generated dynamically from data stored in an enterprise database using the same type of server-side logic (CGI scripts, Java Beans, ASPs, JSPs, Java servlets, etc.) that you use to generate dynamic HTML documents.

The VoiceXML browser interprets and renders the document. Based on the user’s input, the VoiceXML browser may request a new VoiceXML document from the Web server, or may send data back to the Web server to update information in the back-end database. The important thing is that the mechanism for accessing your back-end enterprise data does not need to change; your VoiceXML applications can access the same information from your enterprise servers that your HTML applications do.

When thinking about voice-enabling the Web, there are two points to keep in mind:

  1. Providing voice access to your application doesn’t mean throwing away the graphics from a traditional visual interface and reading the rest of the information aloud. That probably wouldn’t be very useful. What it does mean, though, is providing a different way of accessing the same information and services. Even though you are providing the same information and services as you would with a graphical interface, you probably need to change the way you present this information. For example, you may be able to show a list box with 30 items in it in a visual application, but you probably don’t want to read 30 items to the caller over the phone. The point is, you are changing the presentation of the information, not the information or how it’s generated (by the Web server and the back-end).

  2. Voice isn’t always the best user interface for an application. There are some applications that are just more suited for a visual medium. And that’s OK. For example, it may be quite acceptable to purchase a music CD or a book over the phone, but you might not want to purchase a $300 cashmere sweater without being able to see it and feel it first. However, a customer might find it very useful to check the status of their order over the phone.

Still, many applications can be voice enabled. What you have to do as an application designer is to decide what kind of information to provide, how much information to provide, and how and when to present it to the user.

CONCLUSION
VoiceXML is an open standard that utilizes common programming methodologies to enable voice access to critical e-business applications. This saves companies money and speeds market adoption because mainstream developers can more easily build voice applications using VoiceXML than using proprietary software and arcane software languages.

Middleware infrastructure products built to utilize VoiceXML very quickly provide a new access methodology to new and existing e-business applications. Giving a “voice” to these business applications allow customers to access vital information more quickly and easily, thus enabling greater profitability for the companies that use it.

Kimberlee Kemble is the manager of Technical Marketing for IBM Voice Systems. For more information, visit www.ibm.com.

[ Return To The January 2002 Table Of Contents ]