|
Programming Approaches for
Speech-Enabled Applications
|
|
|
|
|
|
By
Robert Delwood
Introduction
With the advent and proliferation of more powerful computers,
speech technology has become more affordable and accessible. No
longer restricted to specialized or esoteric applications, it is
possible for speech technology to reach the business and home
users in mainstream applications such as word processors,
spreadsheets, e-mail packages, and games. However, its
introduction as an interface device requires consideration during
feature design and software development stages.
Each device has its strengths and weaknesses and those
characteristics need to be optimized. For example, take the case
of the keyboard and mouse. The keyboard predates the mouse and it
is still capable of performing many of the same functions as the
mouse, however awkward those functions may be to carry out. The
mouse is more adept at certain tasks such as pointing, selecting,
or dragging, but is inefficient for textual input. For a while,
the two devices were contentious. Over the last two decades they
have evolved together and application designers have used each
device's unique characteristics to make a better product.
The example of the mouse vs. keyboard is appropriate here. As a
new technology, speech recognition has to find its niche and its
role in the user interface. Obviously, speech technology is
designed for voice input and output and does it very well. When
appropriate, designers are encouraged to use speech in their
applications. Clearly some uses for speech are better, or at least
more obvious, than others. Word processors and e-mail applications
can readily take advantage of both dictation and text to speech
capabilities. Games may be better suited to use speech recognition
for command and control features. In contrast, Web browsers
require additional design considerations if speech enabling is
contemplated. For instance, Web pages have fields where the user
can enter information. However, the fields are often arranged in a
visually pleasing but not in a particularly systematic layout.
Pages usually have a URL line but they can also have search boxes,
comment areas, forms, and check boxes as well as links. Deciding
how the user assigns speech to a specific box or area can be
awkward. Likewise, the ability to read information from a Web page
can be equally awkward for the same reasons. For interfaces to be
successful, designers must present and use the interface in a
consistent and straightforward way.
Speech input modes
There are two basic speech input modes. The first is command and
control. This uses speech to issue commands. Typically, the
commands are brief sentences or phrases. A good example is using
spoken commands as a shortcut to the menu and menu items. The user
needs only to say "file open," for instance, to access
the open file dialog box. Command and control is the simpler of
the two speech input modes in terms of programming. It is simpler,
in part, because the range of recognizable words is limited. Using
the menu example, the word list may be only as long as the number
of menu items the user is allowed to speak. Existing applications
may be retrofitted for command and control in a few lines of code
and perhaps the addition of a word list.
Dictation is the second speech input mode. In this mode, users may
speak freely to the computer, which translates the speech into
text. In contrast to command and control, the word list is greatly
expandedby definition, to the size of a dictionary. Dictation
makes no attempt to recognize the words as commands. Programming
for dictation is more complex than for command and control. This
applies not only to programming new applications but retrofitting
existing ones as well.
Dictation becomes practical in several situations. Users may have
a large amount of text to add in one session. Dictation approaches
the ideal in a speech system. Users may dictate letters or even
write books. Recognition accuracy improves with user experience
and voice training with the microphone. The user may not be a
proficient typist, and dictation offers an efficient option to
enter information. Alternatively, the user may not be able to
operate a keyboard or a mouse due to physical limitations such as
accessibility issues.
It is possible to combine some of the options above. As an
example, a page layout or CAD application is dependent on the
mouse to create a box and place it correctly and accurately within
the design. The software team may decide to add a speech feature
to access a dialog box used to control the dimensions of a box.
This would be a command and control function since users would
access specific menu items and would use only a few words to do
so. Using command and control and the dialog box as an example,
the user would speak the command, "dimensions box," the
numeric dimensions of the object, and then say, "okay"
to accept the box. In this way, speech complements the mouse
operations as the user is accomplishing one task (placing and
sizing a box) without having to interrupt mouse positioning. The
entire operation is completed quicker with speech and users do not
have to move the mouse or reposition their hands. This
combination, using command and control as well as the mouse, also
keeps a consistent user interface. The user is simply accessing
the application's existing menu items and using speech as a short
cut to them. Adding speech does not introduce any new or hidden
features and the user may still perform the task manually. By
combining the different input methods, users have a greater
ability to concentrate on their task because they spend less time
and effort on the mechanics of making the change itself. The
application only requires minor code changes to accommodate
speech.
Using Speech Interfaces
Effectively
For desktop systems, it is important to remember that some tasks
are easier with speech but others are not. Early speech
applications tried replacing the interface entirely, or at least a
large portion, with a voice system. As a result, many of these
systems failed because they were too complex or counter-intuitive.
Recent applications have been more successful using speech to
complement existing interfaces.
Pick the level of speech interaction right for your project.
In reality, applications with speech are more like a spectrum.
Conventional, keyboard only involvement is at one end of the
spectrum and science fiction level involvement at the other end.
That is, the captain of the space ship has only to speak the
command, and the computer interprets it 100% correctly and
instantaneously, without regard to other meanings, confirmation,
inflection, accent, or background noise. Perhaps that is the goal
of speech recognition in general, but in designing applications,
consider the level of involvement needed for your user. For
desktop applications, the keyboard is still an inherent part of
the computer system. Asking users to enter information from the
keyboard is not a new concept. In fact, it is the paradigm speech
designers must compete against. Therefore, it may be acceptable to
allow users to enter some text and use speech in supplementary
roles such as command and control, or navigation. Further along
the spectrum, it may be better to reverse the roles and use speech
as the primary input method. This reserves the keyboard or mouse
to supplement voice operations. If a word is not readily
recognized, the user can type in the correct word. Even further
down the spectrum, voice is intended as the primary input method.
Speech applications intended for automobiles cannot rely on the
driver to manually push a button. In the same way, smart phone
Internet devices will have no keyboard and will rely exclusively
on speech for all aspects of their operations.
Speech often works best when it is integrated with other user
interface methods. Speech can complement other input methods;
it does not need to compete against them. For example, action
games require a quick response and moving a hand from a joystick
or keyboard is often detrimental. When appropriate, use speech.
Voice commands are useful for some options such as firing weapons
but still allow the user to use the keyboard for other operations.
In some applications, such as with spreadsheets or chat rooms,
speech might allow users to enter textual information quickly, and
use the keyboard for navigating through the document or
application.
Do not force a fit. If the proposed use of speech is not
appropriate, rethink the approach. The user experience
requires a logical and intuitive interface. Making a task more
complex just to accommodate speech or using speech in cases where
is just does not make sense, is bound to confuse the user and
detract from the application. This includes forcing a voice
equivalent for other input methods. Unless there is a compelling
reason, leave out awkward voice interfaces.
Use speech to simplify complex or tedious sequences, not
complicate them. Currently, applications must break down tasks
into separate steps. Entering this information is generally
restricted to one fact or piece of information for each entry. For
example, to order airplane tickets, Web sites have separate boxes
for each of the departure and arrival cities, date, time, flight,
airline and so on. A natural language approach allows users to
speak a sentence and the application to parse the information. In
the travel example, they can say "I'd like to book a flight
from Seattle to Boston at one p.m. on the fourth and come back in
the morning of the fifteenth." One sentence conveys all the
information.
Speech can also take advantage of information the user
intrinsically knows about but that is not presently on the screen.
In many cases, the screen only represents a small part of the
overall information. For example, Web pages usually have material
off the screen and the user knows that there is a
"submit" button available but it is just not visible. A
speech-enabled application may allow users to say,
"submit" rather than having to scroll down the page to
the actual button and click it. A mapping application may permit
users to say the city name and to center it on the screen. This
may prevent the time consuming (and often disorienting) function
of manually scrolling around a potentially large area.
If the user does not or cannot use a mouse or keyboard, speech
may be the most effective option available. Visually impaired
users may not be able to see the screen to scroll, for example.
Disabled users may not be able to manually operate a mouse or
keyboard. In both cases (and certainly these are not the only
ones) speech may be the best, if not the only, method to operate a
computer.
Consider the user's environment. For speech recognition to
work accurately, the environment must be suitable. A relatively
quiet one, such as a business office, is optimal. SAPI 5.0
recognizes background noise and filters it out. Even occasional
loud noises will not significantly change the accuracy, although
frequent noises will slow the processing rate. Therefore a
perfectly quiet location yields only marginally better recognition
results than a normal business one. By contrast, a speech-enabled
application in an airport or factory location may yield inferior
results. Also, since the user will be speaking aloud, there is an
issue of privacy. The user may disturb others nearby, or the
information being spoken may confidential.
Adding speech to applications
Adding speech to applications is not a difficult task. As
mentioned earlier, many applications may be retrofitted for
speech; that is, speech may be added to existing packages. These
changes need not be extensive, and in some cases, require no
modifications to existing code. In general, there are three
approaches to adding speech: Without code changes, with code
changes, and from the ground up.
The least intrusive method is without code changes. Legacy
software incorporates speech without have to change any of the
code. This approach takes advantage of external hooks which are
already present in the software. These hooks are usually intended
for COM, automation, or keyboard interfaces. However, an external
application or executable is needed. This executable has the
responsibility of handling speech and exporting the features in
the appropriate format for the hooks. As an example, the Microsoft
SAPI 5.0 SDK demonstrates how to add speech to Age of Empires II (AoE
II). In fact, the AoE II program does not need to be modified;
expecting users to run a patch would be prohibitive. Rather, the
demonstration uses a separate executable: AOESAPI.exe. After
handling all the speech and recognition, the output to AoE II is
sent with the Win32 call SendInput(), simulating a keystroke. In
this way, developers can create speech interfaces for many games
and other applications using a keyboard.
Existing applications may also be modified directly to accept
speech. This requires changes to the application's code base and
therefore is more complex. Before doing this, look at existing
commands where speech adds value. This may be as direct as adding
speech commands to access menus and menu items. This adds only a
small amount of code. Also, the interface remains the same and
does not risk confusing the user.
Finally, applications can be created from the ground up. This is
the most radical approach but also the more effective for
incorporating the newest speech technology. Here, designers
attempt radical, or at least vastly different, applications than
those that are currently available. In the case of existing
paradigms (word processors, for example), designers may be
interested in incorporating speech in fundamental or integral ways
so that modifying existing code is not an option. New kinds of
applications including voice telephony, smart phone Web browsers,
hand held computers or other new devices will also require a
ground up approach. |
TOP
HOME
Embedded Digital System
Co.,Ltd. CANADA
嵌入数码系统公司
Email: embedigital@yahoo.com
copy right © 2002 All Rights
Reserved
|
|
|
|
|
|