The Acorn Computer User WWW Server

Speech Recognition: A Performance Boost Brings It To The Mainstream

The Current State of the Market

Until only a short while ago, speech recognition was largely an immature technology in search of a market. The first systems had to be trained by the individuals that would use them, and were capable of identifying only a limited vocabulary of words. From its humble beginnings recognizing tens of words for command-and-control applications, the technology has evolved to enable the understanding of thousands of words in full-fledged speech dictation systems. Today, speech recognition is making major strides in its ability to provide information to consumers, dictation capabilities to physicians and lawyers, and reduced costs for large companies with extensive customer service units.

Market research firms predict that the industry, which topped $340 million in 1994, is on its way to $1 billion by the end of the century. An estimated 30 percent of revenue comes from telephone applications, while the balance comes from speech-to-text products, data-entry applications, consumer-market applications and speech-verification products. Speech recognition is beginning to arrive on the desktop with increasing frequency. As with most emerging technologies, the market and the products are diverse and the landscape can be confusing.

Current Technology and Applications

Telephony and speech-to-text applications are in the greatest demand. At the high end of telephony applications are products being offered by Nynex, the baby Bells and other telephone carriers. At a lower level are computer telephony applications such as those based on interactive speech response. The speech-to-text arena includes what are often referred to as dictation systems- sophisticated systems that can recognize more than 50,000 words. Most desktop applications fall under either command-and-control or data entry product types. Command-and-control products are basically speech-activated "hot keys," and data-entry products usually generate forms but are also based on command-and-control. The technology is basically the same for all these product types, but it is a technology that lends itself to great product diversity.

Discrete or Continuous

Further, speech recognition systems can either be discrete, requiring the user to pause between each utterance ("dial [pause] 5 [pause]...") or continuous, allowing the user to talk conversationally: "dial 555-1234." Systems based on continuous recognition are more natural and efficient for users than discrete systems, but the speech interpretation and analysis is more complex. Whether a system is discrete or continuous, there are two types of recognition: speaker-dependent and speaker-independent.

Speaker Dependent or Independent

Speaker-dependent systems allow the use of specific spoken phrases that are unique to one individual. The user repeats each phrase two or three times to create a speech model.

As the name implies, speaker-independent systems understand commands regardless of the speaker: a system can recognize specific words without prior training. Word models are created from samples of a broad range of people saying the word, or else developed phonetically using linguistic methods. Word-based systems offer the advantage of higher recognition accuracy but are less flexible than speaker-dependent systems, since many samples must be collected to build models every time new words are added to the vocabulary.

Future Directions

Several improvements in recognition technology have fueled growth in speech systems. Better processing and filtering techniques improve the quality of the speech signal, making recognition more reliable. Systems are becoming easier to use and more natural with the incorporation of full hands-free speech control and other speech technologies.

There is no disputing the fact that speech recognition and telephony are coming together, which makes sense given lower pricing and the technology's virtue as a more natural interface. But it's early in the game and many vendors are still working out the applications. At its best, speech recognition combined with telephony will be the equivalent of having a personal assistant, especially useful in the mobile workforce. The proliferation of phones will likely accelerate the use of speech recognition in customer service applications. Vendors also peg the millions of cellular phone users as a hot target market. It would enable them to keep their hands on the wheel instead of on a handset.

The cost of incorporating speech recognition in PCs is drastically lower than in the past, which could bring speechenabled systems closer to corporate usage. Most audio cards using speech recognition provide users with a low-cost subset of full-blown large-vocabulary recognition systems, which usually require a separate add-in card and cost at least $500. PC companies are also adopting speech recognition as the technology becomes easier to implement on standard sound cards.

What the End Users Say

A survey at Spring Comdex '95 revealed that nearly two thirds of respondents believe that speech recognition will be very valuable to computing applications over the next three years, while another 27% see it as changing the nature of computing. Further, 83% of those surveyed said that speech recognition will benefit the average computer user within five years. Half of these people said within two years is a more likely estimate.

When Link Resources conducted an end user telephone survey in 1994, over 70% of mobile executives and professionals said that speech input would be useful to extremely useful, and 75% of mobile industrial workers indicated the same.

The Final Hurdles

In a June 1995 review of five dictation packages, The Seybold Report on Desktop Publishing delineated a list of features that could represent the last hurdles to speech recognition becoming mainstream: continuous recognition, both accuracy and "smarts," integration with a word processor, speech commands for other applications, no hands required, no training period, easier installation with no new hardware, less stringent hardware requirements, and affordability.

Further, speech recognition has been limited to desktop systems. Performance estimates to recognize 5,000 words and phrases indicate the need for more than 100 Dhrystone2.1 MIPS. This type of performance has traditionally been added through the use of digital signal processors (DSP) and associated controller circuitry, which add to board space and total system cost. Software- based systems such as Dragon Dictate address manufacturing cost considerations but require that the processor assume more DSP functions, raising again the need for increased performance. Since speech recognition has been identified as a key requirement for mobile workers, being tied to the desktop represents a significant inhibitor to mainstream adoption.

Digital Semiconductor Offers Enabling Technology

The SA-110, Digital's first StrongARM family member, provides performance matched only by desktop processors and has power dissipation levels well below those required for portable battery-operated handheld products. The SA- 110 is offered in two speed variants for low power handheld systems. The first operates at 100MHz with an estimated performance rating of 115 Dhrystone2.1 MIPS and power dissipation of less than 300 mW. A 160MHz version yields 185 MIPS at less than 450 mW of power consumption. While clock rate is a performance enabler, other StrongARM architectural features provide significant performance improvements, making the SA-110 an ideal speech recognition enabler for handheld products.

The 0.35-micron CMOS process allows the cost effective addition of larger caches, enhancing processsor performance without burdening the system designer with a requirement for expensive external memory. This reduces traffic on the bus which improves performance and minimizes power consumption. Further, the SA-110's high performance cache architecture and multiplier substantially improve the execution time of software applications.

The combination of clock rate and architectural features provides performance improvements enabling computeintensive applications like Dragon Systems industry leading desktop speech recognition to move to the portable environment. Digital has already ported its DECtalk speech synthesis application to the ARM architecture, giving handheld computing access to hands-free text-to- speech capability. Together with speech recognition product refinements, the SA-110's higher performance can enable more rapid acceptance of speech recognition as a standard technology for handheld computing.

Note:

Digital, Digital Semiconductor, DECtalk and the Digital logo are all trademarks of Digital Equipment Corporation. IBM is a registered trademark of International Business Machines Corporation. Dragon Dictate is a trademark of Dragon Systems, Inc. ARM is a registered trademark, and StrongARM is a trademark of Advanced RISC Machines, Ltd.

The above is a Press release from ARM Limited.

poppy@poppyfields.net