| History | | | | computer input devices. In fact, people who used the |
| The first speech recognizer appeared in 1952 and | | | | keyboard a lot and developed RSI became an urgent |
| consisted of a device for the recognition of single | | | | early market for speech recognition. Speech |
| spoken digits Another early device was the IBM | | | | recognition is used in deaf telephony, such as |
| Shoebox, exhibited at the 1964 New York World's | | | | voicemail to text, relay services, and captioned |
| Fair. | | | | telephone. Individuals with learning disabilities who |
| One of the most notable domains for the commercial | | | | have problems with thought-to-paper communication |
| application of speech recognition in the United States | | | | (essentially they think of an idea but it is processed |
| has been health care and in particular the work of | | | | incorrectly causing it to end up differently on paper) |
| the medical transcriptionist (MT)[citation needed]. | | | | can benefit from the software[citation needed]. |
| According to industry experts, at its inception, | | | | This section requires expansion. |
| speech recognition (SR) was sold as a way to | | | | Further applications |
| completely eliminate transcription rather than make | | | | Automatic translation; |
| the transcription process more efficient, hence it was | | | | Automotive speech recognition (e.g., Ford Sync); |
| not accepted. It was also the case that SR at that | | | | Telematics (e.g. vehicle Navigation Systems); |
| time was often technically deficient. Additionally, to | | | | Court reporting (Realtime Voice Writing); |
| be used effectively, it required changes to the ways | | | | Hands-free computing: voice command recognition |
| physicians worked and documented clinical | | | | computer user interface; |
| encounters, which many if not all were reluctant to | | | | Home automation; |
| do. The biggest limitation to speech recognition | | | | Interactive voice response; |
| automating transcription, however, is seen as the | | | | Mobile telephony, including mobile email; |
| software. The nature of narrative dictation is highly | | | | Multimodal interaction; |
| interpretive and often requires judgment that may | | | | Pronunciation evaluation in computer-aided language |
| be provided by a real human but not yet by an | | | | learning applications; |
| automated system. Another limitation has been the | | | | Robotics; |
| extensive amount of time required by the user and | | | | Video games, with Tom Clancy's EndWar and Lifeline |
| or system provider to train the software. | | | | as working examples; |
| A distinction in ASR is often made between "artificial | | | | Transcription (digital speech-to-text); |
| syntax systems" which are usually domain-specific | | | | Speech-to-text (transcription of speech into mobile |
| and "natural language processing" which is usually | | | | text messages); |
| language-specific. Each of these types of application | | | | Air Traffic Control Speech Recognition. |
| presents its own particular goals and challenges. | | | | Performance of speech recognition systems |
| Applications | | | | The performance of speech recognition systems is |
| Health care | | | | usually specified in terms of accuracy and speed. |
| In the health care domain, even in the wake of | | | | Accuracy may be measured in terms of performance |
| improving speech recognition technologies, medical | | | | accuracy which is usually rated with word error rate |
| transcriptionists (MTs) have not yet become | | | | (WER), whereas speed is measured with the real |
| obsolete. Many experts in the field[who?] anticipate | | | | time factor. Other measures of accuracy include |
| that with increased use of speech recognition | | | | Single Word Error Rate (SWER) and Command |
| technology, the services provided may be | | | | Success Rate (CSR). |
| redistributed rather than replaced. Speech recognition | | | | Most speech recognition users would tend to agree |
| is used to enable deaf people to understand the | | | | that dictation machines can achieve very high |
| spoken word via speech to text conversion, which is | | | | performance in controlled conditions. There is some |
| very helpful. | | | | confusion, however, over the interchangeability of |
| Speech recognition can be implemented in front-end | | | | the terms "speech recognition" and "dictation". |
| or back-end of the medical documentation process. | | | | Commercially available speaker-dependent dictation |
| Front-End SR is where the provider dictates into a | | | | systems usually require only a short period of training |
| speech-recognition engine, the recognized words are | | | | (sometimes also called `enrollment') and may |
| displayed right after they are spoken, and the | | | | successfully capture continuous speech with a large |
| dictator is responsible for editing and signing off on | | | | vocabulary at normal pace with a very high accuracy. |
| the document. It never goes through an MT/editor. | | | | Most commercial companies claim that recognition |
| Back-End SR or Deferred SR is where the provider | | | | software can achieve between 98% to 99% |
| dictates into a digital dictation system, and the voice | | | | accuracy if operated under optimal conditions. |
| is routed through a speech-recognition machine and | | | | `Optimal conditions' usually assume that users:have |
| the recognized draft document is routed along with | | | | speech characteristics which match the training |
| the original voice file to the MT/editor, who edits the | | | | data,can achieve proper speaker adaptation, andwork |
| draft and finalizes the report. Deferred SR is being | | | | in a clean noise environment (e.g. quiet office or |
| widely used in the industry currently. | | | | laboratory space). |
| Many Electronic Medical Records (EMR) applications | | | | This explains why some users, especially those |
| can be more effective and may be performed more | | | | whose speech is heavily accented, might achieve |
| easily when deployed in conjunction with a | | | | recognition rates much lower than expected. Speech |
| speech-recognition engine. Searches, queries, and | | | | recognition in video has become a popular search |
| form filling may all be faster to perform by voice | | | | technology used by several video search companies. |
| than by using a keyboard. | | | | Limited vocabulary systems, requiring no training, can |
| Military | | | | recognize a small number of words (for instance, the |
| High-performance fighter aircraft | | | | ten digits) as spoken by most speakers. Such |
| Substantial efforts have been devoted in the last | | | | systems are popular for routing incoming phone calls |
| decade to the test and evaluation of speech | | | | to their destinations in large organizations. |
| recognition in fighter aircraft. Of particular note are | | | | Both acoustic modeling and language modeling are |
| the U.S. program in speech recognition for the | | | | important parts of modern statistically-based speech |
| Advanced Fighter Technology Integration (AFTI)/F-16 | | | | recognition algorithms. Hidden Markov models (HMMs) |
| aircraft (F-16 VISTA), the program in France on | | | | are widely used in many systems. Language modeling |
| installing speech recognition systems on Mirage | | | | has many other applications such as smart keyboard |
| aircraft, and programs in the UK dealing with a | | | | and document classification. |
| variety of aircraft platforms. In these programs, | | | | Hidden Markov model (HMM)-based speech |
| speech recognizers have been operated successfully | | | | recognition |
| in fighter aircraft with applications including: setting | | | | Main article: Hidden Markov model |
| radio frequencies, commanding an autopilot system, | | | | Modern general-purpose speech recognition systems |
| setting steer-point coordinates and weapons release | | | | are generally based on Hidden Markov Models. These |
| parameters, and controlling flight displays. Generally, | | | | are statistical models which output a sequence of |
| only very limited, constrained vocabularies have been | | | | symbols or quantities. One possible reason why HMMs |
| used successfully, and a major effort has been | | | | are used in speech recognition is that a speech signal |
| devoted to integration of the speech recognizer with | | | | could be viewed as a piecewise stationary signal or a |
| the avionics system. | | | | short-time stationary signal. That is, one could assume |
| Some important conclusions from the work were as | | | | in a short-time in the range of 10 milliseconds, speech |
| follows: | | | | could be approximated as a stationary process. |
| Speech recognition has definite potential for reducing | | | | Speech could thus be thought of as a Markov model |
| pilot workload, but this potential was not realized | | | | for many stochastic processes. |
| consistently. | | | | Another reason why HMMs are popular is because |
| Achievement of very high recognition accuracy (95% | | | | they can be trained automatically and are simple and |
| or more) was the most critical factor for making the | | | | computationally feasible to use. In speech recognition, |
| speech recognition system useful with lower | | | | the hidden Markov model would output a sequence |
| recognition rates, pilots would not use the system. | | | | of n-dimensional real-valued vectors (with n being a |
| More natural vocabulary and grammar, and shorter | | | | small integer, such as 10), outputting one of these |
| training times would be useful, but only if very high | | | | every 10 milliseconds. The vectors would consist of |
| recognition rates could be maintained. | | | | cepstral coefficients, which are obtained by taking a |
| Laboratory research in robust speech recognition for | | | | Fourier transform of a short time window of speech |
| military environments has produced promising results | | | | and decorrelating the spectrum using a cosine |
| which, if extendable to the cockpit, should improve | | | | transform, then taking the first (most significant) |
| the utility of speech recognition in high-performance | | | | coefficients. The hidden Markov model will tend to |
| aircraft. | | | | have in each state a statistical distribution that is a |
| Working with Swedish pilots flying in the JAS-39 | | | | mixture of diagonal covariance Gaussians which will |
| Gripen cockpit, Englund (2004) found recognition | | | | give a likelihood for each observed vector. Each |
| deteriorated with increasing G-loads. It was also | | | | word, or (for more general speech recognition |
| concluded that adaptation greatly improved the | | | | systems), each phoneme, will have a different output |
| results in all cases and introducing models for | | | | distribution; a hidden Markov model for a sequence of |
| breathing was shown to improve recognition scores | | | | words or phonemes is made by concatenating the |
| significantly. Contrary to what might be expected, no | | | | individual trained hidden Markov models for the |
| effects of the broken English of the speakers were | | | | separate words and phonemes. |
| found. It was evident that spontaneous speech | | | | Described above are the core elements of the most |
| caused problems for the recognizer, as could be | | | | common, HMM-based approach to speech recognition. |
| expected. A restricted vocabulary, and above all, a | | | | Modern speech recognition systems use various |
| proper syntax, could thus be expected to improve | | | | combinations of a number of standard techniques in |
| recognition accuracy substantially. | | | | order to improve results over the basic approach |
| The Eurofighter Typhoon currently in service with the | | | | described above. A typical large-vocabulary system |
| UK RAF employs a speaker-dependent system, i.e. it | | | | would need context dependency for the phonemes |
| requires each pilot to create a template. The system | | | | (so phonemes with different left and right context |
| is not used for any safety critical or weapon critical | | | | have different realizations as HMM states); it would |
| tasks, such as weapon release or lowering of the | | | | use cepstral normalization to normalize for different |
| undercarriage, but is used for a wide range of other | | | | speaker and recording conditions; for further speaker |
| cockpit functions. Voice commands are confirmed by | | | | normalization it might use vocal tract length |
| visual and/or aural feedback. The system is seen as a | | | | normalization (VTLN) for male-female normalization |
| major design feature in the reduction of pilot | | | | and maximum likelihood linear regression (MLLR) for |
| workload, and even allows the pilot to assign targets | | | | more general speaker adaptation. The features would |
| to himself with two simple voice commands or to | | | | have so-called delta and delta-delta coefficients to |
| any of his wingmen with only five commands. | | | | capture speech dynamics and in addition might use |
| Helicopters | | | | heteroscedastic linear discriminant analysis (HLDA); or |
| The problems of achieving high recognition accuracy | | | | might skip the delta and delta-delta coefficients and |
| under stress and noise pertain strongly to the | | | | use splicing and an LDA-based projection followed |
| helicopter environment as well as to the fighter | | | | perhaps by heteroscedastic linear discriminant analysis |
| environment. The acoustic noise problem is actually | | | | or a global semitied covariance transform (also known |
| more severe in the helicopter environment, not only | | | | as maximum likelihood linear transform, or MLLT). |
| because of the high noise levels but also because the | | | | Many systems use so-called discriminative training |
| helicopter pilot generally does not wear a facemask, | | | | techniques which dispense with a purely statistical |
| which would reduce acoustic noise in the microphone. | | | | approach to HMM parameter estimation and instead |
| Substantial test and evaluation programs have been | | | | optimize some classification-related measure of the |
| carried out in the past decade in speech recognition | | | | training data. Examples are maximum mutual |
| systems applications in helicopters, notably by the | | | | information (MMI), minimum classification error (MCE) |
| U.S. Army Avionics Research and Development | | | | and minimum phone error (MPE). |
| Activity (AVRADA) and by the Royal Aerospace | | | | Decoding of the speech (the term for what happens |
| Establishment (RAE) in the UK. Work in France has | | | | when the system is presented with a new utterance |
| included speech recognition in the Puma helicopter. | | | | and must compute the most likely source sentence) |
| There has also been much useful work in Canada. | | | | would probably use the Viterbi algorithm to find the |
| Results have been encouraging, and voice applications | | | | best path, and here there is a choice between |
| have included: control of communication radios; | | | | dynamically creating a combination hidden Markov |
| setting of navigation systems; and control of an | | | | model which includes both the acoustic and language |
| automated target handover system. | | | | model information, or combining it statically |
| As in fighter applications, the overriding issue for | | | | beforehand (the finite state transducer, or FST, |
| voice in helicopters is the impact on pilot | | | | approach). |
| effectiveness. Encouraging results are reported for | | | | Dynamic time warping (DTW)-based speech |
| the AVRADA tests, although these represent only a | | | | recognition |
| feasibility demonstration in a test environment. Much | | | | Main article: Dynamic time warping |
| remains to be done both in speech recognition and in | | | | Dynamic time warping is an approach that was |
| overall speech recognition technology, in order to | | | | historically used for speech recognition but has now |
| consistently achieve performance improvements in | | | | largely been displaced by the more successful |
| operational settings. | | | | HMM-based approach. Dynamic time warping is an |
| Battle management | | | | algorithm for measuring similarity between two |
| This section does not cite any references or sources. | | | | sequences which may vary in time or speed. For |
| Please help improve this article by adding citations to | | | | instance, similarities in walking patterns would be |
| reliable sources. Unsourced material may be | | | | detected, even if in one video the person was |
| challenged and removed. (July 2009) | | | | walking slowly and if in another they were walking |
| Battle Management command centres generally | | | | more quickly, or even if there were accelerations and |
| require rapid access to and control of large, rapidly | | | | decelerations during the course of one observation. |
| changing information databases. Commanders and | | | | DTW has been applied to video, audio, and graphics |
| system operators need to query these databases as | | | | indeed, any data which can be turned into a linear |
| conveniently as possible, in an eyes-busy environment | | | | representation can be analyzed with DTW. |
| where much of the information is presented in a | | | | A well known application has been automatic speech |
| display format. Human-machine interaction by voice | | | | recognition, to cope with different speaking speeds. |
| has the potential to be very useful in these | | | | In general, it is a method that allows a computer to |
| environments. A number of efforts have been | | | | find an optimal match between two given sequences |
| undertaken to interface commercially available | | | | (e.g. time series) with certain restrictions, i.e. the |
| isolated-word recognizers into battle management | | | | sequences are "warped" non-linearly to match each |
| environments. In one feasibility study speech | | | | other. This sequence alignment method is often used |
| recognition equipment was tested in conjunction with | | | | in the context of hidden Markov models. |
| an integrated information display for naval battle | | | | Further information |
| management applications. Users were very optimistic | | | | Popular speech recognition conferences held each |
| about the potential of the system, although | | | | year or two include ICASSP, Eurospeech/ICSLP (now |
| capabilities were limited. | | | | named Interspeech) and the IEEE ASRU. |
| Speech understanding programs sponsored by the | | | | Conferences in the field of Natural language |
| Defense Advanced Research Projects Agency | | | | processing, such as ACL, NAACL, EMNLP, and HLT, |
| (DARPA) in the U.S. has focused on this problem of | | | | are beginning to include papers on speech processing. |
| natural speech interface. Speech recognition efforts | | | | Important journals include the IEEE Transactions on |
| have focused on a database of continuous speech | | | | Speech and Audio Processing (now named IEEE |
| recognition (CSR), large-vocabulary speech which is | | | | Transactions on Audio, Speech and Language |
| designed to be representative of the naval resource | | | | Processing), Computer Speech and Language, and |
| management task. Significant advances in the | | | | Speech Communication. Books like "Fundamentals of |
| state-of-the-art in CSR have been achieved, and | | | | Speech Recognition" by Lawrence Rabiner can be |
| current efforts are focused on integrating speech | | | | useful to acquire basic knowledge but may not be |
| recognition and natural language processing to allow | | | | fully up to date (1993). Another good source can be |
| spoken language interaction with a naval resource | | | | "Statistical Methods for Speech Recognition" by |
| management system. | | | | Frederick Jelinek and "Spoken Language Processing |
| Training air traffic controllers | | | | (2001)" by Xuedong Huang etc. More up to date is |
| Training for military (or civilian) air traffic controllers | | | | "Computer Speech", by Manfred R. Schroeder, |
| (ATC) represents an excellent application for speech | | | | second edition published in 2004. The recently |
| recognition systems. Many ATC training systems | | | | updated textbook of "Speech and Language |
| currently require a person to act as a "pseudo-pilot", | | | | Processing (2008)" by Jurafsky and Martin presents |
| engaging in a voice dialog with the trainee controller, | | | | the basics and the state of the art for ASR. A good |
| which simulates the dialog which the controller would | | | | insight into the techniques used in the best modern |
| have to conduct with pilots in a real ATC situation. | | | | systems can be gained by paying attention to |
| Speech recognition and synthesis techniques offer | | | | government sponsored evaluations such as those |
| the potential to eliminate the need for a person to | | | | organised by DARPA (the largest speech |
| act as pseudo-pilot, thus reducing training and support | | | | recognition-related project ongoing as of 2007 is the |
| personnel. Air controller tasks are also characterized | | | | GALE project, which involves both speech recognition |
| by highly structured speech as the primary output of | | | | and translation components). |
| the controller, hence reducing the difficulty of the | | | | In terms of freely available resources, Carnegie Mellon |
| speech recognition task. | | | | University's SPHINX toolkit is one place to start to |
| The U.S. Naval Training Equipment Center has | | | | both learn about speech recognition and to start |
| sponsored a number of developments of prototype | | | | experimenting. Another resource (free as in free |
| ATC trainers using speech recognition. Generally, the | | | | beer, not free software) is the HTK book (and the |
| recognition accuracy falls short of providing graceful | | | | accompanying HTK toolkit). The AT&T libraries |
| interaction between the trainee and the system. | | | | GRM library, and DCD library are also general |
| However, the prototype training systems have | | | | software libraries for large-vocabulary speech |
| demonstrated a significant potential for voice | | | | recognition. |
| interaction in these systems, and in other training | | | | A useful review of the area of robustness in ASR is |
| applications. The U.S. Navy has sponsored a | | | | provided by Junqua and Haton (1995). |
| large-scale effort in ATC training systems, where a | | | | See also |
| commercial speech recognition unit was integrated | | | | Audio mining |
| with a complex training system including displays and | | | | Audio visual speech recognition |
| scenario creation. Although the recognizer was | | | | Acoustic Model |
| constrained in vocabulary, one of the goals of the | | | | Digital dictation |
| training programs was to teach the controllers to | | | | Direct Voice Input |
| speak in a constrained language, using specific | | | | Keyword spotting |
| vocabulary specifically designed for the ATC task. | | | | List of speech recognition software |
| Research in France has focused on the application of | | | | Microphone |
| speech recognition in ATC training systems, directed | | | | Mondegreen |
| at issues both in speech recognition and in application | | | | Multimodal interaction |
| of task-domain grammar constraints. | | | | OpenDocument |
| The USAF, USMC, US Army, and FAA are currently | | | | Phonetic search technology |
| using ATC simulators with speech recognition from a | | | | Speech Analytics |
| number of different vendors, including UFA, Inc, and | | | | Speaker identification |
| Adacel Systems Inc (ASI). This software uses | | | | Speaker diarisation |
| speech recognition and synthetic speech to enable | | | | Speech corpus |
| the trainee to control aircraft and ground vehicles in | | | | Speech processing |
| the simulation without the need for pseudo pilots. | | | | Speech recognition in Linux |
| Another approach to ATC simulation with speech | | | | Speech synthesis |
| recognition has been created by Supremis. The | | | | Speech verification |
| Supremis system is not constrained by rigid | | | | Text-to-speech (TTS) |
| grammars imposed by the underlying limitations of | | | | VoiceXML |
| other recognition strategies. | | | | Voxforge |
| Telephony and other domains | | | | Windows Speech Recognition |
| ASR in the field of telephony is now commonplace | | | | Speech technology |
| and in the field of computer gaming and simulation is | | | | References |
| becoming more widespread. Despite the high level of | | | | Karat, Clare-Marie; Vergo, John; Nahamoo, David |
| integration with word processing in general personal | | | | (2007), "Conversational Interface Technologies", in |
| computing, however, ASR in the field of document | | | | Sears, Andrew; Jacko, Julie A., The Human-Computer |
| production has not seen the expected[by whom?] | | | | Interaction Handbook: Fundamentals, Evolving |
| increases in use. | | | | Technologies, and Emerging Applications (Human |
| The improvement of mobile processor speeds made | | | | Factors and Ergonomics), Lawrence Erlbaum |
| feasible the speech-enabled Symbian and Windows | | | | Associates Inc, ISBN 978-0805858709 .managing |
| Mobile Smartphones. Speech is used mostly as a part | | | | editors Giovanni Battista Varile, Antonio Zampolli. |
| of User Interface, for creating pre-defined or custom | | | | (1997), Cole, Ronald; Mariani, Joseph; Uszkoreit, Hans |
| speech commands. Leading software vendors in this | | | | et al., eds., Survey of the state of the art in human |
| field are: Microsoft Corporation (Microsoft Voice | | | | language technology, Cambridge Studies In Natural |
| Command), Nuance Communications (Nuance Voice | | | | Language Processing, XIIIII, Cambridge University |
| Control), Vito Technology (VITO Voice2Go), Speereo | | | | Press, ISBN 0-521-59277-1 . |
| Software (Speereo Voice Translator) and SVOX. | | | | Junqua, J.-C.; Haton, J.-P. (1995), Robustness in |
| People with disabilities | | | | Automatic Speech Recognition: Fundamentals and |
| People with disabilities can benefit from speech | | | | Applications, Kluwer Academic Publishers, ISBN |
| recognition programs. Speech recognition is especially | | | | 978-0792396468 . |
| useful for people who have difficulty using their | | | | ^ Davies , K.H., Biddulph, R. and Balashek, S. (1952) |
| hands, ranging from mild repetitive stress injuries to | | | | Automatic Speech Recognition of Spoken Digits, J. |
| involved disabilities that preclude using conventional | | | | Acoust. Soc. Am. 24(6) pp. |