Speech recognition

Historycomputer input devices. In fact, people who used the
The first speech recognizer appeared in 1952 andkeyboard a lot and developed RSI became an urgent
consisted of a device for the recognition of singleearly market for speech recognition. Speech
spoken digits Another early device was the IBMrecognition is used in deaf telephony, such as
Shoebox, exhibited at the 1964 New York World'svoicemail to text, relay services, and captioned
Fair.telephone. Individuals with learning disabilities who
One of the most notable domains for the commercialhave problems with thought-to-paper communication
application of speech recognition in the United States(essentially they think of an idea but it is processed
has been health care and in particular the work ofincorrectly causing it to end up differently on paper)
the medical transcriptionist (MT)[citation needed].can benefit from the software[citation needed].
According to industry experts, at its inception,This section requires expansion.
speech recognition (SR) was sold as a way toFurther applications
completely eliminate transcription rather than makeAutomatic translation;
the transcription process more efficient, hence it wasAutomotive speech recognition (e.g., Ford Sync);
not accepted. It was also the case that SR at thatTelematics (e.g. vehicle Navigation Systems);
time was often technically deficient. Additionally, toCourt reporting (Realtime Voice Writing);
be used effectively, it required changes to the waysHands-free computing: voice command recognition
physicians worked and documented clinicalcomputer user interface;
encounters, which many if not all were reluctant toHome automation;
do. The biggest limitation to speech recognitionInteractive voice response;
automating transcription, however, is seen as theMobile telephony, including mobile email;
software. The nature of narrative dictation is highlyMultimodal interaction;
interpretive and often requires judgment that mayPronunciation evaluation in computer-aided language
be provided by a real human but not yet by anlearning applications;
automated system. Another limitation has been theRobotics;
extensive amount of time required by the user andVideo games, with Tom Clancy's EndWar and Lifeline
or system provider to train the software.as working examples;
A distinction in ASR is often made between "artificialTranscription (digital speech-to-text);
syntax systems" which are usually domain-specificSpeech-to-text (transcription of speech into mobile
and "natural language processing" which is usuallytext messages);
language-specific. Each of these types of applicationAir Traffic Control Speech Recognition.
presents its own particular goals and challenges.Performance of speech recognition systems
ApplicationsThe performance of speech recognition systems is
Health careusually specified in terms of accuracy and speed.
In the health care domain, even in the wake ofAccuracy may be measured in terms of performance
improving speech recognition technologies, medicalaccuracy which is usually rated with word error rate
transcriptionists (MTs) have not yet become(WER), whereas speed is measured with the real
obsolete. Many experts in the field[who?] anticipatetime factor. Other measures of accuracy include
that with increased use of speech recognitionSingle Word Error Rate (SWER) and Command
technology, the services provided may beSuccess Rate (CSR).
redistributed rather than replaced. Speech recognitionMost speech recognition users would tend to agree
is used to enable deaf people to understand thethat dictation machines can achieve very high
spoken word via speech to text conversion, which isperformance in controlled conditions. There is some
very helpful.confusion, however, over the interchangeability of
Speech recognition can be implemented in front-endthe terms "speech recognition" and "dictation".
or back-end of the medical documentation process.Commercially available speaker-dependent dictation
Front-End SR is where the provider dictates into asystems usually require only a short period of training
speech-recognition engine, the recognized words are(sometimes also called `enrollment') and may
displayed right after they are spoken, and thesuccessfully capture continuous speech with a large
dictator is responsible for editing and signing off onvocabulary at normal pace with a very high accuracy.
the document. It never goes through an MT/editor.Most commercial companies claim that recognition
Back-End SR or Deferred SR is where the providersoftware can achieve between 98% to 99%
dictates into a digital dictation system, and the voiceaccuracy if operated under optimal conditions.
is routed through a speech-recognition machine and`Optimal conditions' usually assume that users:have
the recognized draft document is routed along withspeech characteristics which match the training
the original voice file to the MT/editor, who edits thedata,can achieve proper speaker adaptation, andwork
draft and finalizes the report. Deferred SR is beingin a clean noise environment (e.g. quiet office or
widely used in the industry currently.laboratory space).
Many Electronic Medical Records (EMR) applicationsThis explains why some users, especially those
can be more effective and may be performed morewhose speech is heavily accented, might achieve
easily when deployed in conjunction with arecognition rates much lower than expected. Speech
speech-recognition engine. Searches, queries, andrecognition in video has become a popular search
form filling may all be faster to perform by voicetechnology used by several video search companies.
than by using a keyboard.Limited vocabulary systems, requiring no training, can
Militaryrecognize a small number of words (for instance, the
High-performance fighter aircraftten digits) as spoken by most speakers. Such
Substantial efforts have been devoted in the lastsystems are popular for routing incoming phone calls
decade to the test and evaluation of speechto their destinations in large organizations.
recognition in fighter aircraft. Of particular note areBoth acoustic modeling and language modeling are
the U.S. program in speech recognition for theimportant parts of modern statistically-based speech
Advanced Fighter Technology Integration (AFTI)/F-16recognition algorithms. Hidden Markov models (HMMs)
aircraft (F-16 VISTA), the program in France onare widely used in many systems. Language modeling
installing speech recognition systems on Miragehas many other applications such as smart keyboard
aircraft, and programs in the UK dealing with aand document classification.
variety of aircraft platforms. In these programs,Hidden Markov model (HMM)-based speech
speech recognizers have been operated successfullyrecognition
in fighter aircraft with applications including: settingMain article: Hidden Markov model
radio frequencies, commanding an autopilot system,Modern general-purpose speech recognition systems
setting steer-point coordinates and weapons releaseare generally based on Hidden Markov Models. These
parameters, and controlling flight displays. Generally,are statistical models which output a sequence of
only very limited, constrained vocabularies have beensymbols or quantities. One possible reason why HMMs
used successfully, and a major effort has beenare used in speech recognition is that a speech signal
devoted to integration of the speech recognizer withcould be viewed as a piecewise stationary signal or a
the avionics system.short-time stationary signal. That is, one could assume
Some important conclusions from the work were asin a short-time in the range of 10 milliseconds, speech
follows:could be approximated as a stationary process.
Speech recognition has definite potential for reducingSpeech could thus be thought of as a Markov model
pilot workload, but this potential was not realizedfor many stochastic processes.
consistently.Another reason why HMMs are popular is because
Achievement of very high recognition accuracy (95%they can be trained automatically and are simple and
or more) was the most critical factor for making thecomputationally feasible to use. In speech recognition,
speech recognition system useful  with lowerthe hidden Markov model would output a sequence
recognition rates, pilots would not use the system.of n-dimensional real-valued vectors (with n being a
More natural vocabulary and grammar, and shortersmall integer, such as 10), outputting one of these
training times would be useful, but only if very highevery 10 milliseconds. The vectors would consist of
recognition rates could be maintained.cepstral coefficients, which are obtained by taking a
Laboratory research in robust speech recognition forFourier transform of a short time window of speech
military environments has produced promising resultsand decorrelating the spectrum using a cosine
which, if extendable to the cockpit, should improvetransform, then taking the first (most significant)
the utility of speech recognition in high-performancecoefficients. The hidden Markov model will tend to
aircraft.have in each state a statistical distribution that is a
Working with Swedish pilots flying in the JAS-39mixture of diagonal covariance Gaussians which will
Gripen cockpit, Englund (2004) found recognitiongive a likelihood for each observed vector. Each
deteriorated with increasing G-loads. It was alsoword, or (for more general speech recognition
concluded that adaptation greatly improved thesystems), each phoneme, will have a different output
results in all cases and introducing models fordistribution; a hidden Markov model for a sequence of
breathing was shown to improve recognition scoreswords or phonemes is made by concatenating the
significantly. Contrary to what might be expected, noindividual trained hidden Markov models for the
effects of the broken English of the speakers wereseparate words and phonemes.
found. It was evident that spontaneous speechDescribed above are the core elements of the most
caused problems for the recognizer, as could becommon, HMM-based approach to speech recognition.
expected. A restricted vocabulary, and above all, aModern speech recognition systems use various
proper syntax, could thus be expected to improvecombinations of a number of standard techniques in
recognition accuracy substantially.order to improve results over the basic approach
The Eurofighter Typhoon currently in service with thedescribed above. A typical large-vocabulary system
UK RAF employs a speaker-dependent system, i.e. itwould need context dependency for the phonemes
requires each pilot to create a template. The system(so phonemes with different left and right context
is not used for any safety critical or weapon criticalhave different realizations as HMM states); it would
tasks, such as weapon release or lowering of theuse cepstral normalization to normalize for different
undercarriage, but is used for a wide range of otherspeaker and recording conditions; for further speaker
cockpit functions. Voice commands are confirmed bynormalization it might use vocal tract length
visual and/or aural feedback. The system is seen as anormalization (VTLN) for male-female normalization
major design feature in the reduction of pilotand maximum likelihood linear regression (MLLR) for
workload, and even allows the pilot to assign targetsmore general speaker adaptation. The features would
to himself with two simple voice commands or tohave so-called delta and delta-delta coefficients to
any of his wingmen with only five commands.capture speech dynamics and in addition might use
Helicoptersheteroscedastic linear discriminant analysis (HLDA); or
The problems of achieving high recognition accuracymight skip the delta and delta-delta coefficients and
under stress and noise pertain strongly to theuse splicing and an LDA-based projection followed
helicopter environment as well as to the fighterperhaps by heteroscedastic linear discriminant analysis
environment. The acoustic noise problem is actuallyor a global semitied covariance transform (also known
more severe in the helicopter environment, not onlyas maximum likelihood linear transform, or MLLT).
because of the high noise levels but also because theMany systems use so-called discriminative training
helicopter pilot generally does not wear a facemask,techniques which dispense with a purely statistical
which would reduce acoustic noise in the microphone.approach to HMM parameter estimation and instead
Substantial test and evaluation programs have beenoptimize some classification-related measure of the
carried out in the past decade in speech recognitiontraining data. Examples are maximum mutual
systems applications in helicopters, notably by theinformation (MMI), minimum classification error (MCE)
U.S. Army Avionics Research and Developmentand minimum phone error (MPE).
Activity (AVRADA) and by the Royal AerospaceDecoding of the speech (the term for what happens
Establishment (RAE) in the UK. Work in France haswhen the system is presented with a new utterance
included speech recognition in the Puma helicopter.and must compute the most likely source sentence)
There has also been much useful work in Canada.would probably use the Viterbi algorithm to find the
Results have been encouraging, and voice applicationsbest path, and here there is a choice between
have included: control of communication radios;dynamically creating a combination hidden Markov
setting of navigation systems; and control of anmodel which includes both the acoustic and language
automated target handover system.model information, or combining it statically
As in fighter applications, the overriding issue forbeforehand (the finite state transducer, or FST,
voice in helicopters is the impact on pilotapproach).
effectiveness. Encouraging results are reported forDynamic time warping (DTW)-based speech
the AVRADA tests, although these represent only arecognition
feasibility demonstration in a test environment. MuchMain article: Dynamic time warping
remains to be done both in speech recognition and inDynamic time warping is an approach that was
overall speech recognition technology, in order tohistorically used for speech recognition but has now
consistently achieve performance improvements inlargely been displaced by the more successful
operational settings.HMM-based approach. Dynamic time warping is an
Battle managementalgorithm for measuring similarity between two
This section does not cite any references or sources.sequences which may vary in time or speed. For
Please help improve this article by adding citations toinstance, similarities in walking patterns would be
reliable sources. Unsourced material may bedetected, even if in one video the person was
challenged and removed. (July 2009)walking slowly and if in another they were walking
Battle Management command centres generallymore quickly, or even if there were accelerations and
require rapid access to and control of large, rapidlydecelerations during the course of one observation.
changing information databases. Commanders andDTW has been applied to video, audio, and graphics 
system operators need to query these databases asindeed, any data which can be turned into a linear
conveniently as possible, in an eyes-busy environmentrepresentation can be analyzed with DTW.
where much of the information is presented in aA well known application has been automatic speech
display format. Human-machine interaction by voicerecognition, to cope with different speaking speeds.
has the potential to be very useful in theseIn general, it is a method that allows a computer to
environments. A number of efforts have beenfind an optimal match between two given sequences
undertaken to interface commercially available(e.g. time series) with certain restrictions, i.e. the
isolated-word recognizers into battle managementsequences are "warped" non-linearly to match each
environments. In one feasibility study speechother. This sequence alignment method is often used
recognition equipment was tested in conjunction within the context of hidden Markov models.
an integrated information display for naval battleFurther information
management applications. Users were very optimisticPopular speech recognition conferences held each
about the potential of the system, althoughyear or two include ICASSP, Eurospeech/ICSLP (now
capabilities were limited.named Interspeech) and the IEEE ASRU.
Speech understanding programs sponsored by theConferences in the field of Natural language
Defense Advanced Research Projects Agencyprocessing, such as ACL, NAACL, EMNLP, and HLT,
(DARPA) in the U.S. has focused on this problem ofare beginning to include papers on speech processing.
natural speech interface. Speech recognition effortsImportant journals include the IEEE Transactions on
have focused on a database of continuous speechSpeech and Audio Processing (now named IEEE
recognition (CSR), large-vocabulary speech which isTransactions on Audio, Speech and Language
designed to be representative of the naval resourceProcessing), Computer Speech and Language, and
management task. Significant advances in theSpeech Communication. Books like "Fundamentals of
state-of-the-art in CSR have been achieved, andSpeech Recognition" by Lawrence Rabiner can be
current efforts are focused on integrating speechuseful to acquire basic knowledge but may not be
recognition and natural language processing to allowfully up to date (1993). Another good source can be
spoken language interaction with a naval resource"Statistical Methods for Speech Recognition" by
management system.Frederick Jelinek and "Spoken Language Processing
Training air traffic controllers(2001)" by Xuedong Huang etc. More up to date is
Training for military (or civilian) air traffic controllers"Computer Speech", by Manfred R. Schroeder,
(ATC) represents an excellent application for speechsecond edition published in 2004. The recently
recognition systems. Many ATC training systemsupdated textbook of "Speech and Language
currently require a person to act as a "pseudo-pilot",Processing (2008)" by Jurafsky and Martin presents
engaging in a voice dialog with the trainee controller,the basics and the state of the art for ASR. A good
which simulates the dialog which the controller wouldinsight into the techniques used in the best modern
have to conduct with pilots in a real ATC situation.systems can be gained by paying attention to
Speech recognition and synthesis techniques offergovernment sponsored evaluations such as those
the potential to eliminate the need for a person toorganised by DARPA (the largest speech
act as pseudo-pilot, thus reducing training and supportrecognition-related project ongoing as of 2007 is the
personnel. Air controller tasks are also characterizedGALE project, which involves both speech recognition
by highly structured speech as the primary output ofand translation components).
the controller, hence reducing the difficulty of theIn terms of freely available resources, Carnegie Mellon
speech recognition task.University's SPHINX toolkit is one place to start to
The U.S. Naval Training Equipment Center hasboth learn about speech recognition and to start
sponsored a number of developments of prototypeexperimenting. Another resource (free as in free
ATC trainers using speech recognition. Generally, thebeer, not free software) is the HTK book (and the
recognition accuracy falls short of providing gracefulaccompanying HTK toolkit). The AT&T libraries
interaction between the trainee and the system.GRM library, and DCD library are also general
However, the prototype training systems havesoftware libraries for large-vocabulary speech
demonstrated a significant potential for voicerecognition.
interaction in these systems, and in other trainingA useful review of the area of robustness in ASR is
applications. The U.S. Navy has sponsored aprovided by Junqua and Haton (1995).
large-scale effort in ATC training systems, where aSee also
commercial speech recognition unit was integratedAudio mining
with a complex training system including displays andAudio visual speech recognition
scenario creation. Although the recognizer wasAcoustic Model
constrained in vocabulary, one of the goals of theDigital dictation
training programs was to teach the controllers toDirect Voice Input
speak in a constrained language, using specificKeyword spotting
vocabulary specifically designed for the ATC task.List of speech recognition software
Research in France has focused on the application ofMicrophone
speech recognition in ATC training systems, directedMondegreen
at issues both in speech recognition and in applicationMultimodal interaction
of task-domain grammar constraints.OpenDocument
The USAF, USMC, US Army, and FAA are currentlyPhonetic search technology
using ATC simulators with speech recognition from aSpeech Analytics
number of different vendors, including UFA, Inc, andSpeaker identification
Adacel Systems Inc (ASI). This software usesSpeaker diarisation
speech recognition and synthetic speech to enableSpeech corpus
the trainee to control aircraft and ground vehicles inSpeech processing
the simulation without the need for pseudo pilots.Speech recognition in Linux
Another approach to ATC simulation with speechSpeech synthesis
recognition has been created by Supremis. TheSpeech verification
Supremis system is not constrained by rigidText-to-speech (TTS)
grammars imposed by the underlying limitations ofVoiceXML
other recognition strategies.Voxforge
Telephony and other domainsWindows Speech Recognition
ASR in the field of telephony is now commonplaceSpeech technology
and in the field of computer gaming and simulation isReferences
becoming more widespread. Despite the high level ofKarat, Clare-Marie; Vergo, John; Nahamoo, David
integration with word processing in general personal(2007), "Conversational Interface Technologies", in
computing, however, ASR in the field of documentSears, Andrew; Jacko, Julie A., The Human-Computer
production has not seen the expected[by whom?]Interaction Handbook: Fundamentals, Evolving
increases in use.Technologies, and Emerging Applications (Human
The improvement of mobile processor speeds madeFactors and Ergonomics), Lawrence Erlbaum
feasible the speech-enabled Symbian and WindowsAssociates Inc, ISBN 978-0805858709 .managing
Mobile Smartphones. Speech is used mostly as a parteditors Giovanni Battista Varile, Antonio Zampolli.
of User Interface, for creating pre-defined or custom(1997), Cole, Ronald; Mariani, Joseph; Uszkoreit, Hans
speech commands. Leading software vendors in thiset al., eds., Survey of the state of the art in human
field are: Microsoft Corporation (Microsoft Voicelanguage technology, Cambridge Studies In Natural
Command), Nuance Communications (Nuance VoiceLanguage Processing, XIIIII, Cambridge University
Control), Vito Technology (VITO Voice2Go), SpeereoPress, ISBN 0-521-59277-1 .
Software (Speereo Voice Translator) and SVOX.Junqua, J.-C.; Haton, J.-P. (1995), Robustness in
People with disabilitiesAutomatic Speech Recognition: Fundamentals and
People with disabilities can benefit from speechApplications, Kluwer Academic Publishers, ISBN
recognition programs. Speech recognition is especially978-0792396468 .
useful for people who have difficulty using their^ Davies , K.H., Biddulph, R. and Balashek, S. (1952)
hands, ranging from mild repetitive stress injuries toAutomatic Speech Recognition of Spoken Digits, J.
involved disabilities that preclude using conventionalAcoust. Soc. Am. 24(6) pp.