January 19, 2007
4.1 |
Outline of Topics | ||
Concepts & terms | |||
Machine hearing | |||
Speech research | |||
Levels of abstraction in dialogue | |||
Speech generation | |||
Speech recognition | |||
Early recognition | |||
Connected word recognition | |||
Continuous speech recognition | |||
Hidden Markov Models | |||
HMM computations | |||
Errors in speech recognition | |||
4.2 | Concepts & Terms | |
Speech | Sounds that a human makes with their throat that conveys information in the form of symbols to other humans | |
Monolog | One person speaking on their own | |
Dialog | Two people talking to each other, often used for n-people talking to each other | |
Conversation | Another word for dialog | |
Representation | The convoluted issue of how data and information are stored — not the medium, but the form (example: representing the relative position of the moon and Earth with absolute numbers or as differences) | |
Phoneme | "The smallest meaningful sound snippet in speech" | |
Corpus | A body of information that can be used for automatic and manual analysis, e.g. to extract probabilities of events in the real world | |
4.3 | Machine hearing | |
Goal | Get machines to hear sounds in a way that allows them to act intelligently | |
Scene analysis | Try to figure out a high-level understanding of where you are from the type of sounds heard all around, e.g. "restaurant", "ocean front", "indoors", etc. | |
Muscial genre classification | Identifying what type of music is playing, e.g. "disco", "rap", etc., directly by listening to the audio file. | |
Tibral signature classification | "Recognizing objects from their sound" | |
Main focus in machine hearing | Speech recognition | |
Main focus in speech recognition | phonemes, words and sentences | |
Problem | Even if you have figured out what words are being said, in what order, you do not necessarily understand what was meant | |
4.4 | Speech research | |
Why study human speech production & perception? | The human mechanisms used for speech productions have still
not been matched using artificial means; the human mechanisms for speech recognition are even less understood |
|
We must study the phenomenon we are trying to mimic |
This study can take the form of inspiration or faithful reproduction
— both have been tried. Speech is not useful for machine-machine interaction, it is useful for human-machine interaction. |
|
We must study the two together | They evolved together In all known practical applications they are used together. |
|
4.5 | Levels of abstraction in dialog | |
Acoustics | Speech is just another sound | |
Articulation | Neural impulses control muscles to produce the sounds | |
Phonemic | Phonemes are the smallest set of sound that conveys meaning The phonemes symbolized by "h" and "y" separate "hello" and "yellow" |
|
Lexical | The dictionary | |
Syntactic | Syntax dictates legal ways to combine words Noun phrase + verb phrase: [The door] [opened] |
|
Semantic | "It is hot in here" = the temperature is such that the speaker believes it can be qualified as "hot" | |
Pragmatic | "It is hot in here" = open the window, please! | |
Discourse | Everything below + turntaking and context | |
4.6 | Speech Generation | |
Must be understood for recognition | ||
Voiced and unvoiced sounds | Speech is a sequence of pitched sounds, or "tones",
and noise bursts The former are typically called vowels, the latter consonants |
|
Paraverbals | Sounds made by throat and mouth; what separates paraverbals
from speech is their inability to be used in the same way as words or phonemes What gives them meaning is their use in the semantic and especially pragmatic layer of discourse |
|
Voiced | Voiced speech sounds have typically the most energy of the
speech sounds Steady state — you can "say it forever" Dipthongs: two vowels strung togther, e.g. hi (ha-ee) Nasals: n, m |
|
Unvoiced | Noise burst of some sort Transient state Plosives: p, t, k Fricatives: s, th, f |
|
Coarticulation | Effects of upcoming sounds on the production of the sounds
being made at the present — e.g. "show" vs. "shine":
"sh" sound is different in each of these words Complicates speech synthes |
|
Humans generate sound by modifying physical structures | Very different from how current speech synthesis technology works. | |
Vocal tract is different in everyone | Timbral quality of everyone's voice is different Timbre: the quality given to a sound by its (unique set of) overtones The attribute of auditory sensation that enables a listener to distinguish two similar sounds with the same pitch and loudness |
|
4.7 | Early Recognition | |
Method: Template matching | Early speech recognition was mostly this | |
Typically used for single words, spoken in isolation | "Isolated word recognition" | |
How it works | Templates are created by transforming a training set into a feature vector Templates {t0, ... tn} represented as points in feature space Incoming sound t1 is transformed into the same feature space; distance between position of incoming sound to other points in the space is measured The closest point is the best match Threshold for rejection distance; Threshold for distance between closest and second closest may reject all |
|
This method is still used, for example in some personalized voice dialing services | ||
4.8 | Connected Word Recognition | |
What it is | the ... kind ... of ... speech ... recognition ... that ... requires ... pauses ... between ... each ... word | |
When was it popular | 70s to mid-80s | |
Basic advance in computing power led to an increase in the speed in isolated word recognition | This led to an improvement on the template matching systems people had built before | |
Main problem | Very unnatural; breaks the flow of speech; introduces hesitations and artifacts -- in short, people are not good at doing this in real situations | |
4.9 | Continuous Speech Recognition | |
What is it | You speak fluently (with slience before and after!) | |
When was it popular | Early 90s to present day | |
Example | Sphinx-4 - java-based opensource speech recognizer | |
Uses HMMs extensively | ||
4.10 | Hidden Markov Models (HMMs) | |
Good for analyzing temporal patterns | Solid statistical foundation HMM can be used to train sequences, training is done using corpora |
|
Represented by triple (π, A, B) | π = Initial state probabilies vector A ={aij} State transition matrix B = {bij} Output (emission) matrix |
|
π = Initial state probabilies vector | Some states may be more probable as start states than others | |
A ={aij} State transition matrix | Given state i has probability a of transitioning to state j | |
B = {bij} Output matrix | State i has probability b of outputting symbol j | |
4.11 |
HMM computations (refer to figure 24.36 in your textbook for [m]) | |
States: O, M, E | ||
Transition probabilities |
|
|
Output probabilities |
|
|
Signal quantization | C1, C4, C6 | |
Formula | P([C1,C4,C6]|[m]) = (0.7 * 0.1 * 0.6) * (0.5 * 0.7 * 0.5) = 0.00735 | |
End goal | Compute: P(wordi|wordi-1) | |
4.12 | Errors in speech recognition | |
Main problems | - Environmental noise |
|
Noise | - Included in this category are noises which actually count as communicative,
for example tongue clicks and other meaningful non-speech mouth sounds - Constant background noise deteriorates recognition rates; a threshold is quickly reached where recognition quality drops below useless - Non-uniform background noises are even worse as they derail the HMMs down the wrong paths |
|
Error types | - Rejection (false negative) - Insertion (false positive) - Substitution |
|
Interaction error types | - Mistaken give-turn signal (initiated by silence before user is done
speaking) - Mistaken take-turn/interrupt signal (initiated by external nose e.g. tongue click) - Missed take-turn/interrupt signal - Misinterpreted take-turn/interrupt signal (e.g. as valid response to ongoing talk) |
|
The main source of misrecognition | Lack of understanding | |