<- Last lecture

INTRODUCTION TO A.I.

Perception 2 - Hearing

Lecture 4

January 19, 2007

4.1	Outline of Topics
	Concepts & terms
	Machine hearing
	Speech research
	Levels of abstraction in dialogue
	Speech generation
	Speech recognition
		Early recognition
		Connected word recognition
		Continuous speech recognition
	Hidden Markov Models
	HMM computations
	Errors in speech recognition

4.2	Concepts & Terms
	Speech	Sounds that a human makes with their throat that conveys information in the form of symbols to other humans
	Monolog	One person speaking on their own
	Dialog	Two people talking to each other, often used for n-people talking to each other
	Conversation	Another word for dialog
	Representation	The convoluted issue of how data and information are stored — not the medium, but the form (example: representing the relative position of the moon and Earth with absolute numbers or as differences)
	Phoneme	"The smallest meaningful sound snippet in speech"
	Corpus	A body of information that can be used for automatic and manual analysis, e.g. to extract probabilities of events in the real world

4.3	Machine hearing
	Goal	Get machines to hear sounds in a way that allows them to act intelligently
	Scene analysis	Try to figure out a high-level understanding of where you are from the type of sounds heard all around, e.g. "restaurant", "ocean front", "indoors", etc.
	Muscial genre classification	Identifying what type of music is playing, e.g. "disco", "rap", etc., directly by listening to the audio file.
	Tibral signature classification	"Recognizing objects from their sound"
	Main focus in machine hearing	Speech recognition
	Main focus in speech recognition	phonemes, words and sentences
	Problem	Even if you have figured out what words are being said, in what order, you do not necessarily understand what was meant

4.4	Speech research
	Why study human speech production & perception?	The human mechanisms used for speech productions have still not been matched using artificial means; the human mechanisms for speech recognition are even less understood
	We must study the phenomenon we are trying to mimic	This study can take the form of inspiration or faithful reproduction — both have been tried. Speech is not useful for machine-machine interaction, it is useful for human-machine interaction.
	We must study the two together	They evolved together In all known practical applications they are used together.

4.5	Levels of abstraction in dialog
	Acoustics	Speech is just another sound
	Articulation	Neural impulses control muscles to produce the sounds
	Phonemic	Phonemes are the smallest set of sound that conveys meaning The phonemes symbolized by "h" and "y" separate "hello" and "yellow"
	Lexical	The dictionary
	Syntactic	Syntax dictates legal ways to combine words Noun phrase + verb phrase: [The door] [opened]
	Semantic	"It is hot in here" = the temperature is such that the speaker believes it can be qualified as "hot"
	Pragmatic	"It is hot in here" = open the window, please!
	Discourse	Everything below + turntaking and context

4.6	Speech Generation
	Must be understood for recognition
	Voiced and unvoiced sounds	Speech is a sequence of pitched sounds, or "tones", and noise bursts The former are typically called vowels, the latter consonants
	Paraverbals	Sounds made by throat and mouth; what separates paraverbals from speech is their inability to be used in the same way as words or phonemes What gives them meaning is their use in the semantic and especially pragmatic layer of discourse
	Voiced	Voiced speech sounds have typically the most energy of the speech sounds Steady state — you can "say it forever" Dipthongs: two vowels strung togther, e.g. hi (ha-ee) Nasals: n, m
	Unvoiced	Noise burst of some sort Transient state Plosives: p, t, k Fricatives: s, th, f
	Coarticulation	Effects of upcoming sounds on the production of the sounds being made at the present — e.g. "show" vs. "shine": "sh" sound is different in each of these words Complicates speech synthes

	Humans generate sound by modifying physical structures	Very different from how current speech synthesis technology works.
	Vocal tract is different in everyone	Timbral quality of everyone's voice is different Timbre: the quality given to a sound by its (unique set of) overtones The attribute of auditory sensation that enables a listener to distinguish two similar sounds with the same pitch and loudness

4.7	Early Recognition
	Method: Template matching	Early speech recognition was mostly this
	Typically used for single words, spoken in isolation	"Isolated word recognition"
	How it works	Templates are created by transforming a training set into a feature vector Templates {t0, ... tn} represented as points in feature space Incoming sound t1 is transformed into the same feature space; distance between position of incoming sound to other points in the space is measured The closest point is the best match Threshold for rejection distance; Threshold for distance between closest and second closest may reject all
	This method is still used, for example in some personalized voice dialing services

4.8	Connected Word Recognition
	What it is	the ... kind ... of ... speech ... recognition ... that ... requires ... pauses ... between ... each ... word
	When was it popular	70s to mid-80s
	Basic advance in computing power led to an increase in the speed in isolated word recognition	This led to an improvement on the template matching systems people had built before
	Main problem	Very unnatural; breaks the flow of speech; introduces hesitations and artifacts -- in short, people are not good at doing this in real situations

4.9	Continuous Speech Recognition
	What is it	You speak fluently (with slience before and after!)
	When was it popular	Early 90s to present day
	Example	Sphinx-4 - java-based opensource speech recognizer
	Uses HMMs extensively

4.10	Hidden Markov Models (HMMs)
	Good for analyzing temporal patterns	Solid statistical foundation HMM can be used to train sequences, training is done using corpora
	Represented by triple (π, A, B)	π = Initial state probabilies vector A ={aij} State transition matrix B = {bij} Output (emission) matrix
	π = Initial state probabilies vector	Some states may be more probable as start states than others
	A ={aij} State transition matrix	Given state i has probability a of transitioning to state j
	B = {bij} Output matrix	State i has probability b of outputting symbol j

4.11	HMM computations (refer to figure 24.36 in your textbook for [m])
	States: O, M, E
	Transition probabilities	START -> O = 1.0 O -> O = 0.3 O -> M = 0.7 M -> M = 0.9 M -> E = 0.6 E -> E = 0.4 E -> TERMINATE = 0.6
	Output probabilities	P(C1\|O) = 0.5 P(C4\|O) = 0.7 P(C6\|O) = 0.5
	Signal quantization	C1, C4, C6
	Formula	P([C1,C4,C6]\|[m]) = (0.7 * 0.1 * 0.6) * (0.5 * 0.7 * 0.5) = 0.00735
	End goal	Compute: P(wordi\|wordi-1)

4.12	Errors in speech recognition
	Main problems	- Environmental noise - Heavy grammar restrictions force people to speak unnaturally - Turn-taking model is not natural
	Noise	- Included in this category are noises which actually count as communicative, for example tongue clicks and other meaningful non-speech mouth sounds - Constant background noise deteriorates recognition rates; a threshold is quickly reached where recognition quality drops below useless - Non-uniform background noises are even worse as they derail the HMMs down the wrong paths
	Error types	- Rejection (false negative) - Insertion (false positive) - Substitution
	Interaction error types	- Mistaken give-turn signal (initiated by silence before user is done speaking) - Mistaken take-turn/interrupt signal (initiated by external nose e.g. tongue click) - Missed take-turn/interrupt signal - Misinterpreted take-turn/interrupt signal (e.g. as valid response to ongoing talk)
	The main source of misrecognition	Lack of understanding