Speech research

Home | About me | Speech research | Motor control research | Publications

 

 

Computational models of how an infant learns to speak

There has been the general assumption that infants learn to speak by "imitation", without considering how the ability to imitate could come about. In most accounts this is assumed to be achieved by a process of acoustic matching, with the infant comparing his output to that of his caregiver. There are problems with both of these assumptions. Firstly it assumes that the correspondence problem between infant and adult speech can be solved by the infant. Secondly, several developmental observations cannot be explained by these mechanisms.

I have been working on a computational model – Elija - incorporating an articulatory synthesiser and a simple speech recognizer based on DTW, which does not learn to pronounce speech sounds this way. Instead, Elija starts by exploring the sound making capabilities of his vocal apparatus. This process is formulated as an optimization problem. The objective function receives a positive contribution from salience of the sensory consequences of action and of its diversity, and is penalized by the effort involved. This leads to the unsupervised discovery of speech like sounds. Clustering is used to extract a small number of distinct sub-actions (corresponding to Cs and Vs) and their recombination expands Elija’s sound repertoire.

A second stage of development makes use of interactions with a caregiver, who listens to the sounds can respond to them. The natural responses from the caregiver are often “reformulations” – i.e. imitations made by the caregiver. These responses are used to reinforce sounds and bias overall production to those that occur in the ambient language. In addition, the reformulations are used to learn equivalence relations between his vocal actions and the corresponding caregiver’s speech, forming the basis for leaning by imitation. During a final object labelling task, using this newly established mechanism of imitation, Elija is able to learn and reproduce some object names spoken by the caregiver.

A Computational Model of Infant Speech Development We have shown that Elija progresses from a babbling stage to learning the names of objects. The initial publication was from Specom2007 . The first published results in a Motor Control paper are available here. This research is rapidly progressing and further publications will appear soon. Some newer results were be presented at ESSV 2011, 28-30 September in Aachen, Germany.

Modelling motor pattern generation in the development of infant speech production Because embodiment plays an important role in the development of speech, we have also begun preliminary work to incorporate the role of speech breathing in our computational model. See the publication ISSP2008 for further details.

 

Learning inverse models between the speech and vocal tract parameters

The goal of this work is to build a system that can learn to imitate a version of a spoken utterance using an articulatory speech synthesiser.

Learning to Control an Articulator Synthesizer by Imitating Real Speech. Imitation is a powerful mechanism by which both animals and people can learn useful behaviour, by copying the actions of others, providing they can solve the correspondence problem. We adopt this approach as a means to control an articulatory speech synthesizer. At the heart of this work is learning an inverse model that relates acoustic and motor representations of speech. This involves a babbling phase, which is used to learn the mapping between auditory consequences and the articulatory control trajectories that generated them.  See the first publications here:ZAS2005

Training a Vocal Tract Synthesizer to Imitate Speech using Distal Supervised Learning We then investigated the direct estimation of this inverse model with the distal supervised learning scheme proposed by Jordan & Rumelhart (1992). We found that both schemes perform (unsurprisingly) well on speech generated by the synthesizer itself, when no normalization is needed, but that distal learning provided slightly better performance with speech generated by a real human subject. See the publications  Specom2005 for further details.

 

Speech fundamental period estimation using pattern classification

This work formed the basis for my PhD thesis. It involved using a pattern recognition algorithm for the location of the points of vocal fold closure in a noisy speech signal. Using a training signal for the presence of vocal fold closure provided by means of a Laryngograph, the task was formulated as a classical supervised learning problem. The algorithm uses a multi-layer perceptron (MLP) classifier with inputs from a window on the speech signal, and an output signifying the presence of a vocal fold closure at the centre of the window. The location of the vocal fold closures, or the fundamental period epoch markers are given the name Tx, and our perceptron algorithm the name MLP-Tx.

The work also involved developing algorithms to quantitatively evaluate the performance of the algorithm and we also compare MLPTx with several other fundamental frequency estimation algorithms. The MLP-Tx algorithm is shown to have good performance in the presence of noise. See the initial publication here and my PhD thesis for further details.

 

Please send your comments about this web page to: drianhoward@gmail.com 
Copyright © 2005-2010 Ian Howard. All Rights Reserved
Last Changed: 05 August 2010