Steve Cassidy's Publications

Reading Development

S. Cassidy. (1989) When is a developmental model not a developmental model. Cognitive Systems, 2(4):329-344, 1990. (PDF)

A recent paper by Seidenberg and McClelland describes a computational model of word recognition and naming. The authors claim that it is a developmental model; that is, it explains how word recognition skills are acquired by children. The purpose of this paper is to challenge that claim and, in doing so, to set out a number of criteria against which a model of reading development can be assessed.

To summarize the criteria:

The environment of learning should reflect that of the child.
The representations used should be accurate and adequate for learning.
The model's performance should reproduce observations of children's performance.
The model should be consistent with wider theories of cognition.

Some criticisms of the Seidenberg and McClelland model arise directly from their choice of a connectionist architecture for implementing the model. We try to identify those parts of the model that result from this choice and argue that this type of connectionist model is an inappropriate way of describing learning.

We claim that this discussion of models of word recognition is also relevant to other areas of cognitive development.

S. Cassidy. (1990) Substitution errors in a computer model of early reading. Paper presented at the First Conference of the Australasian Society for Cognitive Science, Sydney, Australia, 1990. ( PDF)

A computer model of early word recognition has been built that uses only visual cues to recognise words, broadly following Seymour's view of reading development. Written words are represented as a partial ordering on a set of letter instances. This allows various degrees of positional information to be recorded, from none at all to a complete ordering of all the letters in the word. In addition, some letters may not be identified, their place being marked by a symbol representing, for instance, any ascender, any letter or any group of letters.

The model is exposed to examples of first year reading material collected from local schools. As it `reads' it adds new words to its lexicon. The performance of the model can be monitored as the lexicon grows and a corpus of substitution errors (where an incorrect word is substituted for a target) can be collected.

This paper describes a number of experiments performed using the model to explore the development of visual word recognition. Parts of the recognition procedure are varied and the results are compared to those observed in young children at various stages of development.

S. Cassidy, P. Andreae and G. B. Thompson. A Computational Model of the First Stage of Learning to Read. Unpublished manuscript. ( PDF)

Current theories of reading development leave unspecified many of the details of the procedures used by children as they learn to read. This lack of detail prevents many questions being answered about the relationship between various sources of knowledge and the child's new reading skills. One way of forcing details to be considered is to build a computational model; this paper describes such a model of the very first part of a child's reading development, corresponding perhaps only to a few weeks of developmental time. One aim of this work is to show how children apply the skills they have before they start to read to their very first experiences with print.

The model implements a visual word recognition procedure based on a lexicon of stored representations accessed via visual cues. Words are stored initially as an unordered set of letter tokens. This representation is incomplete in that some letters may be replaced by markers and others may be omitted altogether. As the reading procedure develops, the representation becomes more accurate and the order of letter tokens is also stored. The way that words are selected from the lexicon changes so that initial and final letters are used as cues. This developmental pattern is explored in a series of `snapshot' simulations which model the procedure under a given set of parametric assumptions. The simulations are used to predict the characteristics of the reading procedure, including the types of errors made at each stage. The error profile of the hypothesised developmental sequence for visual reading is shown to correspond to published data from longitudinal studies of children's reading. Finally, an account of development during this first stage of reading is presented in terms of Karmiloff-Smith's representational redescription framework.

Speech

J. Harrington and S. Cassidy. (1994) Dynamic and target theories of vowel classification: Evidence from monophthongs and diphthongs in Australian English. Language and Speech, 37, 357-373.

Recent studies in the perception of speech have suggested that vowel identification depends on dynamic cues, rather than a single `static' spectral slice at the vowel target. The experiments reported in this paper seek both to test the extent to which vowel recognition depends on dynamic information, and to identify the nature of the dynamic cues on which such recognition might depend. Both Gaussian classification techniques, as well as different kinds of neural network architectures, were used to classify around 2000 vowels in /CVd/ citation-form Australian English words, following training on roughly the same number of vowel tokens. The first set of experiments shows that when vowels are classified from three spectral slices taken at the vowel margins and midpoint, only diphthongs, but not monophthongs, benefit from the additional spectral information at the vowel margins. The second set of experiments, in which a time-delay neural network is used, suggests that dynamically changing acoustic information is beneficial to only a small number of monophthongs: However, diphthongs are no better classified from this network than one in which time is not explicitly represented, and many monophthongs perform just as well when they are classified from a single `static' spectral slice at the midpoint. The implications of this study are that not all vowels are dynamic, while those vowels which can be labelled dynamic are dynamic in different ways.

S. Cassidy and J. Harrington (1994). The place of articulation distinction in voiced oral stops: Evidence from burst spectra and formant transitions. Phonetica, 52, 263-284.

This study concerns the extent to which place of articulation in the voiced obstruents /b d dz g/ can be separated from spectral parameters taken in the burst, formant transitions, and a combination of the two. Classifications were obtained by training on citation-form data produced by male speakers and testing on (i) citation-form data produced by female speakers and (ii) continuous speech data produced by the same male speakers. The results show that there is more information for the place distinction in the burst than in formant transitions; when the parameters are combined into a single model, classification scores are improved for the citation-form data, but not for the continuous speech data. The highest classification scores were in the vicinity of 90% correct for both types of data on the combined parameters. The results are seen as supporting a model of sufficient discriminability rather than one in which phonetic categories are characterised by invariant acoustic cues.

S. Cassidy (1999) Compiling Multi-Tiered Speech Databases into the Relational Model: Experiments with the Emu System. In Proceedings of Eurospeech '99, Budapest, September 1999.

The Emu speech database system enables the annotation of speech signals at many levels of detail and provides a mechanism for making links between these levels to produce a hierarchical annotation. Emu provides facilities for searching collections of these annotations according to both sequential and hierarchical criteria. The results of a search can be used to retrieve acoustic and other data stored along with the annotations. One perceived problem with the Emu system is its ability to scale to large databases containing many thousands of utterances. To address this problem we propose a method of translating an Emu database into the relational model, as used by most commercial database systems. Using a Tcl script, the Emu database is converted into a set of tables for the relational database. Queries in the Emu query syntax are translated into SQL and comparisons are made between the query processing time for Emu and the relational database. The results show a marked increase in speed for the relational system on most queries.

S. Cassidy and J. Harrington, Multi-level annotation in the Emu speech database management system Speech Communication, 33, 61-77, January, 2001.

Researchers in various fields, from acoustic phonetics to child language development, rely on digitised collections of spoken language data as raw material for research. Access to this data has, in the past, been provided in an ad-hoc manner with labelling standards and software tools developed to serve only one or two projects. A few attempts have been made at providing generalised access to speech corpora but none of these has gained widespread popularity. The Emu system, described here, is a general purpose speech database management system which supports complex multi-level annotations. Emu can read a number of popular label and data file formats and supports overlaying additional annotation with inter-token relations on existing time-aligned label files. Emu provides a graphical labelling tool which can be extended to provide special purpose displays. The software is easily extended via the Tcl/Tk scripting language which can be used, for example, to manipulate annotations and build graphical tools for database creation. This paper discusses the design of the Emu system, giving a detailed description of the annotation structures that it supports. It is argued that these structures are sufficiently general to potentially allow Emu to read any time-aligned linguistic annotation.

S. Cassidy and S. Bird (2000) Querying Databases of Annotated Speech, Proceedings of the Australian Database Conference, Canberra, January 2000. (Citeseer)

Annotated speech corpora are databases consisting of signal data along with time-aligned symbolic `transcriptions'. Such databases are typically multidimensional, heterogeneous and dynamic. These properties present a number of tough challenges for representation and query. The temporal nature of the data adds an additional layer of complexity. This paper presents and harmonises two independent efforts to model annotated speech databases, one at the University of Pennsylvania, and one at Macquarie University. This paper introduces annotated speech databases and presents two computational models. A range of actual and possible query languages are described, along with illustrative applications to a variety of analytical problems. The research reported here forms a part of various ongoing projects to develop platform-independent open-source tools for creating, browsing, searching, querying and transforming linguistic databases, and to disseminate large linguistic databases over the internet.

Steve Cassidy, Pauline Welby, Julie McGory, Mary Beckman (2000) Testing the Adequacy of Query Languages Against Annotated Spoken Dialog. Proceedings of the Speech Science and Technology Conference, Canberra, December 2000 (PDF)

Large annotated collections of speech data are now common in spoken language research and a recent focus has been on the development of annotation standards and query languages for these annotations As part of this process it is important to evaluate the emerging proposals against a range of Linguistic annotation practices and in many different domains.

This paper presents an example of a richly annotated discourse segment which includes both DAMSL style discourse level annotation and ToBI intonational analysis. We describe how this annotation could be realised in either the Emu, MATE or Annotation Graph formalisms.

In order to evaluate the different query languages we take a small number of queries and attempt to express them in each query language. We are particularly interested in the naturalness of the query expression in each case. In some cases we find that queries cannot be expressed in the current language. We make a number of suggestions to guide the development of these query languages.

Steve Cassidy, XQuery as an Annotation Query Language: a Use Case Analysis, Proceedings of LREC 2002, Las Palmas, Spain, May 2002.

Recent work has shown that single data model can represent many different kinds of Linguistic annotation. This data model can be expressed equivalently as a directed graph of temporal nodes (Bird and Liberman, Speech Communication, 2000) as a set of intersecting hierarchies (Cassidy and Harrington, Speech Communication, 2000). While some tools are being built to support this data model, there is as yet no query language that can be used to search annotations stored in this way. Since the hierarchical view of annotations has much in common with the XML data model, this paper examines a recent proposal for an XML query language as a candidate annotation query language. The methodology used is a use case analysis. The result of the analysis shows that XQuery provides many useful features particularly when queries include hierarchical constraints but that it is weak in expressing sequential constraints.

Steve Cassidy and Catherine Watson Detecting Backchannel Intrusions in Multi-Party Teleconferences, Proceedings of the 9th Australian International Speech Science and Technology Conference, Melbourne, December 2002

Backchannels are short interuptions by a second talker within a dialogue which typically signify agreement by a second party with the main talker or the desire to interrupt. While these short utterances don't contain much useful semantic content in themselves, they do provide a key to understanding the structure of a dialogue. We describe experiments aimed at automatically segmenting a multi-party teleconference dialogue into speaker turns with particular emphasis on the detection of short (less than 500ms) utterances. Using a Bayesian Information Criterion (BIC) to detect acoustic changes we are able to segment an input signal and find around 80% of all turn boundaries and around 40% of all segments with a duration of less than 500ms.

Catherine Watson and Steve Cassidy Speaker Change Detection in Multi-Party Meetings, Proceedings of the Eigth Western Pacific Acoustics Conference, Melbourne, April 2003

The longer term goal of this project is to be able to track speakers throughout a multi -party meeting in a normal meeting room environment. This study reports on initial results in detecting acoustic change in this environment using an array of four microphones. We contrast these results with a single microphone condition. The use of multiple microphones is expected to aid acoustic change detection because of the use of spatial information and improved signal to noise ratio. We recorded six people in a 30 minute meeting with four cardioid microphones arranged at the corners of a square. Acoustic change hypotheses were generated separately for each channel using the Bayesian Information Criterion [1]. We found that combining the acoustic change hypotheses from the different channels resulted in a superior overall segmentation of the signal compared with a single microphone condition.

Jonathan Harrington, Steve Cassidy, Tina John and Michel Scheffers Building an interface between EMU and Praat: a modular approach to speech database analysis, to appear in Proceedings of ICPHS 2003, Barcelona, 2003.

In this paper, we demonstrate the advantages of combining the largely complementary systems of Praat, a computational system for doing phonetics, with the EMU system for speech database analysis. The interface applies to the annotations in which a Praat TextGrid is converted into an EMU hierarchical annotation structure and vice-versa. With the exception of annotations in EMU that are not explicitly linked to times, we show that there is no loss of information in this conversion. The interface between the Praat and EMU systems provides a flexible labelling system: the data can be labelled as segments or events in Praat and various kinds of structures between annotation tiers can be defined and then queried within EMU. We argue that both the variety of existing speech databases as well as the multitude of different possible types of speech analysis require a modular approach allowing the integration of a number of different stand-alone components that are adapted to different aspects of creating, annotation, querying and analysing speech data.

steve@srsuna.shlrc.mq.edu.au

Last modified: Tue Oct 26 15:45:50 EST 1999

Back to the top of this page