Corpus Linguistics

Corpus Linguistics

Corpus methodology – the investigation of collections of text to explore patterns of language usage – is one that is commonly employed in linguistics, and unites a wide range of subdisciplines. Depending on the nature of the corpus, it’s possible to do research into topics as diverse as the development of child language, language change over time, variation across regions, and the characteristics of different spoken and written registers. Within the Department of Linguistics, various research areas make use of corpora, including child language acquisition, second-language learning, translation, sociolinguistics, World Englishes and computational linguistics.

We have a strong tradition in this area of language research. A range of corpora are hosted at Macquarie, many of which have been built in our department.

Areas of interest

Child language acquisition

Corpora of children's spontaneous speech production and the input that they hear are essential to research how children learn language. We use existing corpora in the CHILDES database as well as purpose-built corpora to inform and extend our experimental work on children's acquisition of sound structure, morphology, syntax, and interaction. Three of the audio(/video) corpora available on the CHILDES database were developed by researchers now at Macquarie: The Providence (English) Database; The Lyon (French) Database; and the Demuth Sesotho Corpus. See here for more details.

Discourse analysis

We use natural language corpora to study many kinds of social contexts, including media and political discourse, clinical consultations in medicine and pyschotherapy, and literary texts. We draw on both specialized register-specific corpora (where the data comes from one kind of social context), as well as large multi-generic corpora, such as the British National Corpus.

Language variation and change

A number of researchers working in the focus area Language Variation and Change make use of synchronic and diachronic corpora to investigate how languages vary in different settings, and across time. More information is available here.

Phonetics and phonology

Corpora can be used to investigate variation not just in what people say but how they say it. AusTalk is a large state-of-the-art database of spoken Australian English from all around the country. Collected from 2011-2016, almost a thousand adults with ages ranging from 18 to 83 from 15 different locations in all states & territories were recorded. AusTalk represents regional and social diversity and linguistic variation of Australian English, including Australian Aboriginal English. Each speaker was audio and video recorded on three separate occasions to sample their voice in a range of scripted and spontaneous speech situations at various times. AusTalk is accessible from Alveo.

Student writing

We use different corpora of student writing to search for and investigate the micro- (lexico-grammatical) and macro-level (generic and rhetorical) features of discipline-specific genres. The outcomes of the student writing corpus research will help different stakeholders in academia and beyond to deal with issues related to academic communication and literacy. We also intend to develop local student writing corpora to complement ones such as the British Academic Written English (BAWE) corpus.


We use electronic corpora and quantitative corpus linguistic methods to analyse the linguistic features that set translated language apart from non-translated language. We try to “fingerprint” what makes translated language different from language that has not been translated, and develop hypotheses about the cognitive and social constraints that give rise to these features. We also use corpus methods to investigate a variety of other research questions in translation, including translation style and ideology in translation. Most of our researchers working in this area also work in the focus area Translation and Interpreting.

World Englishes

Study into the convergence and divergence of Englishes around the world has been greatly facilitated by ICE (the International Corpus of English) which currently contains equivalent 1-million word corpora of spoken and written English for 23 regions including Australia, Great Britain, Hong Kong, India, Jamaica, New Zealand, Philippines and South Africa. For features that require larger amounts of data, the GloWbE (Global Web-based English) corpus provides multi-million word collections of written text.

Our projects and activities

Corpus collection

The following corpora were collected at Macquarie and are available to researchers on request: ACE (Australian Corpus of English), ICE-AUS, the Australian component of ICE, ART (Australian Radio Talkback corpus). A range of other corpora are also fully searchable via this site.

Corpus linguistics workshops

In association with Lancaster University, Macquarie has organised workshops for beginner (2015) and more advanced (2016) users of corpora. These were attended by students and researchers from Australia and overseas.

Language, register and stylistic change in the Hansard (1900-2015)

This project, funded under a Macquarie University Research Development Grant (MQRDG 2017-2018) led by Dr Haidee Kruger uses newly compiled comparable historical corpora of the British, Australian and South African Hansard to investigate how written English usage changes over time in three varieties of English.

Linguistic Epicentres: Empirical perspectives on regional and international influences on World Englishes

Funded by a Universities Australia / DAAD grant (2018-2019) in partnership with the Justus Liebig University Giessen, this project investigates how regional varieties develop their local features while in contact with neighbouring varieties and “supervarieties” (such as American and British English) The research will examine written, spoken and online discussion data from corpus collections of varieties of English such as Australian, Indian, New Zealand and Sri Lanka, so as to test whether more formal registers of writing (parliamentary records, newspapers) are more or less receptive to international English than informal conversation or online interaction.


This project, initiated in 2006 and still very active, uses specialised corpora to find headwords and provide definitions for online termbanks focusing on academic areas for 1st-year students (e.g. Accounting, Genetic biology, Statistics) and others designed for use by the general public, in the areas of Family Law (LawTermFinder) and cancer treatment (HealthTermFinder).

Our People

Current researchers

Felicity Cox
Katherine Demuth
Cassi Liardet
Annabelle Lukin
Pam Peters
Mehdi Riazi
Adam Smith
Canzhong Wu

Current PhD students

Ibrahim Alasmri: The features of translated language across register and time: A corpus-based study of translation from English to Arabic

Hayyan Al-Roussan: Subtitling of cultural references from English into Arabic: A corpus-based study

Eisa Asiri: Translation strategies for culture-specific items in the Qur’an: A corpus-based descriptive study.

June Hings: Towards making racism more visible: a comparative application and evaluation of selected discourse analytical approaches for explicating racism in public constructions of Indigenous Australians

Neda Karimi: Patient-centred advanced cancer care: A systemic functional linguistic analysis of oncology consultations with advanced cancer patients

Kristin Khoo: Cohesion and language and the self: Linguistic cohesion in the psychotherapeutic register

Melanie Ann Law: The role of editorial intervention in ongoing language variation and change in South African and Australian English

Abdullah Qabani: Language, power and the Arab Spring: Three case studies

Karien Redelinghuys: Language contact and change through translation in Afrikaans and South African English: A diachronic corpus-based study

Yousef Sahari: A corpus-based study of taboo language in Arabic subtitles

Martin Tilney: Style in Peter Carey's short fiction

Xiaomin Zhang: Investigating explicitation in children’s literature translated from English to Chinese


Dr Adam Smith

Content owner: Department of Linguistics Last updated: 01 Apr 2020 9:59am

Back to the top of this page