Centre for Language Technology

Centre for Language Technology

Honours Projects

This page lists possible Honours projects in Language Technology for 2005. If you find that a project listed here is close to something you're interested in, but isn't quite what you were looking for, you should speak to the project supervisor to see if an appropriate project can be constructed. More generally you'll find that members of staff are usually open to suggestions for projects. Note that you need to provide the honours convenor with your project title and supervisor's name by the Monday of Week 2, and you have to submit your proposal and make a presentation in Week 4.

There is also a collection of project topics using Nokia's state-of-the-art mobile network laboratory.

See also our information on:

Projects in the area of Speech Recognition

EMU: Tools for Annotation and Corpus Querying

Supervisor: Steve Cassidy

I'm the main author of Emu which is a set of software tools for research with annotated speech corpora. The development of Emu is ongoing and there are likely to be various projects apart from the ones listed here. Most of these projects require no knowledge of speech and can be seen as general Software Engineering/Database projects. For more details consult Steve Cassidy's list of Honours projects.

Speaker Identification in Meetings

Supervisor: Steve Cassidy

We have an ongoing project to analyse audio recordings made in meetings. In the first phase we are trying to segment the audio stream according to who is talking: speaker segmentation and identification. Possible student projects in this area might involve evaluating different speaker identification algorithms; looking at applying speech recognition to the audio stream to build an index for information retrieval; investigating algorithms for coping with varying room acoustics in different meeting rooms.

If you are interested in hardware there are ideas to follow up in building a special purpose meeting recorder device -- something like a PDA which can be used to obtain high quality recordings of meetings and do some of the indexing work on the captured speech signal.

Recognising Australian Speech

Supervisor: Steve Cassidy

This project involves training a speech recogniser to work on Australian speech. This would fit in with the Centre's DARPA Communicator project, using the Sphynx speech recognition engine. The project would involve getting to know Sphynx well, adapting it to and training it on the Australian data we have, and then evaluating its performance, perhaps in the context of an application like the Department's information kiosk.

Projects related to Question Answering and Information Retrieval

Most of the projects in this section are related to AnswerFinder. AnswerFinder is a question answering system that finds the returns the answer to an arbitrary question by exploring text documents. To do this, AnswerFinder constructs the logical forms of the questions and compares them with the logical forms of the answers. To speed up the process and make it possible to explore considerable volumes of text, AnswerFinder incorporates additional methods based on shallow but fast processing of text.

Representing the Semantics of Sentences

Supervisor: Diego Mollá Aliod

AnswerFinder uses logical forms to determine if a sentence contains the answer to the question. These logical forms, however, are rather difficult to understand by humans and therefore the process of manually discovering inference rules to add to the system is time-consuming. The goal of this project is to determine methods to simplify the representation of logical forms to the user. The methods may be a combination of graphical expressions (e.g. represent the dependencies between the concepts graphically) or natural language generation (e.g. write a paraphrase that accurately describes the contents of the logical form), or something else. You decide!

Graph-based Question Answering

Supervisor: Diego Mollá Aliod

A sentence is a structured collection of words. This structure can be represented as a graph where the nodes are the concepts expressed in the sentence, and the arcs are the relations between the concepts. The aim of this project is to explore the use of such graphs as means of sentence representation for the task of question answering. The project will involve the automatic creation of the graph, the use of graph theory methods to determine if a sentence can answer a specific question, and the extraction of the exact answer from the question.

Question Answering from Speech Data

Supervisor: Diego Mollá Aliod

A speech recognition system of continuous speech may introduce up to 50% of recognition errors. This high percentage of recognition errors present new challenges to question answering systems. This project aims at developing a question answering system that uses the output of a speech recognition system as the input data.

Classification of Bibliographic References

Supervisor: Diego Mollá Aliod

We have a BibTeX database of bibliography entries, where every entry typically contains information about the author, title, abstract, and additional comments. Every entry is also tagged with keywords according to a keyword ontology. However, the process of updating the keywords in the bibliography entries when the ontology changes is too time-consuming and prone to errors. The goal of this project is to automatically assign keywords to the bibliography entries given an arbitrary keyword ontology.

Retrieval of Bibliographic References

Supervisor: Diego Mollá Aliod

It is always difficult to remember who said what in what document. We have a BibTeX database of bibliography entries, where every entry typically contains information about the author, title, abstract, and additional comments. Every entry is also tagged with keywords according to a keyword ontology. The goal of this project is to retrieve the bibliography entries that are relevant to the topic given in an arbitrary user query. An important part of the project is to account for variations of terms describing related concepts.

Classification of Questions

Supervisor: Diego Mollá Aliod

Our system currently uses very simple rules to determine the type of information a question is asking for. The goal of this project is to build a question classification system that automatically learns the types of questions by analysing a corpus that is annotated with the correct question types. This project is especially suitable to those who are doing COMP348 in the first semester of 2007.

Answering Complex Questions

Supervisor: Diego Mollá Aliod

Currently we are developing a system that answers complex questions where the answer needs to be composed by exploring several documents. The current system simply presents all sentences that have some part of the answer but this can be done better. The goal of this project is to combine the independent answers in such a way that the resulting answer is coherent and has reduced redundancy.

Processing Wikipedia

This set of projects is about extending Wikipedia to make it easier to find information in it.

Question Answering on Wikipedia

Supervisor: Diego Mollá Aliod

The goal of this project is to use a 2-stage question answering system that converts the user question into a series of Web queries on Wikipedia pages, queries Wikipedia, and collects the result. The result is processed to find the exact answer to the query by combining AnswerFinder technology with other state-of-the-art technology on question answering.

Find Related Information

Supervisor: Diego Mollá Aliod

Given a Wikipedia page, find other Wikipedia articles that are related to it and propose them as links from the page.

Find Translations

Supervisor: Diego Mollá Aliod

Given a Wikipedia page in a language, find their equivalent pages in other languages.

Search and Summarise

Supervisor: Diego Mollá Aliod

Find all documents relevant to a topic, and with them compose a summary (this could easily be extended to a PhD project)

Learn Entailments

Supervisor: Diego Mollá Aliod

Use Wikipedia to learn text patterns that indicate entailment between two words. This could be done in two steps:

  1. Using known entailment pairs (such as the ones provided by WordNet), mine the web for text that contains these pairs and determine the inherent entailment patterns.
  2. Apply the patterns to learn new pairs, and bootstrap.

KELP: Knowledge Extraction and Linguistic Presentation

KELP is a new project aimed at carrying out sophisticated extraction of information from online resources, and then combining and collating this information in novel ways, re-presenting it to users via both speech and text. The project involves the use of techniques in information extraction, natural language analysis, natural language generation, user modelling, and spoken language dialogue systems.

Extracting Tabular Information from Web Pages

Supervisor: Robert Dale, Rolf Schwitter or Diego Mollá Aliod

Much important information in web pages is presented in tables. However, it turns out to be quite difficult to extract the information from tables in a meaningful way, because the authors of web pages use tables for a range of purposes besides laying out data. This project will explore how information extraction techniques can be used to construct well-organised data structures from the information embedded in web pages.

An Information Extraction Toolkit

Supervisor: Robert Dale, Rolf Schwitter or Diego Mollá Aliod

Much work in information extraction involves searching for patterns in text and then extracting specific pieces of information on the basis of the patterns that are found. Although this is generally accomplished using regular expressions written expressly for the task at hand, it turns out that there are many patterns which recur from one domain to another, and a number of operations applied to manipulate these patterns that also recur. The aim of this project is to construct a toolkit that operationalises these observations, and so provides a way of easily moving KELP from one domain to another.

Other Language Technology Projects

Stock Portfolio Reporting

Supervisor: Robert Dale

Over the years we have developed a number of research prototypes that dynamically generate textual summaries of stock market behaviour: see, for example, the StockReporter system.

In this project you will develop an application in the same domain. There are a number of possible directions here: for example

  • taking StockReporter as a base, you might extend the program to report on defined portfolios of stock holdings: this might result in reports along the lines of Your tech stocks remain strong, but you might want to watch your oil stocks: Texaco dropped $3 in the last 24 hours.
  • again taking StockReporter as a base, you might build a voice dialog interface to the system, so that a user can call in and receive up-to-date stock report summaries over the phone.

There are many other possibilities in this domain.

Information Extraction from Job Descriptions

Supervisor: Robert Dale

We have a corpus of over 1000 emailed job descriptions, all in the language technology domain or related areas. Searching through this amount of data for a job that you might be interested in is painful; and although simple information retrieval and search techniques based on keywords can help a little, ultimately what we really want is to be able to derive more structured information from this data, so that the job descriptions can be processed in order to populate a database. This would enable more robust queries to be posed, so that for example you might look for jobs in specific geographic areas that require specific programming languages.

The aim of this project is to develop an information extraction system that can locate and extract useful elements of information from these job ads.

An Automatically Constructed Conferences Website

Supervisor: Robert Dale

We have a corpus of around 6000 conference announcements, all in the language technology domain or related areas. The aim of this project is to develop robust technology that extracts key information (such as the title of the conference, where it is being held, and the dates of the event) from these announcements, stores this information as XML, and then uses XSLT and related technologies to provide a highly-functional web interface for browsing the information. The project involves research in both information extraction and web technology.

Coreference Resolution

Supervisor: Robert Dale

There has been a lot of research into pronominal reference resolution, but the problem of determining co-reference between proper names is much less explored. This project, which will attract a $5000 scholarship from the Capital Markets CRC, is concerned with working out when two proper names refer to the same person. The aim is to develop new techniques that can be applied to a previously unseen text domain in order to (a) identify the proper names that appear in that domain and (b) determine when multiple names refer to the same entity. The techniques will be developed to handle both person names (as in 'Mrs Clinton', 'Hilary Clinton', and 'Bill Clinton's wife') as well as company names (as in 'BHP' and 'Broken Hill Proprietary Limited').

Machine Translation

Supervision: Mark Dras

This would involve investigating a specific language pair and examining issues in machine translation with respect to that pair. Very recent work at Johns Hopkins University has been exploring integrating structural approaches (where you design rules for translation) with statistical approaches (where the system "learns" translation). A specific project would be to replicate the preliminary work from Johns Hopkins with a closer language pair (say, English-French), and to evaluate results relative to purely structural or purely statistical approaches.

A more general project in this area is also possible. For more details of this project consult Mark Dras' Honours project page.


Supervision: Mark Dras

This project would be related to some of my research on paraphrasing. The idea would be to build a system, using an existing broad-coverage parser, together with an existing mathematical optimisation package, to build a system that would take a text (e.g. a paper) and fit it to a set of constraints (e.g. a 2000 word limit with sentences of middling complexity).

Using the Web for Term Translation

Supervisor: Diego Mollá Aliod

Human translators often find it difficult to determine the exact translation of technical terms in specialised areas. The goal of this project is to build a system that, given a term in a specific document, uses the Web to find the most likely translations in the target language. This project combines multilingual information retrieval techniques with machine translation techniques.

Back to the top of this page