DADA HCS -- File Sharing

The goal of the project is to be able to share corpora among researchers in different locations around Australia and potentially the world. This document discusses some requirements and options for implementation.


Linguistic corpora are collections of text, audio and video with associated annotations stored in a variety of formats and used for a variety of purposes. Examples include the ANDOSL database of spoken Australian English and the Susanne written corpus. ADOSL consists of about 30 CDROMS of audio recordings with annotations showing which words are being spoken and in some cases where syllable and phoneme boundaries occur. Susanne is a collection of textual data which has been analysed syntactically, the corpus includes detailed parse trees for every sentence.

The project aims to make it easier for researchers to share this kind of data between themselves and to enable shared annotation of the data. The annotation problem will be considered separately, this document is concerned mainly with access to files.

The size of many of these corpora places some restrictions on what can be done to share them. Many text corpora are only a few hundred kilobytes and could be easily emailed between researchers. Speech and video corpora though are much larger and are typically distributed by mailing CDROM or DVDROM copies between labs. Both of these methods though create a problem of duplication of data without any mechanism for managing change. If errors are found with data or annotations, there's no way to track who has a copy in order to propagate the changes out to users.

Sharing data is quite widespread in some disciplines but many researchers are reluctant to do so for reasons of privacy or to maintain the academic advantage of having collected their own data. In some cases there are restrictions on who can access the data based on the original conditions of collection - for example, only researchers within a given project, institution or country. In some cases, data is licensed to certain groups on payment of a fee or on signing an agreement. Any file sharing system needs to be able to respect and to some extent enforce these requirements if it is to be used by the community.

In general, research groups will have an existing store for their data, often a large disk on a central file server which is used to store data in use by active projects. Publishing data within a research group is commonly a matter of copying it to this disk and letting everyone know it's there. If possible, we should leverage this resource in this project by extending this model to more widespread access. Publishing data should be as simple as copying it to a locally mounted volume and writing some appropriate meta-data. Since we have no funds for a large centralised server we prefer a model where local resources are made available (and perhaps shadowed or cached) to other labs. However, these disparate data stores should be unified in some way so that a user can log on to DADA and see whatever data resources are available around the country.

Having said that no funds are available for a central server, it may be possible to make use of facilities provided by AC3 who can provide large amounts of storage to some projects. This might be used as a central redundant copy of any data that we have on our system or as a more active component.

There is a wide variety of modes of use of linguistics data, ranging from Linguists who want to search and read textual data to speech recognition researchers who need to run complex algorithms over speech recordings. There is a similarly wide range of tools used to access the data. The only thing that these have in common is that they will tend to read files from locally mounted disk. There is no uniformity of file formats (which can make sharing difficult). The system we build will need to be mountable as a local file system or facilitate copying of data to local storage. If we go for the latter solution, we need some way of keeping track of the copies to allow for updates. We note that for compute intensive tasks such as speech technology processing, local copies are absolutely necessary.

We would therefore like to build a file store which:

  1. Is able to deal with potentially large collections of data, delivering them to remote users
  2. Supports user login and restrictions on access to resources based on group membership or other criteria
  3. Provides remote access to local data stores, possibly with mirroring or caching of frequently used data.
  4. Provides a unified (federated) interface to users.
  5. Is mountable as a local file system on a range of operating systems.
  6. Supports making local copies and synchronising these copies with remote versions to propagate changes made either locally or remotely.

Implementation Options


WebDAV provides a method for remote access to file stores over HTTP. It has the advantage of being very light weight and compatible with most operating systems. One implementation option might be Jakarta Slide from the Apache project but there may be others.


The V in WebDAV stands for versioning but in most cases is not implemented by WebDAV servers (Slide mentions it though). If we could use WebDAV versioning we might have some change of implementing the ability to propagate changes. The source code control system Subversion also makes use of WebDAV and might provide some useful insights.


The main player here seems to be Shibboleth which provides federated access control for web based resources. One advantage is that a group at Macquarie (MELCOE) are heavily involved with Shibboleth and can provide some technical assistance on our project. Our problem will be how to Shibboleth enable our file store, using an HTTP protocol like WebDAV should make this a lot easier.


It may be possible to have a single WebDAV server which federates access to the many stores that are made available around the country.


How should we manage propagating copies of files between servers? Is it even necessary?


It will be necessary to store some metadata on files and corpora in order to mediate access control. It would also be useful to have more metadata to enable searching for resources. There are existing metadata stores that might be useful here: DSpace, Fedora.

Back to the top of this page