Statistics topics

Associate Professor Georgy Sofronov

Explore the range of research interests in our school

Postgraduate research topics in statistics

Thinking of doing a PhD or research degree in statistics? Here are some suggested topics.

Alternatively, you may have your own topic ideas you'd like to explore. In that case, talk to staff with related interests about developing a proposal.

Supervisor: Dr Nino Kordzakhia and Dr Hassan Doosti

Topic Description:

Estimation of an unknown function arises in many fields eg Engineering, Medicine or Social Sciences. We will employ an adaptive method from Nonparametric Statistics for estimation of unknown functions such as:

probability density
quantile or regression functions
intensity function in spatial-temporal models.

We will investigate the properties of the adaptive estimator analytically and through specifically designed numerical experiments in R or Matlab.

Supervisor: Dr Georgy Sofronov

Topic Description:

Change-point problems (or break point problems, disorder problems) can be considered one of the central problems of mathematical statistics, connecting together asymptotic statistical theory and Monte Carlo methods, frequentist and Bayesian approaches, fixed and sequential procedures.

In many real applications, observations are taken sequentially over time, or can be ordered with respect to some other criterion. The basic question, therefore, whether the data obtained are generated by one or by many different probabilistic mechanisms.

The project will focus on development of robust and reliable methods for identifying change points in sequences of random variables. The main aims of this project are to:

develop adequate statistical models fitted to real datasets
investigate new analytical and computational methods to improve accuracy of estimates for parameters of the statistical models
apply the methods to problems in signal processing, bioinformatics and stock exchange modelling.

Supervisor: Dr Georgy Sofronov

Topic Description:

In many applications data are sequentially collected over time and it is necessary to make decisions based on already obtained information while future observations are not known yet.

Examples occur in:

environmental applications (detecting changes in ecological systems)
signal processing (structural analysis of electroencephalographic signals)
epidemiology (timely detection and prevention of various types of diseases)
finance (buying or selling an asset).

This project aims to develop novel optimal sequential procedures. A significant outcome will be the creation of computational infrastructure for identifying optimal decision rules in real applications.

Supervisor: Dr Georgy Sofronov

Topic Description:

The literature concerned with the development of optimisation techniques is both large and diverse. Optimization algorithms that construct some kind of statistical model and use this model to influence the search process can be found in areas such as:

evolutionary computation
machine learning
engineering design
stochastic and global optimization.

The algorithms considered in this project will be based on a model of the density of promising points from a sample or population evaluated at a given iteration of the algorithm.

Supervisor: A/Prof Jun Ma and Prof Benoit Liquet-Weiland

Topic Description:

Time-to-event outcomes are paramount in biostatistics. To analyse such data, Cox proportional hazard regressions are widely used to assess for instance, the efficacy of new interventions, or to quantify the prognostic and/or predictive values of some features (patients, clinical or genes). When these models are used to build risk prediction, they often fail to provide high accurate predictive performance. This lack of predictive accuracy can be attributed to the implausibility of assumption required by the Cox model.

To achieve high predictive performance, a predictive model should cope with complicated predictors such as longitudinal (time-dependent), random regression coefficients (random effects, used to capture survival time dependence), and the number of predictors can be numerous. Such predictive models, when successfully derived in medicine for example, can provide a rational basis for making personalized management decisions, recommending appropriate follow-up schedules, and determining clinical trial eligibility and stratification.

The proposed project aims at developing new methods along with software that provide high predictive performance. A new class of semi-parametric predictive survival regression will be developed using constrained maximum penalized likelihood.

All new methods developed in this thesis will be applied to real-life data using the world's largest prospective clinical research database (n=+52,000) and annotated frozen (n=11,000) and archival tumour collections held at the Melanoma Institute Australia (Sydney).

Supervisor: A/Prof Ayse Bilgin

Topic Description:

In recent decades, especially during first two years of COVID pandemic, statistics became part of daily life if it wasn’t before. Climate change meetings by high level government officials and discussions of parliaments and citizens on climate change are also occurring weekly, if not daily. We will therefore argue that Statistical Literacy today as important as the Literacy for everyone on the planet.

Possible research topics in statistics education research are

What would be the benefits and challenges associated with embedding the United Nations Sustainable Development Goals (UNSDGs) into statistics curriculum,
What would be the benefits and challenges associated with embedding indigenous knowledge into statistics curriculum,
Place of ethics in statistics education (ie different perspectives could be considered here such as ethics on infographics usage, creation or ethics on data analysis).

Supervisors: Prof Benoit Liquet, Dr Kelly Williams and Dr Lyndal Henden

Topic Description:

Regions of the genome that have been inherited from a common ancestor are considered identical-by-descent (IBD) and, in the context of hereditary disease, harbour a disease locus and disease-causing gene mutation. Henden et al., 2016 developed the algorithm for the R package XIBD to detect regions of the genome inherited identical-by-descent and then infer degree of relatedness between individuals. In this project, we propose to exploit statistical machine learning methods to increase accuracy in inferring the degree of relationship between individuals, thereby increasing confidence in identification of distantly related individuals. The dataset used in this project comprises multi-generational motor neuron disease families with known pedigree structure and available genotype data.

Supervisors: Prof Benoit Liquet and Prof Hanlin Shang

Topic Description:

Partial Least Square-like (PLS) methods are popular dimension-reduction technique with numerous applications in genomics, biology, environment science and in engineering. PLS looks for the best orthogonal linear combinations of the predictors which are correlated to a multivariate response variable. So far, the different versions of the PLS assume independent and identically distributed (i.i.d.) observations for the predictors and the responses and can lead inconsistent estimation in case of temporal dependence in the data (Singer 2016). Therefore, it is crucial to take into account the temporal dependence of the data when extracting latent components using PLS. In this project, we will tackle the temporal dependence by proposing dynamic partial least squares regression by exploiting the auto-covariance dependence structure of the predictors and the response variables.

Supervisors: Prof Benoit Liquet and Dr Hoang Nghiem

Topic Description:

Sliced Inverse Regression (SIR) offers a flexible methodology to analyse complex data sets by combining linear dimension reduction with nonlinear regression. Initially constructed for solving problems in the low dimensional setting p<n, SIR methods have seen many successes and developments for tackling the challenge of high dimensional data (p<<n). In this work, we propose to work on multivariate regression models in context of high dimensional data. A multivariate SIR lasso model will be studied for regularization and variable selection purpose.

Supervisor: Dr Nan Zou

Topic Description:

Massive datasets are datasets with sample sizes that exceed the computational capabilities of standard computers. With data sizes growing exponentially nowadays, massive datasets present immense opportunities as well as unprecedented challenges to our society with their prevalence in, for example,

physics,
economy,
finance,
social media.

One family of data science methods designed for massive data sets, namely, the massive-data bootstrap procedures, has recently attained considerable popularity but has unknown theoretical properties, including reliability. This project aims to rectify a major misunderstanding in the literature, provide a comprehensive and complete theory for this family of data science methods, and consequently push forward the frontier of data science in the era of massive data.