1. Macquarie University
  2. Faculty of Science and Engineering
  3. Schools and departments
  4. School of Mathematical and Physical Sciences
  5. Study with us
  6. Higher degree research
  7. Statistics
Associate Professor Georgy Sofronov Explore the range of research interests in our school

Postgraduate research topics

Thinking of doing a PhD or research degree in statistics? Here are some suggested project topics.

Alternatively, you may have your own topic ideas you'd like to explore. In that case, talk to staff with related interests about developing a proposal.

A – O projects

Supervisor: Dr Nino Kordzakhia and Dr Hassan Doosti

Topic description

Estimation of an unknown function arises in many fields eg engineering, medicine or social sciences. We will employ an adaptive method from nonparametric statistics for estimation of unknown functions such as:

  • intensity function in spatial-temporal models
  • probability density
  • quantile or regression functions.

We will investigate the properties of the adaptive estimator analytically and through specifically designed numerical experiments in R or Matlab.

Supervisors: Associate Professor Jun Ma

Topic description

This project explores extensions of our newly developed cumulative incidence-specific Cox models for competing risks to incorporate time-varying covariates. This extension provides a pathway for joint modeling of competing risks event times and time-dependent measurements, such as biomarkers.

This type of model has wide applications in medicine and can be used to predict patients’ survival probabilities when biomarkers play important roles in determining patients’ survival.

Supervisors: Associate Professor Thomas Fung

Topic description

This project develops novel semiparametric longitudinal methods for modelling overdispersed, multimodal count data arising from sickness absence records, where observations are heaped at weekly intervals. Using the Household, Income and Labour Dynamics in Australia (HILDA) survey, the project investigates how latent psychosocial job quality influences sickness leave over time.

A key methodological innovation is the integration of dynamic latent variable models, incorporating autoregressive temporal dependence on the latent psychosocial factor, within an empirical likelihood framework for mixed-effects models.

Recent advances in empirical likelihood inference for variance components in linear mixed-effects models (Zhang et al., 2025) have established nonparametric versions of Wilks' theorem for variance component testing without requiring Gaussian distributional assumptions on random effects.

Expected outcomes include:

  • new statistical tools that provide robust, distribution-free inference for heaped longitudinal counts with dynamic latent structures
  • enabling a more reliable understanding of how evolving work environments affect employee health in the Australian workforce.

Supervisor: Dr Georgy Sofronov

Topic description

Change-point problems (or break point problems, disorder problems) can be considered one of the central problems of mathematical statistics, connecting together asymptotic statistical theory and Monte Carlo methods, frequentist and Bayesian approaches, fixed and sequential procedures.

In many real applications, observations are taken sequentially over time, or can be ordered with respect to some other criterion. The basic question, therefore, is whether the data obtained are generated by one or by many different probabilistic mechanisms.

The project will focus on development of robust and reliable methods for identifying change points in sequences of random variables. The main aims of this project are to:

  • develop adequate statistical models fitted to real datasets
  • investigate new analytical and computational methods to improve accuracy of estimates for parameters of the statistical models
  • apply the methods to problems in signal processing, bioinformatics and stock exchange modelling.

Supervisors: Associate Professor Maurizio Manuguerra

Topic description

Across many disciplines, there is interest in estimating the time until an event occurs, such as an earthquake, an extreme weather episode, a system failure or the onset of a disease.

Anyone willing to model time-to-event outcomes is faced with the problem of choosing the most appropriate model class. Currently, there is not a generally accepted methodology and the best practice is to choose a model class and then test its assumptions.

Looking in the literature, though, it appears that the most popular approach is to use the model class which is best known and accepted in each specific field, possibly even without testing the assumptions. In particular, the PH (or Cox) model class has eclipsed other models in popularity.

This project addresses these challenges by filling a critical gap in statistical methodology and introducing a unifying theory for modelling time-to-event outcomes: a super-class of transformation models which enables data-driven selection of the most appropriate modelling approach and extends the framework beyond the limits of existing models.

The main outcomes of this project will be the development of a novel theoretical framework for the modelling of time-to-event data, which will allow for a data-driven statistical test to select the most appropriate model class, either in absolute terms or among a set of relevant models (PH model included).

Supervisor: Dr Georgy Sofronov

Topic description

In many applications data are sequentially collected over time and it is necessary to make decisions based on already obtained information while future observations are not known yet.

Examples occur in:

  • environmental applications (detecting changes in ecological systems)
  • epidemiology (timely detection and prevention of various types of diseases)
  • finance (buying or selling an asset)
  • signal processing (structural analysis of electroencephalographic signals).

This project aims to develop novel optimal sequential procedures. A significant outcome will be the creation of computational infrastructure for identifying optimal decision rules in real applications.

P – Z projects

Supervisor: Dr Jack Freestone

Topic description

Mass-spectrometry-based proteomics is one of the central technologies for discovering which proteins are present, changing or disease-associated in a biological sample. In a typical analysis, the software compares each experimental spectrum against a reference list of possible peptide sequences and reports the peptide that appears to match best.

Because thousands or millions of such comparisons are made, proteomics pipelines rely on artificial negative controls, called decoys, to estimate how many reported peptide identifications are likely to be false. The reliability of the entire analysis depends on these decoys behaving like realistic incorrect matches.

This PhD project will develop new statistical methodology for constructing more accurate and informative decoys in proteomics. A key challenge is that the peptide that truly generated a spectrum may be missing from the reference list being searched. In that case, the software may incorrectly match the spectrum to a similar-looking peptide that is in the list, producing a high-scoring but false identification. Standard decoy-generation strategies may fail to capture these 'near-miss' errors, causing false discovery rates to be underestimated.

The project will investigate new multi-decoy methods in which each target peptide is paired with several nearby decoys, created through small sequence changes that better mimic realistic sources of error. The work will also use machine-learning-inspired scoring rules to combine information across multiple decoys while preserving rigorous false discovery control.

It would suit a student interested in:

  • machine learning
  • computational biology
  • multiple testing
  • statistical inference.

Supervisors: Associate Professor Jun Ma

Topic description

The semiparametric accelerated failure time (AFT) model plays an important role in survival analysis, as it offers simple and conventional interpretations of regression coefficients. However, the computation of the parameter estimates is challenging because the baseline distribution (or error distribution) depends on the regression coefficients.

This dependence makes the log-likelihood function nonconcave and, therefore, numerically unstable during the optimization process.

In this project, we will first develop a numerically stable algorithm for fitting semiparametric AFT models with general partly interval-censored survival data. The algorithm will then be extended to neural network (NN) AFT models, in which the linear predictors are replaced by neural network functions. Left truncation, frailty components, and time-varying covariates will also be incorporated into both the AFT and NN-AFT models.

Supervisors: Dr Nan Zou

Topic description

Massive datasets are datasets with sample sizes that exceed the computational capabilities of standard computers. Nowadays, massive datasets are ubiquitous; for example, ChatGPT users were sending about 2.5 billion prompts per day, and the Nasdaq stock exchange witnessed about 60 million transactions each day.

Massive datasets present both major opportunities and challenges for AI. On one hand, massive and high-dimensional datasets have driven the development of modern AI methods, including the neural network (NN) and the large language model (LLM) and have enabled substantial improvements in AI performance through data-rich training environments. On the other hand, massive datasets bring in enormous computational cost, e.g., on CPU/GPU and electricity, for classic AI methods like linear regression and even more so for modern, inherently intricate AI methods like NN and LLM. 

Subsampling offers a promising methodology for this challenge. By drawing data randomly or non-randomly from the original dataset, subsampling creates a much smaller dataset that is similar in structure to the original dataset. While preserving the key features of the original dataset, this smaller dataset is significantly easier to compute. Hence, in the context of massive datasets, subsampling has the potential to make modern AI and large-scale statistical learning substantially more computationally tractable.

With a particular emphasis on the trade-off between computational efficiency and statistical validity, this project aims to provide a strong mathematical foundation for rapid and reliable AI.

Supervisor: Dr Nan Zou

Topic description

Massive datasets are datasets with sample sizes that exceed the computational capabilities of standard computers. With data sizes growing exponentially, massive datasets present immense opportunities as well as unprecedented challenges to our society with their prevalence in, for example:

  • economy
  • finance
  • physics
  • social media.

One family of data science methods designed for massive data sets, namely, the massive-data bootstrap procedures, has recently attained considerable popularity but has unknown theoretical properties, including reliability.

This project aims to rectify a major misunderstanding in the literature, provide a comprehensive and complete theory for this family of data science methods and consequently push forward the frontier of data science in the era of massive data.

Supervisor: Dr Houying Zhu

Topic description

This project pioneers new statistical methods that prioritise stability, resistance and interpretability for analysing high-dimensional data. By integrating Bayesian and Frequentist principles for statistical model building, new solutions will enhance data analysis reliability in neuroscience, meat science and climate change.

Expected outcomes include innovative statistical tools that provide consistent, interpretable insights, enabling confident decision-making and interdisciplinary collaboration in these fields.

This research will empower scientists to derive trustworthy findings from complex data in Australian and global contexts, advancing research and application in health, food science and climate studies.

Supervisor: Dr Jack Freestone

Topic description

Modern science is awash with large-scale screening experiments: testing thousands of genes for association with disease, thousands of proteins for differential expression or thousands of candidate features in a statistical model. A persistent challenge is that discoveries made in one dataset often fail to replicate in another, especially in high-dimensional settings where apparent signals can be driven by confounding, sampling bias, model misspecification, or the idiosyncrasies of a single experiment.

This PhD project will develop new statistical methods for replicability analysis: identifying signals that are consistently present across multiple studies, populations, experimental conditions or data-collection environments, while controlling the false discovery rate – the expected proportion of reported discoveries that are false.

A key limitation of existing methodology is that it can be overly conservative when estimating the number of false replicability claims. In particular, if a feature is null in more than one environment, naive approaches can effectively count it multiple times, even though it should contribute only once to the collection of non-replicable features.

This project will develop sharper methods that avoid this overcounting, either through principled pre-filtering strategies that remove unpromising features before the final replicability analysis, or through less conservative false discovery rate estimators that more carefully account for overlap among null features across environments.

This project will involve:

  • applying the methods to real datasets from areas such as genomics and proteomics
  • designing algorithms
  • knock-off-style statistics
  • multi-environment inference
  • negative controls
  • proving rigorous error-control guarantees
  • running simulations.

Supervisors: Associate Professor Jun Ma

Topic description

In this project we investigate the issue of model selection for the semiparametric accelerated failure time (AFT) model when event time observations are subject to partly interval censoring. Specifically, we consider fitting the semiparametric AFT model in settings where the number of covariates is large, often much greater than the number of observations.

In modeling survival times, the availability of high-dimensional data necessitates the development of variable selection techniques to identify important risk factors among numerous covariates, particularly when these covariates are related to genes or biomarkers.

We will employ L_0-norm and penalised likelihoods to perform model selection. More complex settings, including time-varying covariates and left truncation in semiparametric AFT model selection will also be investigated.