Development of machine learning models from imbalanced datasets in biomedicine

Apply and key information  

Summary

Motivation

Data imbalance in machine learning refers to situations where the distribution of classes in a dataset is highly skewed. In age estimation from orthopantomograms (OPGs), this imbalance often arises because certain age groups—such as children or the elderly—are underrepresented compared to the more common adult population. While dataset imbalance can lead to biased models, addressing this issue by excluding data is not always feasible or desirable, particularly when the dataset is large, diverse, and derived from real-world clinical settings. Using all available data, even when imbalanced, is often a necessity, as collecting and annotating new data to balance classes is expensive, time-consuming, and sometimes impractical.

We have acquired a large OPG dataset from a Paris dental clinic. It provides a unique opportunity to develop machine learning models for age determination. Its size and richness in terms of demographic diversity, clinical variability, and real-world age distributions make it an invaluable resource. However, the inherent imbalance in age group representation poses a significant challenge to building unbiased predictive models. Larger datasets, even when imbalanced, offer a richer variety of examples that improve generalization to unseen data, especially in deep learning where model capacity benefits from diverse input. For minority classes, any data points removed represent a direct loss of critical information. Thus, leveraging the full dataset not only preserves natural age group distributions but also ensures the development of robust and meaningful machine learning models.

Underlying aim

The overarching objective of this project is to develop a deep learning framework tailored to address dataset imbalance in biomedicine. We will firstly address age estimation using OPGs, leveraging the inherent diversity and scale of the Paris dental clinic dataset. By incorporating state-of-the-art techniques such as transfer learning, data augmentation, and loss function optimization, the proposed solution aims to overcome bias and improve prediction accuracy across all age groups and demographics. The rationale for this approach lies in the unique composition of the dataset, which provides an opportunity to create a universally applicable model that reflects real-world diversity while addressing methodological challenges.

Specific objectives/Methodology

The project will involve the following specific aims.

Specific Aim 1: Develop a Baseline Deep Learning Model for Age Estimation.

(a) Objective: Establish a robust baseline by applying a deep learning architecture, such as ResNet, to the OPG dataset.

(b) Hypothesis: A deep learning model pre-trained on a large, general radiographic dataset will effectively predict age when fine-tuned on OPG images.

(c) Approach: Utilize transfer learning to train a ResNet-based model on the Paris dataset. Preprocess the images uniformly and conduct a standard evaluation using cross-validation.

(d) Outcome: A baseline model with defined performance metrics highlighting the effects of dataset imbalance on prediction accuracy.

Specific Aim 2: Implement and Evaluate Strategies to Mitigate Dataset Imbalance.

(a) Objective: Integrate techniques such as class-specific loss weighting, synthetic data generation, and adaptive oversampling to address imbalance.

(b) Hypothesis: Loss function optimization and data augmentation will significantly improve prediction accuracy for underrepresented groups without compromising overall model performance.

(c) Approach: Experiment with focal loss, SMOTE-based data augmentation, and GANs to generate synthetic OPGs for minority groups. Compare these techniques against the baseline model.

(d) Outcome: A comprehensive evaluation of imbalance mitigation strategies, including improved performance metrics for minority age groups and demographics.

Specific Aim 3: Develop a Comprehensive and Generalizable Framework for Age Determination.

(a) Objective: Construct an optimized model that integrates the best-performing techniques from Aims 1 and 2 into a unified framework.

(b) Hypothesis: A deep learning framework incorporating diverse data and advanced techniques will yield a highly generalizable model for age estimation.

(c) Approach: Combine findings from the previous aims, fine-tune the final model, and evaluate its performance on an external validation dataset. Conduct interpretability analyses to ensure the model's predictions are biologically and clinically plausible.

(d) Outcome: A generalizable deep learning framework for age determination, capable of handling diverse real-world data with improved accuracy and fairness across all demographics.

Impact

This work will advance the field of age determination by providing a validated, scalable, and generalizable deep learning framework capable of addressing data imbalance in diverse datasets. The findings and models developed through this study will not only enhance forensic and medical applications but also establish methodologies for addressing dataset imbalance in other fields reliant on large, heterogeneous datasets.

Essential criteria

Applicants should hold, or expect to obtain, a First or Upper Second Class Honours Degree in a subject relevant to the proposed area of study.

We may also consider applications from those who hold equivalent qualifications, for example, a Lower Second Class Honours Degree plus a Master’s Degree with Distinction.

In exceptional circumstances, the University may consider a portfolio of evidence from applicants who have appropriate professional experience which is equivalent to the learning outcomes of an Honours degree in lieu of academic qualifications.

  • Sound understanding of subject area as evidenced by a comprehensive research proposal
  • A comprehensive and articulate personal statement

Desirable Criteria

If the University receives a large number of applicants for the project, the following desirable criteria may be applied to shortlist applicants for interview.

  • First Class Honours (1st) Degree
  • Completion of Masters at a level equivalent to commendation or distinction at Ulster
  • Experience using research methods or other approaches relevant to the subject domain
  • Sound understanding of subject area as evidenced by a comprehensive research proposal
  • Work experience relevant to the proposed project
  • Publications record appropriate to career stage
  • Experience of presentation of research findings

Equal Opportunities

The University is an equal opportunities employer and welcomes applicants from all sections of the community, particularly from those with disabilities.

Appointment will be made on merit.

Funding and eligibility

Recommended reading

  • Manisha Saini and Seba Susan; Tackling class imbalance in computer vision: a contemporary Review; Artificial Intelligence Review (2023) 56:S1279–S1335.
  • Walid Brahmi et al; Exploring the role of Convolutional Neural Networks (CNN) in dental radiography segmentation: A comprehensive Systematic Literature Review; Engineering Applications of Artificial Intelligence, 133 (2024) 108510.
  • Mingyu Kim et al; Realistic high‑resolution lateral cephalometric radiography generated by progressive growing generative adversarial network and quality evaluations; Scientific Reports, 11 (2021) 12563.

The Doctoral College at Ulster University

Key dates

Submission deadline
Monday 7 April 2025
04:00PM

Interview Date
April 2025

Preferred student start date
15 September 2025

Applying

Apply Online  

Contact supervisor

Dr Mateus Webba Da Silva

Other supervisors