Xiruo Ding
Graduated: June 13, 2025
Thesis/Dissertation Title:
Building Robust Text Classification Models under Provenance Shift: Methods of Adjustment and a Framework for Evaluation
Machine learning and deep learning have revolutionized various fields, including biomedical research. However, they often rely on large amounts of data, which can be a challenge in fields such as medicine, because access to clinical data is restricted. One solution is to combine data from multiple sites (resulting in examples with different provenance), increasing the variety and quantity of data available for modeling. This thesis addresses the limitations of machine learning models in this circumstance, on account of issues related to provenance shift and confounding by provenance - where models learn to recognize the origin of a data element, and bias their predictions toward the class distribution at this source. The work introduces a formal framework for simulating and evaluating model robustness to such confounding by provenance. To enhance robustness to it, I focus on three methodological categories: statistical adjustment (Backdoor Adjustment), distributional adjustment (the DistMatch framework), and adjustment through modification of the hidden spaces of deep neural networks (including post-hoc editing of language model weights and robust learning).
The proposed framework and adjustment methods were tested on three datasets, two biomedical and one from the general domain. The results indicate that all categories of methods can improve model robustness and performance in the context of confounding by provenance, even in extreme cases of provenance shift. This research contributes to our understanding of how provenance shift affects model performance and provides several ways for developing more reliable and trustworthy machine learning and deep learning based text classification algorithms.