Skip to main content

Harkirat Sohi

Graduated: September 10, 2022

Thesis/Dissertation Title:

Understanding the differences in cognitively defined subgroups in Alzheimer's disease: A data science approach

Abstract: My work connects two types of data in Alzheimer’s Disease (AD): structural MRI data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) and cognition data in the form of AD subgroups. The subgroups (AD-Executive, AD-Language, AD-Memory and AD-Visuospatial), defined by Crane et al. (2017), are based on cognitive test scores from the time of AD diagnosis, and each subgroup is characterized by marked impairment in the specified cognitive domain relative to the other domains. My dissertation’s focus is on data science and mathematical methods to understand how volumes of 70 brain regions of interest (ROIs) might differ across pairs of AD subgroups in cross-sectional data in time, specifically data from the time of AD diagnosis (Aim 1) and in longitudinal data (Aim 2). My work demonstrates a careful assessment and implementation of methods to best utilize the data available that is currently small in sample size, with imbalanced AD subgroup sizes and noisy in nature.

In Aim 1, I used random forest models for identifying the most important brain ROIs for distinguishing between pairs of AD subgroups. Prior to building classification models, I addressed specific challenges in cross-sectional data: potential noise due to non-ROI variables and imbalanced AD subgroup sizes. A challenge in using classification models in the domain of AD subgroups is that there is no gold standard for knowing how separable the AD subgroups are based on ROI volumes. The work presented here may be the first to establish a benchmark for classification accuracies for distinguishing between pairs of AD subgroups based on ROI volumes, although these models are not intended to be used for prediction in a clinical setting but rather to understand which brain regions are most important to distinguish the AD subgroups. In Aim 2, I used linear mixed effects (LME) modeling on longitudinal data to determine which of the 70 ROIs’ volume trajectories differ the most across pairs of AD subgroups in terms of longitudinal volume and rate of change of volume with respect to time. First, I laid out criteria for using data from specific MRI scans in an effort to reduce noise in data, instead of using the default longitudinal dataset. Given the small sample size of the AD subgroups and irregular data, I implemented LME modeling for each ROI on the original dataset consisting of all time points and also on a series of subsets of data that were obtained by restricting each AD subgroup’s data to time points with a specific minimum number of subjects available. Additionally, in Aim 2 work, I also simulated simplistic synthetic longitudinal data for two hypothetical groups, with tweakable parameters for sample size and group differences, which can serve as a test bed for future analysis methods for understanding AD subgroup differences. An important finding of my work is that there was some overlap in the top ROIs that were determined to be important based on cross-sectional and longitudinal data analyses, for distinguishing between pairs of AD subgroups. Results from my Ph.D. work have potential implications for decisions about which brain regions may be relevant for future neuropathological studies in studying AD subgroups.


John Gennari (chair), Ellne Wijsman, Paul Crane, David Crosslin, Shuai Huang, Ali Shojaie