I. Introduction
II. Methods
1. Data Collection and Integration
-
(1) Protein dataset: The protein dataset included data from UniProt [18] and normalized protein expression (NPX) metrics [19].
UniProt is a comprehensive database of protein sequences, functions, and annotations, comprising UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (Uni-Ref), and UniProt Archive (UniParc).
NPX values quantified alpha-synuclein levels in Olink’s Log2 scale, providing standardized protein concentrations (Figure 3).
Differentially expressed proteins reflected neurodegeneration, providing insights into PD modeling.
-
(2) Peptides dataset
Peptides are short amino acid sequences essential in biological functions such as signaling, enzymatic reactions, and hormone regulation.
The dataset captured peptide concentrations via mass spectrometry, providing insights into protein expression levels crucial for understanding PD mechanisms (Figure 4).
-
(3) Clinical dataset: The clinical dataset (Figure 5) included semi-annual and annual CSF-MS results and MDS-UPDRS scores [20, 21] from 248 patients, tracked at 0, 6, 12, and 24 months to support disease severity prediction.
(4) Gait dataset & integration with clinical data (updated clinical data): The clinical dataset was expanded with gait analysis data [22], incorporating gait speed, step length, freezing episodes, stride variability, and balance metrics. These parameters were aligned with UPDRS III scores, serving as the primary merging metric due to their comprehensive evaluation of motor symptoms like bradykinesia, rigidity, tremor, and postural instability (Figure 6). Unlike variables such as patient ID or visit month, which might have missing data, UPDRS III provides a consistent reference point for integrating clinical and gait data. As a quantitative measure of motor impairment, UPDRS III facilitates direct comparison with gait-derived features, improving alignment and prediction accuracy. Using UPDRS III as a reference ensures that gait metrics are meaningfully associated with motor severity and disease progression.
-
(5) Integration of UniProt protein dataset with clinical data: To predict MDS-UPDRS scores (updrs_1, updrs_2, updrs_3, updrs_4), protein and peptide data were integrated with clinical assessments through the following steps:
Direct Patient Association: All datasets shared the patient_id, linking protein and peptide expressions to individual clinical profiles.
Temporal Tracking: visit_id aligned molecular and clinical data with specific visits for longitudinal modeling
-
Improved Predictive Power:
Protein data revealed underlying biological changes,
Gait data captured motor function impairments,
Clinical data provided a holistic view of disease severity.
2. Data Preprocessing
3. Feature Selection
1) Dimensionality reduction: from PCA to MI
Capture nonlinear dependencies between features and target variables.
Preserve biological interpretability.
Model complex relationships in biological data, such as protein expression linked to clinical outcomes.
2) Encoding clinical treatment information
On – Ingested, with a good prognosis.
Off – Ingested, but poor prognosis.
No – No ingestion or missing data.
4. Model Training and Validation
1) ML based regression models
2) Training workflow
Data Structuring: Merged features using visit identifiers.
Cleaning: Removed rows with missing targets.
Formatting: Converted training data to TensorFlow dataset format for GPU/TPU efficiency.
Model Setup: Initialized random forest, linear regression, decision tree, and KNN with MSE as the loss metric.
4) Custom ensemble model
Linear regression: Served as a baseline model.
XGBRegressor: Captured nonlinear relationships and boosted weak learners.
Random forest & gradient boosting: Robust to noise and biologically interpretable.
5) Phase-shift ensembling for temporal modeling
6) Comparison with time-series ML techniques
Recurrent neural networks: Captures sequential dependencies but suffers from vanishing gradient problems for long-term predictions.
Long short-term memory (LSTM) networks: Retain long-term dependencies, suitable for modeling progression, but are data and computationally intensive.
Transformer models (e.g., temporal fusion transformer): Use self-attention to model complex, nonlinear disease trends; robust to missing data but require substantial hyperparameter tuning.
7) Model optimization & custom loss function
8) Final model validation
5. Testing
6. Performance Metrics
APR evaluates clinical validity, marking predictions as accurate if they fall within a specified range (e.g., ±10%) of actual values.



PDF
Citation
Print



XML Download