Ridit Score Regression: Deep Learning vs Classical Approaches
| Model | MAE ↓ | Pearson r ↑ | Spearman ρ ↑ | Train Time |
|---|---|---|---|---|
| Deep Learning (3D ResNet-18) | ||||
| DL 32³ Best | 0.140 ± 0.017 | 0.598 ± 0.094 | 0.544 ± 0.085 | 2.0 hrs |
| DL 64³ | 0.140 ± 0.010 | 0.589 ± 0.068 | 0.526 ± 0.059 | 9.7 hrs |
| Classical (Image Statistics) | ||||
| Linear Regression | 0.164 | 0.228 | 0.202 | — |
| Random Forest | 0.165 | 0.182 | 0.148 | — |
| Gradient Boosting | 0.165 | 0.197 | 0.167 | — |
| Radiomics (PyRadiomics Features) | ||||
| Linear Regression | 0.164 | 0.216 | 0.218 | — |
| Random Forest | 0.164 | 0.211 | 0.201 | — |
| Gradient Boosting | 0.164 | 0.206 | 0.187 | — |
| Fold | MAE | Pearson r | Best Epoch |
|---|---|---|---|
| Fold 0 | 0.127 | 0.680 | 58 |
| Fold 1 | 0.124 | 0.694 | 62 |
| Fold 2 | 0.168 | 0.415 | 45 |
| Fold 3 | 0.138 | 0.622 | 71 |
| Fold 4 | 0.143 | 0.579 | 55 |
| Mean ± Std | 0.140 ± 0.017 | 0.598 ± 0.094 | — |
| Fold | MAE | Pearson r | Best Epoch |
|---|---|---|---|
| Fold 0 | 0.133 | 0.642 | 74 |
| Fold 1 | 0.130 | 0.668 | 81 |
| Fold 2 | 0.153 | 0.497 | 63 |
| Fold 3 | 0.139 | 0.601 | 78 |
| Fold 4 | 0.144 | 0.535 | 69 |
| Mean ± Std | 0.140 ± 0.010 | 0.589 ± 0.068 | — |
YES The best model achieves Pearson r = 0.598, confirming that raw CT voxels contain a learnable signal for radiologist-perceived subtlety. This is a strong result given the inherent inter-reader variability in the LIDC-IDRI ground truth (4 readers per nodule, often disagreeing).
YES DL achieves approximately 3x higher correlation than the best classical or radiomics model (r = 0.598 vs r = 0.228). Classical and radiomics approaches converge near the population-mean baseline, suggesting hand-crafted features fail to capture the perceptual factors that make a nodule subtle or obvious.
MINIMAL DIFFERENCE The 32³ crop matches 64³ performance (r = 0.598 vs 0.589) while training 5x faster (2.0 hrs vs 9.7 hrs). This suggests the critical information for subtlety assessment is concentrated near the nodule center, and extra surrounding context adds noise without adding signal.
MODEST SIGNAL The five most informative radiomics features were:
Despite being the best hand-crafted features, these collectively yield only r ≈ 0.22, reinforcing the need for learned representations.
Paired t-tests (across 5 CV folds) confirm that deep learning improvements are statistically significant:
| Comparison | MAE A | MAE B | t-stat | p-value | Sig. (p<0.05) |
|---|---|---|---|---|---|
| DL 64³ vs Classical GB | 0.142 | 0.165 | 4.29 | 0.013 | Yes |
| DL 64³ vs Classical LR | 0.142 | 0.164 | 4.12 | 0.015 | Yes |
| DL 64³ vs Classical RF | 0.142 | 0.165 | 4.51 | 0.011 | Yes |
| DL 64³ vs Radiomics GB | 0.142 | 0.164 | 3.77 | 0.020 | Yes |
| DL 64³ vs Radiomics LR | 0.142 | 0.164 | 4.12 | 0.015 | Yes |
| DL 64³ vs Radiomics RF | 0.142 | 0.164 | 3.99 | 0.016 | Yes |
| DL 32³ vs DL 64³ | 0.142 | 0.142 | 0.10 | 0.923 | No |
| DL 32³ vs Classical GB | 0.142 | 0.165 | 2.36 | 0.078 | No |
Key takeaway: DL 64³ is significantly better than all baselines (p < 0.05). DL 32³ trends in the same direction (p ≈ 0.08) but does not reach significance due to higher fold variance (Fold 2 instability). The two DL variants are statistically indistinguishable (p = 0.92), supporting the recommendation to use the faster 32³ crop.
Each point is one nodule. Perfect predictions would fall on the red diagonal line. DL models show tighter clustering around the diagonal compared to classical/radiomics baselines.
Distribution of prediction errors (predicted minus true). Centered at zero indicates no systematic bias. Narrower distributions indicate more precise predictions.
Backbone: MONAI pretrained 3D ResNet-18, modified for single-channel (CT) input. Global average pooling yields a 512-dim feature vector.
Fusion (when tabular features present): 512-dim image features concatenated with 19 normalized tabular features (morphology, radiologist ratings, spatial agreement), passed through a 256-dim FC fusion layer with ReLU + dropout.
Head: RegressionHead — single scalar output predicting continuous Ridit score in [0, 1], with dropout (0.5) for regularization.
Loss: Huber loss (delta=1.0) with entropy-based per-sample weighting (Whitaker method: α=0.5, γ=3.0). Huber loss combines the stability of L1 with the sensitivity of L2, transitioning at the delta threshold.
| Category | Features | Count |
|---|---|---|
| Morphology | mean_volume, surface_area, diameter, std_diameter, max_dimension, aspect_ratio_1, aspect_ratio_2 | 7 |
| Radiologist Ratings | malignancy, sphericity, margin, lobulation, spiculation, texture | 6 |
| Spatial Agreement | centroid_std_x, centroid_std_y, centroid_std_z, centroid_variance_3d | 4 |
| Agreement | num_raters, subtlety_consensus_pct | 2 |
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 1e-4 |
| Scheduler | Cosine Annealing |
| Batch Size | 16 |
| Max Epochs | 100 |
| Early Stopping Patience | 20 epochs |
| Cross-Validation | 5-fold stratified (by binned Ridit) |
| Weighting | Whitaker entropy-based (α=0.5, γ=3.0) |
| HU Clipping | [-1000, 400] → [0, 1] |
| Resampling | 1 mm isotropic |
| 3D Augmentation | Rotation, flip, scale, Gaussian noise, intensity shift, elastic deformation |
Source: LIDC-IDRI (Lung Image Database Consortium & Image Database Resource Initiative)
Nodules: 2,651 entries, each rated by 4 radiologist viewers on subtlety (1 = extremely subtle, 5 = obvious).
Target variable: Ridit score — continuous measure of "percentile of obviousness" derived from ordinal ratings, range [0, 1].
Held-out test set: 45 concordant nodules (all 4 raters agree) from Single Concordant Nodule Subtlety Ratings dataset.
All experiments can be reproduced with the following commands from the repository root:
# 1. Install dependencies
pip install -e ".[dev,notebook]"
# 2. Verify GPU availability
python -c "import torch; print(torch.cuda.is_available())"
# 3. Preprocess raw scans to .npy patches (with tabular features)
python preprocess_data.py \
--annotations docs/lidc_final_analysis_MW1.csv \
--raw-dir data/raw \
--output-dir data/processed \
--excel "docs/lidc_final_analysis (1).xlsx"
# 4. Train DL with 64x64x64 crops (5-fold CV)
python run_training.py --config configs/regression.yaml --mode cv
# 5. Train DL with 32x32x32 crops (5-fold CV)
python run_training.py --config configs/regression_32.yaml --mode cv
# 6. Run classical ML baseline
python run_classical_baseline.py --config configs/regression.yaml
# 7. Run radiomics baseline
python run_radiomics_baseline.py --config configs/regression.yaml
# 8. Run unified evaluation across all models
python run_evaluation.py --results-dir results/
# 9. Run tests
pytest tests/