We trained two versions of our AI. One used 19 inputs including opinions from radiologists (doctors who read CT scans). The other used only 7 physical measurements — things a computer can calculate from the scan itself, like the size and shape of the nodule.
Removing the doctor opinions made the AI more accurate, not less.
This is a massive simplification for our product. Here's what changed:
The model uses two types of information to make its prediction:
A deep learning model called a 3D ResNet-18 looks at a small cube of the CT scan around the nodule (64×64×64 voxels — about 2.5 inches on each side). It automatically learns to recognize patterns like:
This is similar to how a doctor would "eyeball" a scan, except the AI can detect patterns too subtle for humans to articulate.
These are simple, objective numbers calculated directly from the scan:
We trained both models on the same 1,795 nodules from 555 patients using identical settings. The only difference is which input features they received.
| Metric | 19 Features (with doctor ratings) | 7 Features (measurements only) |
|---|---|---|
| Overall Accuracy Lower error = better |
24.83% error | 24.11% error BETTER |
| Detection Accuracy "Would a radiologist spot this?" |
32.82% error | 32.41% error BETTER |
| Obvious Nodule Accuracy "Is this clearly visible?" |
16.23% error | 15.50% error BETTER |
| Test Loss Overall model quality score |
0.5920 | 0.5745 BETTER |
| Expert Input Required? | Yes — radiologist must rate 6 characteristics | No — fully automated |
Before we even started training, we took 44 special nodules and locked them away. These are "concordant" nodules — cases where all 4 radiologists agreed the nodule was detectable. The AI never saw these during training, validation, or testing. This is the fairest possible test.
Of the 44, we had preprocessed patches available for 17. Here's how the model performed on each one:
For every single held-out nodule, the model correctly predicted it would be detected. The predictions were off by less than 5 percentage points on detection probability.
Each row is a real nodule from a real patient. The "Actual" column is what the 4 radiologists said. The "Predicted" column is what our AI said, having never seen these cases.
| Patient | Subtlety Rating | AI Prediction | Error |
|---|---|---|---|
| LIDC-IDRI-0423 | 5.00 (all said "obvious") | 100.0% | 0.0% |
| LIDC-IDRI-0946 | 4.75 | 99.7% | 0.3% |
| LIDC-IDRI-0697 | 5.00 (all said "obvious") | 99.5% | 0.5% |
| LIDC-IDRI-0796 | 5.00 (all said "obvious") | 99.5% | 0.5% |
| LIDC-IDRI-0363 | 4.00 | 99.5% | 0.5% |
| LIDC-IDRI-0403 | 4.75 | 99.3% | 0.7% |
| LIDC-IDRI-0915 | 3.50 | 98.4% | 1.6% |
| LIDC-IDRI-0597 | 4.50 | 98.4% | 1.6% |
| LIDC-IDRI-0803 | 3.75 | 97.4% | 2.6% |
| LIDC-IDRI-0941 | 3.75 | 97.1% | 2.9% |
| LIDC-IDRI-0017 | 3.50 | 96.8% | 3.2% |
| LIDC-IDRI-0699 | 3.25 | 95.5% | 4.5% |
| LIDC-IDRI-0475 | 2.25 (hardest case) | 93.4% | 6.6% |
| LIDC-IDRI-0681 | 3.50 | 87.9% | 12.1% |
| LIDC-IDRI-0644 | 3.75 | 86.0% | 14.0% |
| LIDC-IDRI-0828 | 3.75 | 85.8% | 14.2% |
| LIDC-IDRI-0580 | 3.75 | 83.9% | 16.1% |
The model doesn't give a single score — it answers 5 different questions about each nodule. Here's how accurate it is at each one:
Green = morphology-only (7 features). Gray = previous model (19 features). Green wins at every threshold.
When a lawyer or insurance company submits a CT scan, they get back something like this:
Based on AI analysis of this lung nodule using the LIDC-IDRI reference dataset (1,795 nodules rated by 12 board-certified radiologists across 7 institutions):
| P(S≥1) = 94% | 94 out of 100 radiologists would detect this finding at some level |
| P(S≥2) = 87% | 87 out of 100 would rate it at least moderately detectable |
| P(S≥3) = 73% | 73 out of 100 would consider it intermediate or more obvious |
| P(S≥4) = 51% | 51 out of 100 would rate it moderately obvious or more |
| P(S≥5) = 24% | 24 out of 100 would consider it obviously detectable |
This assessment was generated using only the CT imaging data and automated physical measurements. No subjective expert ratings were used as inputs.
The model was trained on LIDC-IDRI, the gold standard dataset for lung nodule research. It has been cited in over 500 published scientific papers. Every nodule was independently rated by up to 4 board-certified radiologists.
In AI, "cheating" means the model accidentally gets hints about the answers during training. We took several steps to prevent this:
A single test can get lucky. To prove our results are reliable and repeatable, we ran the same training process 5 times, each time holding out a different 20% of patients for testing. This is called "cross-validation" — it's the gold standard for proving an AI model works consistently.
The results are remarkably stable. The spread between the best and worst fold is only 3.6 percentage points.
| Fold | Val Loss | Overall MAE | Detection MAE | Best Epoch |
|---|---|---|---|---|
| Fold 1 | 0.5890 | 25.0% | 32.5% | 66 |
| Fold 2 | 0.5823 | 25.4% | 32.9% | 46 |
| Fold 3 | 0.5594 | 24.9% | 32.8% | 44 |
| Fold 4 | 0.5389 | 22.8% | 30.7% | 55 |
| Fold 5 | 0.5390 | 21.8% | 30.3% | 85 |
| Mean +/- Std | 0.562 +/- 0.021 | 24.0% +/- 1.4% | 31.8% +/- 1.1% | — |
The single-split results fall right within the cross-validation range, confirming they are representative.