A simple guide to what our AI does and how well it works
Imagine you go to the doctor and get a CT scan of your lungs. The doctor sees a small spot — called a nodule — on the scan. Sometimes these nodules are easy to see, and sometimes they're really hard to find.
Now imagine a lawsuit where someone says their doctor missed a nodule on a CT scan. One expert doctor says "That nodule was obvious — any doctor should have seen it!" Another expert says "That nodule was extremely hard to see — it's not the doctor's fault."
Who is right? Right now, there's no objective way to answer that question. It's just one opinion against another.
Our AI looks at a CT scan and answers one simple question:
For example, the AI might say:
This gives lawyers, judges, and juries a number instead of an opinion.
Radiologists rate how easy a nodule is to see on a scale from 1 to 5:
But here's the thing: different doctors often disagree! One doctor might say a nodule is a 3 ("Intermediate") while another says it's a 5 ("Obvious"). That's normal — just like how two teachers might grade the same essay differently.
Our dataset had 4 different doctors rate each nodule. Here's an example of how they might rate the same nodule:
Average rating: 4.0 — "Moderately Obvious"
But notice they don't all agree! Ratings range from 3 to 5.
Our AI learned from a huge dataset called LIDC-IDRI. Think of it like a textbook with answer keys:
For each nodule, the AI looks at two things:
1. The CT image — a small 3D cube (64 x 64 x 64 voxels) cut out around the nodule. Think of it like a tiny 3D photo of just the area around the suspicious spot.
2. Clinical measurements (19 features) — numbers that describe the nodule:
The AI combines both the image and the measurements to make its prediction. This is like how a doctor doesn't just look at the picture — they also consider the size, shape, and location of what they see.
We took several steps to make sure the AI is actually learning, not just memorizing:
Some patients have multiple nodules. If we put one nodule from a patient in the training set and another in the test set, the AI might recognize the patient rather than the nodule. So we made sure all nodules from the same patient stay in the same group.
We set aside 44 special nodules where all 4 doctors agreed they could see the nodule. These 44 cases were completely hidden from the AI during all training. They're like a secret final exam that the AI never got to practice on.
The training data came from 7 different hospitals using 8 different CT scanner brands. This means the AI learned general patterns — not quirks of one particular machine or hospital.
MAE stands for Mean Absolute Error. It's just a fancy way of saying "on average, how far off is the prediction?"
Here's a concrete example from our AI:
Our AI's Detection MAE of 5.5% on held-out cases means: on average, when the AI says "X% of doctors would detect this nodule," the real answer is within about 5.5 percentage points.
The AI doesn't just give one number. It gives 5 probabilities that answer increasingly specific questions:
Notice how the bars get shorter as you go down. That's because each question is harder to satisfy. Almost all doctors would at least detect it (96%), but only 30% would call it "Obvious." These probabilities always go down — never up. This is mathematically guaranteed by our model design.
Here's how much error the AI has at each level (lower is better):
The AI is most accurate at the extremes (predicting if a nodule is "Obvious") and less accurate in the middle ground. This makes sense — the really obvious and really subtle nodules are easier to distinguish, while the middle ones are where even doctors disagree.
Remember those 44 special nodules we set aside? These are cases where all 4 doctors agreed they could see the nodule. We tested 17 of them (the rest need more data to be downloaded).
For every single one of these nodules, the AI correctly predicted that a high percentage of radiologists would detect it. Here are some examples:
Even the AI's "worst" prediction (78.9% for LIDC-IDRI-0828) still correctly says the majority of doctors would find this nodule — which matches reality since all 4 doctors actually did find it.
Numbers in parentheses = true mean subtlety rating from 4 doctors. All bars should be near 100% since all 4 doctors detected these nodules. Green = 95%+, Blue = 75-95%.
We trained the model 4 times, making improvements each round:
| Run | What Changed | Nodules | Overall Error | Detection Error |
|---|---|---|---|---|
| Run 1 | First attempt | 1,078 | 25.4% | 30.1% |
| Run 2 | More data (fixed scan finder) | 1,795 | 26.6% | 35.3% |
| Run 3 | Longer training (patience=20) | 1,795 | 24.4% | 32.8% |
| Run 4 | Even longer (63 epochs) | 1,795 | 22.9% | 31.6% |
No AI is 100% perfect. Here's where our model still has room to improve:
Why the held-out results (5.5% error) are so much better than the general test (31.6% error):
Ways to improve:
The system isn't perfect yet — but it doesn't need to be for a POC. What matters is that the concept works: an AI can learn from expert radiologist ratings and produce objective, quantitative assessments of nodule detectability.
With more data and validation, this technology can provide courtrooms with something they've never had before: an objective, data-driven answer to "how detectable was this nodule?"
| Nodule | A small spot or growth found on a CT scan of the lungs |
| CT Scan | A type of medical imaging that creates 3D pictures of the inside of the body |
| Radiologist | A doctor who specializes in reading medical images like CT scans |
| Subtlety Rating | A 1-5 score of how easy a nodule is to see (1 = very hard, 5 = very easy) |
| MAE | Mean Absolute Error — the average difference between what the AI predicted and the real answer |
| Detection Rate | The percentage of radiologists expected to find a nodule — the AI's main output |
| LIDC-IDRI | The public dataset of 1,018 CT scans from 7 hospitals, rated by 12 radiologists, used to train our AI |
| Concordant Nodule | A nodule where all 4 rating doctors agreed on its visibility — used as our "gold standard" test |
| Patient-Aware Split | A way of dividing data so all of one patient's nodules stay together, preventing the AI from recognizing patients instead of nodules |
| Held-Out Set | Data that is completely hidden from the AI during training, used as the final "surprise test" |
| Epoch | One complete pass through all training data — like reading the entire textbook once |
| Early Stopping | Automatically stopping training when the AI stops improving, to prevent over-studying (memorizing rather than learning) |