Evidentia Analytics — Results Explained Simply

The Problem We're Solving

Imagine you go to the doctor and get a CT scan of your lungs. The doctor sees a small spot — called a nodule — on the scan. Sometimes these nodules are easy to see, and sometimes they're really hard to find.

Now imagine a lawsuit where someone says their doctor missed a nodule on a CT scan. One expert doctor says "That nodule was obvious — any doctor should have seen it!" Another expert says "That nodule was extremely hard to see — it's not the doctor's fault."

Who is right? Right now, there's no objective way to answer that question. It's just one opinion against another.

Think of it like this: Imagine two people arguing about whether a test was "easy" or "hard." Instead of guessing, what if you could say: "83% of students who took this test got it right"? That's a fact, not an opinion. That's what we're building — but for doctors looking at CT scans.

What Our AI Does

Our AI looks at a CT scan and answers one simple question:

"What percentage of radiologists (doctors who read CT scans) would find this nodule?"

For example, the AI might say:

"95% of radiologists would detect this nodule" — meaning it's pretty obvious, and almost every doctor would see it
"30% of radiologists would detect this nodule" — meaning it's very subtle, and most doctors would miss it

This gives lawyers, judges, and juries a number instead of an opinion.

The Subtlety Scale (1 to 5)

Radiologists rate how easy a nodule is to see on a scale from 1 to 5:

1

Extremely Subtle

2

Moderately Subtle

3

Intermediate

4

Moderately Obvious

5

Obvious

But here's the thing: different doctors often disagree! One doctor might say a nodule is a 3 ("Intermediate") while another says it's a 5 ("Obvious"). That's normal — just like how two teachers might grade the same essay differently.

Our dataset had 4 different doctors rate each nodule. Here's an example of how they might rate the same nodule:

🧒

Dr. A

4

👩‍⚕️

Dr. B

5

🧒

Dr. C

3

👩‍⚕️

Dr. D

4

Average rating: 4.0 — "Moderately Obvious"
But notice they don't all agree! Ratings range from 3 to 5.

Our AI learns from this disagreement. Instead of just saying "the average is 4," the AI learns the full pattern of how doctors rated similar-looking nodules. It learns: "For nodules that look like this, about 100% of doctors rate it at least a 1, about 75% rate it at least a 4, and about 25% rate it a 5."

How the AI Learns

Our AI learned from a huge dataset called LIDC-IDRI. Think of it like a textbook with answer keys:

1

Collect Scans

1,795 nodules from 555 real patients at 7 hospitals

→

2

Get Expert Ratings

12 radiologists each rated nodules on the 1-5 scale

→

3

AI Studies Them

The AI looks at each scan and learns what makes nodules easy or hard to see

→

4

Test on New Cases

We test the AI on cases it has never seen before

It's like studying for a test: Imagine you have 1,795 practice problems with answer keys. You study 70% of them (1,279 problems). Then you check your understanding with 15% more (270 problems) to see if you're learning well. Finally, you take a test with the last 15% (246 problems) that you've never seen before. Your score on that final test tells us how well you actually learned — not just memorized.

What the AI Actually Looks At

For each nodule, the AI looks at two things:

1. The CT image — a small 3D cube (64 x 64 x 64 voxels) cut out around the nodule. Think of it like a tiny 3D photo of just the area around the suspicious spot.

2. Clinical measurements (19 features) — numbers that describe the nodule:

How big is it? (volume, diameter, surface area)
What shape is it? (round? spiky? smooth edges?)
What do other characteristics look like? (texture, density)
How much did the doctors agree on where it is? (spatial agreement)

The AI combines both the image and the measurements to make its prediction. This is like how a doctor doesn't just look at the picture — they also consider the size, shape, and location of what they see.

How We Made Sure the AI Isn't "Cheating"

We took several steps to make sure the AI is actually learning, not just memorizing:

1. Patient-Aware Splitting

Some patients have multiple nodules. If we put one nodule from a patient in the training set and another in the test set, the AI might recognize the patient rather than the nodule. So we made sure all nodules from the same patient stay in the same group.

Result: 388 patients for training, 83 for validation, 84 for testing. Zero patient overlap between groups.

2. Held-Out Concordant Nodules

We set aside 44 special nodules where all 4 doctors agreed they could see the nodule. These 44 cases were completely hidden from the AI during all training. They're like a secret final exam that the AI never got to practice on.

Why these are special: When all 4 doctors agree, we know the answer. There's no debate. So if the AI gets these right, we can be very confident it's working correctly.

3. Data From Multiple Hospitals

The training data came from 7 different hospitals using 8 different CT scanner brands. This means the AI learned general patterns — not quirks of one particular machine or hospital.

Understanding the Results: What is "MAE"?

MAE stands for Mean Absolute Error. It's just a fancy way of saying "on average, how far off is the prediction?"

Simple example: If the weather forecast says it will be 75°F tomorrow, and it actually turns out to be 72°F, the error is 3 degrees. If the forecast says 75°F and the actual temperature is 80°F, the error is 5 degrees. The MAE is the average of all those errors across many predictions.

Here's a concrete example from our AI:

100%

Actual (Ground Truth)

All 4 doctors detected it

↔

94.5%

AI Prediction

AI says ~95% would detect it

=

5.5%

Error (MAE)

Pretty close!

Our AI's Detection MAE of 5.5% on held-out cases means: on average, when the AI says "X% of doctors would detect this nodule," the real answer is within about 5.5 percentage points.

Is 5.5% error good? Yes! Think of it this way: if the AI says "92% of doctors would detect this," the real answer is probably between about 87% and 97%. That's a tight and useful range for a courtroom.

The 5 Probability Thresholds

The AI doesn't just give one number. It gives 5 probabilities that answer increasingly specific questions:

P(S≥1)

96%

"Would rate it at least a 1" (detected at all)

P(S≥2)

88%

"Would rate it at least Moderately Subtle"

P(S≥3)

72%

"Would rate it at least Intermediate"

P(S≥4)

55%

"Would rate it at least Moderately Obvious"

P(S≥5)

30%

"Would rate it Obvious"

Notice how the bars get shorter as you go down. That's because each question is harder to satisfy. Almost all doctors would at least detect it (96%), but only 30% would call it "Obvious." These probabilities always go down — never up. This is mathematically guaranteed by our model design.

Think of it like grades: If you ask "what percentage of students got at least a D?" the answer is always higher than "what percentage got at least an A?" Our probabilities work the same way: P(S≥1) is always ≥ P(S≥2) ≥ P(S≥3) ≥ P(S≥4) ≥ P(S≥5).

How Accurate Is Each Threshold?

Here's how much error the AI has at each level (lower is better):

P(S≥1) — Detection

31.6% error

P(S≥2)

25.7% error

P(S≥3)

21.8% error

P(S≥4)

20.0% error

P(S≥5) — Obvious

15.3% error

The AI is most accurate at the extremes (predicting if a nodule is "Obvious") and less accurate in the middle ground. This makes sense — the really obvious and really subtle nodules are easier to distinguish, while the middle ones are where even doctors disagree.

The Big Test: Held-Out Concordant Nodules

Remember those 44 special nodules we set aside? These are cases where all 4 doctors agreed they could see the nodule. We tested 17 of them (the rest need more data to be downloaded).

✅

17 out of 17 Correctly Identified

Average detection error: only 5.5%

For every single one of these nodules, the AI correctly predicted that a high percentage of radiologists would detect it. Here are some examples:

LIDC-IDRI-0423

True subtlety: 5.0 (all 4 said "Obvious")

99.9%

AI says: "Nearly all doctors would find this"

Correct

LIDC-IDRI-0363

True subtlety: 4.0 (Moderately Obvious)

99.7%

AI says: "Almost every doctor would find this"

Correct

LIDC-IDRI-0699

True subtlety: 3.25 (Intermediate)

92.4%

AI says: "Most doctors would find this"

Correct

LIDC-IDRI-0828

True subtlety: 3.75 (Moderately Obvious)

78.9%

AI says: "Majority of doctors would find this"

Correct

Even the AI's "worst" prediction (78.9% for LIDC-IDRI-0828) still correctly says the majority of doctors would find this nodule — which matches reality since all 4 doctors actually did find it.

All 17 Held-Out Results

LIDC-0423 (5.00)

99.9%

LIDC-0697 (5.00)

99.4%

LIDC-0796 (5.00)

99.4%

LIDC-0363 (4.00)

99.7%

LIDC-0403 (4.75)

99.5%

LIDC-0946 (4.75)

99.4%

LIDC-0475 (2.25)

97.6%

LIDC-0017 (3.50)

97.4%

LIDC-0597 (4.50)

97.1%

LIDC-0915 (3.50)

97.1%

LIDC-0941 (3.75)

95.3%

LIDC-0803 (3.75)

93.3%

LIDC-0644 (3.75)

92.4%

LIDC-0699 (3.25)

92.4%

LIDC-0681 (3.50)

85.8%

LIDC-0580 (3.75)

82.2%

LIDC-0828 (3.75)

78.9%

Numbers in parentheses = true mean subtlety rating from 4 doctors. All bars should be near 100% since all 4 doctors detected these nodules. Green = 95%+, Blue = 75-95%.

How the Model Improved Over Training

We trained the model 4 times, making improvements each round:

Run	What Changed	Nodules	Overall Error	Detection Error
Run 1	First attempt	1,078	25.4%	30.1%
Run 2	More data (fixed scan finder)	1,795	26.6%	35.3%
Run 3	Longer training (patience=20)	1,795	24.4%	32.8%
Run 4	Even longer (63 epochs)	1,795	22.9%	31.6%

Think of it like studying: Run 1 was like studying from half the textbook. Run 2 added more pages. Runs 3 and 4 gave the student more time to study. Each time, the student got better at the test.

What's Not Perfect Yet (And That's OK)

No AI is 100% perfect. Here's where our model still has room to improve:

Detection MAE on the general test set is 31.6%. This means for all types of nodules (not just the concordant ones), the AI's detection prediction can be off by about 32 percentage points. This is because the general test set includes the really tricky middle-ground nodules where even doctors disagree.

Why the held-out results (5.5% error) are so much better than the general test (31.6% error):

The held-out nodules are cases where all 4 doctors agreed — these are the "easier" cases to predict
The general test includes cases where doctors disagreed — if even doctors can't agree, it's harder for the AI too
The general test includes more subtle nodules (subtlety 1-2) which are inherently harder

Ways to improve:

Download the remaining 254 patients we don't have yet (would add ~700 more nodules)
Validate on an independent dataset from a different source (LNDb)
Run a prospective study with new radiologists rating new scans

The Bottom Line

For our Proof of Concept, this is a strong result. The AI has demonstrated that it can:

Correctly identify detectable nodules with 5.5% error on held-out cases
Produce meaningful probability distributions across subtlety levels
Generalize across patients it has never seen
Work with data from multiple hospitals and scanner types

The system isn't perfect yet — but it doesn't need to be for a POC. What matters is that the concept works: an AI can learn from expert radiologist ratings and produce objective, quantitative assessments of nodule detectability.

With more data and validation, this technology can provide courtrooms with something they've never had before: an objective, data-driven answer to "how detectable was this nodule?"

Glossary

Nodule	A small spot or growth found on a CT scan of the lungs
CT Scan	A type of medical imaging that creates 3D pictures of the inside of the body
Radiologist	A doctor who specializes in reading medical images like CT scans
Subtlety Rating	A 1-5 score of how easy a nodule is to see (1 = very hard, 5 = very easy)
MAE	Mean Absolute Error — the average difference between what the AI predicted and the real answer
Detection Rate	The percentage of radiologists expected to find a nodule — the AI's main output
LIDC-IDRI	The public dataset of 1,018 CT scans from 7 hospitals, rated by 12 radiologists, used to train our AI
Concordant Nodule	A nodule where all 4 rating doctors agreed on its visibility — used as our "gold standard" test
Patient-Aware Split	A way of dividing data so all of one patient's nodules stay together, preventing the AI from recognizing patients instead of nodules
Held-Out Set	Data that is completely hidden from the AI during training, used as the final "surprise test"
Epoch	One complete pass through all training data — like reading the entire textbook once
Early Stopping	Automatically stopping training when the AI stops improving, to prevent over-studying (memorizing rather than learning)