Testing the performance
For many reasons, you may be interested in measuring the performance of our technology with your own testing dataset. We encourage this practice, although when conducting such a test, you must understand what is the right way of testing the performance of a computer vision model, and which are appropriate methods to test the performance of the device.
Template for tests
Download | Title |
---|---|
Template | Performance testing of Top-5 and Top-1 accuracy (clinical decision support) |
Template | Performance testing of prioritisation through malignancy |
Principles of validity
There are a few principles that you must follow:
- The testing must be as identical to the real-world environment as possible
- The gold standard by which you measure the accuracy must match the output of the device
- The performance metric must reflect the goals of the implementation
1. Making the test identical to the real-world environment
The use of the device, in the real world, is going to consist of people, such as health care practitioners (HCP) or their patients taking pictures. This will mean that people will use their phone cameras to capture an image of a skin lesion.
That is why, when testing the performance of the device, you should use images that match the characteristics of those that HCPs and patients will take.
You should use...
✅ Images taken directly from smatphone
✅ Images taken directly from digital camera
✅ Images taken directly from dermatoscope
You should not use...
❌ Images that have been compressed or optimized
❌ Images downloaded from the Internet
❌ Images transmitted through WhatsApp or WeChat
How to know if an image has been compressed
A good way if understanding if an image has been artificially distorted is by looking at the image dimensions. Most image compressions also reduce the dimensions of the images.
Device | ✅ Normal image size | ❌ Compressed image size |
---|---|---|
iPhone 6 (2014) | 3456 x 2304 pixels | 346 x 204 pixels |
Canon SX610 HS (2015) | 5184 x 2912 pixels | 640 x 360 pixels |
iPhone 13 (2021) | 4032 x 3024 pixels | 403 x 302 pixels |
Xiaomi 12T Pro (2022) | 16384 x 12288 pixels | 819 x 614 pixels |
As you can see, even iPhones as old as 2014 take images with dimensions of around 3000 pixels. If an image is significatively smaller than this, it may point to a compression.
Check that the image has sizes approximating the normal image sizes, with at least 2000 pixels of width or height, because this is the image size that users will use in the real world.
2. Ensuring that the output can be matched
The device looks at images and outputs a list of conditions with a probability of them being present. These conditions have names, such as Psoriasis
, Basal cell carcinoma
or Rosacea
, along with a few hundred other conditions. Also, each condition has a code, according to the international standard ICD-11.
In the following table, you will see a situation in which the doctor's diagnosis does not match the output from the device. Keep in mind that the doctor's diagnosis acts as a gold standard for this test:
Doctor's diagnosis | Device output | Do they agree? |
---|---|---|
AK | Actinic keratosis | 🤷♂️ 🤷♂️ 🤷♂️ 🤷♂️ 🤷♂️ |
Eczema | Dermatitis | 🤷♂️ 🤷♂️ 🤷♂️ 🤷♂️ 🤷♂️ |
Symptomatic dermographism | Urticaria | 🤷♂️ 🤷♂️ 🤷♂️ 🤷♂️ 🤷♂️ |
Due to the mismatch, it is very difficult to correctly evaluate the performance of the device because there is no straightforward way of telling if the device matched what the doctor said.
How to do it right
The minimum requirement test to even begin to be valid is that the codification between the gold standard and the output must be matcheable. If the doctor and the device are using different names for the same conditions, it will look like they disagree, when that is not the case.
Doctor's diagnosis | Device output | Do they agree? |
---|---|---|
✅ | ||
❌ | ||
✅ |
That is why the template offers the diagnostic options from a dropdown. Furthermore, the name of the conditions is shown alongside its identity in an international standard for condition names, the International Classification of Diseases (ICD).
3. Selecting performance metrics that match the goal
Here's a famous quote that is very relevant to the task at hand:
if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid
The device is a tool that serves a purpose, and it must be measured in regard to that purpose. The purpose is defined by the Intended Use of the device, but it also depends on the goal with which you will use it.
So, ask yourself: what is the actual implementation of the device? What problem is it solving? Who will use it? Depending on the goal and the type of integration, different tests should be conducted, measuring different metrics.
Top-5
and Top-1
accuracy
As you will see in our Intended user section, the intended user of the device is a HCP, because the device is a clinical decision support tool. For this reason, the Top-5
accuracy is the most common performance metric, used along with Top-1
accuracy as a set.
Top-5
accuracy is a measure to denote the correctness of the output of a Machine Learning model. Top-5
accuracy is frequently used with Image Recognition, Object Detection and much more.
Top-5
so important?Diagnosing is a cognitive process that HCPs perform with the information they have available. With more information, the accuracy of the HCP increases. And that's what the research shows: the diagnostic accuracy of the HCP increases when they see the Top-5
results from the device.
To measure the Top-5
and Top-1
accuracies, in the template we provide you will be able to write down not only one, but the five most probable conditions outputed by the device.
Some conditions turn into other conditions. For instance, actinic keratoses
can turn into squamous cell carcinoma
. This means that, if the device looks at a lesion of actinic keratosis, it is very interesting to see how close the squamous cell carcinoma diagnosis is to the 1st guess. That is one way in which the Top-5
is a better metric than the Top-1
: it reflects the evolution from one condition to another.
Malignacy suspicion
If you are using the device to prioritise cases, the metric that you should be testing is the malignancy suspicion index. The malignancy suspicion is a number that goes from 0 to 100, that reflects the probability of a condition being malignant.
In the API, the device contains a field called isMalignantSuspicion
, inside the group preliminaryFindings
, as shown below:
{
// ...
"preliminaryFindings": {
// ...
"isMalignantSuspicion": 62
// ...
}
// ...
}
This can also be shown as a gauge, reflecting the suspicion of malignancy.
In order to measure the performance of the device in prioritising cases through the malignancy suspicion, a table such as the following may be useful:
The test consists in measuring whether or not the malignancy suspicion value reflects the specialists's assignment of priority, or even the result of the biopsy, if such data is available.