Overview of VISCERAL Benchmarks in Intelligent Medical Data Analysis

authors:

avatar Henning Muller 1 , *

University of Applied Sciences Western Switzerland, Delémont, Switzerland

how to cite: Muller H. Overview of VISCERAL Benchmarks in Intelligent Medical Data Analysis. I J Radiol. 2019;16(Special Issue):e99221. https://doi.org/10.5812/iranjradiol.99221.

Abstract

Background:

Large datasets of annotated medical images are important to train machine learning algorithms. Segmentation is the first step in many decision support applications.

Objectives:

Learning objectives include:
1. How to organize a scientific challenge on organ segmentation on a large scale?
2. How to develop research infrastructures where data do not need to be moved anymore?
3. How research infrastructures can help in reproducibility and semi-automatic generation of annotations?

Outline:

Organ segmentation is the first step in many decision support tools in radiology. To obtain good segmentation results, most often, large annotated datasets need to be available. Also, most segmentation algorithms are very organ-specific and modality-specific. The VISCERAL benchmark makes data from 20 organs available both with and without contrast agent and for CT and MR. An architecture for manual segmentation including quality control was developed for this purpose.
VISCERAL also introduces a novel Evaluation as a Service (EaaS) architecture to compare algorithm performance. The data remained in a fixed and inaccessible place in the cloud. Participants obtained a virtual machine and access to the training data. They could, then, install all necessary tools and test on the training data. When the VMs finished, the organizers took control of VMs and ran the algorithms on the test data. Like this no manual optimization is possible on the test data, the actual data are never released to the research and with the availability of executables and data, a full reproducibility is given.
The results of the benchmark compared many algorithms on all four modalities. The CT quality was often much better than MRI quality. Results on some organs such as the lungs and liver were sometimes better than the inter-rater disagreement. The availability of code also allowed to run the segmentation on new images and created fusion of algorithmic segmentations that we call a silver standard to generate large-scale training data.