Search Results

AutoContour is intended to assist radiation treatment planners in contouring and reviewing structures within medical images in preparation for radiation therapy treatment planning.

Device Description

As with AutoContour Model RADAC V3, the AutoContour Model RADAC V4 device is software that uses DICOM-compliant image data (CT or MR) as input to: (1) automatically contour various structures of interest for radiation therapy treatment planning using machine learning based contouring. The deep-learning based structure models are trained using imaging datasets consisting of anatomical organs of the head and neck, thorax, abdomen and pelvis for adult male and female patients, (2) allow the user to review and modify the resulting contours, and (3) generate DICOM-compliant structure set data the can be imported into a radiation therapy treatment planning system.

AutoContour Model RADAC V4 consists of 3 main components:

1. A .NET client application designed to run on the Windows Operating System allowing the user to load image and structure sets for upload to the cloud-based server for automatic contouring, perform registration with other image sets, as well as review, edit, and export the structure set.
1. A local "agent" service designed to run on the Windows Operating System that is configured by the user to monitor a network storage location for new CT and MR datasets that are to be automatically contoured.
1. A cloud-based automatic contouring service that produces initial contours based on image sets sent by the user from the .NET client application.

AI/ML Overview

Here's an analysis of the acceptance criteria and study findings for the Radformation AutoContour (Model RADAC V4) device, based on the provided text:

1. Acceptance Criteria and Reported Device Performance

The primary acceptance criterion for the automated contouring models is the Dice Similarity Coefficient (DSC), which measures the spatial overlap between the AI-generated contour and the ground truth contour. The criteria vary based on the estimated size of the anatomical structure. Additionally, for external clinical testing, an external reviewer rating was used to assess clinical appropriateness.

Acceptance Criteria Category	Metric (for AI performance)	Performance Criteria (for AI performance)	Reported Device Performance (Mean ± Std Dev) for CT Models	Reported Device Performance (Mean ± Std Dev) for MR Models	Reported Device Performance (Mean External Reviewer Rating 1-5, higher is better)
Contouring Accuracy (CT Models)	Mean Dice Similarity Coefficient (DSC)	Large Volume Structures: ≥ 0.8	0.92 ± 0.06	0.96 ± 0.03	N/A
		Medium Volume Structures: ≥ 0.65	0.85 ± 0.09	0.84 ± 0.07	N/A
		Small Volume Structures: ≥ 0.5	0.81 ± 0.12	0.74 ± 0.09	N/A
Clinical Appropriateness (CT Models)	External Reviewer Rating (1-5 scale)	Average Score ≥ 3	N/A	N/A	4.57 (across all CT models)
Contouring Accuracy (MR Models)	Mean Dice Similarity Coefficient (DSC)	Large Volume Structures: ≥ 0.8	N/A	0.96 ± 0.03 (training data) 0.80 ± 0.09 (external data)	N/A
		Medium Volume Structures: ≥ 0.65	N/A	0.84 ± 0.07 (training data) 0.84 ± 0.09 (external data)	N/A
		Small Volume Structures: ≥ 0.5	N/A	0.74 ± 0.09 (training data) 0.61 ± 0.14 (external data)	N/A
Clinical Appropriateness (MR Models)	External Reviewer Rating (1-5 scale)	Average Score ≥ 3	N/A	N/A	4.6 (across all MR models)

2. Sample Size Used for the Test Set and Data Provenance

CT Models Test Set:
- Sample Size: For individual CT structure models, the number of testing sets ranged from 10 to 116 for the internal validation (Table 4) and 13 to 82 for the external clinical testing (Table 6). The document states "approximately 10% of the number of training image sets" were used for testing in the internal validation, with an average of 54 testing image sets per CT structure model.
- Data Provenance: Imaging data for training was gathered from 4 institutions in 2 different countries (United States and Switzerland). External clinical testing data for CT models was sourced from various TCIA (The Cancer Imaging Archive) datasets (Pelvic-Ref, Head-Neck-PET-CT, Pancreas-CT-CB, NSCLC, LCTSC, QIN-BREAST) and shared from several unidentified institutions in the United States. Data was retrospective, as it was acquired and then used for model validation.
MR Models Test Set:
- Sample Size: For individual MR structure models, the number of testing sets ranged from 45 for internal validation (Table 8) and 5 to 45 for external clinical testing (Table 10). The document states an average of 45 testing image sets per MR Brain model and 77 testing image sets per MR Pelvis model were used for internal validation.
- Data Provenance: Imaging data for training and internal testing was acquired from the Cancer Imaging Archive GLIS-RT dataset (for Brain models) and two open-source datasets plus one institution in the United States (for Pelvis models). External clinical testing data for MR models was from a clinical partner (for Brain models), two publicly available datasets (Prostate-MRI-U-S-Biopsy, Gold Atlas Pelvis, SynthRad), and two institutions utilizing MR Linacs for image acquisitions. Data was retrospective.
General Note: Test datasets were independent from those used for training.

3. Number of Experts Used to Establish the Ground Truth for the Test Set and Qualifications of Those Experts

Number of Experts: Three (3) experts were used.
Qualifications of Experts: The ground truth was established by three clinically experienced experts consisting of 2 radiation therapy physicists and 1 radiation dosimetrist.

4. Adjudication Method for the Test Set

Method: Ground truthing of each test data set was generated manually using consensus (NRG/RTOG) guidelines as appropriate by the three experts. This implies an expert consensus method, likely involving discussion and agreement among the three. The document does not specify a quantitative adjudication method like "2+1" or "3+1" but rather a "consensus" guided by established clinical guidelines.

5. If a Multi-Reader Multi-Case (MRMC) Comparative Effectiveness Study was Done

The document does not report an MRMC comparative effectiveness study comparing human readers with AI assistance versus without AI assistance. The study focuses purely on the AI's performance and its clinical appropriateness as rated by external reviewers.

6. If a Standalone (i.e., algorithm only without human-in-the-loop performance) was Done

Yes, a standalone performance evaluation was done. The core of the performance data presented (Dice Similarity Coefficient) is a measure of the algorithm's direct output compared to the ground truth, without a human in the loop during the contour generation phase. The external reviewer ratings also assess the standalone performance of the AI-generated contours regarding their clinical utility for subsequent editing and approval.

7. The Type of Ground Truth Used

Type: The ground truth used was expert consensus, specifically from three clinically experienced experts (2 radiation therapy physicists and 1 radiation dosimetrist), guided by NRG/RTOG guidelines.

8. The Sample Size for the Training Set

CT Models Training Set: For CT structure models, there was an average of 341 training image sets.
MR Models Training Set: For MR Brain models, there was an average of 149 training image sets. For MR Pelvis models, there was an average of 306 training image sets.

9. How the Ground Truth for the Training Set Was Established

The document states that the deep-learning based structure models were "trained using imaging datasets consisting of anatomical organs" and that the "test datasets were independent from those used for training." While it extensively details how ground truth was established for the test sets (manual generation by three experts using consensus and NRG/RTOG guidelines), it does not explicitly describe how the ground truth for the training sets was established. However, given the nature of deep learning models for medical image segmentation, it is highly probable that the training data also had meticulously generated, expert-annotated ground truth contours, likely following similar rigorous processes as the test sets, potentially from various institutions or public datasets. The consistency of the model architecture and training methodologies (e.g., "very similar CNN architecture was used to train these new CT models") suggests a standardized approach to data preparation, including ground truth generation, for both training and testing.

Ask a Question

Ask a specific question about this device

Page 1 of 1