(32 days)
AutoContour is intended to assist radiation treatment planners in contouring structures within medical images in preparation for radiation therapy treatment planning.
As with AutoContour Model RADAC V2, the AutoContour Model RADAC V3 device is software that uses DICOM-compliant image data (CT or MR) as input to: (1) automatically contour various structures of interest for radiation therapy treatment planning using machine learning based contouring. The deep-learning based structure models are trained using imaging datasets consisting of anatomical organs of the head and neck, thorax, abdomen and pelvis for adult male and female patients, (2) allow the user to review and modify the resulting contours, and (3) generate DICOM-compliant structure set data the can be imported into a radiation therapy treatment planning system.
AutoContour Model RADAC V3 consists of 3 main components:
-
- A .NET client application designed to run on the Windows Operating System allowing the user to load image and structure sets for upload to the cloud-based server for automatic contouring, perform registration with other image sets, as well as review, edit, and export the structure set.
-
- A local "agent" service designed to run on the Windows Operating System that is configured by the user to monitor a network storage location for new CT and MR datasets that are to be automatically contoured.
-
- A cloud-based automatic contouring service that produces initial contours based on image sets sent by the user from the .NET client application.
Here's a breakdown of the acceptance criteria and the study proving the device's performance, based on the provided document:
1. Table of Acceptance Criteria & Reported Device Performance
Feature/Metric | Acceptance Criteria | Reported Device Performance (Mean DSC/Rating) |
---|---|---|
CT Structures | ||
Large volume DSC | >= 0.8 | Initial Validation: 0.88 +/- 0.06 |
External Validation: 0.90 +/- 0.09 | ||
Medium volume DSC | >= 0.65 | Initial Validation: 0.88 +/- 0.08 |
External Validation: 0.83 +/- 0.12 | ||
Small volume DSC | >= 0.5 | Initial Validation: 0.75 +/- 0.12 |
External Validation: 0.79 +/- 0.11 | ||
Clinical Appropriateness (1-5 scale, 5 best) | Average score >= 3 | Average rating of 4.5 |
MR Structures | ||
Medium volume DSC | >= 0.65 | Initial Validation: 0.87 +/- 0.07 |
External Validation: 0.87 +/- 0.07 | ||
Small volume DSC | >= 0.5 | Initial Validation: 0.74 +/- 0.07 |
External Validation: 0.74 +/- 0.07 | ||
Clinical Appropriateness (1-5 scale, 5 best) | Average score >= 3 | Average rating of 4.4 |
2. Sample Sizes Used for the Test Set and Data Provenance
- CT Test Set (Internal Validation): Approximately 10% of the training images, averaging 50 test images per structure model.
- Provenance: Retrospective data from "among the patients used for CT training and testing 51.7% were male and 48.3% female. Patient ages range 11-30 : 0.3%, 31-50 : 6.2%, 51-70 : 43.3%, 71-100 : 50.3%. Race 84.0% White, 12.8% Black or African American, 3.2% Other." No specific country of origin is mentioned, but implies internal company data.
- CT Test Set (External Clinical Validation): Variable per structure model, ranging from 19 to 63 images.
- Provenance: Publicly available CT datasets from The Cancer Imaging Archive (TCIA archive). This suggests diverse, likely multi-national origin, but exact countries are not specified. The studies cited are primarily from US institutions (e.g., Memorial Sloan Kettering Cancer Center, MD Anderson Cancer Center). This data is retrospective.
- MR Test Set (Internal Validation):
- Brain models: 92 testing images (from TCIA GLIS-RT dataset).
- Pelvis models: Sample size not explicitly stated for testing, but refers to "Prostate-MRI-US-Biopsy dataset."
- Provenance: TCIA datasets (implying diverse origin, likely US-centric as above), retrospective.
- MR Test Set (External Clinical Validation):
- Brain models: 20 MR T1 Ax post (BRAVO) image scans acquired from a clinical partner (no specific country mentioned, but likely US given the context).
- Pelvis models: 19 images from a publicly available Gold Atlas Data set. The Gold Atlas project has references indicating collaboration across European and US institutions (e.g., Medical Physics - Europe/US).
- Provenance: Retrospective.
3. Number of Experts Used to Establish the Ground Truth for the Test Set and Qualifications
- Number of Experts: Three (3)
- Qualifications: "clinically experienced experts consisting of 2 radiation therapy physicists and 1 radiation dosimetrist." No specific years of experience are mentioned.
4. Adjudication Method for the Test Set
- Method: "Ground truthing of each test data set were generated manually using consensus (NRG/RTOG) guidelines as appropriate by three clinically experienced experts". This implies a consensus-based approach, likely 3-way consensus. If initial contours differed, discussions and adjustments would lead to a final agreed-upon ground truth.
5. If a Multi-Reader Multi-Case (MRMC) Comparative Effectiveness Study was Done
- No, an MRMC comparative effectiveness study was not explicitly done to measure improvement for human readers with AI vs without AI assistance.
- The study focuses on the performance of the AI algorithm itself (standalone performance) and its clinical appropriateness as rated by experts. The "External Reviewer Average Rating" indicates how much editing would be required by a human, rather than directly measuring human reader performance improvement with assistance.
- "independent reviewers (not employed by Radformation) were used to evaluate the clinical appropriateness of structure models as they would be evaluated for the purposes of treatment planning. This external review was performed as a replacement to intraobserver variability testing done with the RADAC V2 structure models as it better quantified the usefulness of the structure model outputs in an unbiased clinical setting." This suggests an assessment of the usability of the AI-generated contours for human review and modification, but not a direct MRMC study comparing assisted vs. unassisted human performance.
6. If a Standalone (i.e., algorithm only without human-in-the-loop performance) Was Done
- Yes, standalone performance was done.
- The Dice Similarity Coefficient (DSC) metrics presented are a measure of the algorithm's performance in generating contours when compared to expert-defined ground truth, without human intervention during the contour generation process. The "External Reviewer Average Rating" also evaluates the standalone output's quality before any human editing.
7. The Type of Ground Truth Used
- Type of Ground Truth: Expert consensus.
- "Ground truthing of each test data set were generated manually using consensus (NRG/RTOG) guidelines as appropriate by three clinically experienced experts consisting of 2 radiation therapy physicists and 1 radiation dosimetrist."
8. The Sample Size for the Training Set
- CT Training Set: Average of 373 training image sets per structure model.
- MR Training Set:
- Brain models: Average of 274 training image sets.
- Pelvis models: Sample size not explicitly stated for training, but refers to "Prostate-MRI-US-Biopsy dataset."
- It's important to note that specific numbers vary per structure, as shown in Table 4 and Table 8.
9. How the Ground Truth for the Training Set Was Established
- The document implies that the training data and their corresponding ground truths were prepared internally prior to the testing phase. While it doesn't explicitly state how the ground truth for the training set was established, it strongly suggests a similar rigorous, expert-driven approach as described for the test sets.
- "The test datasets were independent from those used for training and consisted of approximately 10% of the number of training image sets used as input for the model." This indicates that ground truth was established for both training and testing datasets.
- "Publically available CT datasets from The Cancer Imaging Archive (TCIA archive) were used and both AutoContour and manually added ground truth contours following the same structure guidelines used for structure model training were added to the image sets." This suggests that for publicly available datasets used for both training and external validation, ground truth was added following the same NRG/RTOG guidelines. For proprietary training data, a similar expert-based ground truth creation likely occurred.
§ 892.2050 Medical image management and processing system.
(a)
Identification. A medical image management and processing system is a device that provides one or more capabilities relating to the review and digital processing of medical images for the purposes of interpretation by a trained practitioner of disease detection, diagnosis, or patient management. The software components may provide advanced or complex image processing functions for image manipulation, enhancement, or quantification that are intended for use in the interpretation and analysis of medical images. Advanced image manipulation functions may include image segmentation, multimodality image registration, or 3D visualization. Complex quantitative functions may include semi-automated measurements or time-series measurements.(b)
Classification. Class II (special controls; voluntary standards—Digital Imaging and Communications in Medicine (DICOM) Std., Joint Photographic Experts Group (JPEG) Std., Society of Motion Picture and Television Engineers (SMPTE) Test Pattern).