Multi-GPU reconstructor for large CT datasets

Blazing fast scalable CUDA-based CT reconstructor for high resolution tomography (e.g. 10K × 3600 projections)

Customer:
Neoscan (Belgium)

Links and publications:
• Fast CT reconstruction of large datasets on multi-GPU systems (Youtube)
• Neoscan microCT

Platform:
Windows x64

Technology stack:
C++ 17, CUDA

Description

Our objective was to create an efficient multi-GPU implementation of the classical Feldkamp cone-beam algorithm. No quality trade-offs were allowed, all optimizations had to be carried out without compromising the final result. The main focus was the processing of large datasets (≥ 5K), but the efficiency for smaller ones also had to remain high.

The module developed over the course of the project is integrated in Neoscan microCT software and supports all common options, like 360/180+ scans, beam hardening, ring artifacts, smoothing, misalignment compensation, ROI, etc.

Our optimization goals were to develop a solution using CUDA, multi-GPU support, and server clusters. We have maximized the performance of all computational stages by efficiently balancing the load of different GPU subsystems like memory bandwidth, caches, texture blocks and SMs. We have also implemented direct asynchronous access to disk drives and used direct work in the main graphic file formats without using slow standard codecs.

Benchmarking on real systems shows a speed-up of 20–50 times over CPU software. Our reconstructor is on par with other GPU software for small datasets, while being ahead of commercially available solutions by 3–5 times on large datasets. And this advantage increases steadily as the complexity of the task increases.

It is also important to note that our solutions is perfectly scalable. A simple video card upgrade can gain significant performance increase. Adding the second video card brings more than ×1.5–1.8 speedup. And it's cluster ready, so any additional server unit brings linear performance scaling.

The results of the work are discussed in more detail in the video presentation below: