Multi-GPU reconstructor for large CT datasets
Blazing fast scalable CUDA-based CT reconstructor for datasets (like 10K × 3600 projections)

Links and publications:
Fast CT reconstruction of large datasets on multi-GPU systems (Youtube)
Neoscan microCT
Windows x64

Technology stack:
C++ 17, CUDA

The objective of Unicore Solutions team was to create an efficient multi-GPU implementation of classical Feldkamp cone-beam algorithm. No quality trade-offs were allowed. The first focus was quality and only then speed. We were targeting large datasets (≥ 5K), while keeping efficiency on smaller ones.

The module is integrated in Neoscan microCT software and support all common options, like 360/180+ scans, beam hardening, ring artifacts, smoothing, misalignment compensation, ROI, etc.

Our optimization goals were to create a CUDA multi-GPU and cluster ready solution. We have maximized performance of all computational stages, by efficiently balancing load of different GPU subsystems like memory bandwidth, caches, texture blocks and SMs. We have also implemented an optimal asynchronous direct disk IO with no slow external codecs for main formats.
Benchmarking on real systems shows that we have achieved a speed-up over CPU software 20–50 times. Our reconstructor is on par with other GPU software for small datasets, while beating commercially available solutions 3–5 times on large datasets. And this advantage increase more and more with datasets complexity.

It worse to note that our solutions is perfectly scalable. A simple video card upgrade can gain significant performance increase. Adding second video card brings more ×1.5–1.8 speedup. And it's cluster ready, so any additional server unit brings linear performance scaling.

More detailed explanation is in a short video presentation: