Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FBP filtering on GPU #423

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

tsadakane
Copy link
Contributor

@tsadakane tsadakane commented Jan 8, 2023

Summary

Implemented FBP filtering on GPU using cuFFT.

See #312

Test

Condition

Software Version
Windows 10 64bit
Python(anaconda) 3.10.4
CUDA 11.8
Visual Studio 2022
GPU GTX1060 6GB
Software Version
Windows 10 64bit
MATLAB 2022a
CUDA 11.8
Visual Studio 2022
GPU GTX1060 6GB

Python

import numpy as np
import time

import tigre
import tigre.algorithms as algs
from tigre.utilities import sample_loader
from tigre.utilities.Measure_Quality import Measure_Quality
import tigre.utilities.gpu as gpu
import matplotlib.pyplot as plt

listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
    print("Error: No gpu found")
else:
    for id in range(len(listGpuNames)):
        print("{}: {}".format(id, listGpuNames[id]))

gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)

mag = 4
# Geometry
# geo1 = tigre.geometry(mode='cone', high_resolution=False, default=True)
geo = tigre.geometry(mode="cone", nVoxel=np.array([256, 256, 256]), default=True)
geo.nDetector = np.array([256, 256])*mag
geo.dDetector = np.array([1.6, 1.6])/mag   # size of each pixel            (mm)
geo.sDetector = geo.dDetector * geo.nDetector
# print(geo)

nangles = 100
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)

# Prepare projection data
head = sample_loader.load_head_phantom(geo.nVoxel)
proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
print("proj size = {}".format(proj.shape))

# Reconstruct
niter = 10
print("niter = {}".format(niter))
timer = time.perf_counter
tic = timer()
for _ in range(niter):
    fdkout = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=False)
toc = timer()
print("CPU: {}".format(toc-tic))
tic = timer()
for _ in range(niter):
    ossart = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=True)
toc = timer()
print("GPU: {}".format(toc-tic))


# Measure Quality
# 'RMSE', 'MSSIM', 'SSD', 'UQI'
print("RMSE fdk CPU: ")
print(Measure_Quality(fdkout, head, ["nRMSE"]))
print("RMSE fdk GPU: ")
print(Measure_Quality(ossart, head, ["nRMSE"]))

# Plot
fig, axes = plt.subplots(3, 3)
axes[0, 0].set_title("FDK FFT on CPU")
axes[0, 0].imshow(fdkout[geo.nVoxel[0] // 2])
axes[1, 0].imshow(fdkout[:, geo.nVoxel[1] // 2, :])
axes[2, 0].imshow(fdkout[:, :, geo.nVoxel[2] // 2])
axes[0, 1].set_title("FDK FFT on GPU")
axes[0, 1].imshow(ossart[geo.nVoxel[0] // 2])
axes[1, 1].imshow(ossart[:, geo.nVoxel[1] // 2, :])
axes[2, 1].imshow(ossart[:, :, geo.nVoxel[2] // 2])
axes[0, 2].set_title("GPU-CPU")
axes[0, 2].imshow(ossart[geo.nVoxel[0] // 2]      -fdkout[geo.nVoxel[0] // 2])
axes[1, 2].imshow(ossart[:, geo.nVoxel[1] // 2, :]-fdkout[:, geo.nVoxel[1] // 2, :])
axes[2, 2].imshow(ossart[:, :, geo.nVoxel[2] // 2]-fdkout[:, :, geo.nVoxel[2] // 2])
plt.show()
# tigre.plotProj(proj)
# tigre.plotImg(fdkout)

Matlab

clear;
close all;
%% Geometry
geo=defaultGeometry('nVoxel',[512;512;256]);                     

%% Load data and generate projections 
% define angles
angles=linspace(0,2*pi,100);
% Load thorax phatom data
head=headPhantom(geo.nVoxel);
% generate projections
projections=Ax(head,geo,angles,'interpolated');
sprintf("size of projections = %s", mat2str(size(projections)))

niter = 10;
sprintf("niter = %d", niter)

disp("CPU")
tic
for ii=1:niter
% FDK with CPU filtering
imgFDK_CPU=FDK(projections,geo,angles,'usegpufft',false);
end
toc
disp("GPU")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU=FDK(projections,geo,angles,'usegpufft',true);
end
toc
% Show the results
%plotImg([imgFDK_CPU,imgFDK_GPU],'Dim','Z');
%plotImg([imgFDK_GPU],'Dim','Z');
imshow([imgFDK_CPU(:,:,50), imgFDK_GPU(:,:,50), imgFDK_CPU(:,:,50)- imgFDK_GPU(:,:,50)]);

Discussion

This code consumes more time than CPU even when the projections are size of 2048x2048.

  1. I found that
    for (int iV = 0; iV < uiVLen; ++iV) {
    for (int iU = 0; iU < uiULen; ++iU) {
    h_pcfInOut[iU+uiULen*iV] = cufftComplex{pfIn[iU+uiULen*iV], 0};
    }
    }

and

for (int iV = 0; iV < uiVLen; ++iV) {
for (int iU = 0; iU < uiULen; ++iU) {
pfOut[iU+uiULen*iV] = h_pcfInOut[iU+uiULen*iV].x;
}
}

consumes time, while FFT and multiplying the filter in the Fourier space is fast enough.
These are necessary as long as we use FFT C2C.
R2C and C2R may work.

  1. As for now, the CUDA codes are common for Matlab and Python because filtering.m transposes the matrix before calling fft.

  2. Only one GPU is used.

  3. Only one stream is used.

@tsadakane tsadakane force-pushed the enh_fbp-filtering-on-gpu branch from 98fa77f to 262ac8f Compare January 8, 2023 15:01
@tsadakane
Copy link
Contributor Author

tsadakane commented Jan 8, 2023

R2C and C2R may work.

Worked. (a39e6b2)

Results of python test code

Conditions

0: NVIDIA GeForce GTX 1060 6GB
{'name': 'NVIDIA GeForce GTX 1060 6GB', 'devices': [0]}
proj size = (100, 1024, 1024)
niter = 10

Before

CPU: 24.03577409999707 s
GPU: 28.770307099999627 s
RMSE fdk CPU:
526.5072514306678
RMSE fdk GPU:
526.5072768954007

After

CPU: 24.188547600002494 s
GPU: 12.808391700003995 s
RMSE fdk CPU:
526.5072514306678
RMSE fdk GPU:
526.5072259659336

@tsadakane
Copy link
Contributor Author

@AnderBiguri Do you think we should keep the option to use/not to use GPU for FBP filtering for filtering and FDK function?

@AnderBiguri
Copy link
Member

@tsadakane Fantastic work, honestly!

I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores?

I will test this on my side in a couple of machines (old/new) and report speed (sometime next week). I think if we can more or less verify that the filtering is always faster on GPU, we can just remove the CPU option from FBP/FDK. We can always leave the matlab/python code there, deprecated, for people to be able to read what the CUDA part does, but in an easier mode. What do you think?

@tsadakane
Copy link
Contributor Author

@AnderBiguri Thank you for your reply. I think it would be better to remove the old code if possible. Let me know the result and I can remove the option before submitting this PR.

I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores?

The CPU is i7-11700 (8 cores, 16 threads).
I found that only about 2 sec. in 12.8 sec. of GPU was creating and destroying FFT "plan"s and FFT, multipling filter and IFFT, other 7 sec. is consumed before and after calling the host function in FbpFiltration.cu, apply_filtration.

@AnderBiguri
Copy link
Member

@tsadakane fantastic, I think its not a bad idea, it will live in git's history anyway, so you are right.
I will test on a few different machines to ensure we can reliably say the GPU is faster all the time.

In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management?

@AnderBiguri
Copy link
Member

In my i7-7700HQ/GTX 1070 laptop, GPU is often slower:

MATLAB:

geo.nDetect 256^2, nagles=100, niter=10:

CPU
Elapsed time is 6.132782 seconds.
GPU
Elapsed time is 8.440495 seconds.

geo.nDetect 512^2, nagles=100, niter=10:

CPU
Elapsed time is 16.697335 seconds.
GPU
Elapsed time is 18.622791 seconds.

geo.nDetect 1024^2, nagles=100, niter=10:

CPU
Elapsed time is 54.229799 seconds.
GPU
Elapsed time is 51.912400 seconds.

I still am quite surprised by this performance, I really did think it would be much faster. I'll try to dig a bit further, I am not sure if there is something extra we can optimize in the code that we may not be aware of, or if my assumption that it would be much faster is seriously flawed.

@tsadakane
Copy link
Contributor Author

In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management?

Sorry for unclear English.
I meant...

  • 2 sec. is FFT, multiplying filter, and IFFT. (*)
  • 7 sec. is python<->cuda interoperability stuff. (**)
  • other is cuda memory allocation and transfer.

(*) In apply_filtration, commenting out the lines of creating and destroying "plan", which nvidia calls the FFT object, and the lines in the curly bracket, the elapsed time reduced 2sec from 12sec. (became 10sec).
In the 2 sec., 0.7 sec is for FFT, multiplying filter and IFFT, and 1.4 sec. is for creating/destroying the plan.

(**) Returning at the beginning of apply_filtration, the time became 7sec from 12sec.

@AnderBiguri
Copy link
Member

@tsadakane I see, thanks. So if we were to try to optimize this we'd need to look at the python<->cuda interoperability. I need to sit down and read the code more carefully, as this seems surprisingly high? What do you think?

@tsadakane
Copy link
Contributor Author

I have no idea so far.

  • Commenting out all but the last line of filtering function in filtering.py, which is called by FDK before Atb is called, the time became 4sec.
  • In the filtering function in fintering.py, I modified as
...
    if use_gpu:
        bundle_size = 32
        fproj=np.empty((bundle_size, geo.nDetector[0],filt_len),dtype=np.float32)
        for idx_bundle in range(0,(angles.shape[0]+bundle_size-1)//bundle_size):
            fproj.fill(0)
            for k in range(0, bundle_size):
                if idx_bundle*bundle_size+k < angles.shape[0]:
                    fproj[k, :,padding:padding+geo.nDetector[1]]=proj[idx_bundle*bundle_size+k]

            fproj = fproj.reshape([bundle_size*geo.nDetector[0], filt_len])
            fproj = fbpfiltration.apply(fproj, filt, gpuids)

            fproj = fproj.reshape([bundle_size, geo.nDetector[0], filt_len])
            for k in range(0, bundle_size):
                if idx_bundle*bundle_size+k < angles.shape[0]:
                    proj[idx_bundle*bundle_size+k]=fproj[k, :,padding:padding+geo.nDetector[1]] * scale_factor
    else:
...

to reduce the number of calling cuda function to 1/32, but the total time did not change (12s).

@tsadakane
Copy link
Contributor Author

I wrote multi-gpu version and cudaMemcpyAsync version, but as expected, the both were no effect.

@tsadakane
Copy link
Contributor Author

Just removing padding from filtering function in filtering.py reduced 3 sec., so I moved the padding into cuda function.
This reduces the amount of memory to transfer. For effective strided data copy, I used cudaMemcpy2D, then GPU is now three times faster than GPU.

0: NVIDIA GeForce GTX 1060 6GB
{'name': 'NVIDIA GeForce GTX 1060 6GB', 'devices': [0]}
proj size = (100, 1024, 1024)
niter = 10
CPU: 23.939920199998596
GPU: 7.977493500002311
RMSE fdk CPU: 
526.5072514306678
RMSE fdk GPU: 
526.5072514306678

filtering.py will be as

def filtering(proj, geo, angles, parker, verbose=False, use_gpu=True, gpuids=None):
    if parker:
        proj=parkerweight(proj.transpose(0,2,1),geo,angles,parker).transpose(0,2,1)

    filt_len=max(64,2**nextpow2(2*max(geo.nDetector)))
    ramp_kernel=ramp_flat(filt_len)

    d=1
    filt=filter(geo.filter,ramp_kernel[0],filt_len,d,verbose=verbose)

    padding = int((filt_len-geo.nDetector[1])//2 )
    scale_factor = (geo.DSD[0]/geo.DSO[0]) * (2 * np.pi/ len(angles)) / ( 4 * geo.dDetector[0] ) 
    if use_gpu:
        bundle_size = 32 #len(gpuids)
        n_bundles = (angles.shape[0]+bundle_size-1)//bundle_size
        for idx_bundle in range(0,n_bundles):
            bundle_size_actual = bundle_size if idx_bundle != n_bundles-1 else angles.shape[0]-idx_bundle*bundle_size
            idx_begin = idx_bundle*bundle_size
            idx_end = idx_bundle*bundle_size+bundle_size_actual
            proj[idx_begin:idx_end] = fbpfiltration.apply2(proj, idx_begin, idx_end, filt, scale_factor, gpuids) 

    else:
        filt=np.kron(np.ones((np.int64(geo.nDetector[0]),1),dtype=np.float32),filt)

        #filter 2 projection at a time packing in to complex container
        fproj=np.empty((geo.nDetector[0],filt_len),dtype=np.complex64)
        for i in range(0,angles.shape[0]-1,2):
            fproj.fill(0)
            fproj.real[:,padding:padding+geo.nDetector[1]]=proj[i]
            fproj.imag[:,padding:padding+geo.nDetector[1]]=proj[i+1]

            fproj=fft(fproj,axis=1)
            fproj=fproj*filt
            fproj=ifft(fproj,axis=1)

            proj[i]=fproj.real[:,padding:padding+geo.nDetector[1]] * scale_factor
            proj[i+1]=fproj.imag[:,padding:padding+geo.nDetector[1]] * scale_factor

        #if odd number of projections filter last solo
        if angles.shape[0] % 2:
            fproj.fill(0)
            fproj.real[:,padding:padding+geo.nDetector[1]]=proj[angles.shape[0]-1]

            fproj=fft(fproj,axis=1)
            fproj=fproj*filt
            fproj=np.real(ifft(fproj,axis=1))     
            proj[angles.shape[0]-1]=fproj[:,padding:padding+geo.nDetector[1]] * scale_factor

    return proj

I'm worried that by moving "padding" to cuda, the "if GPU" code is not parallel to CPU. What do you think?

I have not yet run this modification on Matlab and not yet commited.

@AnderBiguri
Copy link
Member

@tsadakane interesting!

Sorry, I am a bit busy this week, perhaps next week, so I don't have time to look at it with the detail it deserves.
I intend to have a serious look at it as soon as I can.

@tsadakane
Copy link
Contributor Author

tsadakane commented Jan 21, 2023

About aa51ec6

"CPU" uses CPU for convolution.
In "GPU 1", the convolution is processed in a GPU, even if gpuids includes multiple GPUs. (=a39e6b2)
In "GPU 2", the padding and convolution are processed in GPUs.

Tests

Python

test code

import numpy as np
import time

import tigre
import tigre.algorithms as algs
from tigre.utilities import sample_loader
from tigre.utilities.Measure_Quality import Measure_Quality
import tigre.utilities.gpu as gpu
import matplotlib.pyplot as plt

listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
    print("Error: No gpu found")
else:
    for id in range(len(listGpuNames)):
        print("{}: {}".format(id, listGpuNames[id]))

gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)

mag = 4
# Geometry
# geo1 = tigre.geometry(mode='cone', high_resolution=False, default=True)
geo = tigre.geometry(mode="cone", nVoxel=np.array([256, 256, 256]), default=True)
geo.nDetector = np.array([256, 256])*mag
geo.dDetector = np.array([1.6, 1.6])/mag   # size of each pixel            (mm)
geo.sDetector = geo.dDetector * geo.nDetector
# print(geo)

nangles = 100
angles = np.linspace(0, 2 * np.pi, nangles, endpoint=False, dtype=np.float32)

# Prepare projection data
head = sample_loader.load_head_phantom(geo.nVoxel)
proj = tigre.Ax(head, geo, angles, gpuids=gpuids)
print("proj size = {}".format(proj.shape))

# Reconstruct
niter = 10
print("niter = {}".format(niter))
timer = time.perf_counter

tic = timer()
for _ in range(niter):
    out_cpu = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=0)
toc = timer()
print("CPU  : {}".format(toc-tic))
tic = timer()
for _ in range(niter):
    out_gpu1 = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=1)
toc = timer()
print("GPU 1: {}".format(toc-tic))

tic = timer()
for _ in range(niter):
    out_gpu2 = algs.fdk(proj, geo, angles, gpuids=gpuids, use_gpu_for_filtration=2)
toc = timer()
print("GPU 2: {}".format(toc-tic))

# Measure Quality
# 'RMSE', 'MSSIM', 'SSD', 'UQI'
print("RMSE fdk CPU: ")
print(Measure_Quality(out_cpu, head, ["nRMSE"]))
print("RMSE fdk GPU 1: ")
print(Measure_Quality(out_gpu1, head, ["nRMSE"]))
print("RMSE fdk GPU 2: ")
print(Measure_Quality(out_gpu2, head, ["nRMSE"]))

# Plot
fig, axes = plt.subplots(3, 5)
axes[0, 0].set_title("FDK FFT on CPU")
axes[0, 0].imshow(out_cpu[geo.nVoxel[0] // 2])
axes[1, 0].imshow(out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 0].imshow(out_cpu[:, :, geo.nVoxel[2] // 2])
axes[0, 1].set_title("FDK FFT on GPU")
axes[0, 1].imshow(out_gpu1[geo.nVoxel[0] // 2])
axes[1, 1].imshow(out_gpu1[:, geo.nVoxel[1] // 2, :])
axes[2, 1].imshow(out_gpu1[:, :, geo.nVoxel[2] // 2])
axes[0, 2].set_title("FDK FFT on GPU 2")
axes[0, 2].imshow(out_gpu2[geo.nVoxel[0] // 2])
axes[1, 2].imshow(out_gpu2[:, geo.nVoxel[1] // 2, :])
axes[2, 2].imshow(out_gpu2[:, :, geo.nVoxel[2] // 2])
axes[0, 3].set_title("GPU1-CPU")
axes[0, 3].imshow(out_gpu1[geo.nVoxel[0] // 2]      -out_cpu[geo.nVoxel[0] // 2])
axes[1, 3].imshow(out_gpu1[:, geo.nVoxel[1] // 2, :]-out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 3].imshow(out_gpu1[:, :, geo.nVoxel[2] // 2]-out_cpu[:, :, geo.nVoxel[2] // 2])
axes[0, 4].set_title("GPU2-CPU")
axes[0, 4].imshow(out_gpu2[geo.nVoxel[0] // 2]      -out_cpu[geo.nVoxel[0] // 2])
axes[1, 4].imshow(out_gpu2[:, geo.nVoxel[1] // 2, :]-out_cpu[:, geo.nVoxel[1] // 2, :])
axes[2, 4].imshow(out_gpu2[:, :, geo.nVoxel[2] // 2]-out_cpu[:, :, geo.nVoxel[2] // 2])
plt.show()

Result

Software Version
Windows 10 64bit
Python(Intel) 3.9.15
CUDA 11.8
Visual Studio 2022
GPU 2x NVIDIA GeForce RTX 2080 Ti
0: NVIDIA GeForce RTX 2080 Ti
1: NVIDIA GeForce RTX 2080 Ti
2: NVIDIA GeForce GTX 1070
{'name': 'NVIDIA GeForce RTX 2080 Ti', 'devices': [0, 1]}
proj size = (100, 1024, 1024)
niter = 10
[path to tigre working dir.]\Python\tigre\utilities\filtering.py:80: RuntimeWarning: divide by zero encountered in true_divide
  h[odd] = -1 / (np.pi * nn[odd]) ** 2
CPU  : 35.7683762
GPU 1: 19.560339300000003
GPU 2: 12.197235299999996
RMSE fdk CPU: 
526.5050614590306
RMSE fdk GPU 1: 
526.504934134818
RMSE fdk GPU 2: 
526.504934134818

Malab

Software Version
Windows 10 64bit
MATLAB 2022a
CUDA 11.8
Visual Studio 2022
GPU 2x NVIDIA GeForce RTX 2080 Ti

Test code

clear;
close all;
%% Geometry
geo=defaultGeometry('nVoxel',[512;512;256], "nDetector", [1024;1024]);                     

%% GPUs
gpuids = GpuIds('NVIDIA GeForce RTX 2080 Ti');
disp(gpuids);
disp(gpuids.devices);

%% Load data and generate projections 
% define angles
angles=linspace(0,2*pi,100);
% Load thorax phatom data
head=headPhantom(geo.nVoxel);
% generate projections
projections=Ax(head,geo,angles,'interpolated', 'gpuids', gpuids);
sprintf("size of projections = %s", mat2str(size(projections)))

niter = 10;%10;
sprintf("niter = %d", niter)

disp("CPU")
tic
for ii=1:niter
% FDK with CPU filtering
imgFDK_CPU=FDK(projections,geo,angles,'usegpufft',0, 'gpuids', gpuids);
end
toc
disp("GPU 1")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU1=FDK(projections,geo,angles,'usegpufft',1, 'gpuids', gpuids);
end
toc
disp("GPU 2")
tic
for ii=1:niter
% FDK with GPU filtering
imgFDK_GPU2=FDK(projections,geo,angles,'usegpufft',2, 'gpuids', gpuids);
end
toc
% Show the results
%plotImg([imgFDK_CPU,imgFDK_GPU],'Dim','Z');
%plotImg([imgFDK_GPU],'Dim','Z');
imshow([imgFDK_CPU(:,:,50), imgFDK_GPU1(:,:,50), imgFDK_GPU2(:,:,50), abs(imgFDK_GPU1(:,:,50) - imgFDK_CPU(:,:,50)), abs(imgFDK_GPU2(:,:,50)- imgFDK_CPU(:,:,50))]);
%imshow([imgFDK_GPU(:,:,2)]);

Result

  GpuIds with properties:

       name: 'NVIDIA GeForce RTX 2080 Ti'
    devices: [0 1]
   0   1
ans = 
    "size of projections = [1024 1024 100]"
ans = 
    "niter = 10"
CPU
Elapsed time is 35.826591 seconds.
GPU 1
Elapsed time is 36.107104 seconds.
GPU 2
Elapsed time is 29.146955 seconds.

@tsadakane
Copy link
Contributor Author

tsadakane commented Jan 21, 2023

The table below is the summary of the results above.
The result of python is acceptable for me, but the result of matlab is hard to understand. At first, I suspected that only one GPU is used due to a bug, but when I checked, deviceCount in apply_filtration2 function was 2.

Python [s] Matlab [s]
CPU 35.8 35.8
GPU 1 (Filtering only) 19.6 36.1
GPU 2 (Padding and Filtering) 12.2 29.1

Here, two GPUs are used.

@tsadakane
Copy link
Contributor Author

tsadakane commented Jan 22, 2023

I ran the python test program above after adding gpuids.devices.pop(-1) to use only one GPU.

Python

2x GPU [s], 1x GPU [s]
CPU 35.9 33.8
GPU 1 (Filtering only) 19.2 16.8
GPU 2 (Padding and Filtering) 12.1 10.4

It seems that the overhead of the copying is larger than the benefit of doubling the GPU resource.

@AnderBiguri
Copy link
Member

@tsadakane that makes sense. Stil surprised about the times, but I think its my own false assumptions of how fast it should be. Still really good :)

@AnderBiguri
Copy link
Member

I am also a bit baffled by the MATLAB slowness, maybe something to do with extra copies? let me check.

@tsadakane
Copy link
Contributor Author

I think matlab code (copies &) transposes the matrix; the fastest variable in the memory is v in matlab while in python it is u, but in the cuda code, there is no difference between matlab and python.

@AnderBiguri
Copy link
Member

@tsadakane ah! We may be able to do then the same that we do with the other cuda code? #if IS_FOR_MATLAB_TIGRE kind of thing? I am not entirely sure where this is applied in the new CUDA code, happy to have a look if you want, but if you know where it happens in CUDA, feel free to do it yourself.

Thanks again, your work is really appreciated!

@tsadakane
Copy link
Contributor Author

I can do it, but I am not sure if it works, because cuda is not good at transposing.

It was not necessary to transpose the projection in ax and atb, because it is copied to the 3d texture, where there is no difference in the memory access efficiency between u and v axes. On the other hand, the convolution must be applied to the u-direction, so transposing in the memory layout is necessary.

That's said, I think, from the point of view of symmetry, permutation line should be moved to the filtering function if possible, i.e

%% Weight
proj=permute(proj,[2 1 3]);
for ii=1:size(angles,2)
us = ((-geo.nDetector(1)/2+0.5):1:(geo.nDetector(1)/2-0.5))*geo.dDetector(1) + offset(1,ii);
vs = ((-geo.nDetector(2)/2+0.5):1:(geo.nDetector(2)/2-0.5))*geo.dDetector(2) + offset(2,ii);
[uu,vv] = meshgrid(us,vs); % detector
% Create weight according to each detector element
w = (geo.DSD(ii))./sqrt((geo.DSD(ii))^2+uu.^2 + vv.^2);
% Multiply the weights with projection data
proj(:,:,ii) = proj(:,:,ii).*w';
end
%% Fourier transform based filtering
proj = filtering(proj,geo,angles,parker); % Not sure if offsets are good in here

=>

%% Weight
% proj=permute(proj,[2 1 3]);  %%%%%%%%%% Detele
for ii=1:size(angles,2)
    us = ((-geo.nDetector(1)/2+0.5):1:(geo.nDetector(1)/2-0.5))*geo.dDetector(1) + offset(1,ii);
    vs = ((-geo.nDetector(2)/2+0.5):1:(geo.nDetector(2)/2-0.5))*geo.dDetector(2) + offset(2,ii);
    [uu,vv] = meshgrid(us,vs); % detector
    
    % Create weight according to each detector element
    w = (geo.DSD(ii))./sqrt((geo.DSD(ii))^2+uu.^2 + vv.^2);
    
    % Multiply the weights with projection data
    proj(:,:,ii) = w.*proj(:,:,ii);   %%%%%%%%%% Transpose
end
%% Fourier transform based filtering
%proj=permute(proj,[2 1 3]);   %%%%%%%%%% Add this line at the beginning of the filtering function.
proj = filtering(proj,geo,angles,parker); % Not sure if offsets are good in here

This modification makes the code a little bit beautiful, but the specification of the filtering function will be changed (the first arg proj is not transposed).

What do you think?

@AnderBiguri
Copy link
Member

@tsadakane I see what you mean with the filtering. My convolution/fft maths are a bit rusty, so I may be completely wrong, but isn't there a way to change the shape of the filter itself so it is equivalent to transposing the projections? That would be the ideal solution, if possible, right?

About changing the transpose to inside filtering, 100% agree is much elegant.

@AnderBiguri
Copy link
Member

Hi @tsadakane, sorry I have been very busy with my life. How is this going? do you need any help? Is it done?

@tsadakane
Copy link
Contributor Author

Hi @AnderBiguri, I've been busy lately too. I've implemented the code to transpose matrices, but I am not satisfied with the result and trying several granularities of processing to pass to mex/cuda.

tsadakane added 6 commits May 29, 2023 22:16
* Modified and added Matlab files
* Modified and added Python files
* Use cufftExecR2C and cufftExecC2R to omit slow copy.
* Padding and FBP filtration are applied in `apply_filtration2` function
@tsadakane tsadakane force-pushed the enh_fbp-filtering-on-gpu branch from bb7e0d7 to 7c97471 Compare May 29, 2023 13:28
@AnderBiguri
Copy link
Member

Aparently making the fft and ifft calls in python have the workers argument can accelerate them. fproj = fft(fproj, axis=1, workers=cores), given cores=os.cpu_count(). Haven't tried it myself.

@tsadakane
Copy link
Contributor Author

Under the condition of

proj size = (100, 1024, 1024)
nVoxel = [256, 512, 512]

On 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz (8 physical core + hyper-threading => cores=16) + GTX1060

0, CPU: Padding & FFT       : 26.59433160000026 s @ 10 times
1, CPU: Multi-Workers       : 15.725976400000036 s @ 10 times

Not bad.

@AnderBiguri
Copy link
Member

I think the GPU implementation can be valuable too. I think the fastest FDK would be one that does interleave filtering and backprojection. I read a paper about it at some point.

@AnderBiguri
Copy link
Member

Hi @tsadakane
What do you think about all this? worth merging it for the cases when its faster?

@tsadakane
Copy link
Contributor Author

Hi, @AnderBiguri ,

It's difficult to answer, but I'd like to say no.

  • Offloading a light weight task like FFT is not worth for the cost of copying back and forth between CPU and GPU (and plus transposing if its run from MATLAB).
  • Using fft's workers argument of python can achieve speedups comparable to GPU offloading.
  • TIGRE's flexibility will be impaired.

@AnderBiguri
Copy link
Member

@tsadakane makes sense!

Flexibility is a bit less important because we can add a flag with default CPU, but I think the rest of the points make total sense.
Indeed, this would make sense if there was a FFT+Backproject in the same CUDA call, but then the multi-GPU code gets complex.

It may be worth leaving the PR here (if you are happy with that), in case anyone in the future wants to use the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants