Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review #5

Open
maleadt opened this issue Apr 21, 2023 · 2 comments
Open

Review #5

maleadt opened this issue Apr 21, 2023 · 2 comments

Comments

@maleadt
Copy link

maleadt commented Apr 21, 2023

Hi @coezmaden! I was asked to review your JuliaCon proceedings submission in here JuliaCon/proceedings-review#128, so here goes 🙂

In general, your paper was nicely written and a pleasant read, so kudos for that! I have a few comments on the paper itself, but also on the code and the repository. I hope this is helpful!

(Also a caveat: I'm going to focus on the things I'm familiar with, which is Julia and GPUs. I hope somebody else can take a look at the SDR-related pieces.)

Reproducibility

The README should mention a Julia version to use, as 1.8 seemed incompatible (I took 1.7 from the Manifest, which works)

It would also be good if the README contained some information on how to reproduce the measurements. I gather I first need to run the scripts/run_benchmark scripts before scripts/plot_benchmarks? But that doesn't actually run on the GPU, so I have to customize the params? Doing so I still don't get the actual plots from the paper, so it'd be good to add some instructions on that.

Are there plans to upstream this work into Tracking.jl? I'm missing a bit how this work is to be used with the rest of the ecosystem, as the GPUAcceleratedTracking repository is focussed on the benchmarks related to the paper. The template in JuliaCon/proceedings-review#128 specifically asks about example usage / functionality documentation / tests etc, and that does seem to be missing a bit.

Performance

Generally I was surprised by the desktop GPU being outperformed by the CPU, and the explanation is fairly short. I see in the repository that you did actually profile the code using NVTX/nsys/ncu, so what were the conclusions from that work? What makes the GPU implementation slow (memory copy, kernels, etc)? Launch overhead shouldn't be it when we're doing >1ms of processing.

The charts in the paper also show some suspicious processing time drops at the highest sampling frequency, as well as some generally very noisy measurements for certain GPU implementations. What's going on there?

It also wasn't clear to me if your measurements are fully end-to-end? I notice you're using BenchmarkTools, so you're not 'cheating' by only measuring kernel execution time, but do your measurements include the time to upload memory to the GPU and download results back (if that matters)?

Finally, the template asks for a comparison to approaches with similar goals. Do you know of any?

Paper

In general, the paper is nicely written 👍 There's a couple of typo's (s/tieing/tying/, s/hadrware/hardware/), so I'd recommend going through it with a LaTeX-aware spell checker.

The template does ask for a more explicit 'statement of need' though, so you should extract some of the introduction into such a section. As part of that, I would elaborate some more on the need for GPU power in the context of SDR/GNSS processing, as that's only mentioned in passing.

I would also suggest using \texttt for code-like names instead of quoting them as you currently do (e.g. an evaluation of the "cplx_multi" reduction).

Finally, there's some missing DOIs that the bot spotted in the other thread.

Now for some more detailed comments:

  1. Methodology

I'd mention the CUDA.jl (3.8) and Julia (1.7) version.

You mention 'this ensures a coalesced memory access ... discussed in more detail later', but I don't actually find more details there.

  1. Algorithm

wrt. device limits, I would mention that each dimension has its own limit, and then elaborate that blocks are special because there's a limit in x*y*z.

The grid limit is currently shown in the overview table, and isn't terribly interesting, especially because it's fairly stable: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability

s/first-grade support/first-class support/

About the parallel reduction, it's unclear to me if you're describing CUDA's existing mapreducedim here or whether this is a new implementation or extension. I think it would be good to elaborate on that.

You also mention cooperative groups, but I don't see those used in the code? This because despite what your explanation suggests, they don't fully obviate a multi-pass reduction, as you need to be able to fit all thread blocks in memory when doing a grid-wide sync.

  1. Experiment Evaluation

Generally, see the comments on performance above. It would be good to put some of those clarifications in the paper.

@coezmaden
Copy link
Owner

Hi @maleadt! Thank you very much for your thorough and insightful review. I sincerely appreciate your effort and time invested to review my paper. At the moment I am on a vacation until the 2nd week of May, so please excuse the late response. I will get back later in May with the suggested improvements. Thanks again!

@zsoerenm
Copy link
Collaborator

zsoerenm commented May 19, 2023

@coezmaden I made quite a big restructure to Tracking.jl:
JuliaGNSS/Tracking.jl#43
With this pull request you are able to track multiple satellites coherently.
I adjusted the GPU implementation:
https://github.com/JuliaGNSS/Tracking.jl/blob/master/src/gpu_downconvert_and_correlate.jl
I have not optimized it to the last, but it may stand as a good baseline.
On my 10 year old computer the GPU version is about half as slow as the CPU version.
The implementation could in theory be parallelized over multiple number of satellites, that wasn't possible beforehand.
The kernel is quite simple:
https://github.com/JuliaGNSS/Tracking.jl/blob/master/src/gpu_downconvert_and_correlate.jl#L127
The kernel only downconverts and decodes the PRN.
The correlation (the sum over the samples) is done afterwards: https://github.com/JuliaGNSS/Tracking.jl/blob/master/src/gpu_downconvert_and_correlate.jl#L191

Have a look at this test, if you'd like to know how to use the implementation:
https://github.com/JuliaGNSS/Tracking.jl/blob/master/test/track.jl#L390

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants