Implementation of the C++ word-cloud with SYCL and Intel oneAPI
- Uses oneAPI and runs on Intel Dev Cloud
- Final iteration is on the v2 target
- qsub -I -l nodes=1:gpu:ppn=2 -d .
- GPU node
- qsub -I
- general interactive node
- Serial code ran faster on GPU node than this one, GPU device slower for parallel than this one.
- Started with Unified shared memory, very easy and saw significant speedup in the portion doing the counting
- Moved to buffers and accessors and performance got significantly slower
Inteval | Time |
---|---|
Tokenize file and hash | .011 seconds |
Serial count tokens | 1.67 seconds |
Write serial data from vector | 5.61 seconds |
Parallel count tokens | .247 seconds |
Write parallel results from malloc | 1.650 seconds |
- Added windowed results for both seq and parallel
- Sub buffers to divide work of input by window range
- 2D buffer for the output where outer dimension is the window and inner is the count vector
- Had troulble using sub buffer of 2D to get a single dimension.
- Eliminated sub buffer and did global accessor.
- Reading sub buffer allows multiple queues to work on the buffer at the same time
Interval | Time (Seconds) |
---|---|
Tokenize File | .009 |
Sequential Windowed* Count | 15.821 |
Parallel Windowed* Count | 0.245 |
* WINDOW_SIZE = 10,000 |
- Did it ever need subbuffers or could we have partitioned the range someother way. (Partitioners?)
- Could we have tried to do the input the same as the output. Would either option limit the ability for simultaneous kernals to process each window.
- Is their a chance that we are doing any simultaneous processing in current setup or is the q.submit blocking the outer loop. Probably is I would think.
- Why are there errors at certain window sizes pertaining to the memory base address alignment and sub buffer creation.
- Multiples of 10,000 seem to work well
- Interestingly the final section doesn't have issues and is an odd size
- Are the start and ranges inclusive, are we missig any words when we partition the subbuffer?
- Decode hashes
- CLI (File path, max kernels)
- Identify further optimizations. Particularly result output
- Working find with large output arrays with 4 windows at least
- Update verision 2 seq to remove the map creation or add map creation to par
- Investigate Profiling tools (VTune)