sycl-word-cloud

Implementation of the C++ word-cloud with SYCL and Intel oneAPI

Running the code

Uses oneAPI and runs on Intel Dev Cloud
Final iteration is on the v2 target

qsub Commands for different node types

qsub -I -l nodes=1:gpu:ppn=2 -d .
- GPU node
qsub -I
- general interactive node
- Serial code ran faster on GPU node than this one, GPU device slower for parallel than this one.

Results and Observations

Version One Notes (Target sycl-word-cloud)

Started with Unified shared memory, very easy and saw significant speedup in the portion doing the counting
Moved to buffers and accessors and performance got significantly slower

Timing Results (CPU Node)

Original implementation with USM

Inteval	Time
Tokenize file and hash	.011 seconds
Serial count tokens	1.67 seconds
Write serial data from vector	5.61 seconds
Parallel count tokens	.247 seconds
Write parallel results from malloc	1.650 seconds

Version Two Notes (Target v2)

Added windowed results for both seq and parallel
Sub buffers to divide work of input by window range
2D buffer for the output where outer dimension is the window and inner is the count vector
Had troulble using sub buffer of 2D to get a single dimension.
- Eliminated sub buffer and did global accessor.
- Reading sub buffer allows multiple queues to work on the buffer at the same time

Timing Results (CPU Node)

Interval	Time (Seconds)
Tokenize File	.009
Sequential Windowed* Count	15.821
Parallel Windowed* Count	0.245
* WINDOW_SIZE = 10,000

Questions unanswered

Did it ever need subbuffers or could we have partitioned the range someother way. (Partitioners?)
Could we have tried to do the input the same as the output. Would either option limit the ability for simultaneous kernals to process each window.
Is their a chance that we are doing any simultaneous processing in current setup or is the q.submit blocking the outer loop. Probably is I would think.
Why are there errors at certain window sizes pertaining to the memory base address alignment and sub buffer creation.
- Multiples of 10,000 seem to work well
- Interestingly the final section doesn't have issues and is an odd size
Are the start and ranges inclusive, are we missig any words when we partition the subbuffer?

ToDo:

Functionality

Decode hashes
CLI (File path, max kernels)
Identify further optimizations. Particularly result output
- Working find with large output arrays with 4 windows at least
Update verision 2 seq to remove the map creation or add map creation to par

Data Analysis

Investigate Profiling tools (VTune)

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.vscode		.vscode
data		data
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sycl-word-cloud

Running the code

qsub Commands for different node types

Results and Observations

Version One Notes (Target sycl-word-cloud)

Timing Results (CPU Node)

Original implementation with USM

Version Two Notes (Target v2)

Timing Results (CPU Node)

Questions unanswered

ToDo:

Functionality

Data Analysis

About

Releases

Packages

Contributors 2

Languages

kctraveler/sycl-word-cloud

Folders and files

Latest commit

History

Repository files navigation

sycl-word-cloud

Running the code

qsub Commands for different node types

Results and Observations

Version One Notes (Target sycl-word-cloud)

Timing Results (CPU Node)

Original implementation with USM

Version Two Notes (Target v2)

Timing Results (CPU Node)

Questions unanswered

ToDo:

Functionality

Data Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages