Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 1: Xiaomao Ding #22

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
44 changes: 37 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,40 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 1 - Flocking**
#University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking#

## Xiaomao Ding ##
* Tested on: Windows 8.1, i7-4700MQ @ 2.40GHz 8.00GB, GT 750M 2047MB (Personal Computer)

## Intro ##
The code in this repo is part of Project 1 for CIS565 Fall 2016 at UPenn. For this project, I accelerated the [Reynolds Boid algorithm](http://www.red3d.com/cwr/boids/) using a NVIDIA Cuda kernel. There are three different implementations: a brute force method that compares each boid to every other boid, a uniform grid method that divides the space into a grid for neighbor search, and a coherent uniform grid method that reorganizes the position and velocity vectors to represent the uniform grid. See the gif below for an example of the algorithm in action! Each color represents a different flock of boids.

<div style="text-align:center"><img src ="https://github.com/xnieamo/Project1-CUDA-Flocking/blob/master/images/dt0.2_particles16000.gif" /></div>

Above is a gif generated using the code in this repo with 16000 boids using the coherent grid implementation.

### Quick Note ###
Before running any of the code in this repo, it is possible that you may have to adjust the compute capability flag in `scr/CMakeLists.txt`. To do so, change the '-arch=sm_30' to match your compute capability. 20 matches to 2.0, 30 to 3.0, etc.

## Performance Analysis ##

### Number of Boids ###
In order to analyse the performance of our implementation, we will be using the FPS without visualization as a metric. The first thing we would like to know is how well each algorithm performs with the number of boids.

![FPSvNumBoidPlot](https://github.com/xnieamo/Project1-CUDA-Flocking/blob/master/images/PerformanceVBoidNum.png)

In the graph above, it is clear that the brute force method performs the worst. This is because the number of comparisons for each boid increases linearly with the number of boids. The scattered uniform grid performs much better as it drastically reduces the number of comparisons needed for each boid. What's surprising is the dramatic increase in performance for the uniform grid! Even though both the coherent and scattered grid make the same number of comparisons, the difference in performance is similar to that between the scattered grid and the brute force method. The only change is that we remove the use of an intermediate array for grid indexing. This suggests that reading from memory is a significant bottleneck in our GPU implementations.

### Block Size and Count###
We might also be interested in the performance of our implementations for varying block sizes on the GPU. Below we see that performance is roughly equivalent for each implementation. This makes sense as increasing the block sizes (and thereby decreasing block count) doesn't necessarily change the number of threads allocated for the whole calculation and each boid has an independent calculation. These graphs were generated using a coherent grid with 16000 boids.

![FPSvBlockSize](https://github.com/xnieamo/Project1-CUDA-Flocking/blob/master/images/PerformanceVBlockSize.png)

### dT ###
What's somewhat surprising is that changing the time step parameter, dT, also affects performance. As you increase dT, performance increases drastically as shown in the graph below. This is possibly related to the fact that at high dT, all the boids are automatically placed into 1 giant flock. This graph is generated using a coherent grid with 16000 boids.

![FPSvdT](https://github.com/xnieamo/Project1-CUDA-Flocking/blob/master/images/PerformanceVdt.png)

![FastFlock](https://github.com/xnieamo/Project1-CUDA-Flocking/blob/master/images/dt1.6_particles16000.gif)



* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)

### (TODO: Your README)

Include screenshots, analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
Binary file added images/PerformanceVBlockSize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/PerformanceVBoidNum.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/PerformanceVdt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/dt0.2_particles16000.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/dt1.6_particles16000.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ set(SOURCE_FILES

cuda_add_library(src
${SOURCE_FILES}
OPTIONS -arch=sm_20
OPTIONS -arch=sm_30
)
Loading