diff --git a/PERFORMANCE ANALYSIS.md b/PERFORMANCE ANALYSIS.md
new file mode 100644
index 0000000..4ad6605
--- /dev/null
+++ b/PERFORMANCE ANALYSIS.md	
@@ -0,0 +1,14 @@
+# Performance Analysis
+In the following section, we analyze the performance of our 3 kernel functions and algorithms. We use framerate and timers for velocity updates to demonstrate how the three different approaches differ. We take average samples over 1000 iterations for velocity update timers. Velocity calculation timings are highlighted specifically since algorithmic differences are present. The other kernel functions in Coherent and Scattered grid approaches are trivial and embarassingly parallel thus they provide no interesting insight. 
+
+##### Naive Implementation
+At an algorithmic viewpoing, each boid requires an additional check for N-1 other boids , thus increasing the number of boids will increase computation type by a strictly N^2 factor. From a low-level perspective, the number of boids that can be computed in parallel is limited by the number of threads that the hardware has (any more would require boids than the number of threads would require multiple stages to calculate position updates all of them). Since the access for each boid would be all the other boids, I would assume that the locality of the data is poor leading to access time issues since data access is wide ranged. Both the algorithmic and low level issues are reflected in the data as expected. We can see that the velocity calculation takes up the majority of the time per boid and grows by the same factor that the frame rate drops. 
+
+##### Scattered with uniform grid
+The algorithmic speed up is constant but still polynomial. At a low-level, the boid look-up is significantly reduced but once again the locality of data is poor. In addition, the data access is quite "jumpy" and would result in a lot of fetching (I'm assuming there's some form of pagination in the GPU) once the number of boids increases. Though it is interesting to note that from 5k to 10k boids there is an increase in frames (I have no idea why this is happening) but also an increase in the time it took to calculate velocity updates. The performance drop past that point is expected as we can see the toll that unstructured data takes. Beyond 80k we see a sharp drop in performance. I'm assuming some low-level swapping must occur at this point and causes a performance drop. From the combined point of view, the more dense the grids are, the more boids we have to inspect (thus further emphasizing the importance of spatial locality). 
+
+##### Coherent with uniform grid
+The algorithmic speed up is constant but still polynomial. At a low-level, the boid look-up is significantly reduced and the locality of data is much better than the scattered uniform grid approach. The Coherent grid approach has a linear growth rate w.r.t. the velocity update timings since it is not affected by these locality issues. However, increasing the number of boids beyond a certain point might cause buffer issues as a block might require to fetch vector buffers that exceed its memory capacity. This problem is similar to the one faced by the Scattered grid approach. The performance improvements were expected since the structured placement of data meant less fetching. Though I was quite surprised by the magnitude of improvement since we went from exponential increase in velocity update timings to a linear increase (wow).
+
+##### On the topic of block count and block sizes
+Changing the block count and size did not affect performance. This might be due to the fact that the problems we are trying to solve is embarassingly parallel. Changing the block size should not change the fact that each thread has to perform a velocity update, position update, etc. In other words, no task would block another task and thus no computational latency occurs (yay no stalls!). In this case, changing block size or count does not affect performance as expected.
\ No newline at end of file
diff --git a/README.md b/README.md
index 98dd9a8..eedbdc1 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,26 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* David Liao
+* Tested on: Windows 7, Xeon E5-1630 v4 @ 3.70GHz 32GB, GTX 1070 4096MB (SigLab MOR 103-57)
 
-### (TODO: Your README)
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+# Flocking Simulation
+![boids](/images/5000boids.gif)
+
+# Performance Analysis
+In the following section, we analyze the performance of our 3 kernel functions and algorithms. We use framerate and timers for velocity updates to demonstrate how the three different approaches differ. We take average samples over 1000 iterations for velocity update timers. Velocity calculation timings are highlighted specifically since algorithmic differences are present. The other kernel functions in Coherent and Scattered grid approaches are trivial and embarassingly parallel thus they provide no interesting insight. 
+
+![Table and Charts of Data](/images/tables.png)
+
+##### Naive Implementation
+At an algorithmic viewpoing, each boid requires an additional check for N-1 other boids , thus increasing the number of boids will increase computation type by a strictly N^2 factor. From a low-level perspective, the number of boids that can be computed in parallel is limited by the number of threads that the hardware has (any more would require boids than the number of threads would require multiple stages to calculate position updates all of them). Since the access for each boid would be all the other boids, the locality of the data is poor leading to access time issues since data access is wide ranged. Both the algorithmic and low level issues are reflected in the data as expected. We can see that the velocity calculation takes up the majority of the time per boid and grows by the same factor that the frame rate drops. 
+
+##### Scattered with uniform grid
+The algorithmic speed up is constant but still polynomial. At a low-level, the boid look-up is significantly reduced but once again the locality of data is poor. In addition, the data access is quite "jumpy" and would result in a lot of fetching (if there's some form of pagination in the GPU) once the number of boids increases. Though it is interesting to note that from 5k to 10k boids there is an increase in frames but also an increase in the time it took to calculate velocity updates. The performance drop past that point is expected as we can see the toll that unstructured data takes. Beyond 80k we see a sharp drop in performance. Some low-level swapping must occur at this point and causes a performance drop. From the combined point of view, the more dense the grids are, the more boids we have to inspect (thus further emphasizing the importance of spatial locality). 
+
+##### Coherent with uniform grid
+The algorithmic speed up is constant but still polynomial. At a low-level, the boid look-up is significantly reduced and the locality of data is much better than the scattered uniform grid approach. The Coherent grid approach has a linear growth rate w.r.t. the velocity update timings since it is not affected by these locality issues. However, increasing the number of boids beyond a certain point might cause buffer issues as a block might require to fetch vector buffers that exceed its memory capacity. This problem is similar to the one faced by the Scattered grid approach. The performance improvements were expected since the structured placement of data meant less fetching. Though I was quite surprised by the magnitude of improvement since we went from exponential increase in velocity update timings to a linear increase (wow).
+
+##### On the topic of block count and block sizes
+Changing the block count and size did not affect performance. This is due to the fact that the problems we are trying to solve is embarassingly parallel. Changing the block size should not change the fact that each thread has to perform a velocity update, position update, etc. In other words, no task would block another task and thus no computational latency occurs (yay no stalls!). In this case, changing block size or count does not affect performance as expected.
diff --git a/images/5000boids.gif b/images/5000boids.gif
new file mode 100644
index 0000000..74dec9c
Binary files /dev/null and b/images/5000boids.gif differ
diff --git a/images/tables.png b/images/tables.png
new file mode 100644
index 0000000..6543296
Binary files /dev/null and b/images/tables.png differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..dff0113 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_50
     )
diff --git a/src/kernel.cu b/src/kernel.cu
index aaf0fbf..abbb989 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -6,7 +6,6 @@
 #include "utilityCore.hpp"
 #include "kernel.h"
 
-// LOOK-2.1 potentially useful for doing grid-based neighbor search
 #ifndef imax
 #define imax( a, b ) ( ((a) > (b)) ? (a) : (b) )
 #endif
@@ -30,7 +29,18 @@ void checkCUDAError(const char *msg, int line = -1) {
     exit(EXIT_FAILURE);
   }
 }
-
+/***********************
+* Performance Analysis *
+************************/
+#define perfAnal float elapsed = 0;\
+cudaEventElapsedTime(&elapsed, cudaEventStart, cudaEventStop);\
+totalEventTime += elapsed;\
+eventCount++;\
+if (!(skip++ % 1000)) {\
+	printf("%f %f\n", totalEventTime / eventCount, elapsed);\
+	totalEventTime = 0;\
+	eventCount = 0;\
+}\
 
 /*****************
 * Configuration *
@@ -39,8 +49,6 @@ void checkCUDAError(const char *msg, int line = -1) {
 /*! Block size used for CUDA kernel launch. */
 #define blockSize 128
 
-// LOOK-1.2 Parameters for the boids algorithm.
-// These worked well in our reference implementation.
 #define rule1Distance 5.0f
 #define rule2Distance 3.0f
 #define rule3Distance 5.0f
@@ -61,39 +69,34 @@ void checkCUDAError(const char *msg, int line = -1) {
 int numObjects;
 dim3 threadsPerBlock(blockSize);
 
-// LOOK-1.2 - These buffers are here to hold all your boid information.
-// These get allocated for you in Boids::initSimulation.
-// Consider why you would need two velocity buffers in a simulation where each
-// boid cares about its neighbors' velocities.
-// These are called ping-pong buffers.
+
 glm::vec3 *dev_pos;
 glm::vec3 *dev_vel1;
 glm::vec3 *dev_vel2;
 
-// LOOK-2.1 - these are NOT allocated for you. You'll have to set up the thrust
-// pointers on your own too.
 
-// For efficient sorting and the uniform grid. These should always be parallel.
-int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
-int *dev_particleGridIndices; // What grid cell is this particle in?
-// needed for use with thrust
+int *dev_particleArrayIndices;
+int *dev_particleGridIndices;
 thrust::device_ptr<int> dev_thrust_particleArrayIndices;
 thrust::device_ptr<int> dev_thrust_particleGridIndices;
 
-int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
-int *dev_gridCellEndIndices;   // to this cell?
+int *dev_gridCellStartIndices;
+int *dev_gridCellEndIndices;
 
-// TODO-2.3 - consider what additional buffers you might need to reshuffle
-// the position and velocity data to be coherent within cells.
+glm::vec3 *dev_shuffledPos;
 
-// LOOK-2.1 - Grid parameters based on simulation parameters.
-// These are automatically computed for you in Boids::initSimulation
 int gridCellCount;
 int gridSideCount;
 float gridCellWidth;
 float gridInverseCellWidth;
 glm::vec3 gridMinimum;
 
+// Performance Analysis
+cudaEvent_t cudaEventStart, cudaEventStop;
+float totalEventTime = 0;
+int eventCount = 0;
+int skip = 0;
+
 /******************
 * initSimulation *
 ******************/
@@ -108,10 +111,7 @@ __host__ __device__ unsigned int hash(unsigned int a) {
   return a;
 }
 
-/**
-* LOOK-1.2 - this is a typical helper function for a CUDA kernel.
-* Function for generating a random vec3.
-*/
+
 __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
   thrust::default_random_engine rng(hash((int)(index * time)));
   thrust::uniform_real_distribution<float> unitDistrib(-1, 1);
@@ -119,10 +119,6 @@ __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
   return glm::vec3((float)unitDistrib(rng), (float)unitDistrib(rng), (float)unitDistrib(rng));
 }
 
-/**
-* LOOK-1.2 - This is a basic CUDA kernel.
-* CUDA kernel for generating boids with a specified mass randomly around the star.
-*/
 __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if (index < N) {
@@ -140,8 +136,7 @@ void Boids::initSimulation(int N) {
   numObjects = N;
   dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
 
-  // LOOK-1.2 - This is basic CUDA memory management and error checking.
-  // Don't forget to cudaFree in  Boids::endSimulation.
+
   cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
 
@@ -151,13 +146,14 @@ void Boids::initSimulation(int N) {
   cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
 
-  // LOOK-1.2 - This is a typical CUDA kernel invocation.
+  cudaMalloc((void**)&dev_shuffledPos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_shuffledPos failed!");
+
   kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
     dev_pos, scene_scale);
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
-  // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+  gridCellWidth = 1.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -168,7 +164,21 @@ void Boids::initSimulation(int N) {
   gridMinimum.y -= halfGridWidth;
   gridMinimum.z -= halfGridWidth;
 
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, sizeof(int) * numObjects);
+  checkCUDAErrorWithLine("cudaMalloc failed!");
+  cudaMalloc((void**)&dev_particleGridIndices, sizeof(int) * numObjects);
+  checkCUDAErrorWithLine("cudaMalloc failed!");
+  cudaMalloc((void**)&dev_gridCellStartIndices, sizeof(int) * gridCellCount);
+  checkCUDAErrorWithLine("cudaMalloc failed!");
+  cudaMalloc((void**)&dev_gridCellEndIndices, sizeof(int) * gridCellCount);
+  checkCUDAErrorWithLine("cudaMalloc failed!");
+  
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+  cudaEventCreate(&cudaEventStart);
+  cudaEventCreate(&cudaEventStop);
+
   cudaThreadSynchronize();
 }
 
@@ -223,105 +233,181 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * stepSimulation *
 ******************/
 
-/**
-* LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
-* __device__ code can be called from a __global__ context
-* Compute the new velocity on the body with index `iSelf` due to the `N` boids
-* in the `pos` and `vel` arrays.
-*/
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+	glm::vec3 thisPos = pos[iSelf];
+	glm::vec3 newVel = vel[iSelf];
+	glm::vec3 center, separate, cohesion;
+	int neighborCount = 0;
+	for (int i = 0; i < N; i++) {
+		if (i == iSelf) {
+			continue;
+		}
+
+		float distance = glm::distance(thisPos, pos[i]);
+		if (distance < rule1Distance) {
+			center += pos[i];
+			neighborCount++;
+		}
+		if (distance < rule2Distance) {
+			separate -= pos[i] - thisPos;
+		}
+		if (distance < rule3Distance) {
+			cohesion += vel[i];
+		}
+	}
+	if (neighborCount > 0) {
+		center /= neighborCount;
+		newVel += (center - thisPos) * rule1Scale;
+		newVel += cohesion * rule3Scale;
+	}
+	newVel += separate * rule2Scale;
+	if (glm::length(newVel) > maxSpeed) {
+		newVel = glm::normalize(newVel) * maxSpeed;
+	}
+	return newVel;
 }
 
-/**
-* TODO-1.2 implement basic flocking
-* For each of the `N` bodies, update its position based on its current velocity.
-*/
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
-  // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	if (index >= N) {
+		return;
+	}
+	vel2[index] = computeVelocityChange(N, index, pos, vel1);
 }
 
-/**
-* LOOK-1.2 Since this is pretty trivial, we implemented it for you.
-* For each of the `N` bodies, update its position based on its current velocity.
-*/
+
 __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
   // Update position by velocity
   int index = threadIdx.x + (blockIdx.x * blockDim.x);
   if (index >= N) {
     return;
   }
-  glm::vec3 thisPos = pos[index];
-  thisPos += vel[index] * dt;
+glm::vec3 thisPos = pos[index];
+thisPos += vel[index] * dt;
 
-  // Wrap the boids around so we don't lose them
-  thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
-  thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
-  thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
+// Wrap the boids around so we don't lose them
+thisPos.x = thisPos.x < -scene_scale ? scene_scale : thisPos.x;
+thisPos.y = thisPos.y < -scene_scale ? scene_scale : thisPos.y;
+thisPos.z = thisPos.z < -scene_scale ? scene_scale : thisPos.z;
 
-  thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
-  thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
-  thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
+thisPos.x = thisPos.x > scene_scale ? -scene_scale : thisPos.x;
+thisPos.y = thisPos.y > scene_scale ? -scene_scale : thisPos.y;
+thisPos.z = thisPos.z > scene_scale ? -scene_scale : thisPos.z;
 
-  pos[index] = thisPos;
+pos[index] = thisPos;
 }
 
-// LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
-// LOOK-2.3 Looking at this method, what would be the most memory efficient
-//          order for iterating over neighboring grid cells?
-//          for(x)
-//            for(y)
-//             for(z)? Or some other order?
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
-  return x + y * gridResolution + z * gridResolution * gridResolution;
+	return x + y * gridResolution + z * gridResolution * gridResolution;
 }
 
 __global__ void kernComputeIndices(int N, int gridResolution,
-  glm::vec3 gridMin, float inverseCellWidth,
-  glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+	glm::vec3 gridMin, float inverseCellWidth,
+	glm::vec3 *pos, int *indices, int *gridIndices) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 grid = (pos[index] - gridMin) * inverseCellWidth;
+	indices[index] = index;
+	gridIndices[index] = gridIndex3Dto1D(grid.x, grid.y, grid.z, gridResolution);
+	
 }
 
-// LOOK-2.1 Consider how this could be useful for indicating that a cell
-//          does not enclose any boids
 __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
-  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if (index < N) {
-    intBuffer[index] = value;
-  }
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index < N) {
+		intBuffer[index] = value;
+	}
 }
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
-  int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+	int *gridCellStartIndices, int *gridCellEndIndices) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	int curr = particleGridIndices[index];
+	if (index == 0) { // first cell
+		gridCellStartIndices[curr] = index;
+		return;
+	}
+	int prev = particleGridIndices[index - 1];
+	if (curr != prev) { // cell change
+		gridCellStartIndices[curr] = index;
+		gridCellEndIndices[prev] = index - 1;
+	}
+	if (index == N - 1) { // final cell
+		gridCellEndIndices[curr] = index;
+	}
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
-  int N, int gridResolution, glm::vec3 gridMin,
-  float inverseCellWidth, float cellWidth,
-  int *gridCellStartIndices, int *gridCellEndIndices,
-  int *particleArrayIndices,
-  glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int N, int gridResolution, glm::vec3 gridMin,
+	float inverseCellWidth, float cellWidth,
+	int *gridCellStartIndices, int *gridCellEndIndices,
+	int *particleArrayIndices,
+	glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	int boidIndex, start, end, neighborCount = 0;
+	glm::vec3 center, separate, cohesion;
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 thisPos = pos[index];
+	glm::ivec3 gridPos = (pos[index] - gridMin) * inverseCellWidth;
+	glm::vec3 newVel = vel1[index];
+	//glm::ivec3 neighbor(
+	//	(thisPos.x > (gridPos.x + 0.5f) * cellWidth) ? -1 : 0, 
+	//	(thisPos.y > (gridPos.y + 0.5f) * cellWidth) ? -1 : 0,
+	//	(thisPos.z > (gridPos.z + 0.5f) * cellWidth) ? -1 : 0);
+
+	glm::ivec3 neighbor;
+
+	for (int z = -1; z <= 1; z++) {
+		for (int y = -1; y <= 1; y++) {
+			for (int x = -1; x <= 1; x++) {
+				glm::ivec3 offset(x, y, z);
+				neighbor = gridPos + offset;
+				if (neighbor.x < 0 || neighbor.y < 0 || neighbor.z < 0
+					|| neighbor.x >= gridResolution || neighbor.y >= gridResolution || neighbor.z >= gridResolution) {
+					continue;
+				}
+				int neighborNumber = gridIndex3Dto1D(neighbor.x, neighbor.y, neighbor.z, gridResolution);
+				start = gridCellStartIndices[neighborNumber];
+				end = gridCellEndIndices[neighborNumber];
+				for (int i = start; i <= end; i++) {
+					boidIndex = particleArrayIndices[i];
+
+					float distance = glm::distance(thisPos, pos[boidIndex]);
+					if (distance < rule1Distance) {
+						center += pos[boidIndex];
+						neighborCount++;
+					}
+					if (distance < rule2Distance) {
+						separate -= pos[boidIndex] - thisPos;
+					}
+					if (distance < rule3Distance) {
+						cohesion += vel1[boidIndex];
+					}
+				}
+			}
+		}
+	}
+	if (neighborCount > 0) {
+		center /= neighborCount;
+		newVel += (center - thisPos) * rule1Scale;
+		newVel += cohesion * rule3Scale;
+	}
+	newVel += separate * rule2Scale;
+	if (glm::length(newVel) > maxSpeed) {
+		newVel = glm::normalize(newVel) * maxSpeed;
+	}
+	//newVel = glm::vec3(1, 1, 1);
+	vel2[index] = newVel;
+
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,59 +415,134 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+	int index = threadIdx.x + (blockIdx.x * blockDim.x);
+	int start, end, neighborCount = 0;
+	glm::vec3 center, separate, cohesion;
+	if (index >= N) {
+		return;
+	}
+	glm::vec3 thisPos = pos[index];
+	glm::ivec3 gridPos = (pos[index] - gridMin) * inverseCellWidth;
+	glm::vec3 newVel = vel1[index];
+	glm::ivec3 neighbor;
+
+
+	for (int z = -1; z <= 1; z++) {
+		for (int y = -1; y <= 1; y++) {
+			for (int x = -1; x <= 1; x++) {
+				glm::ivec3 offset(x, y, z);
+				neighbor = gridPos + offset;
+				if (neighbor.x < 0 || neighbor.y < 0 || neighbor.z < 0
+					|| neighbor.x >= gridResolution || neighbor.y >= gridResolution || neighbor.z >= gridResolution) {
+					continue;
+				}
+				int neighborNumber = gridIndex3Dto1D(neighbor.x, neighbor.y, neighbor.z, gridResolution);
+				start = gridCellStartIndices[neighborNumber];
+				end = gridCellEndIndices[neighborNumber];
+				for (int i = start; i <= end; i++) {
+					float distance = glm::distance(thisPos, pos[i]);
+					if (distance < rule1Distance) {
+						center += pos[i];
+						neighborCount++;
+					}
+					if (distance < rule2Distance) {
+						separate -= pos[i] - thisPos;
+					}
+					if (distance < rule3Distance) {
+						cohesion += vel1[i];
+					}
+				}
+			}
+		}
+	}
+	if (neighborCount > 0) {
+		center /= neighborCount;
+		newVel += (center - thisPos) * rule1Scale;
+		newVel += cohesion * rule3Scale;
+	}
+	newVel += separate * rule2Scale;
+	if (glm::length(newVel) > maxSpeed) {
+		newVel = glm::normalize(newVel) * maxSpeed;
+	}
+	//newVel = glm::vec3(1, 1, 1);
+	vel2[index] = newVel;
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	//cudaEventRecord(cudaEventStart);
+	kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+	//cudaEventRecord(cudaEventStop);
+	//cudaEventSynchronize(cudaEventStop);
+
+	kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+
+	glm::vec3 *tmp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = tmp;
+	
+	//perfAnal
+	
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth, 
+		dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+	
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	cudaEventRecord(cudaEventStart);
+	kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+		gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, dev_vel1, dev_vel2);
+	cudaEventRecord(cudaEventStop);
+	cudaEventSynchronize(cudaEventStop);
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> >(numObjects, dt, dev_pos, dev_vel2);
+
+	glm::vec3 *tmp = dev_vel1;
+	dev_vel1 = dev_vel2;
+	dev_vel2 = tmp;
+
+	//perfAnal
+}
+
+__global__ void kernShuffleBuffer(int N, int *ref, glm::vec3 *orig, glm::vec3 *dest) {
+	int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (index >= N) {
+		return;
+	}
+	dest[index] = orig[ref[index]]; 
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+
+	dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+	kernComputeIndices << <fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+		dev_pos, dev_particleArrayIndices, dev_particleGridIndices);
+
+	thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+	kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_particleGridIndices, dev_gridCellStartIndices, dev_gridCellEndIndices);
+	
+	kernShuffleBuffer << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_particleArrayIndices, dev_pos, dev_shuffledPos);
+	kernShuffleBuffer << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_particleArrayIndices, dev_vel1, dev_vel2);
+	cudaEventRecord(cudaEventStart);
+	kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum, gridInverseCellWidth,
+		gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_shuffledPos, dev_vel2, dev_vel1);
+	cudaEventRecord(cudaEventStop);
+	cudaEventSynchronize(cudaEventStop);
+	kernUpdatePos << <fullBlocksPerGrid, blockSize >> >(numObjects, dt, dev_shuffledPos, dev_vel1);
+
+	glm::vec3 *tmp = dev_pos;
+	dev_pos = dev_shuffledPos;
+	dev_shuffledPos = tmp;
+
+	perfAnal
 }
 
 void Boids::endSimulation() {
@@ -389,12 +550,15 @@ void Boids::endSimulation() {
   cudaFree(dev_vel2);
   cudaFree(dev_pos);
 
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+
+  cudaFree(dev_shuffledPos);
 }
 
 void Boids::unitTest() {
-  // LOOK-1.2 Feel free to write additional tests here.
-
   // test unstable sort
   int *dev_intKeys;
   int *dev_intValues;
diff --git a/src/main.cpp b/src/main.cpp
index e416836..075ff1b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,8 +14,8 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
 const int N_FOR_VIS = 5000;
@@ -220,7 +220,7 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
-    Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
+    //Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
 
     while (!glfwWindowShouldClose(window)) {