CRPL students write about their week-long GPU hackathon experience at the Brookhaven National Lab, 2018

November 7, 2018

Four students from CRPL@ UDEL Mauricio Ferrato (Ph.D. student), Eric Wright (Ph.D. student), Robert Searles (Ph.D student), Kyle Friedline (Undergraduate student BSCS) attended the 2018 Brookathon - Brookhaven National Laboratory GPU hackathon held at BNL on September 17–21, 2018 as one of the 2 mentors for four hackathon teams.

They have also served as mentors in the Boulder, Oak Ridge National Laboratory (ORNL), Pawsey Supercomputing Center and NASA hackathons.

What do they have to say?

Team: Lagrangian Particle Technologies (Mauricio Ferrato)

The Lagrangian Particles team came with an application that already had an OpenMP implementation, but it showed good potential to run on GPUs. The group had already profiled the application (using NVIDIA’s nvprof) using their own timers and had defined four different hotspots that could be improved if moved to GPUs. The team was attracted to the idea of using the OpenACC, high-level directive-based parallel programming model because they were new to GPU programming and OpenACC provided them with an easier linear curvet. On the first day we worked on getting the application to compile using an OpenACC compiler, PGI. We accomplished this step successfully and then during the rest of the week, the team separated into smaller groups, each assigned to tackle different tasks of the problem.

One of the groups focused on the data structure construction of the particle distribution of the application. They managed to implement an OpenACC version of that part of the code using managed memory. The team also wanted to tackle the data sparsity that the particle distribution created with the hope that this would improve their performance. The team attempted to use the CUDA Thrust library to sort the particle distribution vector but we encountered a compiler error when using managed memory and the thrust library together. With the help of our other mentor, Kwangmin Yu, the team then decided to write CUDA kernels as it proved to be easier for them while using the Thrust Library. In the end, the group managed to profile the kernels using nvprof and successfully speedup that part of the application.

Another group focused on the part of the application related to octree searching and a linear solver. We quickly saw that the octree searching was not a good candidate for GPU parallelization. This part of the code was not structured for GPUs, the code had a lot of branching and data structures that made it hard to parallelize. On the other side, the linear solver used the LAPACK library to perform QR factorization with pivoting. The group found a couple of alternatives in the MAGMA library but found difficulties because there did not seem to be a batched version of the QR factorization routine that had pivoting.

Overall the team felt that they learned a lot and felt that they were much more prepared to approach GPU programming. Although not all tasks were ported to the GPU successfully, the team left with a plan on how to target the remaining tasks for the future. One of the team members said he will take over the project as part of his PhD tenure and was excited about all that he learned!!! What a fantastic outcome of a 5-day hackathon event! :-)

Team: SBNBuggy (Eric Wright)

My team came prepared with a small portion of their code isolated. We still had problems with some library dependencies, but overall it went well. Early on we ran into some problems compiling with PGI, and honestly I don’t think we would have ever gotten past it without help from Mat Colgrove and the rest of the PGI developer techs. After we got it to compile we had little trouble getting it to work on multicore. We then tried to do managed memory on the GPU, but it wasn’t working. So we had to go over some advanced data management concepts to get all of their stl vectors working on GPU. For their first bit of code we saw a very nice speedup.

Then after the first success, each person got a different objective, and Mat and I went around the table helping whoever was stuck. One of the best “a-ha” moments we had was when Mat and I sat down with one of the members to optimize a semi-complex loop nest. The loop seemed like a version of matrix-multiply, but with a few extra steps. It was pretty cool to hear Mat’s feedback of how the optimizations were being applied to the GPU at the low-level. We had one thing we tried end up being ~15 seconds on the GPU, and another thing was ~1 second. So the optimizations made a huge difference.

Lastly, the team seemed very excited to go back after the event and continue the code and I felt mission accomplished!!

Team: Game of Threads (Robert Searles)

Game of Threads came ready to hack. With existing OpenMP, CUDA, and OpenACC implementations of their applications, we focused primarily on optimization with the goal of getting their OpenACC code to exhibit performance that is on-par with their CUDA code. Specifically, we were looking at the GPP and FF kernels. These are the two main computational kernels in the BerkeleyGW code, which is used to calculate accurate electronic and optical properties in materials of different dimensionalities and complexity, from bulk semiconductors and metals to nanostructured materials and molecules.

Each of these kernels contains a few computational functions, and they all exhibited the same issue: the OpenACC versions were slower than the CUDA versions. Thanks to the relatively simplistic nature of OpenACC’s syntax, we were able to explore a plethora of different parallel configurations within their functions’ loop nests. Since the vector-level loops of their code contained a lot of computation, we also explored different block sizes and vector lengths, which is very easy to do with OpenACC. We went through this process for each of the computationally intensive functions in GPP and FF. In the end, we had an OpenACC implementation for each code that matched the performance of their CUDA code. It just took a bit of tweaking and some trial and error, but we got there!

My team members said they learned a lot through this process, and they were very excited to see that their code could match CUDA’s performance using OpenACC. The portability and low programmer overhead OpenACC offers is very appealing to them moving forward. I also was able to teach them a bit about the default behaviors of PGI’s compiler that they weren’t aware of, so they have a better understanding of how that works as well. They had prior experience using NVProf, but we did use PGProf as well at the hackathon. They were very happy to see that you can open the output of PGProf in NVIDIA’s visual profiling application. This proved to be a useful tool for us as well. After the hackathon, they plan to integrate this code back into BerkeleyGW.

Team: hpMUSIC (Kyle Friedline)

Throughout the week, the hpMUSIC team worked to port a CFD code for jet turbines to GPUs using CUDA. Within the first day, we saw that the size of the problem before us was beyond the our abilities to port inside the week. Our two main challenges were templated data structures and a large number of compute functions. Within the first day, we decided that instead of trying to get a running port of the code on GPU, we would focus on learning and porting two functions into compute kernels. Throughout the week, I worked mostly on teaching about parallelization of the computation while my co-mentor worked to teach about transferring the complicated data structures to device and managing the data to minimize the data overhead. One of the functions employed nested for loops five layers deep with only the fourth having a loop carried dependence. Each loop iterator also only iterated 3-6 times resulting in a total loop iteration count of 810. I showed them how to collapse the loop, leaving the 4th loop with the dependency and moving the others into the cuda thread_idx. Even though this wasn’t an optimal situation for GPU execution and we didn’t take much time in optimizations, we were able to show over 100x speedup over single-core cpu!!!! This was fantastic!! I continued to teach them about various optimizations to improve performance such as loop tiling, minimizing memory accesses, and methods to reduce register load. Sadly, we couldn’t cover all that there was to cover; it turns out even a 5-day hackathon is not enough when things get quite exciting! However, I hope that the things they learned throughout the week will help them as they continue to port the remaining 99% of the code to GPU.

They loved the event and expressed a lot of interest in returning for the next hackathon, possibly to experiment with implementing the code using OpenACC.