-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlistofExperiences-resume.tex
126 lines (105 loc) · 6.83 KB
/
listofExperiences-resume.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
% TODO: put in numbers and impact
% TODO: explain role in each
% TODO: check for grammare
% TODO: get checked by recruiters
\newcommand{\myExpOne}{
\item Pathfinding and software engineering for tools for Kokkos integrated with (1) HPC performance monitoring and feedback via LDMS and (2) PMPI and adaptive runtime systems for MPI.
\item Developed AI-assisted HPC Tools through LLMs (coderosetta.com) and autotuning (TAU+APEX) for Kokkos applications run on NVIDIA GPUs, resulting in a poster presentation at GTC 2025.
\item Research and pathfinding on the use of AI chips, e.g., Cerebras WSE-3, for science simulations.
\item Submitted two proposals on correctness tools for HPC, each with \$1.5M in funding for 3 years.
}
\textbf{Sandia National Laboratories}\\
\textit{Principal Member of Technical Staff II} \hfill \textit{July 2024 - Present}
%\vspace{-0.02in}
\noindent
\begin{itemize}[itemsep=-0.1em]\onlyitems[include={1,2}]
\myExpOne
\end{itemize}
\newcommand{\myExpTwo}{
\item Developed and maintained Kokkos Tools for the CMake and Spack build system, tooling overheads, CI/CD, auto-tuning, and nvtx/roctx/vtune integration, leading to 15 merged github PRs.
% \item Contributed to autotuning features to the Kokkos 4.5 release.
\item Developed a debugging tool that detected 7 common Kokkos user bugs by analyzing LLVM IR of Kokkos programs via symbolic execution, leading to a paper at SC24's Correctness workshop.
\item Implemented prototype LLVM OpenMP feature for index set splitting of an OpenMP loop, leading to a 1.2x speedup for an OpenMP + CUDA benchmark and to OpenMP 6.0's new split directive.
% \item Developed and implemented CPU+GPU auto-tuning optimizations, leading to a 1.2x speedup.
\item Drafted standards for OpenMP multi-GPU features for NVIDIA DGX, and for GPUs for AWS, Google Cloud, and OCI, leading to 19 proposed features for OpenMP versions 6.1 and 7.0.
%which led to technical report on the OpenMP Set object.
% \item Porting HPC Tools and Runtime Systems to exascale+multicloud, collaborating with AWS, Google Cloud and Oracle OCI.
% \item Impacted six different production-level scientific applications running on El Capitan and Frontier supercomputers.
}
\noindent
\textit{Senior Member of Technical Staff} \hfill \textit{August 2022 - June 2024}
%\vspace{-0.02in}
\begin{itemize}[itemsep=-0.1em]
\myExpTwo
\end{itemize}
\newcommand{\myExpThree}{
% \item Contributed to developing an LLVM OpenMP implementation, specifically the OpenMP implementation's compiler and its runtime, targetted for Department of Energy's upcoming Exascale Supercomputer platforms.
\item Implemented OpenMP user-defined multi-GPU scheduling for LLVM, offering 2.1x speedup over using MPI parallelization, leading to papers at IWOMP 2020 and BCB 2021.
\item Implemented performance optimizations in LLVM for OpenMP asynchronous GPU offloading that achieved a 1.2x speedup, leading to a paper at SC22's HiPar workshop.
\item Developed performance benchmarks that evaluated 5 major vendor OpenMP GPU implementations, leading to an ACM journal paper and an IWOMP 2021 workshop paper.
% \item Developed benchmarks and evaluating OpenMP implementations, e.g., LLVM's OpenMP, NVIDIA's OpenMP, on Exascale Supercomputers.
\item Demonstrated technical leadership as technical project manager for the ECP SOLLVE project, submitting 12 ECP milestone reports, organizing 7 GPU hackathons, and defining 3 project KPIs.
%and voting in 5 OpenMP Committee meetings.
}
\noindent
\textbf{Brookhaven National Laboratory}\hfill
\textit{Assistant Computational Scientist} \hfill \textit{May 2019 - August 2022}
%\vspace{-0.02in}
\begin{itemize}[itemsep=-0.1em]
\myExpThree
\end{itemize}
\newcommand{\myExpFour}{
\item Implemented, tested and experimented with User-defined Loop Schedules (UDS) for OpenMP, leading to a paper at IWOMP 2018 and a prototype library for LLVM and GCC.
\item Added the UDS feature to RAJA and Charm++'s CkLoop, with 1 github PR merged in Charm++.
}
\noindent
\textbf{Charmworks}\hfill
\textit{Software Engineer} \hfill \textit{May 2018 - May 2019}
\vspace{-0.01in}
\begin{itemize}[itemsep=-0.1em]
\myExpFour
\end{itemize}
\newcommand{\myExpFive}{
\item Performance analysis and optimization of 3-D image reconstruction application on NVIDIA GPUs via CUPTI and auto-tuning, leading to a performance-enhanced CUDA version of the application.
\item Developed tuning support for coordinated loop scheduling and load balancing in Charm++, leading to a 1.2x speedup on a particle-in-cell benchmark code and a Best Poster Candidate at SC18.
}
\noindent
%\comments{
\textbf{USC - Information Sciences Institute}\hfill
\textit{Computer Scientist} \hfill \textit{Dec 2016 - May 2018}
\begin{itemize}[itemsep=-0.1em]
\myExpFive
\end{itemize}
%\vspace{-0.02in}
%\item Worked in team to manage computational performance aspects of running an application program for 3-D image reconstruction algorithms on NVIDIA GPUs.
%\item Ensured external network infrastructure to support transfer of application code's input data files were adequate for an application code's efficient execution using the Globus Toolkit.
%\item Translated an x-ray tomography code written in Matlab code to C code and then parallelizing it to run on a supercomputer
%having nodes with GPGPUs.
%\item \small Doing optimizations for MPI+CUDA application code involving low-overhead loop scheduling and loop optimizations such as loop unrolling.
%\item \small Working on transformations in LLVM.
\newcommand{\myExpSix}{
\item Extended Charm++ to offer a novel runtime system capability of coordinating inter-node load balancing and intra-node loop scheduling, leading to 2 github PRs merged in Charm++.
%\item Provided PR reviews for CharmMPI, an MPI implementation providing adaptive runtime support.
%Implemented 2 loop schedules in the extended Charm++ library, leading to a 1.4x speedup of a 3-D Jacobi code on a node of LBL-NERSC's Cori.
%\item Helped to improve portability of Charm++ to desktop platforms.
}
\noindent
\textbf{Charmworks}\hfill
\textit{Software Developer} \hfill \textit{Jan 2016 - Dec 2016}
%\vspace*{-0.02in}
\begin{itemize}
\myExpSix
%TODO: consider adding 'including in cloud environments' the end of
%the sentence.
%TODO: make paragraph
%\item Assisted with business aspects of a high-tech startup.
\end{itemize}
\noindent
\textbf{University of Illinois}\hfill
\textit{Postdoctoral Associate} \hfill \textit{Jul 2015 - Dec 2015}
%\vspace*{-0.02in}
\begin{itemize}[itemsep=-0.1em]
%\item Developed LLVM OpenMP lw-sched library that allows application programmers to use strategies from dissertation.
\item Sped up a plasma-physics Fortran MPI+OpenACC code by 1.2x via a combination of GPU offload optimizations and loop transformations on an NVIDIA K80 GPU.
%\item Incorporated over-decomposition and locality awareness into low-overhead OpenMP loop scheduling strategies.
\end{itemize}