My primary research interests revolve around extracting the maximum performance out of the next generation supercomputers, development of efficient programming models for GPUs and multithreaded environments, and performance portability:
AMT on 1024 GPUs
Contributed to building the heterogeneous task scheduler in Uintah Asynchronous Many Task (AMT) Runtime System to be used for the coal boiler simulation at exascale. Ported several tasks involved in the simulation to GPU using Kokkos. Solved multiple race conditions in the scheduling logic. Successfully ran the simulation on 1024 Nvidia V100s (LLNL's Lassen cluster).
MPI End Points to accelerate linear equations solver
Improved CPU performance of Hypre (linear equations solver) up to 2.4x on 256 KNLs (ANL's Bebop cluster) using MPI EndPoints and a new threading model. Optimized the Cuda version of Hypre on Lassen to get performance improvement up to 1.44x on 512 Nvidia V100s (LLNL's Lassen cluster).
Related Publications:
Improving Performance of the Hypre Iterative Solver for Uintah Combustion Codes on Manycore Architectures Using MPI Endpoints and Kernel Consolidation. D Sahasrabudhe, M. Berzins. In Computational Science -- ICCS 2020, 20th International Conference, Amsterdam
Optimizing the Hypre solver for manycore and GPU architectures, D. Sahasrabudhe, R. Zambre, A. Chandramowlishwaran, M. Berzins. In Journal of Computational Science, Springer International Publishing, pp. 101279. 2020. DOI: https://doi.org/10.1016/j.jocs.2020.101279
Portable SIMD Primitive in Kokkos
Developed a prototype for the portable SIMD primitive within Kokkos to achieve efficient vectorization on CPUs and also ensure portability across GPUs (cuda). The new primitive gives upto 7.8x speedups on Intel's Knights Landing (KNL).
Related Publication: A Portable SIMD Primitive using Kokkos for Heterogeneous Architectures. D. Sahasrabudhe, E. T. Phipps, S. Rajamanickam, M. Berzins. In Sixth Workshop on Accelerator Programming Using Directives (WACCPD), 2019, DOI: https://doi.org/10.1007/978-3-030-49943-3_7
Resiliency using ULFM
Developed a Resiliency component for Uintah to recover from rank (or node) failure, rebuild patches for the failed rank and continue normal execution using ULFM MPI (a variation of MPI that allows fault handling after rank crashes). Implemented different interpolation schemes to rebuild patch data using coarse values. The new Algorithm Based Fault Tolerance (ABFT) scheme achieved 10x speedup on 128 nodes compared to traditional checkpointing.
Related Publication: Node failure resiliency for Uintah without checkpointing. D. Sahasrabudhe, M. Berzins, J. Schmidt. In Concurrency and Computation: Practice and Experience, pp. e5340. 2019. DOI:10.1002/cpe.5340
Asynchronous Multi-threaded Task Scheduler
Contributed to porting of Uintah to Sunway TaihuLight. Adapted Uintah's task scheduler for the Sunway architecture. The scheduler offloads tasks to slave cores for asynchronous execution and continues processing MPI communication, reduction tasks, etc. The scheduler achieved 96% parallel efficiency on Sunway.
Related Publication: A Preliminary Port and Evaluation of the Uintah AMT Runtime on Sunway TaihuLight. Z. Yang, D. Sahasrabudhe, A. Humphrey, M. Berzins. In 9th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2018), IEEE, May, 2018.
Acknowledgment
I am very much thankful to DOE NNSA PSAAP-II program, Intel Parallel Computing Center at SCI Institute, National Science Foundation, and Sandia National Laboratories for supporting the research.