Don Maxwell
Oak Ridge National Laboratory(US)
Publications by Year
Research Areas
Parallel Computing and Optimization Techniques, Cloud Computing and Resource Management, Distributed and Parallel Computing Systems, Distributed systems and fault tolerance, Software System Performance and Reliability
Most-Cited Works
- → Understanding GPU errors on large-scale HPC systems and the implications for system design and operation(2015)179 cited
- → The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems(2018)164 cited
- → Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility(2015)86 cited
- → Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems(2015)65 cited
- → TUE, a New Energy-Efficiency Metric Applied at ORNL’s Jaguar(2013)36 cited
- → GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability(2020)31 cited
- Monitoring Tools for Large Scale Systems(2010)
- → Scaling the Summit: Deploying the World’s Fastest Supercomputer(2019)21 cited
- Wireless Temperature Monitoring in Remote Systems Analog(2002)
- → Analyzing a Five-Year Failure Record of a Leadership-Class Supercomputer(2019)16 cited