Strong Memory System Reliability
The size and sensitivity of computer memory make its protection the first order of business for a reliability-conscious designer. Despite the long and successful history of using error coding techniques to mitigate memory error rates, there is still a need for strong and flexible memory error protection techniques for large supercomputers and other high-performance systems. Correspondingly, much of my recent research focuses on techniques to provide very high levels of main memory reliability without exceeding the current industry standard storage footprint for error-correcting codes.
Efficient and Reliable Application-Specific Acceleration
Increasing levels of integration make it so that specialized hardware units can be cost-effectively placed on-chip. This, combined with the ever-increasing need for energy efficient execution, will make the hardware acceleration of important applications and workloads more commonplace. Towards this end, some of my research has aimed at the efficient acceleration of workloads that exhibit fine-grained gather/scatter memory access patterns, DRAM link compression, and the reliability characterization of DNN accelerators.
Arithmetic Error Detection
Rising levels of integration and decreasing component reliabilities make error protection increasingly important. At the same time, the need for energy efficiency necessitates the careful evaluation of resilience techniques. Arithmetic error protection is typically more expensive than the protection of memory or data movement, requiring large amounts of redundant logic. Protection of computer arithmetic has correspondingly been reserved for critical or high-availability applications. Current trends, however, indicate that low-cost arithmetic error detection will be necessary in the future across diverse application areas. Towards this end, some of my research is focused on providing strong, flexible, and low-cost error protection for arithmetic operations.
Reliability and resilience are major obstacles on the road to exascale computing. The growing number of components required for exascale systems and the decreasing inherent reliability of components in future fabrication technologies conspire to make reliability a first-order design concern. A strong system-level approach towards reliability is needed in order to efficiently handle errors at all scales. In addition, it is important to enable and exploit cross-layer reliability through system-level mechanisms—there are a plethora of different failures that can occur in a large computer system, and superior efficiency can only be achieved by dealing with every error in the appropriate manner and system layer.
The speed and levels of integration of modern devices have risen to the point that arithmetic can be performed very fast and with high precision. While precise arithmetic is important for some applications, it comes at a hidden energy cost; by computing most results past the precision they require, systems are inefficiently utilizing their resources. These inefficiencies will become increasing problematic for future power-constrained devices. I am interested in the co-design of precision-aware hardware and software algorithms to create systems with superior energy efficiency.