Fault Tolerant High Performance Computing Systems

Supercomputers are highly specialized, massively parallel systems that mostly run simulations for scientists and engineers. Those simulations help to model, explain, understand, and predict a wide range of phenomena. From atomic reactions that explain the origin of the universe, to material design for new mobile devices, supercomputers provide the necessary horsepower to run computationally demanding applications. The complexity of those applications has been steadily growing, forcing high performance systems to grow at an exponential rate. Every new supercomputer generation brings substantially bigger and more powerful machines than the previous generation.

Such sustained growth does not come without its downsides. A bigger and more complex supercomputer is more difficult to program and harder to handle. Part of the reason comes from the mere fact that assembling a huge number of components makes any system more fragile. An increase in failure frequency is inevitable if the reliability of the parts does not increase to match the speed at which those parts are assembled. Supercomputers of the future are expected to fail more often than their current counterparts. The figure below shows (on the left plot) a historical view of the size of the top 10 supercomputers (from the Top500 lists) in the last 20 years. The size of a system is determined by the number of sockets, a simplifying assumption that leaves out many important components of the system. However, using this measure for the size of a system, we see that supercomputers have grown exponentially. Some studies predict an exascale system will have from 200K to 1M sockets. The matching MTTI (mean-time-to-interrupt) of such systems is presented in the figure below (on the right plot). For different values of failure frequency per socket per year (f), we get an approximation of the system’s reliability. Even if parts are highly reliable (f=0.01, i.e. they fail once every 100 years!), an exascale system with 1M sockets will fail every 52 minutes.

Socket Count


One of my research interests is to understand the problem of failures in supercomputer. I aim to design, analyze, and build fault tolerant HPC systems. Here are some of the research directions I have pursued:

  • Fault-tolerance protocol design, including rollback-recovery and replication strategies.
  • Failure data analysis, using data from real-world supercomputers to find interesting patterns.
  • Interplay of fault tolerance and energy efficiency, understanding the tradeoffs of two of the most important challenges in extreme-scale systems.

Check out my publications for a more comprehensive list of my work on this area.

Programming with Parallel-Objects

The von Neumann model of computation has been a cornerstone in sequential computer programming. There is no equivalent in parallel computing. That is part of the reason programmers are not all trained in using the plethora of available parallel architectures. To make things worse, the upcoming generation of extreme-scale systems will challenge the existing programming models of parallel computing. In a very large supercomputer, execution will face a big deal of variability. From the thermal properties of components to the architecture of the processing units, the programming model will have to be flexible enough to accommodate the requirements of this new kind of processing.

A promising programming and computational model is called parallel objects. This model draws ideas from agent-oriented programming, active messages, and object-oriented programming. A programmer decomposes the program into entities (objects) that perform a specific function and hold a portion of the program’s data. Each object exports a list of methods that can be called remotely by other objects. It is private-memory message-driven programming. Implementations such as Charm++ or Global Arrays attest of the big potential of this model.

The figure below presents a typical implementation of parallel objects. The top part shows the abstract view of a program with different objects interacting via messages. In the middle, an adaptive runtime system (ARTS) performs all dynamic performance-related functions. For example, the ARTS defines (and refines) the mapping of objects to computational nodes, seen at the bottom of the figure. The assignment of objects to nodes can be changed by the ARTS to provide load balancing and improve the performance of the system.

Exploring the parallel objects model is one of my research interest. Some of the topics I have worked on are:

  • Design of load balancing strategies, to fit particular parallel architectures.
  • Implementation of scientific applications, to benefit from the dynamic features of ARTS and parallel objects.
  • High-level language interfaces to ARTS, to provide the best of both worlds: expressivity and high performance.
  • Parallel objects for accelerators, to exploit the monumental computing capability of these devices.

Check out my publications for a more comprehensive list of my work on this area.


Program Committee Member

  • International European Conference on Parallel and Distributed Computing, EuroPar (2016)
  • IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID (2016)
  • Latin American High Performance Computing Conference, CARLA (2016, 2017)
  • International Workshop on Fault Tolerant Systems, FTS (2015, 2016, 2017)
  • International Conference of the Chilean Computer Science Society, SCCC (2014, 2015, 2016)
  • International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD (2014, 2015)
  • Jornadas Costarricenses de Investigación en Computación e Informática, JOCICI (2015, 2017)


  • Sage International Journal of High Performance Computing Applications (IJHPCA)
  • ACM Transactions on Architecture and Code Optimization (TACO)
  • Springer Cluster Computing (CLUS)
  • Springer The Journal of Supercomputing (TJS)
  • IEEE Transactions on Parallel and Distributed Systems (TPDS)
  • IEEE Transactions on Sustainable Computing (TSUSC)
  • Elsevier Journal of Parallel and Distributed Computing (JPDC)
  • Elsevier Parallel Computing (ParCo)
  • IOP Journal of Physics: Conference Series (JPCS)


  • Here is a list of research ideas that may be a good starting point for a Master’s thesis.
  • Here is a list of publishing venues for HPC research.
  • Here is a list of HPC benchmark suites.
  • Here is a study guide for HPC first timers.