Checkpointing to improve resiliency - High-Level Transformations with OP2

High-Level Transformations with OP2

6.1 Checkpointing to improve resiliency

Due to the increasing number of components in modern computer systems, the mean time for failure is decreasing to the point where it may be smaller than the execution time of a large scale simulation. More complex systems inherently require more complex software as well, on all levels; the operating system, parallel libraries and applications, which increases the probability of software errors. One of the advantages of using Domain Specific Languages, such as OP2, is that the complexity in orchestrating parallelism is hidden from the application developer, and relies only on parallel programming experts developing and maintaining the library; this helps minimise the number of bugs. Regardless of what level the error occurs at, or whether it is a hard error (causing the application to crash) or a soft error (that might go unnoticed for a while), it is crucial to provide means to correct these errors and finish running the application. The detection of soft errors is

the target of intense research: on the lowest, hardware level, Error Checking & Correcting (ECC) circuitry is added to memory to detect bit flips due to e.g. cosmic radiation. Errors occurring during computations are more difficult to detect, hardware redundancy has been proposed, but duplication of circuitry would be very expensive [81]. Due to the complexity of detecting soft errors, here I only consider the case where the error was already detected by some external (to OP2) means.

Traditionally in scientific computing, developers had to manually implement some sort of checkpointing system, saving the entire state space to disk and in case of failure, relaunch the application and provide some means of fast-forwarding to the point during execution where the last checkpoint was. Here, I show that this process can be automated in its entirety, furthermore that it is possible to find a point during execution where the state space is minimal and only save the necessary data. I demonstrate three approaches - one that makes a “locally optimal” decision, one that attempts to make a “globally optimal” decision by assuming some form of periodic execution pattern, and the third which given a test-run tells the user where the optimal point during execution is, so that the checkpointing can be triggered manually.

The basic idea comes from the checkpointing of transactions in database systems [82];

execution using OP2 can be considered as a sequence of parallel loops, each accessing data, describing the type of access. This information can be used to reason about what data is required for any given loop; for example if a loop reads A and writes B, then the prior contents of B are irrelevant, thus if a checkpoint were to be created, B would not have to be included in it. Such a transactional view of OP2 is valid because once OP2 takes control of data, the user can only access it via API calls which go through the library, therefore the checkpointing algorithm can take them into account when working out the state of different datasets.

To facilitate checkpointing, each dataset has a “dirty” flag to keep track of modifications to it. A high-level description of the checkpointing algorithm is as follows:

1. If a dataset was never modified (as it might be the case with e.g. coordinates), then it is not saved at all.

2. Variables passed to loops throughop_arg_gbl that are not read-only are saved for every occurrence of the loop because data returned after a loop is out of the hands of OP2, and may be used for control decisions.

3. When checkpoint creation is triggered due to a timeout, then start looking for a loop

with at least one OP_WRITE argument, if found, then enter “checkpointing mode”, and before executing that loop:

(a) Save datasets used by the loop that are notOP_WRITE.

(b) Drop datasets that areOP_WRITEin the loop from the checkpoint.

4. When already in “checkpointing mode” (previous point), start executing subsequent loops to determine whether datasets that were not yet saved nor dropped (i.e. are flagged) would have to be saved:

(a) If a flagged dataset is encountered, save it if it’s notOP_WRITE, otherwise drop it and remove the flag.

(b) If a flagged dataset is not encountered within a reasonable timeframe allocated for the “checkpointing mode”, then save it.

The immediate benefit of this algorithm is that a checkpoint does not necessarily depend on the state of the collection of all datasets at any particular point during the execution; the “checkpointing mode” can cover several loops and gather information based on different datasets being accessed by different loops.

The “locally optimal” algorithm starts looking for the first loop to enter “checkpointing mode” at, based on the criteria that it has at least oneOP_WRITEargument, but this may be a suboptimal decision if for example the next loop would write two other arguments and read the one written by the first loop. However, OP2 cannot definitively predict what loops would come next because control flow is in the hands of the user, therefore this kind of decision will result in a checkpoint that may have a suboptimal size. There are two possible ways to improve upon this decision; the first is for OP2 to carry out a

“dry run” where it looks for potential places during execution where a globally minimal checkpoint could have been created, reports this to the user and the user places a call in the code that will trigger checkpointing in the right place. The other possibility is to record the sequence of parallel loop calls and carry out a subsequence alignment analysis;

assuming that some periodicity is found, it can determine the optimal location to begin checkpointing at, therefore entering the “checkpointing mode” can be postponed until that point is encountered again.

These algorithms were implemented in the OP2 library, available at [83], it currently supports single-node checkpointing into HDF5. There are some important practical

con-bounds(1) rms x(2) q(4) q_old(4) adt(1) res(4) vector scalar vector vector vector vector vector

1 save_soln R W 8

units of data saved if entering checkpointing mode here

Figure 6.1. Example checkpointing scenario on the Airfoil application

siderations that have to be taken into account when implementing this for general unstruc-tured grid computations; if an indirect write exists in a loop then it has to be checked whether the mapping includes every element of the target set, therefore the whole of the target dataset would be written to. I also assume that every field of a multidimensional dataset is written to when it appears in a parallel loop.

6.1.1 An example on Airfoil

Figure 6.1 shows an example of how the checkpointing algorithm work when applied to the Airfoil benchmark application. It shows for each loop, which datasets are read (R), written (W), incremented (I) or read-and-written (RW) as well as the dimensionality, or number of components, for each dataset. It also indicates how many units of data would have to be saved if checkpointing mode was entered at the current loop. Colours are used to indicate when a dataset is saved (green), dropped (yellow) or flagged for further decision (blue). If checkpointing were to be triggered right before the execution of adt_calc, then checkpointing mode would be entered upon reaching adt_calc, saving q and droppingadtimmediately, and then subsequentlyreswould be saved when reaching res_calcandq_oldwhen reachingupdate. Sinceboundsandxwere never modified, they are not saved, and the values of rms are saved whenever update has executed. Clearly, entering checkpointing mode at kerneladt_calcis still better than entering it before either

res_calcorbres_calc, but it does not give as much data savings as if checkpointing mode were to be entered before eithersave_soln orupdate. Therefore either the user can add a call to trigger checkpointing before these two loops, or OP2 can apply the “speculative”

algorithm and recognise that there is likely a periodic execution because the sequence of kernels 1-9 repeats, thus it can wait with entering checkpointing mode until either save_soln orupdateare reached.

In the event of a failure, the application is simply restarted, but until reaching the point in execution where the checkpoint was made, the op_par_loops do not carry out any computations, only set the value of op_arg_gblarguments, if any. This ensures that the same execution path is followed when fast-forwarding, once the checkpoint is reached, all state is restored from the saved data, and normal execution is resumed.

In summary, this algorithm can deliver checkpointing to any application that uses OP2 by only relying on domain specific knowledge and the high-level specification of parallel loops. This is completely opaque to the application programmer, thereby providing resiliency to unstructured grid computations at no effort from the user.

In document ABSTRACTION AND IMPLEMENTATION OF UNSTRUCTURED GRID ALGORITHMS ON MASSIVELY PARALLEL HETEROGENEOUS ARCHITECTURES (Pldal 100-104)