A computer implemented method for providing fault tolerance to a plurality of instances in a system including a plurality of surviving instances includes: determining, for each of the surviving instances, an aggregate load by: retrieving a job load of each job assigned to the respective surviving instance; and summing the job loads of all of the jobs assigned to the respective surviving instance; and selecting to recover and perform, by one of the surviving instances, an orphaned job based upon the aggregate loads of the surviving instances.