Fault-tolerance of Distributed Multithreaded Applications inShared-Nothing Systems

Report ID

1999-35

Report Authors

J. James and A. Singh

Report Date

1999-11-01

Abstract

The ubiquity of distributed systems has led to increasingly complex distributedapplications. That complexity has been increased by multithreadedapplications, shared-nothing environments like the Internet, and the use ofnested transactions to access multiple sets of data atomically. Providingfault tolerance for such applications is complicated by the loss of thepiecewise determinism assumption (due to multithreading), the necessity ofreplicating data (due to the shared-nothing environment), and the necessity ofmaintaining consistency for nested transactions.Providing fault tolerance for such applications in an ad hoc manner isdifficult. We explore a systematic approach to providing fault tolerance. Weshow that the assumption of data-race-freedom has some of the benefits ofpiecewise determinism, but allows multithreading. We develop a logical ringstructure for the logging and recovery processes, and show how the ringsimplifies those tasks. We discuss roll-forward versus roll-backward recoveryand suggest the use of roll-forward techniques. Finally, we investigate amessage combining technique that can reduce the number of messages sent duringthe logging process.