Credible execution of bounded-time parallel systems with delayed diagnosis

Computing - Tập 48 - Trang 21-37 - 1992
R. Shankar1, D. P. Miranker2
1Department of Electrical Engineering, University of Texas at Austin, Austin, U.S.A.
2Department of Computer Sciences, University of Texas at Austin, Austin, U.S.A.

Tóm tắt

This paper presents a forward recovery method for the fault-tolerant execution of parallel software systems on multicomputers such that faults are neither detected nor diagnosed until the fault prevents progress in the computation of the system. The method minimizes the communication and synchronization overhead required to verify the reliability of the system and consequently minimizes the impact of fault-tolerance on the throughput of the computation. We say the system is credible provided that the system is diagnosable and complete, where complete means that at least one copy of each process exists on a fault-free processor. We apply the method to the process structure deriving from parallel, bounded-time decision systems and show through an exact Markov analysis that the method will yield a very credible system. We then introduce a much simpler but approximate Markov model that facilitates credibility analysis over a larger range of parameters and applications.

Tài liệu tham khảo

[Agh86] Agha, G.: Actors: A model of computation. Cambridge, Mass.: MIT Press 1986.

[Agr85] Agrawal, P.: RAFT. A recursive algorithm for fault tolerance. In: Pro., International Conference on Parallel Processing, August 1985, pp. 814–821.

[AS82] Adams, G. B., Siegel, H. J.: The extra stage cube: A fault-tolerant interconnection network for supercomputers. IEEE Transactions on Computers1982, 443–454.

[BEG+90] Browne, J. C., Emerson, A., Gouda, M., Miranker, D. P., Mok, A., Rosier, L.: Bounded-time, fault-tolerance, rule-based systems. In: Proc Goddard Conference on Space Applications of Artificial Intelligence, 1990.

[BFKM85] Brownston L., Farrell, R., Kant, E., Martin, N.: Programming expert systems in OPS5. Reading Mass.: Addison-Wesley 1985.

[CKWC90] Mok, A., Wang, C. K., Cheng A.: MRL: A real-time rule-based production system. In: Proc. 11th Real-Time Systems Symposium, December 1990, pp. 267–276.

[CM88] Chandy, K. M., Misra, J.: Parallel program design: A foundation Reading, Mass.: Addison-Wesley 1988.

[Dij76] Dijkstra, E. W.: A discipline of programming: Englewood Cliffs, N.J.: Prentice-Hall 1976.

[DM84] Dabhura, A. T., Masson, G. M.: AnO(n2.5) Fault identification algorithm for diagnosable systems. IEEE Transactions on Computers1984, 486–492.

[ER89] Fussel, D., Rangarajan, S.: Probabilistic diagnosis of multiprocessor systems. In: Fault-Tolerant Conference Symposium, 1989.

[FWA87] Fuchs, W. K., Wu, K. L., Abraham, J. A.: Comparison and diagnosis of large replicated files. IEEE Transactions on Software Engineering1987, 15.

[Kle75] Kleinrock, L.: Queueing systems, Vol. 1: Theory: Wiley 1975.

[LFA90] Long, J., Fuchs, W. K., Abraham, J. A.: Forward recovery using checkpointing in parallel systems. In: Proc. IEEE International Conference on Parallel Processing, Vol. 1, pp. 272–275.

[Mal80] Malek, M.: A comparison connection assignment for diagnosis of multiprocessor systems, pp. 31–35, 1980.

[Mir88] Miranker, D. P.: A high level language approach to the Fault tolerant exectuation of AI expert systems. Artificial Intelligence Laboratory, University of Texas at Austin, March 1988, Technical Report AI88-71.

[MKB90] Miranker, D. P., Kuo, C. M., Browne, J. C.: Parallelizing compilation of rule-based programs. In: Proc. IEEE International Conference on Parallel Processing, 1990, Vol. 2, pp. 247–251.

[ML90] Miranker, D. P., Lofaso, B. J.: The organization and performance of a treat-based production system compiler. IEEE Trans. on Knowledge and Data Engineering1991, 3–9.

[Mok89] Mok, A.: Formal analysis of real-time equational rule-based. In: Proc., 10th Real-Time Systems Symposium, December 1989, pp. 308–318.

[PMC67] Preperate, F. P., Metze, G., Chien, R. T.: On the connection assignment problem of diagnosable systems. IEEE Transactions on Elec. ComputersEC-16: 848–854 (1967).

[Pra82] Pradhan, D. K.: On a class of Fault-tolerant multiprocessor network architecture. 1982.

[Sha90] Shankar, R.: Credible execution of parallel systems with delayed diagnosis. Master's thesis, Department of Electrical and Computer Engineering, University of Texas at Austin, 1990.

[Sto84] Stolfo, S. J.: Five parallel algorithms for production system execution on the DADO machine. In: Proc. National Conference on Artificial Intelligence, Austin, Texas, August 1984.