Adaptive and virtual reconfigurations for effective dynamic job scheduling in cluster systems

Songqing Chen1, Li Xiao1, Xiaodong Zhang1
1Department of Computer Science, College of William and Mary, Williamsburg, USA

Tóm tắt

In a cluster system with dynamic load sharing support, a job submission or migration to a workstation is determined by the availability of CPU and memory resources of the workstation at the time. In such a system, a small number of running jobs with unexpectedly large memory allocation requirements may significantly increase the queuing delay times of the rest of jobs with normal memory requirements, slowing down executions of individual jobs and decreasing the system throughput. We call this phenomenon as the job blocking problem because the big jobs block the execution pace of majority jobs in the cluster. We propose a software method incorporating with dynamic load sharing, which adaptively reserves a small set of workstations through virtual cluster reconfiguration to provide special services to the jobs demanding large memory allocations. This policy implies the principle of shortest-remaining-processing-time policy. As soon as the blocking problem is resolved by the reconfiguration, the system will adaptively switch back to the normal load sharing state. We present three contributions in this study. (1) the conditions to cause the job blocking problem; (2) the adaptive software method in a dynamic load sharing system; and (3) trace-driven simulations. We show that our method can effectively improve the cluster computing performance by quickly resolving the job blocking problem. The effectiveness and performance insights are also analytically verified.

Từ khóa

#Dynamic scheduling #Workstations #Optimal scheduling #Processor scheduling #Delay #Switches #Job design #Computer science #Educational institutions #Availability

Tài liệu tham khảo

10.1016/S0166-5316(99)00035-8 10.1109/SC.1998.10047 10.1109/HPDC.2000.868636 zhang, 2000, Improving distributed workload performance by sharing both CPU and memory resources, Proceedings of 20th International Conference on Distributed Computing Systems (ICDCS‘2000), 233 10.1002/spe.4380231203 feitelson, 1995, Job characteristics of a production parallel scientific workload on the NASA Ames iPSC/860, Job Scheduling Strategies for Parallel Processing, 949, 337, 10.1007/3-540-60153-8_38 10.1109/ICDSC.2001.918939 10.1002/spe.437 10.1145/263326.263344 schrage, 1968, A proof of the optimality of the shortest processing remaining time discipline, Operational Research, 16, 678 10.1109/ICDSC.2001.918991 10.1109/IPDPS.2000.845971 10.1016/S0141-9331(98)00077-5 10.1145/309746.309751