International Journal of Computer & Information Sciences
Công bố khoa học tiêu biểu
* Dữ liệu chỉ mang tính chất tham khảo
Sắp xếp:
Revisiting Cache Resizing
International Journal of Computer & Information Sciences - Tập 43 - Trang 59-85 - 2013
We present a novel framework to dynamically reconfigure on-chip memory resources according to the changing behavior of the executing applications. Our framework enables smooth scaling (i.e., resizing) of the on-chip caches targeting both performance and power efficiency. In contrast to previous approaches, the resizing decisions in our framework are not tainted by transient events (e.g., misses) that are due to downsizing avoiding swinging the cache size due to trial-and-error resizing decisions. This minimizes both execution time penalty induced by downsizing decisions as well as the effective cache size. In addition, an inherent property of our approach is that the actual invalidation of the cache blocks and the corresponding write-backs of the cache dirty blocks are asynchronous to resizing decisions, ensuring a smooth transition from one size to another. This makes it possible to apply our framework even on write-back caches—a major limitation in previous proposals. The proposed framework is simple to implement requiring minimal hardware overhead (
$$<$$
1 % of the target cache). Using cycle-accurate simulations and a wide range of applications, we evaluate our approach against previously proposed cache resizing schemes for various cache sizes and types. In all cases, our experimental findings show significant benefits across the board in both power and performance.
Foreword to the special issues
International Journal of Computer & Information Sciences - Tập 25 - Trang 243-244 - 1997
M-DRL: Deep Reinforcement Learning Based Coflow Traffic Scheduler with MLFQ Threshold Adaption
International Journal of Computer & Information Sciences - Tập 49 - Trang 646-657 - 2021
The coflow scheduling in data-parallel clusters can improve application-level communication performance. The existing coflow scheduling method without prior knowledge usually uses multi-level feedback queue (MLFQ) with fixed threshold parameters, which is insensitive to coflow traffic characteristics. Manual adjustment of the threshold parameters for different application scenarios often has long optimization period and is coarse in optimization granularity. We propose M-DRL, a deep reinforcement learning based coflow traffic scheduler by dynamically setting thresholds of MLFQ to adapt to the coflow traffic characteristics, and reduces the average coflow completion time. Trace-driven simulations on the public dataset show that coflow communication stages using M-DRL complete 2.08x(6.48x) and 1.36x(1.25x) faster on average coflow completion time (95-th percentile) in comparison to per-flow fairness and Aalo, and is comparable to SEBF with prior knowledge.
Data-Centric Transformations for Locality Enhancement
International Journal of Computer & Information Sciences - Tập 29 - Trang 319-364 - 2001
On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed locality-enhancing program transformations such as loop permutation and tiling. These transformations work well for perfectly nested loops (loops in which all assignment statements are contained in the innermost loop), but their performance on codes such as matrix factorizations that contain imperfectly nested loops leaves much to be desired. In this paper, we propose an alternative approach called data-centric transformation. Instead of reasoning directly about the control structure of the program, a compiler using the data-centric approach chooses an order for the arrival of data elements in the cache, determines what computations should be performed when that data arrives, and generates the appropriate code. At runtime, program execution will automatically pull data into the cache in an order that corresponds approximately to the order chosen by the compiler; since statements that touch a data structure element are scheduled close together, locality is improved. The idea of data-centric transformation is very general, and in this paper, we discuss a particular transformation called data-shackling. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of control-centric transformations for locality enhancement. We present experimental results on the SGI Octane comparing the performance of the two approaches, and show that for dense numerical linear algebra codes, data-shackling does better by factors of two to five.
Backtracking-Based Instruction Scheduling to Fill Branch Delay Slots
International Journal of Computer & Information Sciences - Tập 30 - Trang 397-418 - 2002
Conventional schedulers schedule operations in dependence order and never revisit or undo a scheduling decision on any operation. In contrast, backtracking schedulers may unschedule operations and can often generate better schedules. This paper develops and evaluates the backtracking approach to fill branch delay slots. We first present the structure of a generic backtracking scheduling algorithm and prove that it terminates. We then describe two more aggressive backtracking schedulers and evaluate their effectiveness. We conclude that aggressive backtracking-based instruction schedulers can effectively improve schedule quality by eliminating branch delay slots with a small amount of additional computation.
Fast, contention-free combining tree barriers for shared-memory multiprocessors
International Journal of Computer & Information Sciences - Tập 22 - Trang 449-481 - 1994
In a previous article,(1) Gupta and Hill introduced anadaptive combining tree algorithm for busy-wait barrier synchronization on shared-memory multiprocessors. The intent of the algorithm was to achieve a barrier in logarithmic time when processes arrive simultaneously, and in constant time after the last arrival when arrival times are skewed. Afuzzy
(2) version of the algorithm allows a process to perform useful work between the point at which it notifies other processes of its arrival and the point at which it waits for all other processes to arrive. Unfortunately, adaptive combining tree barriers as originally devised perform a large amount of work at each node of the tree, including the acquisition and release of locks. They also perform an unbounded number of accesses to nonlocal locations, inducing large amounts of memory and interconnect contention. We present new adaptive combining tree barriers that eliminate these problems. We compare the performance of the new algorithms to that of other fast barriers on a 64-node BBN Butterfly 1 multiprocessor, a 35-node BBN TC2000, and a 126-node KSR 1. The results reveal scenarios in which our algorithms outperform all known alternatives, and suggest that both adaptation and the combination of fuzziness with tree-style synchronization will be of increasing importance on future generations of shared-memory multiprocessors.
An algorithmic analysis of simulation strategies
International Journal of Computer & Information Sciences - Tập 11 Số 2 - Trang 101-122 - 1982
Compiler technology for machine-indepenent parallel programming
International Journal of Computer & Information Sciences - Tập 22 - Trang 79-98 - 1994
Historically, the principal achievement of compiler technology has been to make it possible to program in a high-level, machine-independent style. The absence of compiler technology to provide such a style for parallel computers is the main reason these systems have not found widespread acceptance. This paper examines the prospects for machine-independent parallel programming, concentrating on Fortran D and High Performance Fortran, which support machine-independent expression of “data parallelism.”
Removal of Conflicts in Hardware Transactional Memory Systems
International Journal of Computer & Information Sciences - Tập 42 - Trang 198-218 - 2012
This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses. It dissects the root causes of data conflicts in hardware transactional memory systems (HTM) into four classes of conflicts: true sharing, false sharing, silent store, and write-write conflicts. These conflicts can cause performance and energy losses due to aborts and extra communication. To quantify losses, the paper proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses. The paper also contributes with two techniques for removal of data conflicts: One for removal of false sharing conflicts and another for removal of silent store conflicts. In addition, it revisits and adapts a technique that is able to reduce losses due to both true and false conflicts. All of the proposed techniques can be accommodated in a lazy versioning and lazy conflict resolution HTM built on top of a MESI cache-coherence infrastructure with quite modest extensions. Their ability to reduce performance is quantitatively established, individually as well as in combination. Performance and energy consumption are improved substantially.
Guest Editors’ Editorial: Special Issue on the Second International Workshop on Microgrids
International Journal of Computer & Information Sciences - Tập 38 - Trang 1-3 - 2009
Tổng số: 1,017
- 1
- 2
- 3
- 4
- 5
- 6
- 10