Caching in Multi processors


Class Summary: Siva and Osman
Caching in Multiprocessors :

    In shared memory multiprocessor, the memory system provides access to the data to be processed and mechanisms for interprocessor communication.  The bandwidth of the system limits the speed of computation.  Caches are fast local memories that moderate  multiprocessor's memory bandwidth demands by holding copies of recently used data and provide a low latency access path to the processor.  Because of locality in the memory access patterns of multiprocessors, the cache satisfies a large fraction of the processor accesses, thereby reducing both the averge memory latency and the communication bandwidth requirements imposed on the system's interconnection network.

    Caches in a multiprocessing environment poses the cache-coherence problem.  When multiple processors maintain locally cached copies of a unique shared memory location, any local modification of the location can result in a globally inconsistent view of memory.  Cache-coherence schemes prevent this problem by maintaining a uniform state for each cached block of data.

    Several of today's commercially available multiprocessors use bus-based memory systems.  A bus is a convenient device for ensuring cache coherence because it allows all processors in the system to observe ongoing memory transactions.  If a bus transaction threatens the consistent state of a locally cached object, the cache controller can take such appropriate action as invalidating the local copy.  Protocols that use this mechanism to ensure coherence are called snoopy protocols because each cache snoops on the transactions of other cahces.  Unfortunately bus based cache coherence doesn't scale well because of relative disparity between bus and processor clocks.

    Consequently scalable multiprocessor systems interconnect processors using short point-to-point wires in direct or multistage networks.  Communication along impedance-mached transmission line channels can occur at high speeds, providing communication bandwidth that scales with the number of processors.  Unlike busses, the bandwidth of these networks increases as more processors are added to the system.  But such networks don't have a convenient snooping mechanism and don't provide an efficient broadcast capability.  In the absence of a systemwide broadcast mechanism, the cache-coherence problem can be solved with interconnection networks using some variant of directory schemes.

    Domain orthogonal to caching schemes is consistency issues.  A consistency model is a contract between software and memory.  It says that if the software agrees to obey certain rules, the memory promises to work correctly.  A wide spectrum of contracts have been devised, ranging from contracts that place only minor restrictions on the software to those that make normal programming nearly impossible.  Some of the consistency models are : strict, sequential, causal, PRAM, processor, weak, Release (lazy and eager), entry consistencies.
 


Reviews

Paper: Directory -Based cache coherence in Large-Scale multiprocessors
      A Survey of Cache Coherence Schemes for Multiprocessors

Reviewer: Vijay Sundaram

    Shared memory multiprocessors use low-cost microprocessors economically interconnected with shared memory modules. This system organization faces the problems of: memory contention, communication contention and latency time. These problems contribute to increased memory access times and hence slow down processor's execution speeds. Cache memories have served as an important way to reduce the average memory access time in multiprocessors. These shared memory multiprocessors can result in several copies of a shared block in one or more caches at the same time. To maintain a coherent view of the memory, these copies must be consistent. This is the cache coherence problem or the cache consistency
problem. This paper surveys schemes for cache coherence.

    Cache coherence poses a problem mainly for shared, read-write data structures. The tow parallel applications investigated in the paper that use shared data structures differently are the : bounded buffer producer and consumer problem and, a parallel algorithm for solving linear equations by iteration. The Hardware based protocols include snoopy cache protocols, directory schemes and cache-coherent network architectures. The snoopy cache protocols outlined are the write-invalidate and write-update snoopy cache protocols. The software cache-coherence schemes attempt to avoid the need for complex hardware mechanisms.

    A note in the end of the paper is that apart from the snoopy cache protocols none of the other schemes described have been implemented. Also  mentioned is that multiprocessor caches are and will not be a hot topic in the coming years.


Reviewer: Zhenlin Wang
 

    I'd like make a summary on software-based schemes we talked in class. I focus on how compiler analyses classify the four types of accesses, read-only for arbitrary number of processes, read-only for arbitrary number of processes but read-write for exactly one process, read-write for exactly one process, and  read-write for an arbitrary number of processes.The compiler does this through  dependence testing. Generally, we have four types of data dependents. Let's say $a$ occurs earlier than $b$ in execution sequence. We say there is a flow dependence from $a$ to $b$ if $a$ writes a memory unit used by $b$. Anti-dependence from $a$ to $b$ means that $a$ uses a memory unit defined by b. Output dependence exists between $a$ and $b$ if they both write a same memory unit. Input dependence between $a$ and $b$ holds if they both read some memory address. Now we say an access is type 1 if the access itself is a read and there is no flow, anti dependences involving this access. Apparently b[J],A[J,K] and X[K] are all type 1 in the first parallel loop. xtemp[J] in the second parallel loop is also type 1 (The paper is not right here?).  An access is type 2 is if it has no flow, anti and output dependence across processes. If we consider X[J] in the whole program   in Figure 11, it is type 2. An access is type 3 if it has no dependence(including input dependence) across processes.  Considering the second parallel loop in the Figure 11, X[J] is only accessed in the one process. So it is type 3. An access is type 4 if it has cross-process flow, anti or output dependences. The dependences across parallel loops can decide if we need use memory read instead of cache read in the enforcement scheme. A memory read is required if there is flow or anti dependence across parallel loops. We can see the anti dependence from X[K] to X[J] and the flow dependence from xtemp[J] in the first parallel loop the xtem[J] in the second  one.


Reviewer: Akash Jain

    The paper does a survey on hardware and software based cache coherence schemes for multiprocessors. For the hardware-based scheme, it talks about two policies - write-invalidate and write-update. In the write update policy, read requests are carried out locally if a copy of the block exists. When a processor updates a block, however all other copies are invalidated. In the write-update policy, when a processor updates a block, all other copies are updated.

    The survey explains these policies in context of two examples: bounded buffer problem and parallel algorithm for solving a linear system of equations by iteration. It is clear that different schemes might be preferred for different examples. The survey then talks about some hardware based protocols for these policies. These protocols need consistency commands to be sent to caches having copies of the block. These should be implemented in hardware. The survey talks in detail about the snoopy cache protocol.

    The survey also mentions software-based schemes for cache coherence. Though the scheme attempt to avoid the need for complex hardware mechanisms, it is not clear how they will be implemented without proper hardware support. One of the schemes basically talk about marking the variables which might be cacheable.(by the compiler). Cache coherence can be enforced by indiscriminate invalidation, selective invalidation or a timestamp based mechanism.

    The directory scheme mentioned in the paper can be used in the context of web caching. We can have a system which contains a directory about the state of the blocks that are cached in different servers. Such a scheme might be useful in distributed environment e.g. caching in distributed games.


Prashant Shenoy