개인 공부/논문

An Energy- and Performance-Aware DRAM Cache Architecture for Hybrid DRAMPCM Main Memory Systems

RyoTTa 2022. 1. 6. 19:52
반응형

2. Background
A. Basic of Memory Devices and Systems
       Increased memory bandwidth and capacity requirement.
       Typically, Mem chip are integrated and installed as Dual In-line Memory Module(DIMM)
       DIMM’s energy consumption is significant.(20% ~ 40% of entire system power consumption)
PCM Pros, 
       Ability to scale down, 
       Low power consumption(1/3 of DRAM in operating state, zero idle state), 
       Non-volatility, 
       Fast read perf
PCM Cons, 
       Low write performance(SET, RESET 150ns)
       Liminited cell endurance(10^6 – 10^8)

 

The focus in this paper is the duet of performance and energy consumption, so endurance problem is outside the scope of this work

3. DRAM/PCM Hybrid main memory architecture
 


A. Using DRAM as a Write Buffer Memory
        DRAM Buffer is large enough to absorb all incoming write data from LLC without ever being completely full, then overall perfomance same as DRAM system
       First Challenge, the number of entries in the buffer that can absorb all incoming traffic should be accurately estimated. 
       DRAM buffer need not be infinitely size. (Upper bound in most case)
       High hit ratio in L1/L2 caches So, only few of the memory request through to DRAM buffer
       But, complex indexing and data-searching algorithm to handle the next incoming request. If requested data is stored whthin the DRAM buffer and has not yet been flushed to the PCM, then request must be served from the DRAM buffer to coherency

        Second Challenge, DRAM buffer system consume less energy than large size DRAM system. But, temporarily hold data until the slower PCM can eventually absorb this data
        One data-write request, initial DRAM write and potentially one DRAM read followed by one PCM Write. So, duplicated Write operation and additional read operation will increase energy.
        The DRAM buffer hit ratio is not so high(0.6% ~ 2%)

B. Using DRAM as a Cache

        Request served directly from the DRAM cache(i.e. DRAM hits), the PCM is only accessed on a DRAM cache miss.
        If the hit ratio of DRAM cache is high enough to reduce the number of PCM accesses, the additional DRAM read and PCM write operation due to cache flush can be reduced. Consequenctly, the energy consumption will decrease, as compared to DRAM buffer architecture
        Since PCM’s read performance is similar to DRAM. But, perforamcne penalty incurred when there is a cache miss that result in a PCM write operation is quite serve and cannot be avoided. This case is Flusing of dirty block(“dirty miss”)
        Overall cache miss penalty is primarily determined by the number of dirty miss.
        Performance degradation = (Tmiss = Nwb x Tblock), Nwb and Tblock indicate the total number of dirty misses that lead to a PCM write, and the time to write one DRAM cache line to the PCM


        Hit ratio of DRAM cache is higher than that of DRAM bufffer(Fig 2.)
        Increasing the cache size enhances the cache hit ratio without any side effect, But, increasing the cache block size has an adverse effect on the miss penalty.
        Therefore, in regular cache configurations, the optimal block size (in terms of total miss penalty) is 64 bytes, even though its corresponding cache hit ratio is not the highest possible

        We can employ a DRAM cache that will be used only for wirte operations.(decrease hit ratio and miss penalty)
        The read misses in the DRAM cache do not force a cache block replacement. Instead, the missed block is served directly from the PCM without allocating a block in the DRAM cache.
        The performance of this configuration is still not as high as when using a DRAM buffer architecture, due to the heavy miss penalty of dirty write-back operations
        But, a high miss rate in the DRAM cache has caused the memory write queue to fill up.

C. A DRAM Cache Architecture Supporting Threshold-based Pre-Invalidation
In Fig 3, 
        DRAM buffer = better performance(large size), worse energy consumption(duplicate write)
        DRAM cache = worse performance(dirty miss), better energy consumption


        Our proposed scheme utilizes the DRAM as a cache in order to reduce the number of memory accesses to the         PCM, but with modifications targeting a reduction in the miss penalty due to dirty misses.
        The incoming data must wait until a possibly lengthy flush-out process is completed. So, In order to hide this flush-out time substantially, our proposed cache is designed to always have a room for incoming data.


Threshold-Based Pre-Invalidation(TBPI)
Write-only DRAM cache
Two level of flush-out queues
Threshold-based pre-invalidation
This methodology minimizes the stall time due to DRAM cache misses by increasing the memory bus utilization while considering the PCM’s poor write performance.
Each set always has at least one empty block

The Pre-Invalidation Controller (PIC), which is embedded within the memory controller, invalidates at least one data block from a particular set in the DRAM cache, before the set is full.
If the PCM module is busy when a DRAM cache block is invalidated, the PIC adds the physical address of the invalidated block into Queue 0 or 1, depending on the remaining number of empty blocks. If there is no available block in a set, the address is placed into Queue 0, the urgent queue, which has the highest flush-out priority. Otherwise, the address is placed into Queue 1, whose priority is lower than that of Queue 0.

Background Flush Controller (BFC) evicts the data listed in the queue from the DRAM cache into the PCM. This process occurs at periods when the PCM and DRAM are not used for regular data service.

The requirement of maintaining at least one empty block per set may not always be satisfied. However, we found that this scenario is extreme and happens rarely, even if we set the threshold value to just one.

 

4. Experimental Evaluation

Performance
The DRAM buffer shows the second-best performance, The proposed TBPI shows very similar performance to the DRAM buffer.

 

The performance of the DRAM cache is inferior, due to its heavy miss penalty.
Small-size write queue is helpful in enhancing the performance of the applications that are not so memory-intensive
Energy consumption
DRAM-only configuration uses a large-size DRAM, its energy consumption is higher than the other configurations
The DRAM buffer configuration is the second-largest energy consumer, due to the duplicated read and write operations
DRAM cache configurations, including our TBPI augmented one, consume less energy than the other configurations.

Energy X Delay
DRAM-only configuration shows the highest performance, its energy×delay product is the highest
TBPI cache shows the lowest energy×delay product values
These results clearly demonstrate that the proposed hybrid architecture can outperform all other configurations in terms of the energy-delay product.

 

5.Conclusion

 

PCM is emerging as a viable alternative to DRAM for the main memory system
But, PCM’s low write performance necessitates the employment of hybridized solution
PCM memory in order to maximize both the performance and energy efficiency of the resulting hybrid structure
An average energy-delay product improvement of 42.2% is reported, over conventional hybrid structures.
반응형