Background
-TLB = Cache for virtual to physical address translations
-TLB miss -> high cost time consumption
-TLB does not support coherence(일관성)
-> Need OS support (= TLB Shootdown)
-Maintain Coherency Using IPI (Inter Processor Interrupt)
-ACK from Remote Core takes a lot of Cycles(>=1000)
-Furthermore, longer due to PTI to avoid the Meltdown vulnerability
(PTI = Page Table Isolation, Separate kernel page table and user page table)
TLB Flush
-Full Flush
-> Flush TLB entry that does not contain G(global) bit
-> By writing the CR3 register (CR3 = Maintain top page table physical address)
-INVLPG
-> <-> Full Flush, Selectively Invalidate TLB entry
-> Compare the cost with the full flush (FreeBSD : >=4096, Linux : >= 33)
-INVLPCID
-> Using ASID(Address-space Identifier)
-> , Selectively Invalidate TLB entry of inactive address space
-> Meltdown Vulnerability
G bit is not used
TLB should be flushed when exiting kernel mode
That is, the flush occurs twice in kernel mode, user mode
-> Increasing TLB miss and Decreasing TLB utility
TLB Shootdown
-Spin Wait( = Busy Wait)
-> Transmit using Multicast IPI and spin wait
-SW overhead
-> Select the Remote core needs to go through the TLB Shootdown
-> If a TLB Shootdown is received for an inactive address in the remote core,
ignore it and check if there is a PTE change when it is active and flush the
address space
Related Work
HW
-Focus on decrease the cost of TLB Shootdown or management instead of OS
-Add microcode for IPI or Add new HW
-> Cost increase and OS change due to the addition of new HW
SW
-Focus on avoiding TLB Shootdown
-ABIS : Tracking page access
-Barrelfish : Message Passing instead of IPI
-LATR : Lazy and Asynchronous TLB Shootdown
-RadixVM : Tracking page access using Radix tree
-> POSIX Incompatibility, Cause Errors
Proposed Technic
Idea
-Concurrent Flushes
-Early ACK
-Cacheline Consolidation
-CoW fault
-Userspace-safe Batching
Concurrent Flushes
-Linux, FreeBSD are flushed in the local and remote order
-Local spin waits for ACK from remote
-> Local TLB Flush using the waiting time
-> Single TLB Flush >= 100ns, Single TLB Shootdown = can be 33 entry >=3us
Cacheline Consolidation
-Cacheline contention occurs because TLB layer is abstracted from SMP layer
(SMP = Symmetric multi-processing)
-> Inline related information to reduce cacheline contention.
Early Acknowledgment
-Currently, the remote core does not send an ack until Shootdown completes
-Local sending IPI and executing asynchronously causes an error
-> Sends ACK when a shootdown request is received
Exception 1)
- When page table is released, cannot be used due to speculative vulnerability
Exception 2)
- When NMI come in, user space can be accessed using the previous TLB
In-context Page Flushes
-In a PTI, TLB flush must be executed on two tables
-active address using INVLPG, for inactive address using INVLPCID
-When INVLPG is used, mode switching is frequent
-> If user address space is activated, then flush it
-Need a Stack to hold the data to user address space flush
-Sometimes TLB flushes can be ignored because the is no serialization command
Avoid TLB flush for CoW
-Copy-on-Write is a technique in which memory pages are write-protected and
copied only when they are modified
-Shootdown is required if different threads use the same mapping.
-But Shootdown not be necessary because the page fault handler specifies that PTE is invalid
Userspace-safe Batching
-If you can guarantee that the TLB is done before the kernel accesses the user space mapping, you can safely run the batch.
Experiment
Safe Mode
-PTI On : For Current architecture using PTI
Unsafe Mode
-PTI Off : For future architectures that do not use PTI
Microbenchmark
-Memory allocation using “mmap” and run “madvise(DONTNEED)” to remove the page mapping
-Create the Initiator and Responder thread
-Initiator thread reports the number of “madvise” call cycles
-Responder thread report the number of cycles the thread has stopped
Results
-For the initiator core, we observe that concurrent flushes and early acknowledgement consistently have the largest impact, reducing shootdown latency by 10%-20% in both safe and unsafe mode.
-The previous two techniques do not work for Responders.
-PTI flushing (§ 3.4) benefits responder cores, and this effect is more clearly highlighted when there are more PTE entries to flush
Safe vs Unsafe
-Due to PTI, each PTE needs to be flushed twice
-The security mitigation techniques against hardware vulnerabilities, specifically PTI, cause the latency of entering the kernel from userspace to be higher
'개인 공부 > 논문' 카테고리의 다른 글
LSP Collective Cross-Page Prefetching for NVM (0) | 2021.03.10 |
---|---|
Towards Efficient NVDIMM-based Heterogeneous Storage Hierarchy Management for Big Data Workloads (0) | 2021.03.08 |
The gem5 Simulator (1) | 2021.03.08 |
Adaptive Memory Fusion: Towards Transparent, Agile Integration of Persistent Memory (0) | 2021.02.16 |
Hybrid2: Combining Caching and Migration in Hybrid Memory Systems (0) | 2021.02.16 |