Don’t Shoot Down TLB Shootdown

개인 공부/논문

Don’t Shoot Down TLB Shootdown

RyoTTa 2021. 2. 16. 20:12

Background

-TLB = Cache for virtual to physical address translations

-TLB miss -> high cost time consumption

-TLB does not support coherence(일관성)

-> Need OS support (= TLB Shootdown)

-Maintain Coherency Using IPI (Inter Processor Interrupt)

-ACK from Remote Core takes a lot of Cycles(>=1000)

-Furthermore, longer due to PTI to avoid the Meltdown vulnerability

(PTI = Page Table Isolation, Separate kernel page table and user page table)

TLB Flush

-Full Flush

-> Flush TLB entry that does not contain G(global) bit

-> By writing the CR3 register (CR3 = Maintain top page table physical address)

-INVLPG

-> <-> Full Flush, Selectively Invalidate TLB entry

-> Compare the cost with the full flush (FreeBSD : >=4096, Linux : >= 33)

-INVLPCID

-> Using ASID(Address-space Identifier)

-> , Selectively Invalidate TLB entry of inactive address space

-> Meltdown Vulnerability

G bit is not used

TLB should be flushed when exiting kernel mode

That is, the flush occurs twice in kernel mode, user mode

-> Increasing TLB miss and Decreasing TLB utility

TLB Shootdown

-Spin Wait( = Busy Wait)

-> Transmit using Multicast IPI and spin wait

-SW overhead

-> Select the Remote core needs to go through the TLB Shootdown

-> If a TLB Shootdown is received for an inactive address in the remote core,

ignore it and check if there is a PTE change when it is active and flush the

address space

Related Work

-Focus on decrease the cost of TLB Shootdown or management instead of OS

-Add microcode for IPI or Add new HW

-> Cost increase and OS change due to the addition of new HW

-Focus on avoiding TLB Shootdown

-ABIS : Tracking page access

-Barrelfish : Message Passing instead of IPI

-LATR : Lazy and Asynchronous TLB Shootdown

-RadixVM : Tracking page access using Radix tree

-> POSIX Incompatibility, Cause Errors

Proposed Technic

Idea

-Concurrent Flushes

-Early ACK

-Cacheline Consolidation

-CoW fault

-Userspace-safe Batching

Concurrent Flushes

-Linux, FreeBSD are flushed in the local and remote order

-Local spin waits for ACK from remote

-> Local TLB Flush using the waiting time

-> Single TLB Flush >= 100ns, Single TLB Shootdown = can be 33 entry >=3us

Cacheline Consolidation

-Cacheline contention occurs because TLB layer is abstracted from SMP layer

(SMP = Symmetric multi-processing)

-> Inline related information to reduce cacheline contention.

Early Acknowledgment

-Currently, the remote core does not send an ack until Shootdown completes

-Local sending IPI and executing asynchronously causes an error

-> Sends ACK when a shootdown request is received

Exception 1)

- When page table is released, cannot be used due to speculative vulnerability

Exception 2)

- When NMI come in, user space can be accessed using the previous TLB

In-context Page Flushes

-In a PTI, TLB flush must be executed on two tables

-active address using INVLPG, for inactive address using INVLPCID

-When INVLPG is used, mode switching is frequent

-> If user address space is activated, then flush it

-Need a Stack to hold the data to user address space flush

-Sometimes TLB flushes can be ignored because the is no serialization command

Avoid TLB flush for CoW

-Copy-on-Write is a technique in which memory pages are write-protected and

copied only when they are modified

-Shootdown is required if different threads use the same mapping.

-But Shootdown not be necessary because the page fault handler specifies that PTE is invalid

Userspace-safe Batching

-If you can guarantee that the TLB is done before the kernel accesses the user space mapping, you can safely run the batch.

Experiment

Safe Mode

-PTI On : For Current architecture using PTI

Unsafe Mode

-PTI Off : For future architectures that do not use PTI

Microbenchmark

-Memory allocation using “mmap” and run “madvise(DONTNEED)” to remove the page mapping

-Create the Initiator and Responder thread

-Initiator thread reports the number of “madvise” call cycles

-Responder thread report the number of cycles the thread has stopped

Results

-For the initiator core, we observe that concurrent flushes and early acknowledgement consistently have the largest impact, reducing shootdown latency by 10%-20% in both safe and unsafe mode.

-The previous two techniques do not work for Responders.

-PTI flushing (§ 3.4) benefits responder cores, and this effect is more clearly highlighted when there are more PTE entries to flush

Safe vs Unsafe

-Due to PTI, each PTE needs to be flushed twice

-The security mitigation techniques against hardware vulnerabilities, specifically PTI, cause the latency of entering the kernel from userspace to be higher

'개인 공부 > 논문' 카테고리의 다른 글

LSP Collective Cross-Page Prefetching for NVM (0)	2021.03.10
Towards Efficient NVDIMM-based Heterogeneous Storage Hierarchy Management for Big Data Workloads (0)	2021.03.08
The gem5 Simulator (1)	2021.03.08
Adaptive Memory Fusion: Towards Transparent, Agile Integration of Persistent Memory (0)	2021.02.16
Hybrid2: Combining Caching and Migration in Hybrid Memory Systems (0)	2021.02.16

현재글Don’t Shoot Down TLB Shootdown

RyoTTa's Basic