개인 공부/논문

Don’t Shoot Down TLB Shootdown

RyoTTa 2021. 2. 16. 20:12
반응형

Background

-TLB = Cache for virtual to physical address translations

-TLB miss -> high cost time consumption

-TLB does not support coherence(일관성)

     -> Need OS support (= TLB Shootdown)

-Maintain Coherency Using IPI (Inter Processor Interrupt)

-ACK from Remote Core takes a lot of Cycles(>=1000)

-Furthermore, longer due to PTI to avoid the Meltdown vulnerability

     (PTI = Page Table Isolation, Separate kernel page table and user page table)

 

TLB Flush

 -Full Flush

  -> Flush TLB entry that does not contain G(global) bit

  -> By writing the CR3 register (CR3 = Maintain top page table physical address)

 -INVLPG

  -> <-> Full Flush, Selectively Invalidate TLB entry

  -> Compare the cost with the full flush (FreeBSD : >=4096, Linux : >= 33)

 -INVLPCID

  -> Using ASID(Address-space Identifier)

  -> , Selectively Invalidate TLB entry of inactive address space

 

-> Meltdown Vulnerability

  G bit is not used

  TLB should be flushed when exiting kernel mode

  That is, the flush occurs twice in kernel mode, user mode

  -> Increasing TLB miss and Decreasing TLB utility

 

TLB Shootdown

-Spin Wait( = Busy Wait)

 -> Transmit using Multicast IPI and spin wait

 

-SW overhead

 -> Select the Remote core needs to go through the TLB Shootdown

 -> If a TLB Shootdown is received for an inactive address in the remote core,

    ignore it and check if there is a PTE change when it is active and flush the

    address space

 

Related Work

HW

 -Focus on decrease the cost of TLB Shootdown or management instead of OS

 -Add microcode for IPI or Add new HW

-> Cost increase and OS change due to the addition of new HW

 

SW

 -Focus on avoiding TLB Shootdown

 -ABIS : Tracking page access

 -Barrelfish : Message Passing instead of IPI

 -LATR : Lazy and Asynchronous TLB Shootdown

 -RadixVM : Tracking page access using Radix tree

-> POSIX Incompatibility, Cause Errors

 

Proposed Technic

Idea

 -Concurrent Flushes

 -Early ACK

 -Cacheline Consolidation

 -CoW fault

 -Userspace-safe Batching

 

Concurrent Flushes

 -Linux, FreeBSD are flushed in the local and remote order

 -Local spin waits for ACK from remote

 -> Local TLB Flush using the waiting time

 -> Single TLB Flush >= 100ns, Single TLB Shootdown = can be 33 entry >=3us 

 

Cacheline Consolidation

 -Cacheline contention occurs because TLB layer is abstracted from SMP layer

     (SMP = Symmetric multi-processing)

 -> Inline related information to reduce cacheline contention.

 

Early Acknowledgment

 -Currently, the remote core does not send an ack until Shootdown completes

 -Local sending IPI and executing asynchronously causes an error

 -> Sends ACK when a shootdown request is received

Exception 1)

 - When page table is released, cannot be used due to speculative vulnerability

Exception 2)

 - When NMI come in, user space can be accessed using the previous TLB

 

In-context Page Flushes

 -In a PTI, TLB flush must be executed on two tables

 -active address using INVLPG, for inactive address using INVLPCID

 -When INVLPG is used, mode switching is frequent

 -> If user address space is activated, then flush it

 

 -Need a Stack to hold the data to user address space flush

 -Sometimes TLB flushes can be ignored because the is no serialization command

 

Avoid TLB flush for CoW

 -Copy-on-Write is a technique in which memory pages are write-protected and

     copied only when they are modified

 -Shootdown is required if different threads use the same mapping.

 -But Shootdown not be necessary because the page fault handler specifies that PTE is invalid

 

Userspace-safe Batching

 -If you can guarantee that the TLB is done before the kernel accesses the user space mapping, you can safely run the batch.

 

 

Experiment

Safe Mode

 -PTI On : For Current architecture using PTI

Unsafe Mode

 -PTI Off : For future architectures that do not use PTI

 

Microbenchmark

 -Memory allocation using “mmap” and run “madvise(DONTNEED)” to remove the page mapping

 -Create the Initiator and Responder thread

 -Initiator thread reports the number of “madvise” call cycles

 -Responder thread report the number of cycles the thread has stopped

 

Results

 -For the initiator core, we observe that concurrent flushes and early acknowledgement consistently have the largest impact, reducing shootdown latency by 10%-20% in both safe and unsafe mode.

 

 -The previous two techniques do not work for Responders.

 

 -PTI flushing (§ 3.4) benefits responder cores, and this effect is more clearly highlighted when there are more PTE entries to flush

 

Safe vs Unsafe

 -Due to PTI, each PTE needs to be flushed twice

 

 -The security mitigation techniques against hardware vulnerabilities, specifically PTI, cause the latency of entering the kernel from userspace to be higher

 

반응형