|   | A   |         |           |
|---|-----|---------|-----------|
| 1 | ARM | SYSTEMS | SEMANTICS |

Ben Simner

University of Cambridge

November 28, 2022

### 5 Declaration

- 6 This dissertation is the result of my own work and includes nothing which is the outcome of work done in
- 7 collaboration except where specifically indicated in the text.
- $_{\rm 8}$   $\,$  This dissertation contains 55200 total words as counted by detex  $\,$  | wc  $\,$  -w .
- $_{\rm 9}$   $\,$  This dissertation does not exceed the regulation length of  $60\,000$  words, including tables and footnotes.

# **Contents**

| 11 | 1  | Intr  | oduction                                         | 6  |
|----|----|-------|--------------------------------------------------|----|
| 12 |    | 1.1   | Thesis Introduction                              | 7  |
| 13 |    | 1.2   | Modern microprocessors and Arm                   | 7  |
| 14 |    | 1.3   | Processor architecture and Armv8-A               | 7  |
| 15 |    | 1.4   | Semantics of these architectures                 | 7  |
| 16 |    | 1.5   | Systems software                                 | 7  |
| 17 |    | 1.6   | Hypervisors                                      | 8  |
| 18 | Pr | eface | to relaxed memory                                | 9  |
| 19 | 2  | Mod   | elling Armv8-A: background                       | 10 |
| 20 |    | 2.1   |                                                  | 10 |
| 21 |    |       | 2.1.1 The litmus zoo                             | 11 |
| 22 |    | 2.2   |                                                  | 13 |
| 23 |    | 2.3   | •                                                | 13 |
| 24 |    | 2.4   | Axiomatic-style models                           | 15 |
| 25 |    |       | 2.4.1 Informal description and the Cat language  |    |
| 26 |    |       | 2.4.2 The Armv8-A axiomatic model                |    |
|    |    |       |                                                  |    |
| 27 | 3  |       |                                                  | 17 |
| 28 |    | 3.1   | Introduction                                     |    |
| 29 |    | 3.2   | Industry Practice and the Existing ARMv8-A Prose |    |
| 30 |    | 3.3   | *                                                | 22 |
| 31 |    |       | •                                                | 23 |
| 32 |    |       |                                                  | 23 |
| 33 |    |       |                                                  | 25 |
| 34 |    |       |                                                  | 27 |
| 35 |    |       |                                                  | 28 |
| 36 |    |       |                                                  | 29 |
| 37 |    | 3.4   |                                                  | 29 |
| 38 |    | 3.5   |                                                  | 31 |
| 39 |    | 3.6   |                                                  | 33 |
| 40 |    | 3.7   | Related Work                                     |    |
| 41 |    | 3.8   | Conclusion                                       | 36 |
| 42 | 4  | Inst  | · · · · · · · · · · · · · · · · · · ·            | 37 |
| 43 |    | 4.1   | Shape of the model                               |    |
| 44 |    | 4.2   | Extending flat                                   | 37 |
| 45 | 5  | Inst  | ruction fetch: axiomatically                     | 38 |
| 46 |    | 5.1   | •                                                | 38 |
| 47 |    | 5.2   |                                                  | 38 |
| 48 |    | 5.3   |                                                  | 38 |
| 49 | 6  | Inst  | ruction fetch: validation                        | 39 |
| 50 | -  | 6.1   |                                                  | 39 |
| 51 |    |       | •                                                | 39 |
|    |    |       | U                                                |    |

| 52       |   | 6.2  | Extensi   | on to isla-axiomatic                              |
|----------|---|------|-----------|---------------------------------------------------|
| 53       |   | 6.3  | Hardw     | are testing                                       |
| 54       |   |      | 6.3.1     | Custom harness                                    |
| 55       |   |      | 6.3.2     | Extending herdtools                               |
| 56       |   |      | 6.3.3     | Results from hardware                             |
| 57       |   | 6.4  | Corres    | pondence between the models                       |
| 58       | 7 | Page | etables a | and the VMSA 40                                   |
| 59       |   | 7.1  |           | iction                                            |
| 60       |   | 7.2  |           | Memory                                            |
| 61       |   | 7.3  |           | ranslation Tables                                 |
| 62       |   | ,    | 7.3.1     | Translation table format                          |
| 63       |   |      | 7.3.2     | The Arm translation table walk                    |
| 64       |   | 7.4  |           | isation and a second stage of translation         |
| 65       |   | 7.5  |           | ation regimes                                     |
| 66       |   | 7.6  |           | eudocode                                          |
| 67       |   | 7.0  | 7.6.1     | The lifecycle of a store                          |
| 68       |   |      | 7.6.2     | Writes to memory                                  |
|          |   |      | 7.6.3     | Translation table walks                           |
| 69       |   | 7.7  |           | g in TLBs                                         |
| 70       |   |      |           |                                                   |
| 71       |   | 7.8  |           | SL Reference                                      |
| 72       |   |      | 7.8.1     | AArch64.TranslateAddress                          |
| 73       |   |      | 7.8.2     | AArch64.FullTranslate                             |
| 74       |   |      | 7.8.3     | AArch64.S1Translate                               |
| 75       |   |      | 7.8.4     | AArch64.S1Walk                                    |
| 76       |   |      | 7.8.5     | AArch64.S2Translate                               |
| 77       |   |      | 7.8.6     | AArch64.S2Walk                                    |
| 78       |   |      | 7.8.7     | AArch64.FetchDescriptor                           |
| 79       | 8 | Rela | ved vir   | tual memory 70                                    |
| 79<br>80 | 0 | 8.1  |           | memory litmus tests                               |
| 81       |   | 8.2  |           | data memory                                       |
|          |   | 0.2  | 8.2.1     | Virtual coherence                                 |
| 82       |   |      | 8.2.2     | Aliasing different locations                      |
| 83       |   |      | 8.2.3     | Might be same (physical) address                  |
| 84       |   | 8.3  |           | an be cached in TLBs                              |
| 85       |   | 0.3  | 8.3.1     | Microarchitectural TLBs                           |
| 86       |   |      |           |                                                   |
| 87       |   |      | 8.3.2     | Model MMU                                         |
| 88       |   | 0.4  | 8.3.3     | Invalid entries                                   |
| 89       |   | 8.4  |           | not from TLB                                      |
| 90       |   |      | 8.4.1     | Out-of-order execution                            |
| 91       |   |      | 8.4.2     | Enforcing thread-local ordering                   |
| 92       |   |      | 8.4.3     | Enhanced Translation Synchronization              |
| 93       |   |      | 8.4.4     | Forwarding to the translation table walker        |
| 94       |   |      | 8.4.5     | Speculative execution                             |
| 95       |   |      | 8.4.6     | Single-copy atomicity                             |
| 96       |   |      | 8.4.7     | Multi-copy atomicity                              |
| 97       |   |      | 8.4.8     | Translation-table-walk intra-walk ordering        |
| 98       |   |      | 8.4.9     | Multiple translations within a single instruction |
| 99       |   | 8.5  |           | g of translations in TLBs                         |
| 100      |   |      | 8.5.1     | Cached translations                               |
| 101      |   |      | 8.5.2     | TLB fills                                         |
| 102      |   |      | 8.5.3     | $\mu$ TLBs                                        |
| 103      |   |      | 8.5.4     | Partial caching of walks                          |
| 104      |   |      | 8.5.5     | Reachability                                      |
| 105      |   | 8.6  | TLB ma    | aintenance                                        |
| 106      |   |      | 8.6.1     | Recovering coherence                              |
| 107      |   |      | 8.6.2     | Thread-local ordering and TLBI                    |
| 108      |   |      | 8.6.3     | Broadcast 10                                      |

| 109 |           |       | 8.6.4    | Virtualization                                      | 108 |
|-----|-----------|-------|----------|-----------------------------------------------------|-----|
| 110 |           |       | 8.6.5    | Break-before-make                                   | 111 |
| 111 |           |       | 8.6.6    | ASIDs and VMIDs                                     | 111 |
| 112 |           |       | 8.6.7    | Access permissions                                  | 113 |
| 113 |           | 8.7   | Contex   | ct synchronisation                                  | 117 |
| 114 |           |       | 8.7.1    | Relaxed system registers                            |     |
| 115 |           | 8.8   |          | butions                                             |     |
|     |           |       |          |                                                     |     |
| 116 | 9         | An a  | xioma    | tic VMSA model                                      | 119 |
| 117 |           | 9.1   | Extend   | ed candidate executions                             | 119 |
| 118 |           |       | 9.1.1    | Candidate events                                    | 119 |
| 119 |           |       | 9.1.2    | Candidate relations                                 | 121 |
| 120 |           | 9.2   | Cat mo   | odel                                                | 122 |
| 121 |           | 9.3   | Axiom    | 8                                                   | 122 |
| 122 |           | 9.4   | Relatio  | ns                                                  | 124 |
| 123 |           |       | 9.4.1    | obs                                                 | 124 |
| 124 |           |       | 9.4.2    | dob                                                 | 125 |
| 125 |           |       | 9.4.3    | bob                                                 | 125 |
| 126 |           |       | 9.4.4    | tob                                                 |     |
| 127 |           |       | 9.4.5    | ctxob                                               |     |
| 128 |           |       | 9.4.6    | obfault and obETS                                   |     |
| 129 |           |       | 9.4.7    | obtlbi                                              |     |
| 129 |           |       | 7.1.7    | 000101                                              | 127 |
| 130 | 10        | Valid | dating t | the VMSA model                                      | 130 |
| 131 |           | 10.1  | Extend   | ing isla-axiomatic                                  | 130 |
| 132 |           | 10.2  | Runnir   | ng on hardware: system-litmus-harness               | 130 |
| 133 |           |       | 10.2.1   | Harness overview                                    | 130 |
| 134 |           |       | 10.2.2   | Results from hardware                               | 130 |
|     |           |       |          |                                                     |     |
| 135 | 11        |       |          | onal VMSA model                                     | 131 |
| 136 |           |       |          | action                                              |     |
| 137 |           | 11.2  | Structu  | are of the state                                    |     |
| 138 |           |       | 11.2.1   | The MMU                                             |     |
| 139 |           |       |          | Its TLB                                             |     |
| 140 |           | 11.3  |          | memory axioms                                       |     |
| 141 |           | 11.4  | Break-   | before-make violation detection                     | 132 |
| 142 |           | 11.5  | A weal   | ker VMSA model                                      | 132 |
| 143 |           | 11.6  | Execut   | ing the models with Isla                            | 132 |
|     |           |       |          |                                                     |     |
| 144 | <b>12</b> |       | itations |                                                     | 133 |
| 145 |           |       | -        | 1                                                   | 133 |
| 146 |           |       |          | interaction of instruction fetch and virtual memory |     |
| 147 |           | 12.3  | Other a  | architectures                                       | 133 |
| 148 | 13        | Con   | clusion  |                                                     | 134 |
|     | CI        |       |          |                                                     | 105 |
| 149 | GI        | ossar | y        |                                                     | 135 |

#### 

# Introduction

Over the previous years and decades, much work has gone into writing down what it is the computers we use every day actually do. To define, mathematically and precisely, the *architecture* that the processors in our computers implement: Intel/AMD's x86, Arm's Armv8-A, IBM's Power, RISC-V, and so on.

Architectures can be thought of as abstractions of the underlying hardware. As programming languages whose syntax is defined by the *ISA* (or Instruction Set Architecture), and whose semantics is the composition of the sequential behaviours of the individual instructions and registers from the ISA, with the machine execution model: the thread and storage subsystems. Architecture therefore can be thought of as the *interface* between hardware and software: defining the guarantees hardware must give and that software may rely upon. In theory, this interface is straight-forward to define. One can give precise formal semantics to the individual instructions, as Arm does with its *Architecture Specification Language* (ASL), and then tie instructions together with a fetch-decode-execute loop. In practice, however, modern industrial architectures accumulate great complexity and subtlety. The Armv8-A and Intel reference manuals have 11,500 [1], and 4922 [2] pages respectively. Covering everything from the ISA to the interactions between the ISA and the thread and storage subsystems.

The complexity of these interfaces becomes most apparent with the interaction with *multicore* systems. When multiple processors are executing concurrently, and communicating through shared memory, then various hardware optimisations, which are usually invisible to the programmer outside of timing effects, can become architecturally visible, that is they affect the semantics of the machine code. Over the years, these effects have been studied as part of the field of 'relaxed memory' research, resulting in numerous formal models for a variety of microprocessor architectures giving a precise mathematical semantics to the concurrent behaviours of 'userland' machine code programs. We now seek to expand this body of work, to cover not just those parts of the architectures used by userland processes, but the features required by systems software to function.

In this work we will focus on the Armv8-A architecture: the *application*-class processors that we find powering a large proportion of modern mobile devices. There are a few reasons why we shall focus on Arm: (1) they are ubiquitous and millions of people rely on software running on them every day, (2) Arm has a diverse ecosystem of implementations, meaning software must program to this abstract interface much more tightly than one might for other architectures, and (3) Arm have put a large amount of effort into precisely and formally defining their ISA.

Specifically, we will focus on key architectural features required by operating systems and hypervisors, which are not or only partially accessible to userland processes: instruction fetching, cache maintenance, virtual memory and TLB maintenance, and exceptions.

**Armv8-A** Armv8-A is a modern industrial reduced-instruction-set architecture. Execution of an Armv8-A processor is split into two modes: AArch64 (for 64-bit execution) or AArch32 (for 32-bit execution). AArch64 mode uses the A64 instruction set. AArch32 mode can be using either the T32 or A32 instruction sets. This is illustrated in Figure 1.1.

A64, currently, has 402 'base' instructions and another 1,205 vector, matrix and SIMD instructions. It has 31 general-purpose registers, accessible through 32-bit views as w0-w30, or as 64-bit views as x0-x30. It has a dedicated zero register (wzr/xzr), and stack pointer register (sp). Instructions are fixed-width and in the typical RISC style, with instructions reading operands from registers, and writing results back to registers, with only limited immediate values. Execution in AArch64 is split into 4 'exception levels', which demark the levels of privilege that a process may have, ranging from EL0 (least privileged) to EL3 (most privileged). Typically userland



Figure 1.1: Armv8-A structure.

- processes execute at EL0, with very limited access to hardware features; with operating systems running at EL1, hypervisors running at EL2, and any firmware and secure monitor running at EL3.
- So, each CPU: has its own bank of registers; is executing in either AArch64 or AArch32 execution mode; fetching, decoding and executing instructions from either the A64, A32 or T32 ISAs; at one of EL0,EL1,EL2 or EL3. For this work, we will focus on AArch64 and its A64 ISA, and hypervisors and below (EL2, EL1 and EL0).
- Systems software When we use our computers on a daily basis, we are typically interacting with *userland*: unprivileged programs, with restricted access to hardware. These userland programs make up the bulk of the applications we use every day, from spreadsheets, to web browsers, text editors and so on. They typically execute with the least privilege (at EL0), and with the operating systems and hypervisors below them restricting the access to memory they have through the use of *virtual memory*.
- Operating systems split userland execution into *processes*: instances of programs, with some associated dedicated virtual memory. It is the operating system, executing with more privilege (at EL1), that configures and schedules these processes.
- 204 Thesis overview
- 1.1 Thesis Introduction
- 206 What this thesis is trying to do
- 207 1.2 Modern microprocessors and Arm
- 208 What/who are Arm and what do they do? how do they relate to others?
- 1.3 Processor architecture and Armv8-A
- 210 Arch vs uarch and why do we care about the A profile.
- 1.4 Semantics of these architectures
- Briefly how do we how and why we write semantics and what can we do with them
- 1.5 Systems software
- What is a systems software and where/why is the semantics of them lacking

# 1.6 Hypervisors

 $_{\rm 216}$   $\,$  Narrow in on hypervisors and their importance and mention pKVM.

In days gone by, the implementations of our programming languages, the compilers and interpreters, implemented in software and hardware, were straight-forward and direct encodings of the desired semantics. As time progressed these implementations became legacy; acquiring multiple layers of abstraction, with implementations made with ever increasing complexity being built upon other implementations with their own abstractions and complexity. Our compilers and our hardware re-write our programs to be faster, use less space, and be more compact. They propagate and duplicate reads, subsume or outright delete writes, change the order we wrote the operations in, replace one computation with another, or even just eliminate whole parts of the program entirely.

It is, perhaps, believed that such optimisations are *semantics preserving*, that they, aside from the timing effects they are designed to cause, are invisible to the programmer. This is, alas, untrue. Many of these optimisations are highly desireable, yet, seem fundamentally incompatible with concurrency. Our multithreaded programs, and our multicore processors, make these incompatibilities impossible to hide.

Take, as an example, Intel's x86 microprocessor architecture. Of which, we are all familiar. It, quite sensibly, wants to allow its implementations to perform a common optimisation: to batch many smaller writes together. This store buffering optimisation is ubiquitous in the hardware world, it is, however, not semantics preserving. Multiple threads of execution, running on multiple cores, may have mutually inconsistent views of memory; where, at the same point in time, different cores have differing opinions on the value at a particular memory address. This disagreement poisons the program, as, if the programmer reads from that memory location, they shall get different answers on different cores. This can break key invariants of our software, leading to critical bugs in our synchronisation primitives, our data structures and our software more generally, if the programmer was unaware of these behaviours and their mitigations.

Intel is not alone, and store buffering not the only behaviour our hardware exhibits. Arm, RISC-V and IBM's Power 238 architectures all exhibit their own behaviours, with their own mitigations. Each microprocessor architecture 239 comes with a reference manual, comprised of thousands, or tens of thousands, of pages of a mix of prose and pseudocode, attempting to describe these, and other, behaviours. We find that these architectures are incomparable, they have different sets of allowable behaviours which break the sequential consistency fantasy of the world. 242 Reordering of instructions, prefetching and caching of data, buffering of writes and loads, hierarchical cache 243 layouts, branch prediction, and speculation, are, on some of those architectures, but not others, allowed to become visible to the programmer. We find, that it is not that some implementations perform these optimisations while others do not, but that architectures do not always require implementations hide their consequences from the 246 programmer: instead allowing the hardware to be more loose with hazard checking and cache invalidation and so on, where the performance gains are considered a greater benefit than the semantic loss.

It is not just our hardware that has these concerns. A variety of software languages, including C, C++, Java, Rust, and Haskell, are all known to have comparable behaviours, derived both from similar optimisations done by their compilers and interpreters, but also inherited from the hardware they run upon.

It is, therefore, imperative, that we, as a community, endeavour to understand the what, why, when and where of these behaviours, to precisely write down how they affect our programming languages and build tools and techniques to help identify, explore, test, check and verify that our programs are correct. This is, in a nutshell, the field of relaxed memory.

270

271

272

275

# Modelling Armv8-A: background

Now we turn our attention to the current, industry standard, methods of precisely and formally modelling the relaxed memory behaviours. There are principly three ways this is done: through operational models that mimic the mechanisms we see on hardware, with axiomatic-style models which filter out whole-program executions based on some predicate, and with the promising model.

We shall see that the idea of *litmus testing* is central. Litmus tests provide a way of succinctly, and efficiently, describing, and enumerating, the behaviours the various models should allow. We will start by looking at litmus testing in general, and some specific litmus tests of interest to the Armv8-A models, before looking at the models in detail.

### 2.1 Litmus tests

The foundation of much of the relaxed memory work has been focused on *litmus tests*, small, self-contained, executable, snippets of code. They each capture a simple pattern or shape one may find in software.

Take the classic MP ("Message passing") litmus test, as an example. The code listing for the Armv8-A (AArch64) variant can be found in Figure 2.1. The 'MP' portion of the name captures the *shape*. The MP shape implies a two-threaded test with two locations with one thread (usually written first) writing to the locations, and another thread reading them in the converse order. The second half of the name ('+pos') designates the variation on the shape. Typically the variations are defined as the sequence of orderings between events (separated by –) for each thread (separated by +). In this case, it is the variation with just program order (po) between each event, on both threads

| MP+pos AArch6 Initial state: 0:X1=x, 0:X3=y, 1:X1=y, 1:X3=x, *x=0, *y=0 |                            |  |  |
|-------------------------------------------------------------------------|----------------------------|--|--|
| Thread 0                                                                | Thread 1                   |  |  |
| MOV X0,#1<br>STR X0,[X1]<br>MOV X2,#1<br>STR X2,[X3]                    | LDR X0,[X1]<br>LDR X2,[X3] |  |  |
| Allowed: 1:X0=1, 1:X2=0                                                 |                            |  |  |

Figure 2.1: MP test code listing.

The code listing given is totally standard. The top line contains the name of the litmus test (MP+pos) and architecture this variant is for (AArch64), the second section contains the initial register and memory state, the next section contains the literal code listing for each thread, with the final state at the bottom being the interesting outcome we wish to explore.

On Arm, this outcome is allowed. We can imagine that there may be many executions of the listed code, where the instructions of the two threads are interleaved in different ways. To see the highlighted outcome, with Thread 1 reading 1 for y but 0 for x, there is only one possible combination of reads: that the read of y reads from the write

to y, and the read of x reads from the initial memory state. This execution can be represented as a graph of the key events (reads and writes) of the program, and their implicit orderings. The execution graph that corresponds with the allowed outcome can be found in Figure 2.2.



Figure 2.2: MP test execution diagram.

The nodes on the left, under the Thread 0 label, correspond to events from executing Thread 0 of the program, where the 'a:  $W \times = 1$ ' event (that is, event labelled a is a write to x of the value 1), corresponds to the propagation of the first store in Thread 0 to memory, and event b corresponds to the second store being propagated. They 288 are related by the po edge saying that event a's instruction occured before event b's instruction in the program. 289 Similarly under Thread 1 we see the event 'c: R y = 1' for the first load reading y and seeing the value 1. This 290 value was read from the write event b, and so the event b is related to the read event c with the reads-from edge rf. Finally, the load of x reads from the initial value in memory, so we have another read event, labelled d, which 292 reads 0. The read of x must be ordered before the write of x, so the read and write are related by fr (from reads). 293 On Arm, the writes and reads need not execute in the order they appear in the program. So, while this execution appears to have a cyclic dependency in the order events must have happened in, the cycle can be broken by 295 re-ordering the execution of the reads or writes. The execution is, therefore, allowed, and we observe this outcome 296 on hardware. 297

### 2.1.1 The litmus zoo

298

300

301

303

304

306

307

308

310

311

312

313

314

322

323

We use litmus tests to, generally, capture some *behaviour*: one particular pattern in code, or a specific hardware mechanism that is responsible for allowing or forbidding the test. Many litmus tests may exercise many microarchitectural mechanisms whose confluence leads to the final result, or where there may be multiple different mechanisms that could independently lead to the final result. For example, in the MP+pos test we just saw, there are two well-understood microarchitectual explanations: that the stores propagate out-of-order, or that the loads satisfy out-of-order. Either explanation is sufficient, and one needs to prevent both to forbid the outcome.

Previous work has enumerated these various patterns to produce a large collection of litmus tests, for a range of architectures, each with an assortment of variations for different intra-thread orderings (for barriers, dependencies and so on). We will not do an exhausitve review of all the behaviours that are allowed and forbidden in Arm, instead refer to test7 TODO: ?CITE?, Pulte et al TODO: ?CITE?, Flur TODO: ?CITE?and Alglave et al TODO: ?CITE?, but we will briefly look at some of the behaviours that the reader should understand before progressing to future chapters. Namely, coherence, barriers and dependencies, and multi-copy atomicity.

**Thread-local ordering** On Arm, instructions need not execute in the order they appear in the program. Reads and writes are free to be re-ordered with respect to each other, with few restrictions. This is in contrast to other architectures such as Intel's x86, where only writes can be re-ordered with respect to program-order later reads (through store buffering).

Not all re-orderings are permissible; Arm require that single-threaded programs should behave as if executed sequentially. This means that non-SC executions only come about through the interaction between multiple threads. We have already seen this with the MP test mentioned earlier. To forbid the outcome of that test, we can add barriers or dependencies to enforce thread-local ordering preventing the events from being reordered. Two (forbidden) variations of MP can be found in Figure 2.3.

Control dependencies, however, do not necessairly enforce order on Arm. Speculation allows reads to happen 'early', but not writes, giving us the asymmetry in the outcomes in Figure 2.4.

**Coherence** A fundamental guarantee provided by most modern microprossor architectures is *coherence*: that there is a total order that writes happen in for each location that all threads agree on. Microarchitecturally this means that the storage subsystem (with all its caches) must remember which write is the 'most recent' (coherence latest) write and reads should always read from that.



**Figure 2.3:** Two variants of MP with thread-local ordering. On the left: MP+dmbs with Arm DMB barrier between instructions. On the right: MP+dmb.st+addr with an address dependency between the reads.



Figure 2.4: Two litmus tests with speculation.

On the left: MP+dmb.st+ctrl with Arm DMB barrier between the writes, but a control dependency between the reads. On the right: LB+ctrls, a variant of the classic 'load buffering' litmus test, with control dependencies to both writes.



**Figure 2.5:** Two coherence litmus tests.

On the left: CoRR1, that two subsequent reads of the same location in the same thread should be consistent with the coherence order. On the right: CoWR, that a read of a location cannot skip over a newer program-order earlier write from the same thread.

- The key litmus tests for coherence can be found in Figure 2.5.
- Multi-copy atomicity and write-forwarding When combining more than two threads, coherence alone does not ensure that writes are propagated to all threads simultaneously. This behaviour is known as multi-copy atomicity.
- Arm has a kind of partial multi-copy atomicity, known as *other*-multi-copy atomicity, where writes can be observed by writer thread earlier than they can be seen by other threads, but once a write has propagated to another thread then all threads must see that write. Figure 2.6 shows the former, that writes can be observed locally (through *write forwarding*) before being propagated to other threads, even down speculative branches, and Figure 2.7 shows the classic independent-reads independent-writes (IRIW) litmus test, which demonstrates the latter point, that writes propagate to all threads simulatenously.

## 2.2 Operational models

Microarchitectural style and how they aren't "how the hardware works" But don't actually explain flat ...

## 2.3 Promising

TODO: talk about after axiomatic?



**Figure 2.6:** Two litmus tests with write forwarding. On the left: MP+dmb.st+addr-rfi-addr with write-forwarding down a non-speculative branch. On the right: PPOCA, with write-forwarding down a speculative branch.



Figure 2.7: IRIW+dmbs: a classic multi-copy atomicity litmus test.

f:R x=0

c:R y=0

### 2.4 Axiomatic-style models

A newer approach, devised by Alglave et al **TODO: ?CITE?**, describes the allowed behaviour with a set of *axioms* constraining *candidate* executions.

The candidates are whole-program executions, consistent with the intra-instruction semantics, but with no constraints on the values of reads or writes. The axioms then discard some of these executions as inconsistent, if they violate a semantic property the architecture enforces. Then, if for a given program and final state, there are any candidate executions, which gives rise to the aforementioned final state, which are consistent with the axioms of the model, then the model is said to *allow* that execution.

### 2.4.1 Informal description and the Cat language

We can think of a candidate execution as a graph of the events of the program, with some intrinsic relations capturing the sequencing of those events.

351 Revisiting the MP example we saw earlier, Figure 2.2 illustrates a typical candidate execution.

### 2.4.2 The Armv8-A axiomatic model

We define the Armv8-A axiomatic-style model in two parts: (1) the generation of candidates given a program, and (2) the relations and axioms.

Candidate generation Candidates are generated using the intra-instruction semantics, defined by Arm's ASL pseudocode. Formally, we assign a denotation function [ - ] which relates an Arm machine code program with a set of candidate executions. We do not, here in this work, try to give a precise definition of this function. Instead, we delegate to the Arm pseudocode for the intra-instruction semantics, and to software to compose instructions together, work out dependencies, and glue multiple threads together, into a whole program candidate execution.

TODO: Why not do that here? I could probably just write down.

### TODO: preexecution v candidate

A candidate execution is made up of a set of event names, a labelling function, and a collection of candidate relations, where the event labels are defined as follows:

```
Event \equiv Reads \cup Writes \cup Barriers Reads \equiv {R, A, Q} \times Loc \times Val Writes \equiv {W, L} \times Loc \times Val Barriers \equiv {DMB.LD, DMB.ST, DMB.SY, ISB} Loc \equiv BV_{48} Val \equiv BV_{64}
```

The candidate relations are:

374

375

380

- $\triangleright$  program order:  $E_1$  po  $E_2$  iff the instruction for  $E_1$  occurs before  $E_2$  in the source program.
- $\triangleright$  same-location:  $M_1$  loc  $M_2$  iff the address of  $M_1$  is the same location as used by  $M_2$ .
- $\triangleright$  address dependent:  $R_1$  addr  $RW_2$  iff the value read by  $R_1$  is used in the calculation of the address  $RW_2$ .
- $\downarrow$  data dependent:  $R_1$  data  $W_2$  iff the value read by  $R_1$  is used in the calculation of the value to write in  $W_2$ .
- control dependent:  $R_1$  ctrl  $RW_2$  iff the value read by  $R_1$  is used to determine whether or not the instruction  $RW_2$  originates from would have executed at all.
  - $\triangleright$  read-modify-write:  $R_1$  rmw  $W_2$  for the separate read and write events of an atomic update.
  - $\triangleright$  external:  $E_1$  ext  $E_2$  iff  $E_1$  and  $E_2$  originate from different threads.

```
(* observed by *)
                                                                       [L]; po; [A]
let obs = rfe | fr | co
(* dependency-ordered-before *)
                                                                  | [A | Q]; po; [R | W]
| [R | W]; po; [L]
(* Ordered-before *)
let dob =
                                                                  let ob1 = obs | dob | aob | bob
      addr | data
   | (ctrl | (addr ; po)) ; [W]
| (addr | data); rfi
atomic-ordered-before *)
                                                                  let ob = ob1^+
                                                                  (* Internal visibility requirement *)
                                                                  acyclic po-loc | fr | co | rf as internal
let aob = rmw
                                                                  (* External visibility requirement *)
| [range(rmw)]; rfi; [A | Q]
(* barrier-ordered-before *)
let bob = [R]; po; [dmbld]
| [W]; po; [dmbst]
| [dmbst]; po; [W]
                                                                  irreflexive ob as external
                                                                  (* Atomic: Basic LDXR/STXR constraint to forbid intervening writes. *)
                                                                  empty rmw & (fre; coe) as atomic
   | [dmbld]; po; [R|W]
```

**Figure 2.8:** Deacon et al's Armv8-A MCA Axiomatic model.

Axioms The axiomatic model is then defined by its *axioms*. An axiom in the model is an assertion of the acyclicity of a relation over L. These relations are constructed using a relation algebra  $\mathcal{A}$ : composing the relations of  $\mathcal{C}_R$  and the existentially quantified relations **co** (coherence-order) and **rf** (reads-from) and the restricted identity relation (**id**<sub>E</sub>, for identity over events with label E), with the standard relation operators: union (|), intersection (&), relation composition (using the flipped operator;), transitive closure (\*) and relation inverse ( $^{-1}$ ). The model is then, formally, a set of terms over  $\mathcal{A}$ .

```
\begin{array}{lll} \mathcal{A} &:& (\mathcal{C}_R \cup \{\mathbf{co},\mathbf{rf},\mathbf{id}_E\}, & \{\$, |,;,+,*,^{-1}\}) \\ & \mathsf{Model} &\subseteq \mathcal{T}(\mathcal{A}) \end{array}
```

392

Such models are usually defined in the Cat language, and the canonical Cat Armv8-A multi-copy atomic axiomatic model is given by Deacon et al TODO: ?CITE?and can be found in Figure 2.8.

# Instruction fetch: an introduction

### 3.1 Introduction

Computing relies on the *architectural abstraction*: the specification of an envelope of allowed hardware behaviour that hardware implementations should lie within, and that software should assume. These interfaces, defined by hardware vendors and relatively stable over time, notionally decouple hardware and software development; they are also, in principle, the foundation for software verification. In practice, however, industrial architectures have accumulated great complexity and subtlety: the ARMv8-A and Intel architecture reference manuals are now 7476 and 4922 pages [3, 2], and hardware optimisations, including out-of-order and speculative execution, result in surprising and poorly-understood programmer-observable behaviour. Architecture specifications have historically also been entirely informal, describing these complex envelopes of allowed behaviour solely in prose and pseudocode. This is problematic in many ways: do not serve as clear documentation, with the inevitable ambiguity and incompleteness of informal prose leaving major questions unanswered; without a specification that is executable as a test oracle (that can decide whether some observed behaviour is allowed or not), hardware validation relies on test suites that must be manually curated; without an architecturally-complete emulator (that can exhibit all allowed behaviour), it is very hard for software developers to "program to the specification" – they rely on test-and-debug development, and can only test above the hardware implementation(s) they have; and without a mathematically rigorous semantics, formal verification of hardware or software is impossible.

Over the last 10 years, much has been done to put architecture specifications on a more rigorous footing, so that a single specification can serve all those purposes. There are three main problems, two of which are now largely solved.

The first is the instruction-set architecture (ISA): the specification of the sequential behaviour of individual instructions. This is chiefly a problem of scale: modern industrial architectures such as Arm or x86 have large instruction sets, and each instruction involves many details, including its behaviour at different privilege levels, virtual-to-physical address translation, and so on – a single Arm instruction might involve hundreds of auxiliary functions. Recent work by Reid et al. within Arm [4, 5, 6] transitioned their internal ISA description into a mechanised form, used both for documentation and testing, and with him we automatically translated this into publicly available Sail definitions and thence into theorem-prover definitions [7, 8]. Other related work is in §3.7.

The second is the relaxed-memory concurrent behaviour of "user-mode" operations: memory writes and reads, and the mechanisms that architectures provide to enforce ordering and atomicity (dependencies, memory barriers, load-linked/store-conditional operations, etc.). In 2008, for ARMv7, IBM POWER, and x86, this was poorly understood, and the architects regarded even their own prose specifications as inscrutable. Now, following extensive work by many people [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], ARMv8-A has a well-defined and simplified model as part of its specification [3, B2.3], including a prose transcription of a mathematical model [26], and an equivalence proof between operational and axiomatic presentations [9, 10]; RISC-V has adopted a similar model [27]; and IBM POWER and x86 have well-established de-facto-standard models. All of these are experimentally validated against hardware, and supported by tools for exhaustively running tests [28, 29]. The combination of these models and the ISA semantics above is enough to let one reason about or model-check concurrent algorithms.

That leaves the third part of the problem: the "system" semantics, of instruction-fetch and cache maintenance, exceptions and interrupts, and address translation and TLB (translation lookaside buffer) maintenance. Just as

for "user-mode" relaxed memory, these are all areas where microarchitectural optimisations can have surprising programmer-visible effects, especially in the concurrent context. The mechanisms are relied on by all code, but they are explicitly managed only by systems code, in just-in-time (JIT) compilers, dynamic loaders, operating-system (OS) kernels, and hypervisors. This is, of course, exactly the security-critical computing base, currently trusted but not trustworthy, that is especially in need of verification – which requires a precise and well-validated definition of the architectural abstraction. Previous work has scarcely touched on this: none of seL4 [30], CertiKOS [31, 32], Komodo [33], or [34, 35], address realistic architecture concurrency, and they use (at best) idealised models of the sequential systems architecture. The CakeML [36, 37] and CompCert [38] verified compilers target only sequential user-mode ISA fragments.

In this paper we focus on one aspect of system semantics: instruction fetch and cache maintenance, for ARMv8-A. The ability to execute code that has previously been written to data memory is fundamental to computing: fine-grained self-modifying code is now rare, and (rightly) deprecated, but program loading, dynamic linking, JIT compilation, debugging, and OS configuration all rely on executing code from data writes. However, because these are relatively infrequent operations, hardware designers have been able to optimise by partially separating the instruction and data paths, e.g. with distinct instruction caching, which by default may not be coherent with data accesses. This can introduce programmer-visible behaviour analogous to that of user-mode relaxed-memory concurrency, and require specific additional synchronisation to correctly pick up code modifications. Exactly what these are is not entirely clear in the current ARMv8-A architecture text, just as pre-2018 user-mode concurrency was not.

Our main contribution is to clarify this situation, developing precise abstractions that bring the instructionfetch part of ARMv8-A system behaviour into the domain of rigorous semantics. Arm have stated [private
communication] that they intend to incorporate a version of this into their architecture. We aim thereby to enable
future work on system software verification using the techniques of programming languages research: program
analysis, model-checking, program logics, etc. We begin (§3.2) by recalling the informal architectural guarantees
that Arm provide, and the ways in which real-world software systems such as Linux, JavaScript, and WebAssembly
change instruction memory. Then:

(1) We explore the fundamental phenomena and architecture design questions with a series of examples (§3.3). We explore the interactions between instruction fetching, cache maintenance and the 'usual' relaxed memory stores and loads, showing that instruction fetches are more relaxed, and how even fundamental coherence guarantees for data memory do not apply to instruction fetches. Most of these questions arose during the development of our models, in detailed ongoing discussion with the Arm Chief Architect and other Arm staff. They include questions of several different kinds. Six are clear from the Arm prose specification. Of the others: two are not implied by the prose but are natural choices; five involved substantive new choices by Arm that had not previously been considered and/or documented; for two, either choice could be reasonable, and Arm chose the simpler (and weaker) option; and for one, Arm were independently already strengthening the architecture to accommodate existing software.

(2) We give an operational semantics for Arm instruction fetch and icache maintenance (§3.4). This is in an abstract-microarchitectural style that supports an operational intuition for how hardware actually works, while abstracting from the mass of detail and the microarchitectural variation of actual hardware implementations. We do so by extending the Flat model [10] with simple abstractions of instruction caches and the coherent data cache network, in a way that captures the architectural intent, defining the entire envelope of behaviours that implementations should be allowed to exhibit.

476 (3) We give a more concise presentation of the model in an axiomatic style (§3.5), extending the "user-477 mode" axiomatic model from previous work [10, 9, 26, 3], and intended to be functionally equivalent. We discuss 478 how this too matches the architectural intent.

(4) **We validate all this** in two ways: by the extensive discussion with Arm staff mentioned above, and by experimental testing of hardware behaviour, on a selection of ARMv8-A cores designed by multiple vendors (§3.6). We run tests on hardware with a mild extension of the Litmus tool [39, 17]. We make the operational model executable as a test oracle by integrating it into the RMEM tool and its web interface [28], introducing optimisations that make it possible to exhaustively execute the examples. We make the axiomatic model executable as a test oracle with a new tool that takes litmus tests and uses a Sail [7] definition of a fragment of the ARMv8-A ISA to generate SMT problems for the model. We then compare hardware and the two models for the handwritten tests (modulo two tests not supported by the axiomatic checker), compare hardware and the operational model on a suite of 1456 tests, automatically generated with an extension of the diy tool [40], and check the operational and axiomatic models against sets of previous non-ifetch tests. In all this data our models are equivalent to each other

and consistent with hardware observations, except for one case where our testing uncovered a hardware bug on a
Oualcomm device.

Finally, we discuss other related work (§3.7) and conclude (§3.8). We do all this for ARMv8-A, but other relaxed architectures, e.g. IBM POWER and RISC-V, face similar issues; our tests and tooling should enable corresponding work there.

The models are too large to include or explain in full here, so we focus on explaining the motivating examples, the main intuition and style of the operational model, in a prose rendering of its executable mathematics, and the definition of the axiomatic model. Appendices provide additional examples, a complete prose description of the operational model, and additional explanation of the axiomatic model. The complete executable mathematics version, the web-interface tool for running it, and our test results are at https://www.cl.cam.ac.uk/~pes20/iflat/.

Caveats and Limitations Our executable models are integrated with a substantial fragment of the Sail ARMv8-A ISA (similar to that used for CakeML), but not yet with the full ISA model [7, 4, 5, 6]; this is just a matter of additional engineering. We only handle the 64-bit AArch64 part of ARMv8-A, not AArch32. We do not handle the interaction between instruction fetch and mixed-size accesses, or other variants of the cache maintenance instructions, e.g. those used for interaction with DMA engines, and variants by set or way instead of by virtual address. Finally, the equivalence between our operational and axiomatic models is validated experimentally. A proof of this equivalence is essential in the long term, but would be a major work in itself: the complexity makes mechanisation essential, but the operational model (in all its scale and complexity) has not yet been subject to mechanised proof. Without instruction fetch, a non-mechanised proof was the main result of an entire PhD thesis [9], and we expect the addition of instruction fetch to require global changes to the argument.

## 3.2 Industry Practice and the Existing ARMv8-A Prose

Computer architecture relies on a host of sophisticated techniques, including buffering, caching, prediction, and pipelining, for performance. For the normal memory reads and writes of "user-mode" concurrency, the programmer-visible relaxed-memory effects largely arise from store buffering and from out-of-order and speculative pipeline behaviour, not from the cache hierarchy (though some IBM POWER phenomena do arise from the interconnect, and from late processing of cache invalidates). All major architectures provide a strong per-location guarantee of *coherence*: for each memory location, different threads cannot observe the writes to that location in different orders. This is implemented in hardware by coherent cache protocols, ensuring (roughly) that each cache line is writable by at most one hardware thread at a time, and by additional machinery restricting store buffer and pipeline behaviour. Then each architecture provides additional synchronisation mechanisms to let the programmer enforce ordering properties involving multiple locations.

At first sight, one might expect instruction fetches to act like other memory reads but, because writes to instruction memory are relatively rare, hardware designers have adopted different caching mechanisms. The Arm architecture carefully does not mandate exactly what these must be, to allow a wide range of possible hardware implementations, but, for example, a high-performance Arm processor might have per-core separate L1 instruction and data caches, above a unified per-core L2 cache and an L3 cache shared between cores. There may also be additional structures, e.g. per-core fetch queues, and caching of decoded micro-operations. This instruction caching is not necessarily coherent with data memory accesses: "the architecture does not require the hardware to ensure coherency between instruction caches and memory" [3, B2.4.4 (B2-114)]; instead, programmers must use explicit cache maintenance instructions. The documentation gives a particular sequence of these: "If software requires coherency between instruction execution and memory, it must manage this coherency using Context synchronization events and cache maintenance instructions. The following code sequence can be used to allow a processing element (PE) to execute code that the same PE has written."

```
; Coherency example for data and instruction accesses [...]
534
         ; Enter this code with <Wt> containing a new 32-bit instruction,
         ; to be held in Cacheable space at a location pointed to by Xn.
535
         STR Wt, [Xn]; Store new instruction
536
         DC CVAU, Xn; Clean data cache by virtual address (VA) to PoU
537
                        Ensure visibility of the data cleaned from cache
538
                        Invalidate instruction cache by VA to PoU
539
                        Ensure completion of the invalidations
540
         ISB
                        Synchronize the fetched instruction stream
```

At first sight, this may be entirely mysterious. The remainder of the paper establishes precise semantics for each instruction, explaining why each is required, but as a rough intuition:

- 1. The DC CVAU, Xn cleans this core's data cache for address Xn, pushing the new write far enough down the hierarchy for an instruction fetch that misses in the instruction cache to be guaranteed to see the new value. This point is the *Point of Unification* (PoU) and is usually the point where the instruction and data caches become unified (L2 for most modern devices).
- 2. The DSB ISH waits for the clean to have happened before letting the later instructions execute (without this, the sequence itself can execute out-of-order, and the clean might not have pushed the write down far enough before the instruction cache is updated). The ISH makes this specific to the *Inner Shareable Domain*: the processor itself, not the system-on-chip. We do not model shareability domains in this paper, so this is equivalent to a DSB SY.
- 3. The IC IVAU, Xn invalidates any entry for that address in the instruction caches for all cores, forcing any future fetch to miss in the instruction cache, and instead read the new value from the data memory hierarchy; it also touches some fetch queue machinery.
- 4. The second DSB ISH ensures the invalidation completes.

545

546

547

549

550

552

553

554

555

556

557

558

559

560

568

569

570

572

573

5. The final ISB flushes this core's pipeline, forcing a re-fetch of all program-order-later instructions.

Some hardware implementations provide extra guarantees, rendering the DC or IC instructions unnecessary. Arm allow software to discover this in an architectural way, by reading the CTR\_EL0 register's DIC and IDC bits. Our modelling handles this, but for brevity we only discuss the weakest case, with CTR\_EL0.DIC=CTR\_EL0.IDC=0, that requires full cache maintenance.

Arm make clear that instructions can be prefetched (perhaps speculatively): "How far ahead of the current point of execution instructions are fetched from is IMPLEMENTATION DEFINED. Such prefetching can be either a fixed or a dynamically varying number of instructions, and can follow any or all possible future execution paths. For all types of memory, the PE might have fetched the instructions from memory at any time since the last Context synchronization event on that PE."

Concurrent modification and instruction fetch require the same sequence, with an ISB on each thread that executes the new instructions, and the rest of the sequence on the modifying thread [3, B2.2.5 (B2-94)]. Concurrent modification without synchronisation is restricted to particular instructions (B (branch), BL (branch-and-link), BRK (break), SMC, HVC, SVC (secure monitor, hypervisor, and supervisor calls), ISB, and NOP), otherwise there could be constrained unpredictable behaviour: "any behavior that can be achieved by executing any sequence of instructions that can be executed from the same Exception level". Concurrent modification of conditional branches is allowed but can result in the old condition with the new target address or vice versa.

All this gives some guidance for programmers, but it leaves the exact semantics of instruction fetch and those 574 cache maintenance instructions unclear, and in practice software typically does not use the above sequence 575 verbatim. For example, it may synchronise a range of addresses at once, looping the DC and IC parts, or the final 576 ISB may be subsumed by instruction synchronisation from exception entry or return. Linux has many places 577 where it modifies code at runtime: in boot-time patching of alternatives, modifying kernel code to specialise it to the particular hardware being run on; when the kernel loads code (e.g. when the user calls dl\_open); and in the ptrace system call, used e.g. by the GDB debugger to patch arbitrary instructions with breakpoints at runtime. In 580 Google's Chrome web browser, its WebAssembly and JavaScript just-in-time (JIT) compilers are required to both 581 write new code during execution and modify existing code at runtime. In JavaScript, this modification happens inside a single thread and so is quite straightforward. The WebAssembly case is more complex, as one thread is modifying the code of another. A software thread can also be moved (by the OS or hypervisor) from one hardware 584 thread to another, perhaps while it is in the middle of some instruction cache maintenance. Moreover, for security reasoning, we have to be able to bound the possible behaviour of arbitrary code.

All this means that we cannot treat the above sequence as a whole, as an opaque black box. Instead, we need a precise semantics for each individual instruction, but the existing prose documentation does not provide that.

The problem we face is to give such a semantics, that correctly defines behaviour in arbitrary concurrent contexts, that captures the Arm architectural intent, that is strong enough for software, and that abstracts from the variety of hardware implementations (e.g. with differing cache structures) that the architecture intends to allow – but which programmers should not have to think about.

[3, B2.4.4 (B2-114)] Synchronization and coherency issues between data and instruction accesses

How far ahead of the current point of execution instructions are fetched from is IMPLEMENTATION DEFINED. Such prefetching can be either a fixed or a dynamically varying number of instructions, and can follow any or all possible future execution paths. For all types of memory:

- The PE might have fetched the instructions from memory at any time since the last Context synchronization event on that PE.
- > Any instructions fetched in this way might be executed multiple times, if this is required by the execution of the program, without being refetched from memory. In the absence of a Context synchronization event, there is no limit on the number of times such an instruction might be executed without being refetched from memory.

The Arm architecture does not require the hardware to ensure coherency between instruction caches and memory, even for locations of shared memory. 604

If software requires coherency between instruction execution and memory, it must manage this coherency using 605 Context synchronization events and cache maintenance instructions. The following code sequence can be used to allow a PE to execute code that the same PE has written.

```
; Coherency example for data and instruction accesses within the same Inner Shareable domain.
608
       ; Enter this code with <Wt> containing a new 32-bit instruction,
         to be held in Cacheable space at a location pointed to by Xn.
       STR Wt. [Xn]
611
       DC CVAU, Xn; Clean data cache by VA to point of unification (PoU)
612
       DSB ISH; Ensure visibility of the data cleaned from cache
613
       IC IVAU, Xn; Invalidate instruction cache by VA to PoU
614
       DSB ISH; Ensure completion of the invalidations
615
       ISB; Synchronize the fetched instruction stream
616
```

#### Note:

618

619

621

622

624

625

626

628

629

631

632

633

635

595

597

598

599

600

601

602

607

- > For Non-cacheable or Write-Through accesses, the clean data cache by VA instruction is not required. However, the invalidate instruction cache instruction is required because the ARMv8-A AArch64 architecture allows Non-cacheable accesses to be held in an instruction cache. See Non-cacheable accesses and instruction caches on page D4-2359.
- > This code can be used when the thread of execution modifying the code is the same thread of execution that is executing the code. The Armv8 architecture limits the set of instructions that can be executed by one thread of execution as they are being modified by another thread of execution without requiring explicit synchronization. See Concurrent modification and execution of instructions on page B2-94.
- > The system software controls whether these cache maintenance instructions are available to the application level by setting SCTLR\_EL1.UCI.

Note: If this sequence is not executed between writing data to a location and executing the instruction at that location, the lack of coherency between instruction caches and memory means that the instructions that are executed might be the old instruction or the updated instruction, and which is used can arbitrarily vary during execution. It must not be assumed by software, before the synchronization sequence is executed, that when the updated instruction has been seen, the old instruction will not be seen again.

[3, B2.2.5 (B2-94)] Concurrent modification and execution of instructions

The Armv8 architecture limits the set of instructions that can be executed by one thread of execution as they are being modified by another thread of execution without requiring explicit synchronization. 637

Concurrent modification and execution of instructions can lead to the resulting instruction performing any 638 behavior that can be achieved by executing any sequence of instructions that can be executed from the same 639 Exception level, except where each of the instruction before modification and the instruction after modification is one of a B, BL, BRK, HVC, ISB, NOP, SMC, or SVC instruction. 641

For the B, BL, BRK, HVC, ISB, NOP, SMC, and SVC instructions the architecture guarantees that, after modification of the instruction, behavior is consistent with execution of either:

▶ The instruction originally fetched.

644

645

650

656

657

659

660

661

662

663

666 667

669

670

672

673

674

675

676

677

680 681

683

> A fetch of the modified instruction.

If one thread of execution changes a conditional branch instruction, such as B or BL , to another conditional instruction and the change affects both the condition field and the branch target, execution of the changed instruction by another thread of execution before the change is synchronized can lead to either:

- ▶ The old condition being associated with the new target address.
- ▶ The new condition being associated with the old target address.

These possibilities apply regardless of whether the condition, either before or after the change to the branch instruction, is the always condition.

For all other instructions, to avoid UNPREDICTABLE or CONSTRAINED UNPREDICTABLE behavior, instruction modifications must be explicitly synchronized before they are executed. The required synchronization is as follows:

- 1. 1. No PE must be executing an instruction when another PE is modifying that instruction.
- 2. 2. To ensure that the modified instructions are observable, a PE that is writing the instructions must issue the following sequence of instructions and operations:

```
; Coherency example for data and instruction accesses within the same Inner Shareable domain. ; Enter this code with <Wt> containing a new 32-bit instruction, ; to be held in Cacheable space at a location pointed to by Xn. STR Wt, [Xn] DC CVAU, Xn; Clean data cache by VA to point of unification (PoU) DSB ISH; Ensure visibility of the data cleaned from cache IC IVAU, Xn; Invalidate instruction cache by VA to PoU
```

### Note:

- ▶ The DC CVAU operation is not required if the area of memory is either Non-cacheable or Write-Through Cacheable.
- ▶ If the contents of physical memory differ between the mappings, changing the mapping of VAs to PAs can cause the instructions to be concurrently modified by one PE and executed by another PE. If the modifications affect instructions other than those listed as being acceptable for modification, synchronization must be used to avoid UNPREDICTABLE or CONSTRAINED UNPREDICTABLE behavior.
- 3. 3. In a multiprocessor system, the IC IVAU is broadcast to all PEs within the Inner Shareable domain of the PE running this sequence. However, when the modified instructions are observable, each PE that is executing the modified instructions must issue the following instruction to ensure execution of the modified instructions:

```
ISB ; Synchronize fetched instruction stream
```

DSB ISH; Ensure completion of the invalidations

For more information about the required synchronization operation, see Synchronization and coherency issues between data and instruction accesses on page B2-114.

Note: For information about memory accesses caused by instruction fetches, see Ordering relations on page B2-100.

### 3.3 Instruction Fetch Phenomena and Examples

We now describe the main instruction-fetch phenomena and architecture design questions for ARMv8-A, illustrated by handwritten litmus tests, to guide the following model design.

### 3.3.1 Instruction-Fetch Atomicity

The first point, as mentioned in §3.2, is that concurrent modification and fetch is only permitted if the original and modified instructions are in a particular set: various branches, supervisor/hypervisor/secure-monitor calls, the ISB instruction synchronisation barrier, and NOP. Otherwise, the architecture permits *constrained unpredictable* behaviour, meaning that the resulting machine state could be anything that would be reachable by arbitrary instructions at the same exception level. The following W+F test illustrates this.

| W+F                                            | AArch64                         |  |  |
|------------------------------------------------|---------------------------------|--|--|
| Initial state: 0:W0="SUB X0,X0,#1", 0:X1=1     |                                 |  |  |
| Thread 0                                       | Thread 1                        |  |  |
| STR W0,[X1] // modify Thread 1 at 1            | 1: ADD X0,X0,#1 // initial code |  |  |
| Allowed: constrained-unpredictable final state |                                 |  |  |

In this test Thread 0 performs a memory store (with the STR instruction) to the code that Thread 1 is executing; overwriting the ADD X0, X0, #1 instruction with the 32-bit encoding of the SUB X0, X0, #1 instruction. If the fetch were atomic, the outcome of this test would be the result of executing either the ADD or the SUB instruction, but, since at least one of those is not in the set of the 8 atomically-fetchable instructions given previously, Thread 1 has constrained-unpredictable behaviour and the final state is very loosely constrained. Note, however, that this is nonetheless much stronger than the C/C++ whole-program undefined behaviour in the presence of a data race: unlike C/C++, a hardware architecture has to define a useful envelope of behaviour for arbitrary code, to provide guarantees for the rest of the system when one user thread has a race.

#### Conditional Branches

690

691

693

695

696

699

700

702

703

716

717

718

719

720

721

722

723

724

725

For conditional branches, the Arm architecture provides a specific non-single-copy-atomic fetch guarantee: the execution will be consistent with either the old or new target, and either the old or new condition.

For example, this W+F+branches test can overwrite
a B.EQ g with a B.NE h, and end up executing
B.NE g or B.EQ h instead of one of those. Our
future examples will only modify NOPs and unconditional branch instructions.

| W+F+branches              | AArch64          |  |  |  |
|---------------------------|------------------|--|--|--|
| Initial state: 0:W0=      | "B.NE h", 0:X1=1 |  |  |  |
| Thread 0                  | Thread 1         |  |  |  |
| <b>STR</b> W0,[X1]        | 1: B.EQ g        |  |  |  |
| Allowed: execute "B.NE g" |                  |  |  |  |

### 3.3.2 Coherence

Data writes and reads are coherent, in Arm and in other major architectures: in any execution, for

each address, the reads of each hardware thread must see a subsequence of the total *coherence order* of all writes to that address. The plain-data CoRR test [18] illustrates one case of this: it is forbidden for a thread to read a new write of x and then the initial state for x. However, instruction fetches are not necessarily coherent: one instruction fetch may be inconsistent with a program-order-previous fetch, and the data and instruction streams can become out-of-sync with each other. We explore three kinds of coherence:

- Instruction-to-Instruction Coherence: whether fetches of the same location must observe writes to the same location coherently.
- Data-to-Instruction Coherence: whether fetches and then reads to the same location must observe writes to the same location coherently.
  - Instruction-to-Data Coherence: whether reads and then fetches of the same location must observe writes to the same location coherently.

#### 726 Instruction-to-Instruction Coherence

Arm explicitly do not guarantee any consistency between fetches of the same location: fetching an instruction does not mean that a later fetch of that location will not see an older instruction [3, B2.4.4]. This is illustrated by CoFF, like CoRR but with fetches instead of reads.

Here Thread 1 makes two calls to address f (BL is branch-and-link), while Thread 0 overwrites the instruction at that address. The interesting potential execution is that in which the first call to f fetches and executes the

| CoFF AArch6             |                                    |                                                           |  |  |  |
|-------------------------|------------------------------------|-----------------------------------------------------------|--|--|--|
| Initial state: 0:W0="B  | Initial state: 0:W0="B 11", 0:X1=f |                                                           |  |  |  |
| Thread 0                | Thread 1                           | Common                                                    |  |  |  |
| <b>STR</b> W0,[X1] //a  | BL f                               | f: B 10<br>11: MOV X10,#2<br>RET<br>10: MOV X10,#1<br>RET |  |  |  |
| Allowed: 1:X0=2, 1:X1=1 |                                    |                                                           |  |  |  |

```
Thread 0

a:write f = |B|1| irf b:fetch f = |B|1|

fpo

irf c:fetch f = |B|0|
```

newly-written B 11, but the second call fetches and executes the original B 10. We can view such executions as graphs, similar to previous axiomatic-model candidate executions but with new fetch events, one per instruction, and new edges. As usual, we use po and rf edges for the program-order and reads-from relations, together with:

- ▷ fe (fetch-to-execute), which relates the fetch event of an instruction to all the execution events (memory writes, reads or barriers) of the instruction;
- irf (instruction-read-from), relating a write to all fetches that read from it (analogous to reads-from, rf);
  - ▶ fpo (fetch-program-order), relating fetches of instructions that are in program order (analogous to program order, po).

Edges from the initial state are drawn from a small circle. Since we do not modify the code of most locations, we usually omit the fetch events for those instructions, showing only a subgraph of the interesting events, e.g. as on the right above. For Arm, this execution is both architecturally allowed and experimentally observed.

Here, and in future tests, we assume some common code consisting of a function at address f which always has the same shape: a branch that might be overwritten, which selects a block that writes a value to register X10 before returning. This is sometimes duplicated at different addresses (f1, f2, ...) or extended to g, with three cases. We sometimes elide the common code.

#### 48 Data-to-Instruction Coherence

733

734

735

736

737

739

740

753

754

757

758

Fetching from a particular write does imply that program-order-later reads from the same address will see that write (or a coherence successor thereof). This is a *data-to-instruction* coherence property, illustrated by CoFR below. Here Thread 1 fetches the newly-written B 11 at f and then, when reading from f with its LDR load instruction, cannot read the original B 10 instruction (it can only read the new B 11).

| CoFR AArch64 Initial state: 0:W0="B 11", 0:X1=f, 1:X2=f |                                   |                                                           |  |  |
|---------------------------------------------------------|-----------------------------------|-----------------------------------------------------------|--|--|
| Thread 0                                                | Thread 1                          | Common                                                    |  |  |
| STR W0,[X1]                                             | BL f<br>MOV X0,X10<br>LDR X1,[X2] | f: B 10<br>11: MOV X10,#2<br>RET<br>10: MOV X10,#1<br>RET |  |  |
| Forbidden: 1:X0=2, 1:X1="B 10"                          |                                   |                                                           |  |  |



This is not clear in the existing prose specification, but the architectural intent that emerged during discussion with Arm is that the given execution should be forbidden, reflecting microarchitectural choices that (1) instructions decode in order, so the fetch b must occur before the read d, and (2) fetches that miss in the instruction cache must read from data storage, so the instruction cache cannot be ahead of the available data. This ensures that fetching from a write means that all threads are now guaranteed to read from that write (or another coherence-after it).

#### Instruction-to-Data Coherence

In the other direction, reading from a particular write to some location does *not* imply that later fetches of that location will see that write (or a coherence successor), as in the following CoRF+ctrl-isb.

| CoRF+ctrl-isb                | CoRF+ctrl-isb AArch6                                     |                                                           |  |  |  |  |
|------------------------------|----------------------------------------------------------|-----------------------------------------------------------|--|--|--|--|
| Initial state: 0:W0=         | Initial state: 0:W0="B 11", 0:X1=f, 1:X2=f               |                                                           |  |  |  |  |
| Thread 0                     | Thread 1                                                 | Common                                                    |  |  |  |  |
| STR W0,[X1]                  | LDR X0,[X2]<br>CBNZ X0,1<br>1: ISB<br>BL f<br>MOV X1,X10 | f: B 10<br>11: MOV X10,#2<br>RET<br>10: MOV X10,#1<br>RET |  |  |  |  |
| Allowed: 1:X0="B 11", 1:X1=1 |                                                          |                                                           |  |  |  |  |



Here Thread 1 has a control dependency and an instruction synchronisation barrier (the CBNZ conditional branch, dependent on the value read by its LDR load, and ISB), abbreviated to ctrl+isb, between its load and the fetch from f. If the latter were a data load, this would ensure the two loads are satisfied in order. This is not explicit in the existing prose, but it is what one would expect, and it is observed in practice. Microarchitecturally, it is easily explained by an out-of-date entry for f in the instruction cache of Thread 1: if Thread 1 had previously fetched f (perhaps speculatively), and that instruction cache entry has not been evicted or explicitly invalidated since, then this fetch of f will simply read the old value from the instruction cache without going out to data memory. The ISB ensures that f is freshly fetched, but does not ensure that Thread 1's instruction cache is up-to-date with respect to data memory.

### 3.3.3 Instruction Synchronisation

Instruction fetches satisfy few guarantees, so explicit synchronisation must be performed when modifying the instruction stream.

### 75 Same-Thread Synchronisation

762

765

766

769

770

771

779

780

782

783

784

Test SM below shows the simplest self-modifying code case: without additional synchronisation, a write to program memory can be ignored by a program-order-later fetch.

| SM                                     | AArch64                                                   |
|----------------------------------------|-----------------------------------------------------------|
| Initial state: 0:W0="B 11"             | , 0:X1=f                                                  |
| Thread 0                               | Common                                                    |
| STR W0,[X1] // a<br>BL f<br>MOV X0,X10 | f: B 10<br>11: MOV X10,#2<br>RET<br>10: MOV X10,#1<br>RET |
| Allowed: 1:X0=1                        |                                                           |



In this execution, the fetch b, fetching the instruction at f, fetches a value from a write coherence-before a, even though b is the fetch of an instruction program-order after a. We illustrate this with an *instruction from-reads* (ifr) edge. This is a derived relation, analogous to the usual *from-reads* (fr) relation, that relates each fetch to all writes that are coherence-after the write it read from; it is defined as ifr = irfINVERSE; co. If the fetch were a data read, this would be a forbidden coherence shape (COWR). As it is, it is architecturally allowed, as described explicitly by Arm [3, B2.4.4], and it is experimentally observed on all devices we have tested. Microarchitecturally, this too is simply due to fetches from old instruction cache entries.

### 5 Cache Maintenance

As we saw in §3.2, the Arm architecture provides cache maintenance instructions to synchronise the instruction and data streams: the DC data-cache clean and IC instruction-cache invalidate instructions. To forbid the relaxed outcome of SM, by forcing a fetch of the modified code, the specified sequence of cache maintenance instructions must be inserted, with an ISB.

```
SM+cachesync-isb
                                           AArch64
Initial state: 0:W0="B 11",
                       0:X1=f
                     Thread 0
STR W0,[X1]
              //overwrite f
                              with
                                    branch
DC CVAÚ, X1
              //clean data cache
DSB ISH
IC IVAU, X1
              //invalidate instruction cache
DSB ISH
ISB
              //flush pipeline
BL f
MOV X0, X10
Forbidden: 1:X0=1
```

```
Thread 0
a:write f=|B I1|
   cachesync
b: IISB
   isb
c:fetch f=|B||0|
```

Now the outcome is forbidden. The cache synchronisation sequence DC CVAU; DSB ISH; IC IVAU; DSB ISH (which we abbreviate to a single cachesync edge) ensures that by the time the ISB executes, the instruction and data memory have been made coherent with each other for f. The ISB then ensures the final fetch of f is ordered after this sequence. The microarchitectural intuition for this was in §3.2; our §3.4 operational model will describe the semantics of each instruction.

#### Cross-Thread Synchronisation 796

790

792

793

795

799

802

805

806

810

811

We now consider modifying code that can be fetched by other threads, using variants of the standard messagepassing shape MP. That checks whether two writes (to different locations) on one thread can be seen out-of-order by 798 two reads on another thread; here we replace one or both of those reads by fetches, and ask what synchronisation is required to ensure that the relaxed outcome is forbidden. Consider first an MP variant where the first write is of a new instruction, and the second is just a simple data memory flag:





This test includes sufficient synchronisation on each thread to enforce thread-local ordering of data accesses: the DMB in Thread 0 ensures the writes a and b propagate to memory in program order, and the control-dependency into an ISB on Thread 1 ensures the read c and the fetch e happen in program order. However, as we saw in §3.2, this is not enough to synchronise concurrent modification and execution of code in ARMv8-A. Thread 0 needs the entire cache synchronization sequence (giving test MP.RF+cachesync+ctrl-isb, not shown), not just a DMB, to forbid this outcome.

Another variant of this MP-shape test where the message passing itself is done using modification of code gives a much stronger guarantee, as can be seen from the following MP.FR+dmb+fpo-fe test. This is not clear from





the architecture manual, but this outcome is already forbidden with only the DMB. This is for similar reasons to the above CoFR test: since Thread 1 fetched the updated value for f, we know that value must have reached at least the data caches (since that is where the instruction cache reads from) and therefore multi-copy atomicity guarantees that a normal load instruction will observe it.

The final variant of these MP-shaped tests has both Thread 0 writes be of new instructions. This idiom is very common in practice; it is currently how Chrome's WebAssembly JIT synchronises the modified thread with the 815 new code.

| MP.FF+dmb+fpo                         | AArch64                                    |  |  |  |
|---------------------------------------|--------------------------------------------|--|--|--|
|                                       | ="B 11", 0:X1=f1,                          |  |  |  |
| 0:W2="B 11", 0:                       | 0:W2="B 11", 0:X3=f2                       |  |  |  |
| Thread 0                              | Thread 1                                   |  |  |  |
| STR W0,[X1]<br>DMB ISH<br>STR W2,[X3] | BL f2<br>MOV X0,X10<br>BL f1<br>MOV X1,X10 |  |  |  |
| Allowed: 1:X0=2, 1:X1=1               |                                            |  |  |  |



Without the full cachesync sequence on Thread 0, this is an allowed outcome. Interestingly, adding the cachesync 817 sequence to Thread 0 (Test MP.FF+cachesync+fpo, not shown) is sufficient to make the outcome forbidden, without 818 an ISB in Thread 1, as the cachesync sequence is intended to make it appear that fetches occur in program order. Microarchitecturally, that could be ensured in two ways: either by actually fetching in-order, or by making the IC instruction not only invalidate all the instruction caches (for this address) but also clean any core's pre-fetch 821 buffer stale entries (for this address). Architecturally, this is not clear in the current prose, but, concurrent with 822 this work, Arm were independently strengthening their definition to make it so. 823

#### Incremental Synchronisation

The cache synchronisation sequence need not be contiguous, or even all in the same thread. So long as the 825 sequence in its entirety has been performed by the time the fetch happens, then the instruction stream will have been made consistent with the data stream for that address. 827

This is demonstrated by the following test, where Thread 0 performs a write to f and then only a DC before 828 synchronizing with Thread 1, which performs the IC, while Thread 2 observes the modified code. This can happen in practice when a software thread is migrated between hardware threads at runtime, by a hypervisor or OS. 830 Thread 0 and Thread 1 may just represent the runtime scheduling of a single process, beginning execution on 831 hardware Thread 0 but migrated to hardware Thread 1 between the DC and IC instructions. In the graph, the 832 dcsync and icsync represent the DC;DSB ISH and DSB ISH;IC;DSB ISH combinations. The DC does not need a preceding DSB ISH because it is ordered w.r.t. the preceding store to the same cache line.





Here the IC gets broadcast to all threads [3, B2.2.5p3], and so the fact that it happens on a different thread to 835 the DC does not affect the outcome. Similarly, if the DC were to happen on another thread first (to get the test MP.RF+[dc]-ic+ctrl-isb, not shown), then it would have the effect of ensuring consistency globally, for all threads.

### **Multi-Copy Atomicity**

837

838

839

841

842

For data accesses, the question of whether they are *multi-copy atomic* is a crucial one for relaxed architectures. IBM POWER, ARMv7, and pre-2018 ARMv8-A are/were non-multi-copy atomic: two writes to different addresses could become visible to distinct other threads in different orders. Post-2018 ARMv8-A and RISC-V are multi-copy atomic (or "other multi-copy-atomic" in Arm terminology) [10, 9, 3]: the programmer can assume there is a single shared memory, with all relaxed-memory effects due to thread-local out-of-order execution.

However, for fetches, due to the lack of any fetch atomicity guarantee for most instructions (§3.3.1), and the lack of coherent fetches for the others (§3.3.2), the question of multi-copy atomicity is not particularly interesting.

Tests are either trivially forbidden (by data-to-instruction coherence) or are allowed but only the full cache synchronisation sequence provides enough guarantees to forbid it, and (§3.3.3) this ensures all cores will share the same consistent view of memory.

### 3.3.5 Strength of the IC Instruction

### Multiple Points of Unification

Cleaning the data cache, using the DC instruction, makes a write visible to instruction memory. It does this by pushing the write past the Point of Unification. However, there may be multiple Points of Unification: one for each core, where its own instruction and data memory become unified, and one for the entire system (or shareability domain) where all the caches unify. Fetching from a write implies that it has reached the closest PoU, but does not imply it has reached any others, even if the write originated from a distant core. Consider: Here Thread 0





modifies f, Thread 1 fetches the new value and performs just an IC and DSB, before signalling Thread 0 which also fetches f. That IC is not strong enough to ensure that the write is pulled into the instruction cache of Thread 0.

This is not clear in the existing prose, but the architectural intent is that it be allowed (i.e., that IC is weak in this respect). We have not so far observed it in practice. The write may have passed the Point of Unification for Thread 1, but not the shared Point of Unification for both threads. In other words, the write might reach Thread 1's instruction cache without being pushed down from Thread 0's data cache. Microarchitecturally this can be explained by *direct data intervention* (DDI), an optimisation allowing cache lines to be migrated directly from one thread's (data) cache to another. The line could be migrated from Thread 0 to Thread 1, then pushed past Thread 1's Point of Unification, making it visible to Thread 1's instruction memory without ever making it visible to Thread 0's own instruction memory. The lack of coherence between instruction and data caches would make this observable, even in multi-copy atomic machines.

### Stale Fetches

So far, we have only talked about fetching from two distinct writes. But theoretically there is no limit to how far back we can fetch from, with insufficient synchronization. The MP.RF+dmb+ctrl-isb test (§3.3.3) required the full cachesync sequence to forbid the given behaviour. Below we give a test, FOW, similar to that MP-shaped test but allowing many consumer threads to independently and simultaneously see different values in their instruction memory, even after invalidating their caches.

This is not clear in the existing architecture text. It is a case where the architecture design is not very constrained. On the one hand, it has not been observed, and it is thought unlikely that hardware will ever exhibit this behaviour: it would require keeping multiple writes in the coherent part of the data caches, rather than a single dirty line, which would require more complex cache coherence protocols. On the other hand, there does not seem to be any benefit to software from forbidding it. Arm therefore prefer the choice that gives a simpler and weaker model (here the two happen to coincide), to make it easier to understand and to provide more flexibility for future microarchitectural optimisations. We therefore design our models to allow the above behaviour.





### 3.3.6 Strength of the DC Instruction

#### 881 Instruction Cache depth

Test CoFF (§3.3.2) showed that fetches can see "old" writes. In principle, there is no limit to the depth of the instruction-cache hierarchy: there could be many values for a single location cached in the instruction memory for each core, even if the data cache has been cleaned. The test below illustrates this, with Thread 1 able to see all three values for g.

| MP.RF+dc+ctrl-isb-isb AA/<br>Initial state: 0:W0="B 11", 0:X2=g, 0:W1="B 12", 0:X3=1, 0:X4=x, [x]=0, 1:X4=x |                                                                                                                     |                                                                                 |  |  |
|-------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|--|--|
| Thread 0                                                                                                    | Thread 1                                                                                                            | Common                                                                          |  |  |
| STR W0, [X2]<br>STR W1, [X2]<br>DSB ISH<br>DC CVAU, X2<br>DSB ISH<br>STR X3, [X4]                           | LDR X0, [X4]<br>CBNZ X0, 1<br>1:ISB<br>BL g<br>MOV X1,X10<br>ISB<br>BL g<br>MOV X2,X10<br>ISB<br>BL g<br>MOV X2,X10 | g: B 10<br>12:MOV X10,#3<br>RET<br>11:MOV X10,#2<br>RET<br>10:MOV X10,#1<br>RET |  |  |
| Allowed: 1:X0=1,                                                                                            | 1:X1=3, 1:X2=2, 1                                                                                                   | :X3=1                                                                           |  |  |



This is similar to the preceding FOW case: it is thought unlikely that hardware will exhibit this in practice, but the desire for the simpler and weaker option means the architectural intent is to allow it, and we follow that in our models.

## 3.4 An Operational Semantics for Instruction Fetch

Previous work on operational models for IBM POWER and Arm "user-mode" concurrency [18, 16, 13, 12, 11, 10] has shown, surprisingly, that as far as programmer-visible behaviour is concerned, one can abstract from almost all hardware implementation details of data memory (store queues, the cache hierarchy, the cache protocol, etc.). For ARMv8-A, following their 2018 shift to a multicopy-atomic architecture, one can do so completely: the *Flat* model of [10] has a shared flat memory, with a per-thread out-of-order thread subsystem, modelling pipeline effects, responsible for all observable relaxed behaviour. For instruction-fetch, it is no longer possible to abstract completely from the data and instruction cache hierarchy, but we can still abstract from much of it.

### The Flat Model

890

893

894

895

897

is a small-step operational semantics for multi-copy atomic ARMv8-A, including the relaxed behaviours of loads and stores [10]. Its states are abstract machine states consisting of a tree of instructions for each thread, and a flat mem-



ory subsystem shared by all threads. Each instruction in each thread corresponds to a sequence of transitions, with some guards and a potential effect on the shared memory state. The Flat model is made executable in our RMEM tool, which can exhaustively interleave transitions to enumerate all the possible behaviours. The tree of instructions for each thread models out-of-order and speculative execution explicitly. Below we show an example for a thread that is

executing 10 instruction instances. Some (grey)
are finished, no longer subject to restart; others
(pink) have run some but perhaps not all of their
instruction semantics; instructions are not necessarily atomic. Those with multiple children are
branch instructions with multiple potential successors speculated simultaneously.



For each state, the model defines the set of allowed transitions, each of which steps to a new machine state.

Transitions correspond to steps of single instructions, and individual instructions may give rise to many. Example transitions include Register Write, Propagate Write to Memory, etc.

#### 914 iFlat Extension

900

901

902

Originally, Flat had a fixed instruction memory, with a single transition that can speculate the address of any program-order successor of any instruction in flight, fetch it from the fixed instruction memory, and decode it.

We now remove that fixed instruction memory, so that instructions can be fetched from data writes, and add the additional structures as shown on the right. These are all of unbounded size, as is appropriate for an architecture definition.

#### Fetch Queues (per-thread)

These are ordered buffers of pre-fetched entries, waiting to be decoded and begin execution. Entries are either a fetched 32-bit opcode, or an unfetched request. The fetch queues allow the model to speculate and pre-fetch many instructions ahead of where the thread is currently executing. The model's fetch queues abstract from multiple real-hardware structures: instruction queues, line-fill buffers, loop buffers, and slots objects. We keep a close relation to this underlying microarchitecture by allowing out-of-order fetches, but we believe this is not experimentally observable on real hardware.

### 927 Abstract Instruction Cches (per-thread)

These are just sets of writes. When the fetch queue requests a new entry, it gets satisfied from the instruction cache, either immediately (a *hit*) or at some later point in time (a *miss*). The instruction cache can contain many possible writes for each location (§3.3.6), and it can be spontaneously updated with new writes in the system at any time ([3, B2.4.4]). To manage IC instructions, each thread keeps a list of addresses yet to be invalidated by in-flight ICs.

#### Data Cache (global)

Above the single shared flat memory for the entire system, which sufficed for the multi-copy-atomic ARMv8-A data memory, we insert a shared buffer which is just a list of writes; abstracting from the many possible coherent data cache hierarchies. Data reads must be coherent, reading from the most recent write to the same address in the buffer, but instruction fetches are allowed to read from any such write in the buffer (§3.3.2).

#### 38 Transitions

To accommodate instruction fetch and cache maintenance, we introduce new transitions: Fetch Request, Fetch Instruction, Fetch Instruction (Unpredictable), Fetch Instruction (B.cond), Decode Instruction, Begin IC, Propagate IC to Thread, Complete IC, Perform DC, and Update Instruction Cache. We also have to modify some Flat transitions: Commit ISB, Wait for DSB, Commit DSB, Propagate Memory Write, and Satisfy Read from Memory. These transitions define the lifecycle of each instruction: a request gets issued for the fetch, then at some later point the fetch gets satisfied from the instruction cache, the instruction is then decoded (in program-order) and then handed to the existing semantics to be executed. To give a flavour, we show just one, the *Propagate IC to Thread* transition, which is responsible for invalidation of the abstract instruction caches. This is a prose rendering of the rule in our executable mathematical model, which is expressed in the typed functional subset of Lem [41].

### Propagate IC to Thread

An instruction *i* (with ID *iiid*) in state WAIT\_IC(address, state\_cont) can do the relevant invalidate for any thread *tid*', modifying that thread's instruction cache and fetch queue, if there exists a pending entry (*iiid*, address) in that thread's *ic* writes. Action:

- 1. for any entry in the fetch queue for thread *tid*, whose *program\_loc* is in the same minimum-size instruction cache line as *address*, and is in Fetched(\_) state, set it to the Unfetched state;
- 2. for the instruction cache of thread *tid*, remove any write-slices which are in the same instruction cache line of minimum size as *address*.

This rule can be found under the same name in the full prose description, and in the handle\_ic\_ivau and flat\_propagate\_cache\_maintenance functions in machineDefThreadSubsystem. lem and machineDefFlatStorageSubsyst in the executable mathematics. Cache maintenance operations work over entire cache lines, not individual addresses. Each address is associated with at least one cache line for the data (and unified) caches, and one for the instruction caches. The cache line of minimum size is the (architected) smallest possible cache line for each of these.

### Example

949

950

953

954

956

957

959

960

963

964

965

967

968

970

971

This model correctly explains all the behaviours of §3.3. We illustrate this by revisiting the cache synchronization explanation of §3.2, which can now be re-interpreted w.r.t. our precise model, and using this to explain the thread migration case of §3.3.3. Given DC Xn; DSB; IC Xn; DSB we can use this model to give meaning to it (omitting uninteresting transitions): First the DC CVAU causes a **Perform DC** transition. This pushes any write that might have been in the abstract data cache into memory. Now the first DSB's **Commit DSB** can be taken, allowing **Begin IC** to happen. This creates entries for each thread, which are discharged by each **Propagate IC to Thread** (see above). Once all entries are invalidated, a **Complete IC** can happen. Now, if any thread decodes an instruction for that address, it must have been fetched from the write the DC pushed, or something coherence-after it. If the software thread performing this sequence is interrupted and migrated (by the OS) to a different hardware thread, then, so long as the OS includes the DSB to maintain the thread-local DC ordering, the DC will push the write in an identical way, since it only affects the global abstract data cache. The IC transitions can all be taken, and the sequence continues as before, just on a new hardware thread. So when the second DSB finishes, and the final **Commit DSB** transitions is taken, the effect of the full sequence will be seen system-wide even if the thread was migrated.

## 5 An Axiomatic Semantics for Instruction Fetch

Based on the operational model, we develop an axiomatic semantics, as an extension of the ARMv8 axiomatic reference model [26, 10]. Since that does not have mixed-size support, we do not model the concurrent modification of conditional branches (§3.3.1), as this would require mixed-size machinery. The existing axiomatic model is

a predicate on candidate executions, hypothetical complete executions of the given program that satisfy some basic well-formedness conditions, defining the set of valid executions to be those satisfying its axioms. Each candidate execution abstractly captures a particular concrete execution of the program in terms of events and relations over them. This model is expressed in the herd language [14, 21, 29]. The events of these executions are memory reads (the set R), memory writes (W), and memory barrier/fence events (F). The relations are: program order (po), capturing the sequencing of events by the same thread in the execution's control-flow unfolding; reads-from (rf), relating a write event w with any read event r that reads from it; the coherence order (co), recording the execution's sequencing of same-address writes in memory; and read-modify-write (rmw), capturing which load/store exclusive instructions form a successful exclusive pair in the execution. The derived relation from-reads  $\mathsf{fr} = \mathsf{rf}INVERSE$ ; co relates a read r with a write w' if r reads from a write w coherence before w'. In addition, candidate executions also have relations capturing dependencies between events: address (addr), data (data), and control dependencies (ctrl). The relation loc relates any two read/write events that are to the same memory address. The model also has relations suffixed "i" and "e": rfi/rfe, coi/coe, fri/fre. These are the restrictions of the relations rf, co, and fr, to same-thread/"internal" event pairs or different-thread/"external" event pairs. The model is defined in relational algebra. In herd, R; S stands for sequential composition of relations R and S, RINVERSE for the inverse of relation R, R|S and R&S for the union and intersection of R and S, and [A];R;[B] for the restriction of R to the domain A and range B.

975

978

979

981

982

983

986

987

989

aan

993

994

995

996

997

998

1001

1002

1004

1005

1007

1008

1009

1011

1012

1015

1016

1017

1018

1019

1020

1023

1024

1026

1027

1028

Handling instruction fetch requires extending the notion of candidate execution. We add new events: an *instruction-fetch* (IF) event for each executed instruction; a DC event for each DC CVAU instruction; an IC event for each IC IVAU and IC IALLU instruction. We replace po with *fetch-program-order* (fpo) which orders the IF event of an instruction before any program-order later IF events. We add a relation *same-cache-line* (scl), relating reads, writes, fetches, DC and IC events to addresses in the same cache line. We add an acyclic transitively closed relation wco, which extends co with orderings for cache maintenance (DC or IC) events: it includes an ordering (e, e') or (e', e) for any cache maintenance event e and same-cache-line event e' if e' is a write or another cache maintenance event; where co = ([W];wco;[W]) & loc. The loc, addr, and ctrl are all extended to include DC and IC events. We add a *fetch-to-execute* relation (fe), relating an IF event to any event generated by the execution of that instruction; and an *instruction-read-from* relation (irf), which relates a write to any IF event that fetches from it. Finally, we add a boolean *constrained-unpredictable* (CU) to detect badly behaved programs. Now we derive the following relations: the standard po relation, as po = feINVERSE; fpo; fe (two events e and e' are po-related if their fetch-events are fpo-related); and *instruction-from-reads* (ifr), the analogue of fr for instruction fetches, relating a fetch to all writes coherence-after the one it fetched from: ifr = irfINVERSE; co.

We then make two semantics-preserving rewrites of the existing model to make adding instruction fetches easier (described in the appendix); and make the following changes and additions to the model. The full model is shown in Figure 3.1, with comments pointing to the relevant locations in the model definition. For lack of space we only describe the main addition, the iseq relation, in detail (including its correspondence with the operational model of §3.4); for the others we give an overview and refer to the appendix for the full description.

We define the relation iseq, relating some write w to address x to an IC event completing a cache synchronisation sequence (not necessarily on a single thread): w is followed by a same-cache line DC event, which is in turn followed by a same-cache line IC event. In operational model terms, this captures traces that propagated w to memory, subsequently performed a same-cache-line DC, and then began an IC (and eagerly propagated the IC to all threads). In any state after this sequence it is guaranteed that w, or a coherence-newer same-address write, is in the instruction cache of all threads: performing the DC has cleared the abstract data cache of writes to x, and the subsequent IC has removed old instructions for location x from the instruction caches, so that any subsequent updates to the instruction caches have been with w, or co-newer writes. Adding ifr; iseq to the observed-by relation (obs) (4) relates an instruction fetch i to location x to an IC ic if: i fetched from a write w to x, some write w' to x is coherence-after w, and ic completes a cache synchronisation sequence (i seq) starting from w'. Then the irreflexive ob axiom requires that i must be ordered-before ic (because it would otherwise have fetched  $w^\prime$ ).We now briefly overview other changes made to the axiomatic model and their intuition. We include irf in obs (3): for an instruction to be fetched from a write, the write has to have been done before. We add a relation fetch-ordered-before (fob) (5-7), which is included in ordered-before. The relation fob includes fpo and fe; including fpo (5) requires fetches to be ordered according to their position in the control-flow unfolding of the execution, and including the fe (fetch-to-execute) relation (6) captures the idea that an instruction must be fetched before it can execute; fetches program-order-after an ISB happen after the ISB (or else are restarted) (7). For DSB ISH instructions the edge [R|W|F|DC|IC];po;[dsb.ish] is included in ob (9): DSB ISHs are ordered with all program-order-preceding non-fetch events. Symmetrically, all non-IF events are ordered after program-order-preceding dsb.ish events (10). DCs wait for preceding dmb.sy events (11). We include the relation cache-op-ordered-before (cob) in ob. This relation orders DC instructions with program-order previous

```
let iseq = [W];(wco&scl);[DC]; (*1*)
                                                         [dmb.ld]; po; [R|W]
                 (wco&scl);[IC]
                                                          [A|Q]; po; [R|W]
[W]; po; [dmb.st]
(* Observed-by *)
[dmb.st]; po;
                                                                     . [Ĺ]
                                                          [RIW]:
                                                                 po:
                                                         [R|W|F|DC|IC]; po; [dsb.ish]
[dsb.ish]; po; [R|W|F|DC|IC]
[dmb.sy]; po; [DC]
                                                                                           (*9*)
(* Fetch-ordered-before *)
                                                                                           (*10*)
let fob = [IF]; fpo; [IF]
                                   (*5*)
                                                                                           (*11*)
    [IF]; fe
    [ISB]; feINVERSE; fpo
                                          (*7*)
                                                     (* Cache-op-ordered-before *
                                                     let cob = [R|W]; (po&scl); [DC]
                                                                                           (*12*)
  Dependency - ordered - before *)
                                                         [DC]; (po&scl); [DC]
                                                                                           (*13*)
let dob = addr | data
                                                     (* Ordered-before *)
    ctrl:
           [W]
                                                     let ob = (obs|fob|dob|aob|bob|cob)+
    (ctrl | (addr; po)); [ISB]
    [ISB]; po; [R]
                                  (*8*)
                                                     (* Internal visibility requirement *)
    addr; po;
                                                     acyclic (po-loc|fr|co|rf) as internal
    (addr | data); rfi
                                                     (* External visibility requirement *)
  Atomic-ordered-before *)
                                                     irreflexive ob as external
let aob = rmw
    [range(rmw)]; rfi; [A|Q]
                                                     (* Atomic *)
                                                     empty rmw & (fre; coe) as atomic
(* Barrier-ordered-before *)
let bob = [R|W]; po; [dmb.sy]
                                                        Constrained unpredictable *)
    [dmb.sy]; po; [R|W]
[L]; po; [A]
                                                     let cff = ([W];loc;[IF]) \
                                                                                           (*14*)
                                                                  obINVERSE \ (co;iseq;ob)
    [R]; po; [dmb.ld]
                                                     cff_bad\ cff \equiv CU
                                                                                           (*15*)
```

Figure 3.1: Axiomatic model

reads/writes and other DCs to the same cache line (12,13).

Finally, could-fetch-from (cff) (14) captures, for each fetch i, the writes it could have fetched from (including the one it did fetch from), which we use to define the constrained unpredictable axiom cff\_bad (not given) (15).

### 3.6 Validation

1032

1033

1034

1041

1043

1044

1047

1048

1050

1051

1052

1054

1055

To gain confidence in the presented models we validated the models against the Arm architectural intent, against each other, and against real hardware.

#### Validation against the Architecture

To ensure our models correctly captured the architectural intent we engaged in detailed discussions with Arm, including the Arm chief architect. These involved inventing litmus tests (including, those described in §3.3 and many others) and discussing what the architecture should allow in each case.

#### Validating against hardware

To run instruction-fetch tests on hardware, we extended the litmus tool [17]. The most significant extension consists in handling code that can be modified, and thus has to be restored between experiments. To that end, code copies are executed, those copies reside in mmap'd memory with (execute permission granted. Copies are made from "master" copies, in effect C functions whose contents basically consist of gcc extended inline assembly. Of course, such code has to be position independent, and explicit code addresses in test initialisation sections (such as in 0:X1=1 in the test of §3.3.1) are specific to each copy. All the cache handling instructions used in our experiments are all allowed to execute at exception level 0 (user-mode), and therefore no additional privilege is needed to run the tests.

To automatically generate families of interesting instruction-fetch tests, we extended the diy test generation tool [40] to support instruction-fetch reads-from (irf) and instruction-fetch from-reads (ifr) edges, in both internal (same-thread) and external (inter-thread) forms, and the cachesync edge. We used this to generate 1456 tests involving those edges together with po, rf, fr, addr, ctrl, ctrlisb, and dmb.sy. diy does not currently support bare DC or IC instructions, locations which are both fetched and read from, or repeated fetches from the same location.

We then ran the diy-generated test suite on a range of hardware implementations, to collect a substantial sample of actual hardware behaviour.

#### 1058 Correspondence between the models

We experimentally test the equivalence of the operational and axiomatic models on the above hand-written and diy-generated tests, checking that the models give the same sets of allowed final states, and that these are consistent with the hardware observations.

#### Making the models executable as a test oracle

To make the operational model executable as a test oracle, capable of computing the set of all allowed executions of a litmus test, we must be able to *exhaustively enumerate* all possible traces. For the model as presented, doing this naively is infeasible: for each instruction it is theoretically possible to speculate any of the  $2^{64}$  addresses as potential next address, and the interleaving of the new fetch transitions with others leads to an additional combinatorial explosion.

We address these with two new optimisations. First, we extend the fixed-point optimisation in RMEM (incrementally computing the set of possible branch targets) [10] to keep track not only of indirect branches but also the successors of every program location, and only allow speculating from this set of successors. Additionally, we track during a test which locations were both fetched and modified during the test, and eagerly take fetch and decode transitions for all other locations. As before, the search then runs until the set of branch targets *and* the set of modified program-locations reaches a fixed point. We also take some of the transitions eagerly to reduce the search space, in cases where this cannot remove behaviour: Wait for IC, Complete IC, Fetch Request, and Update Instruction Cache.

#### Making the axiomatic model executable as a test oracle

The axiomatic model is expressed in a herd-like form, but the herd tool does not support instruction fetch and cache maintenance instructions. To make the model executable as a test oracle, we built a new tool that takes litmus tests and uses a Sail [7] definition of a fragment of the ARMv8-A ISA to generate SMT problems for the model. Using the Sail instruction semantics, we generate a Sail program that corresponds to each thread within a litmus test. The tool then partially evaluates these programs using the concrete values for addresses and registers specified in the litmus file, while allowing memory values and arbitrary addresses to remain symbolic. Using a Sail to SMT-LIB backend, these are translated into SMT definitions that include all possible behaviours of each thread as satisfiable solutions. The rules for the axiomatic model are then applied as assertions restricting the possible behaviours to just those allowed by the axiomatic model. The tool also derives the addr and data relations, using the syntactic dependencies within the instruction semantics to derive the syntactic dependencies between instructions.

For litmus tests, where we can know up-front which instructions may be modified, we would like to avoid generating IF events for instructions that cannot be modified. If we naively removed certain IF events, however, we would break the correspondence between po and feINVERSE; fpo; fe. This can be worked around by ensuring that every modifiable instruction generates an event which appears in po, allowing fpo between the modifiable instructions to instead be derived as fe; po; feINVERSE. Branches emit a special branch address announce event for this purpose, which is also used to derive the ctrl relation. The fpo relation can then be modified, replacing [ISB]; feINVERSE; fpo with [ISB]; po; feINVERSE and adding [ISB]; po. The second change ensures that all the transitive edges generated by [ISB]; feINVERSE; fpo followed by [IF]; fe remain with fob and hence ob.

A limitation of this approach is it cannot support cases where two threads both attempt to execute the same possibly-modified instruction, as in the SM.F+ic and FOW tests.

#### Validation results

First, to check for regressions, we ran the operational model on all the 8950 non-mixed-size tests used for developing the original Flat model (without instruction fetch or cache maintenance). The results are identical, except for 23 tests which did not terminate within two hours. We used a 160 hardware-thread POWER9 server to run the tests.

We have also run the axiomatic model on the 90 basic two-thread tests that do not use Arm release/acquire instructions (not supported by the ISA semantics used for this); the results are all as they should be. This takes around 30 minutes on 8 cores of a Xeon Gold 6140.

Then, for the key handwritten tests mentioned in this paper, together with some others (that have also been discussed with Arm), we ran them on various hardware implementations and in the operational and axiomatic models. The models' results are identical to the Arm architectural intent in all cases, except for two tests which are not currently supported by the axiomatic checker.

| Test                                | Arm intent | op. model | ax. model   | hardware obs.                    |
|-------------------------------------|------------|-----------|-------------|----------------------------------|
| CoFF                                | allow      | =         | =           | 42.6k/13G                        |
| CoFR                                | forbid     | =         | =           | 0/13G                            |
| CoRF+ctrl-isb                       | allow      | =         | =           | 3.02G/13G                        |
| SM                                  | allow      | =         | =           | 25.8G/25.9G                      |
| SM+cachesync-isb                    | forbid     | =         | =           | 0/25.9G                          |
| MP.RF+dmb+ctrl-isb                  | allow      | =         | =           | 480M/6.36G                       |
| MP.RF+cachesync+ctrl-isb            | forbid     | =         | =           | 0/13G                            |
| MP.FR+dmb+fpo-fe                    | forbid     | =         | =           | 0/13G                            |
| MP.FF+dmb+fpo                       | allow      | =         | =           | 447M/13G                         |
| MP.FF+cachesync+fpo                 | forbid     | =         | =           | F2.3k/13G                        |
| ISA2.F+dc+ic+ctrl-isb               | forbid     | =         | =           | 0/6.98G                          |
| SM.F+ic                             | allow      | =         | unsupported | <sup>U</sup> 0/12.9G             |
| FOW                                 | allow      | =         | unsupported | <sup>U</sup> 0/7G                |
| MP.RF+dc+ctrl-isb-isb               | allow      | =         | =           | $^{\mathrm{U}}0/12.94\mathrm{G}$ |
| MP.R.RF+addr-cachesync+dmb+ctrl-isb | forbid     | =         | =           | 0/6.97G                          |
| MP.RF+dmb+addr-cachesync            | allow      | =         | =           | $^{\rm U}0/6.34{\rm G}$          |

[The hardware observations are the sum of testing seven devices: a Snapdragon 810 (4x Arm A53 + 4x Arm A57 cores), Tegra K1 (2x NVIDIA Denver cores), Snapdragon 820 (4x Qualcomm Kryo cores), Exynos 8895 (4x Arm A53 + 4x Samsung Mongoose 2 cores), Snapdragon 425 (4x Arm A53), Amlogic 905 (4x Arm A53 cores), and Amlogic 922X (4x Arm A73 + 2x Arm A53 cores). U: allowed but unobserved. F: forbidden but observed.]

Our testing revealed a hardware bug in a Snapdragon 820 (4 Qualcomm Kryo cores). A version of the first cross-thread synchronisation test of §3.3.3 but with the full cache synchronisation (MP.RF+cachesync+ctrl-isb) exhibited an illegal outcome in 84/1.1G runs (not shown in the table), which we have reported. We have also seen an anomaly for MP.FF+cachesync+fpo, currently under investigation by Arm. Apart from these, the hardware observations are all allowed by the models. As usual, specific hardware implementations are sometimes stronger.

Finally, we ran the 1456 new instruction-fetch div tests on a variety of hardware, for around 10M iterations each, and in the operational model. The model is sound with respect to the observed hardware behaviour except for that same Snapdragon 820 device.

## 3.7 Related Work

To the best of our knowledge, no previous work establishes well-validated rigorous semantics for any systems aspects, of any current production architecture, in a realistic concurrent setting.

The closest is Raad et al.'s work on non-volatile memory, which models the required cache maintenance for persistent storage in ARMv8-A [42], as an extension to the ARMv8-A axiomatic model, and for Intel x86 [43] as an operational model, but neither are validated against hardware. In the sequential case, Myreen's JIT compiler verification [44] models x86 icache behaviour with an abstract cache that can be arbitrarily updated, cleared on a jmp. For address translation, the authoritative Arm-internal ASL model [4, 5, 6], and Sail model derived from it [7] cover this, and other features sufficient to boot an OS (Linux), as do the handwritten Sail models for RISC-V (Linux and FreeBSD) and MIPS/CHERI-MIPS (FreeBSD, CheriBSD), but without any cache effects. Goel et al. [45, 46] describe an ACL2 model for much of x86 that covers address translation; and the Forvis [47] and RISCV-PLV [48] Haskell RISC-V ISA models are also complete enough to boot Linux. Syeda and Klein [49, 50] provide an somewhat idealised model for ARMv7 address translation and TLB maintenance. Komodo [33] uses a handwritten model for a small part of ARMv7, as do Guanciale et al. [34, 35]. Romanescu et al. [51, 52] do discuss

address translation in the concurrent setting, but with respect to idealised models. Lustig et al. [53] describe a concurrent model for address translation based on the Intel Sandy Bridge microarchitecture, combined with a synopsis of some of the relevant Linux code, but not an architectural semantics for machine-code programs.

### 3.8 Conclusion

1134

1136

1137

1140

1141

1142

The mainstream architectures are the most important programming languages used in practice, and their systems aspects are fundamental to the security (or lack thereof) of our computing infrastructure. We have established a robust semantics for one of those systems aspects, soundly abstracting the hardware complexities to a manageable model that captures the architectural intent. This enables future work on reasoning, model-checking, and verification for real systems code.

#### Acknowledgements

This work would not have been possible without generous technical assistance from Arm. We thank Richard 1143 Grisenthwaite, Will Deacon, Ian Caulfield, and Dave Martin for this. We also thank Hans Boehm, Stephen 1144 Kell, Jaroslav Ševčík, Ben Titzer, and Andrew Turner, for discussions of how instruction cache maintenance is used in practice, and Alastair Reid for comments on a draft. This work was partially supported by EPSRC grant 1146 EP/K008528/1 (REMS), ERC Advanced Grant 789108 (ELVER), an ARM iCASE award, and ARM donation funding. 1147 This work is part of the CIFV project sponsored by the Defense Advanced Research Projects Agency (DARPA) 1148 and the Air Force Research Laboratory (AFRL), under contract FA8650-18-C-7809. The views, opinions, and/or findings contained in this paper are those of the authors and should not be interpreted as representing the official 1150 views or policies, either expressed or implied, of the Department of Defense or the U.S. Government. 1151

# Instruction fetch: operationally

- 1154 4.1 Shape of the model
- Need for DC+IC to be separate components that compose together nicely Modelling caches and buffers explicitly (e.g. a uarch-style model)
- 4.2 Extending flat
- Draw out the actual state List all the new transitions Justify each

# Instruction fetch: axiomatically

- 5.1 Candidate model
- New events and built-in relations
- 1163 **5.2 The axioms**
- Describe, in detail, the diff from Will's axiomatic model including Christopher's semantics-preserving transformation of the model
- 5.3 Executing the model in Isla
- 1167 (Probably just cite the cav isla paper here)

## Instruction fetch: validation

- 1170 6.1 Executable operational semantics
- 1171 6.1.1 Making the model executable
- 6.2 Extension to isla-axiomatic
- 1173 Mostly a re-cap of Alasdair's CAV'21 paper.
- 1174 6.3 Hardware testing
- 6.3.1 Custom harness
- A brief re-cap of the original harness I built, and how it worked.
- https://github.com/bensimner/rems-ifetch-harness
- 1178 6.3.2 Extending herdtools
- Mostly a re-cap of work done by Luc Maranget to get diy7 and litmus7 running.
- 1180 6.3.3 Results from hardware
- 1181 6.4 Correspondence between the models

1183

1189

1190

1191

1193

1194

1200

1201

1203

1204

1205

1208

1215

1216

## Pagetables and the VMSA

This chapter is based, in part, on: Chapter D5 of the Arm Architecture Reference Manual DDI 0487H.a; and, Relaxed virtual memory in Armv8-A [54] by Ben Simner, Alasdair Armstrong, Jean Pichon-Pharabod, Christopher Pulte, Richard Grisenthwaite, and Peter Sewell, published in the proceedings of the 31st European Symposium on Programming (ESOP, 2022).

## 7.1 Introduction

Modern computers heavily rely on *virtual memory* to enforce security boundaries: hypervisors and operating systems manage mappings from virtual to physical addresses in order to restrict the access individual processes and guest operating systems have to the underlying physical memory, and to memory-mapped devices. With the endemic use of memory-unsafe languages, even for critical infrastructure, understanding and verifying the programs which manage virtual memory mappings is more vital than ever, driving current interests in hypervisors. The virtual machines those hypervisors enable are the key pieces of software which have become solely responsible for implementing such critical security properties.

The following chapters focus on these aspects of the architecture, on virtual memory and virtualisation and the software they enable, with the aim of giving a precise formal semantics for the purpose of verifying real systems software which use those features.

I first give a description of the sequential behaviour of Arm's virtual memory (this chapter); then describe the *relaxed* behaviours and any open questions about Arm's virtual memory (§8); give our precise axiomatic semantics that capture these behaviours (§9); give an overview of the tooling and validation of the given model(s) (§10); and, finally, a sketch of an equivalent operational semantics (§11).

This chapter overview The remainder of this chapter will give: a brief overview of Arm's virtual memory systems architecture (§7.2); a detailed description of the Arm translation table format (§7.3); an overview of the multiple stages of translation (§7.4), and the different translation regimes (§7.5); a detailed explanation of the official Arm translation table walk pseudocode (§7.6); and finally a discussion on the existence and purpose of translation lookaside buffers (§7.7). This chapter does not present any new contributions or novel research, instead, it is a brief but necessary overview of the required architectural features.

## 7.2 Virtual Memory

Armv8-A8's *virtual memory system architecture* (or *VMSA*) defines the virtual memory and virtulisation features of the Arm architecture. Its structure is described, in detail, in Chapter D5 of the Arm Architecture Reference Manual [1].

Conventionally, we think of memory as being a flat array of bytes, indexed by *physical addresses*. For smaller trusted devices, such as microcontrollers, this may be the end of the story. However larger 'application' class processors rely heavily on virtual addressing: interposing one or more layers of indirection between the accesses using the *virtual* addresses of the program and the 'true' physical addresses of memory. This indirection allows systems running on those processors to:

- 1. partition the physical resources between different programs, giving access to only those resources that each program needs, and protecting those resources from other programs that do not need to access them;
- 2. indirect accesses through specific ranges of addresses with convenient numeric values; and
- 3. update those indirections at runtime to add, remove, or otherwise modify, the mappings to physical memory, to support techniques such as copy-on-write and paging.

To manage all this, typical operating systems splits the programs into distinct *processes* and associates each process with its own virtual to physical mapping. These mappings take the form of partial functions from the process' own (virtual) addresses to the real hardware physical addresses along with some permissions:

 $\texttt{translate}: \texttt{VirtualAddress} \rightharpoonup \texttt{PhysicalAddress} \times 2^{\{\texttt{Read},\texttt{Write},\texttt{Execute}\}}$ 

Note that this is a simplification. A more accurate translate function is given later on TODO: ?REF?.

Typically an operating system would create one such mapping for every process, partitioning the physical memory into disjoint subsets of physical addresses (the *range* of the translate function), and would allocate some convenient numeric values to be the virtual addresses the process interacts with (the *domain* of the translate function). Having this separation allows the processes to be given convienently aligned contiguous chunks of virtual address space even if the underlying physical resources are highly fragmented, or, in the case of paging, potentially not present in memory at all. Additionally, operating systems can provide many processes with mappings to the same physical resource (such as memory-mapped devices) and control which processes have access to such devices at any point in time

These mappings give rise to separate *address spaces* for each process. The diagram in Figure 7.1 illustrates an example, with two processes named P0 and P1 each with their own virtual address space. The left-hand side shows a representation of the 'memory' as the processes see it, with the memory split into *pages* (fixed-size blocks of contiguous addresses). The right-hand side is the equivalent representation of the actual physical memory, with each *physical* page of the available RAM. Note that this diagram shows the virtual address space as being smaller than the physical one, but in general, they may be the same size, or the virtual address space may be even larger than the physical space.

If we assume each page has size 0x1000 then page 1 contains addresses 0x1000 to 0x1FFF inclusive, and we can interpret the diagram like so:

⊳ For P0:

1218

1219

1221

1222

1223

1224

1225

1227

1228

1230

1231

1233 1234

1235

1236

1237

1238

1240

1241

1244

1245

1247

1248

1249

1251

1252

1254

- virtual addresses in pages 1, and 3 are unmapped.
- virtual addresses in pages 0 and 2 map to physical addresses in physical page 1.
- virtual addresses in page 4 map to physical addresses in physical page 5.
- ▶ For P1:
  - $\,$   $\,$  virtual addresses in pages 0 and 4 are unmapped.
  - virtual addresses in page 1 map to physical addresses in physical page 5.
  - virtual addresses in page 2 map to physical addresses in physical page 7.
  - virtual addresses in page 3 map to physical addresses in physical page 8.

For example, if process P0 reads the address 0x2305, it will actually read from the physical location 0x1305, since virtual page 2 was mapped to physical page 1 in P0's address space.

Each address space corresponds to a distinct translate function. Note that these mappings may be: non-injective (contain *aliasing* of multiple virtual addresses to the same physical address); partial (where some virtual addresses do not map to a physical address at all); or overlapping with other processes' address spaces, in either the domain (for example, the physical page 5 is mapped in both P0 and P1), or range (for example, the virtual page 2 is mapped in both P0 and P1 but to different physical pages), or both.

Large application-class processor architectures, such as Armv8-A, often provide hardware support in the form of
the *memory management unit* (the MMU), which, once configured by software, will automatically perform the
translation from virtual to physical addresses. Software is then required to manage a set of translation functions,
and is responsible for ensuring the correct translation function is being used by the MMU whenever a context
switch occurs, and handle any *faults* that the MMU generates.



Figure 7.1: Example virtual and physical address spaces for two processes.

## 7.3 Arm Translation Tables

1267

1278

1279

1280

1281

1282

1284

1285

1287

1288

1289

1291

Software configures the MMU through the creation and modification of sets of *translation tables* (also referred to as *page tables*) for each of the translation functions.

The translation tables form an in-memory tree data structure which encode the (partial) translate function.

Software creates and maintains these trees, and tells the MMU which tree (and so which translation function) to use at runtime. The hardware then reads from this tree structure to perform the translation, or from one of the various caching structures described in TODO: ?REF?, whenever the process reads from, or writes to, memory.

A pointer to the root of the tree is stored in the TTBR ("Translation table base register") register (or rather, one of the various base registers described in more detail in TODO: ?REF?), and this determines which translation function is currently in use by that processor.

Each node in the tree is a page-aligned chunk of memory which is treated as an array of 64-bit entries. Each entry is responsible for mapping some fixed part of the domain of the translation function, with the root table mapping the entire address space.

The tree is separated into different *levels*. with a root table pointed to by the base register and each subsequent child tree increases in level going deeper into the tree. Typically the root is at level 0 with a maximum depth of 4 (down to level 3), but the various configurations are discussed in the next section.

#### 7.3.1 Translation table format

Arm's virtual memory system architecture is highly configurable. Writing to the SCTLR ("System control register") and TCR ("Translation control register") system registers allow the software developer to choose a configuration from a whole host of various options. To give a flavour of this configurability I list some of the configuration bits, some of which will be discussed in more detail in the next chapter; these include: the size of virtual addresses; the number of levels in the tree; the starting level; the size of a single page (or in Arm terminology, the size of the *translation granule*); the number of ASIDs and VMIDs; alignment requirements; memory attributes for hardware walks; enabling hardware management of access flags and dirty bits; write-execute-never permissions; and so on. To simplify things, in this dissertation, we consider just one common configuration, the one currently used by the Linux kernel: a tree of translation tables with maximum depth 4, with 4KiB pages with 48-bit addresses, unless explicitly stated otherwise.

Each node is a table of 512 64-bit entries, bound as one 4096-byte block of memory. Each table controls the mapping of a fixed range of the virtual address space. This range is split into 512 equal slices, with each entry responsible for its slice. Each of those entries can be one of:

- 1. An *invalid* entry, which indicates that this slice of the domain is unmapped;
- 2. A table entry, pointing to a next-level table (a child tree) which recursively maps this slice of the domain; or
- 3. A page (last-level) or block (non-last-level) entry which defines a single fixed-size mapping for this slice of the domain.

**Invalid entries** An invalid entry is defined by bit[0] of the entry being 0. The top 63 bits are ignored by hardware, and software is free to use those bits to store any metadata it wishes. Invalid entries may exist at any level in the tree.



**Block or page entries** Block and page entries are similar to each other; both create a mapping for a contiguous slice of the domain mapped by the entry, encoded as an output address (OA) with some metadata (including access permissions, memory type, and some software-defined bits).

The OA is aligned to the size of the slice of the domain being mapped. For page entries, the OA is aligned on a page boundary. A block entry's OA at level 2 would be 2MiB aligned, and a block entry's OA at level 1 would be GiB aligned. This corresponds to the hardware reserving bits[n:12] of the entry to be 0 depending on how deep the entry is: at level 1 n==30; at level 2 n==21; and at level 3 n==12.

Block entries can exist at levels 1 and 2. Page entries can only exist at level 3.

For block entries bit[1] is 0, for page entries bit[1] is 1.

Metadata (access permissions, shareability, memory type) are encoded into the attrs bits.

| 63       | 50   | 47 | n              |         | 12 11 |       | 210            |
|----------|------|----|----------------|---------|-------|-------|----------------|
| <br>attr | s 00 |    | output address | ignored |       | attrs | P <sub>1</sub> |

**Table entries** A table entry contains a page-aligned pointer to a child table, but can also contain similar metadata as the block or page entry, including access permissions (read/write/execute), which are combined with any permissions from the child table.

1317 Table entries are allowed only at levels 0−2.

1295

1296

1297

1299

1300

1302

1303

1304

1306

1307

1309

1313

1314

1315

1318

1310

1321

1322

1324

1325

1326

1327

1329

1332

| 63    | 50 47 | 12            | 11    | 210 |
|-------|-------|---------------|-------|-----|
| attrs | 00    | table pointer | Res01 | 11  |

### 7.3.2 The Arm translation table walk

When the processor executes an instruction which takes an address, such as the Arm LDR or STR instructions, those addresses are virtual (addresses used by instructions are always virtual addresses). The hardware converts each virtual address to a physical address, and the MMU performs this conversion.

To do this, the MMU reads the TTBR to get the currently in-use tree of translation tables. Then the MMU itself reads memory and walks the tree (except when it can read from a previously cached translation, as described in the next chapter) effectively computing the partial translate function the tree encodes, producing the physical address and any permissions, or reporting a fault back to the processor if the virtual address was unmapped, or if the permissions forbid the requested operation.

**Walk overview** The hardware walker first slices up the input virtual address into chunks: the most-significant bit is used to determine which base register to use (see §7.5); the next 15 bits are typically ignored by hardware; the rest of the address is split into 9-bit fields which we refer to as fields a-d, with the remaining bits as field e. Fields a-d are used for indexing into the tables; and field e is the offset in the page, which is added to the final output address.

 $<sup>^1\</sup>mathrm{The}$  Arm architecture requires these bits are 0 and are reserved for future use.

|    |    |         |    |    |    | VA |    |    |    |    |    |   |   |
|----|----|---------|----|----|----|----|----|----|----|----|----|---|---|
|    |    |         |    |    |    |    |    |    |    |    |    |   |   |
| 63 | 62 | 48      | 47 | 39 | 38 | 30 | 29 | 21 | 20 | 12 | 11 |   | 0 |
|    |    | ignored |    | a  |    | b  |    | c  |    | d  |    | e |   |

The walk then proceeds, with the MMU taking the following steps:

1 Read the TTBR register.

1333

1335

1336

1338

1339

1343

1344

1346

1347

1350

1351

1352

1353

1354

1358

1359

1361

1362

1363

1366

1367

1369

1370

1371

- 2 Perform a 64-bit single-copy atomic read of Mem[TTBR+8\*a] to read the entry in the Level 0 table. Call the result L0entry.
  - a If L@entry[0] is 0 (that is, it's an invalid entry) then report a fault back to the processor.
  - b Otherwise if L0entry[1] is 0 then report a fault back to the processor (top-level tables cannot have block mappings).
- 3 Perform a 64-bit single-copy atomic read of Mem[L0entry.table\_pointer+8\*b] to read the entry in the Level 1 table, which we will call L1entry.
  - a If L1entry[0] is 0 then report a fault back to the processor.
  - b If L1entry[1] is 0 (it's a block entry):
    - i If the access is not permitted (See §7.3.2 "Access permissions"), report a fault to the processor.
    - *ii* Otherwise, return the output address (See §7.3.2 "**Computing the final output address**") back to the processor.
- 4 Perform a 64-bit single-copy atomic read of Mem[L1entry.table\_pointer+8\*c] to read the entry in the Level 1 table, which we will call L2entry.
  - a If L2entry[0] is 0 then report a fault back to the processor.
  - b If L2entry[1] is 0 (it's a block entry):
    - *i* If the access is not permitted, report a fault to the processor.
    - ii Otherwise, return the output address back to the processor.
- 5 Perform a 64-bit single-copy atomic read of Mem[L2entry.table\_pointer+8\*d] to read the entry in the Level 3 table, which we will call L3entry.
  - *a* If L3entry[0] is 0 then report a fault back to the processor.
  - b Else if L3entry[1] is 0, report a fault back to the processor (this encoding is reserved and is treated as invalid).
  - c L3entry[1] is 1 (it's a page entry):
    - i If the access is not permitted, report a fault to the processor.
    - *ii* Otherwise, return the output address back to the processor.



**Computing the final output address** The output address (OA) of the final descriptor is the start of the range mapped by the entry. The low order bits are all 0 in the output address, and need to be added on to compute the final output address of the translation.

To compute this final output address the MMU takes the OA from the entry, and the level in the tree the entry is at, and 'completes' the address by bitwise appending the remaining fields to create the complete 48-bit output address. Recall that the OA field of the block mappings gets wider the deeper in the tree you are, and so for a 1GiB entry the OA field is only 18 bits wide but for a 4KiB page entry its OA field is the full 36 bits.

- ⊳ For a 1GiB (level 1) block entry; PA = OA::c::d::e
- ▷ For a 2MiB (level 2) block entry; PA = OA::d::e

⊳ For a 4KiB (level 3) page entry; PA = OA::e

1372

1373

1380

1383

1384

1386

1387

1394

1395

1397

1401

1402

1403

1405

1406

1409

1410 1411

1412

1413

1414

1416

1417

Note that this process means that the least-significant 12 bits of the input VA are unchanged and remain the same in the final output PA, regardless of how the translation function is configured.

Access permissions Once the walk is complete, and the final output address calculated, the MMU checks to see whether the requested access is permitted. Each level of the table can contain some access permissions and those permissions get combined at the end to calculate the final permissions.

For data accesses (reading and writing), table entries have an APTable field (bits[62:61]), and block/page entries have AP[2:1]<sup>1</sup> field (bits[7:6]). These fields can be decoded using the following table:

| Field      | When set (1)               | When unset (0)                |
|------------|----------------------------|-------------------------------|
| AP[2]      | Read-only                  | Read&Write                    |
| AP[1]      | Allow at EL1&0             | Allow at EL1 only             |
| APTable[1] | Force read-only            | No effect on permissions.     |
| APTable[0] | Force forbid access at EL0 | No effect on EL0 permissions. |

For executable permissions, which permit or forbid instruction fetching from some region of memory, there are no dedicated encodings of the access permission bits. Instead, all mappings are executable by default, unless one of the following applies: the region is mapped writeable at EL0, as writable EL0 regions are never executable at EL1; a global WXN ("Write-execute-never") configuration bit is set, and the entry was writeable; or, when one of the various translation table entry XN ("Execute-never") bits are set. For simplicity, this chapter assumes that execute-never bits are always disabled; see the full description in the Arm ARM TODO: ?REF?for more information.

To combine access permissions from the whole walk, the MMU takes the bitwise union of each of the APTable fields from each table entry, and then intersects the result with the final AP[2:1] field to produce a final set of permissions. Figure 7.2 contains a decoding table for a given table and leaf access permissions, for testing whether a requested access is permitted. If the requested access is not permitted, then the MMU generates a permission fault, which is reported back to the processor.

**Faults** The MMU may emit one of several fault types during a translation table walk (these are referred to by Arm as the *MMU fault* types):

- ▷ Translation fault.
  - These are caused by the mapping being invalid, either because bit[0] was 0, or because the descriptor encoding was reserved-as-invalid. Translation faults also result from trying to translate an address that is outside the 48-bit input address range.
- ▷ Permission fault.
  - For when the mapping was valid, but the access permissions do not permit the requested access (for example, trying to write to a read-only address).
- ▷ Access flag fault.
  - These are generated when hardware management of access flags is disabled and the access flag bit is set.
- → TLB Conflict aborts (see TODO: ?REF?).
  - Alignment fault
    - Generated when an operation expects an aligned memory address, but is given a misaligned one, and alignment checking is enabled in the SCTLR.
  - ▶ Address size fault.
    - For when the OA, or TTBR, has a value that is out of the physical address range.
  - $\,\rhd\,$  Synchronous external abort on a translation table walk.
    - These are *external aborts* (that come from the system not from the MMU) that happen due to accesses that the MMU generated. For example, if the next-level table field pointed to an address for which there was no memory or device, the system-on-chip would return a fault to the processor.

These faults lead to processor exceptions. The fault type is stored in the ESR\_ELn ("exception syndrome register") register's EC ("exception class") field, and any supplementary information is stored in its ISS ("instruction specific syndrome") field (such as which level in the tree the fault came from, whether the originating instruction was a read or a write, and ). Exception handling code can read the ESR register to determine the fault type and cause,

<sup>&</sup>lt;sup>1</sup>Block/page entries do not store the entire AP field but only AP[2:1]. AP[0] is not present in AArch64.

| leriz       | leraz     | 707   |       |              |              |              |              |              |                                      |
|-------------|-----------|-------|-------|--------------|--------------|--------------|--------------|--------------|--------------------------------------|
| APTableF117 | APTablera | AP[2] | AP[1] |              | EL1          |              |              | EL0          |                                      |
|             |           |       |       | R            | W            | X            | R            | W            | X                                    |
| 0           | 0         | 0     | 0     | $\checkmark$ | W<br>✓<br>✓  | $\checkmark$ | ×            | ×            | $\checkmark$                         |
| 0           | 0         | 0     | 1     | $\checkmark$ | $\checkmark$ | $\times$     | $\checkmark$ | $\checkmark$ | $\checkmark$                         |
| 0           | 0         | 1     | 0     | $\checkmark$ | ×            | ×            | ×            | ×            | X  ✓  ✓  ×                           |
| 0           | 0         | 1     | 1     | $\checkmark$ | ×            | $\checkmark$ | $\checkmark$ | ×            | ×                                    |
| 0           | 1         | 0     | 0     | $\checkmark$ | <b>√</b>     | <b>√</b>     | ×            | ×            | $\checkmark$                         |
| 0           | 1         | 0     | 1     | <b>√</b>     | <b>√</b>     | $\times$     | ×            | ×            | $\checkmark$                         |
| 0           | 1         | 1     | 0     | $\checkmark$ | ×            | $\times$     | ×            | ×            | $\checkmark$                         |
| 0           | 1         | 1     | 1     | $\checkmark$ | ×            | $\checkmark$ | ×            | ×            | ×                                    |
| 1           | 0         | 0     | 0     | ✓            | ×            | <b>√</b>     | ×            | ×            | $\checkmark$                         |
| 1           | 0         | 0     | 1     | ✓<br>✓<br>✓  | ×            | $\times$     | $\checkmark$ | ×            | $\checkmark$                         |
| 1           | 0         | 1     | 0     | $\checkmark$ | ×            | ×            | ×            | $\times$     | $\checkmark$                         |
| 1           | 0         | 1     | 1     | $\checkmark$ | ×            | $\checkmark$ | $\checkmark$ | $\times$     | ×                                    |
| 1           | 1         | 0     | 0     | $\checkmark$ | ×            | $\checkmark$ | ×            | ×            | $\checkmark$                         |
| 1           | 1         | 0     | 1     | ✓<br>✓<br>✓  | ×            | ×            | ×            | ×            | ✓<br>✓<br>✓<br>✓<br>✓<br>✓<br>✓<br>✓ |
| 1           | 1         | 1     | 0     | $\checkmark$ | ×            | ×            | ×            | $\times$     | $\checkmark$                         |
| 1           | 1         | 1     | 1     | $\checkmark$ | ×            | $\checkmark$ | ×            | ×            | ×                                    |

**Figure 7.2:** Merging Access Permissions (Stage 1, EL1&0). Entries in **red** highlight differences from the APTable=00.

and can read the FAR\_ELn ("fault address register") to determine the virtual address which triggered the fault, and handle the fault appropriately.

**Memory Attributes** The processor does not necessarily know what is located at any physical address. There may be some dynamic random-access memory (DRAM, what one would generally consider 'memory'), but there may also be other memory-mapped devices, or non-volatile memory, or other peripherals, or possibly nothing at all.

To help accommodate this, hardware allows software to mark regions of memory as one of either *device* memory, *normal cacheable* memory, or normal *non-cacheable* memory, using the translation tables.

The desired memory type is determined from the AttrIndx field (bits[4:2]) in block and page entries. Instead of being directly encoded into this field, Arm chose to have the actual attributes stored in a separate register: the MAIR ("Memory attribute indirection register") register. The MAIR stores an array of eight 8-bit fields each of which contains an encoding of a memory type. The AttrIndx field in the entry is an integer in the range 0-7, which is the index of the field in the MAIR register to use.

This indirection means that the final result of translation depends not only on the value of the final leaf entry in memory, but on the value of certain system registers, such as the MAIR, at that time of the translation table walk.

Below are the three most common encodings for a MAIR field, and the ones that will be useful later when discussing tests:

▷ 0b0000\_0000: device memory.

1420

1422 1423

1425

1426

1427

1429

1430

1435

1436

1437

1439

1440

- ▷ 0b0100\_0100: normal non-cacheable memory.
- ▷ 0b1111\_1111: normal cacheable memory, inner&outer write-back non-transient, read&write-allocating.

Memory locations marked as device tell the hardware that reads or writes to those locations may have side-effects. This means hardware treats those locations differently: there will be no speculative instruction fetches, reads, or writes to those locations; writes to those locations will not *gather* into larger writes; reads and writes to those locations will not re-order with respect to others; those locations generally will not get cached; and other thread-local optimizations get disabled. Note that Arm define a wide range of device memory types, allowing

the systems programmer to selectively re-enable some of the previously described behaviours to enable better performance where they deem it safe to do so.

For normal memory the software can choose between *cacheable* or non-*cacheable* memory. Arm provide a range of different options for the cacheability:

▷ non-cacheable

1447

1484

1485

1487

- - ▷ write-through cacheable

As with other features, there is a wide scope for configuration: separately configuring inner (L1,L2) and outer (L3) caches, and adding cache allocation hints (allocating on reads, writes or both).

As we will see later (TODO: ?REF?), the ability to change cacheability, or even have multiple aliases with different cacheability attributes, give rise to interesting behaviours and security considerations.

## 7.4 Virtualisation and a second stage of translation

So far this chapter has focused on operating systems and processes. However, modern systems isolate not just processes within an operating system but entire operating systems from one another, within a hypervisor.

To do this, software uses the virtual memory abstraction again, adding an extra layer. This layer, like the previous one, is supported by hardware. Processes use virtual addresses which are converted to *intermediate physical* (also sometimes known as *guest physical*) addresses using the operating system's configured translation tables but then these intermediate physical addresses (IPAs) go through another round of translation to convert those IPAs into the final physical address.

Arm calls these *stages* of translation, and the MMU supports both stages and can perform the full translation from virtual to physical (via the intermediate physical) address.

This means software must manage two sets of translation tables: operating systems manage the *stage 1* tables to convert VAs to IPAs; and hypervisors manage *stage 2* tables to convert those IPAs to PAs; this gives two separate translate functions, which the MMU composes together at runtime:

```
translate_stage1 : VirtualAddress \rightarrow IPA \times Permissions \times MemoryType translate_stage2 : IPA \rightarrow PhysicalAddress \times Permissions \times MemoryType
```

Hypervisors (running at EL2) can configure the stage 2 translate function by creating translation tables with a similar format as before and then storing a pointer to the root of this tree in the VTTBR ("Virtualization translation table base register") register. The MMU will read the VTTBR whenever it needs to perform a second-stage translation to convert an IPA to a PA, and will do the translation table walk over that tree in much the same way as described earlier for (what we can now call) the first-stage translation.

This results in two address spaces, a virtual address space and an intermediate-physical address space. Figure 7.3 contains an example layout of these address spaces for a machine running three processes (P0,P1,P2) in two operating systems (OS0,OS1). As with the earlier diagram in Figure 7.1, each column is a (set of) address spaces, with transformations between them defined by their respective translation functions. On the left-hand side are the virtual address spaces of the various processes, whose virtual addresses are translated (using the translation tables pointed to by the TTBR register) into intermediate-physical addresses in the central address spaces (for the respective OS). Those IPAs are then translated (using the VTTBR) into the final physical address.

Concretely, if P1 reads from address 0x1001, it will be translated into the IPA 0x3001 in 0S0's address space, which then gets translated again, and the processor will actually read from RAM at location 0x6001.

Differences in the translation table format from stage 1 Stage 2 translation tables are similar to their stage 1 counterparts, but there are some minor differences:

- ▷ Stage 2 table entries do not have any additional attributes, and so do not have an APTable field.
- ▷ Stage 2 AP field (called S2AP) has a slightly different (and simpler) format, see Figure 7.4.
- Stage 2 block and page entries do not have a MemAttrIndx field but rather encode the memory type directly into the MemAttr field bits[5:2] (see the full description in the Arm ARM [1, D5-4874] for all possible encodings):



**Figure 7.3:** Example virtual, intermediate physical, and physical address spaces for three processes running on two operating systems.

Field When set (1) When unset (0)
S2AP[1] Writeable not Writeable
S2AP[0] Readable not Readable

Figure 7.4: S2AP field encoding.

```
    - 0b0000: Device memory.
    - 0b0101: Normal non-cacheable.
    - 0b1111: Normal write-back inner&outer cacheable.
```

These are interesting as they mean that the stage 1 and stage 2 attributes (permissions and memory types) must be *combined* in order to produce the final output. This combination is not just a case of letting stage 2 overrule the stage 1 settings but rather that both stages get a veto: if stage 1 sets the memory type to be device or non-cacheable then it overrules what stage 2 sets. Similarly, if stage 1 permissions forbid an access then the stage 2 permissions cannot overrule that.

**Second-stage translations during a first-stage walk** There is a complication with the story so far. The stage 1 tables are created by the operating system, which is using an intermediate physical address space, not a physical one. The writes the OS does to the tables will be translated, as they are normal data writes. But, the tables themselves contain references to other tables, and those entries will be intermediate physical addresses, and so, they must also be translated, including the value of the TTBR itself.

In our assumed configuration of 4KiB pages and 4 levels of translation, this leads to a maximum of 24 memory accesses to perform the translation: 4 reads of stage 1 translation tables, 16 reads of stage 2 translation tables during those stage 1 walks, and a final 4 reads of the stage 2 translation tables to translate the output IPA into the final PA.

## 7.5 Translation regimes

As mentioned earlier, there are multiple translation table base registers. Each of them defines a translation function, pointing to the root of the tree of translation tables which define it. These translation functions are then composed together into various translation *regimes*, each defining the set of translation functions (and therefore which translation table base registers) which will be used for translations done by the processor.

Arm define a set of these translation regimes. Figure 7.5 gives an overview of three of the most common regimes, which are:

- ▷ EL1&0 (two-stage)
  - For programs executing at EL0 or EL1 when virtualisation (at EL2) is enabled.
  - VAs with the high bit set are translated into IPAs using the EL1-configured register, TTBR1\_EL1.
     VAs are typically split into 'high' and 'low' regions with different translations, primarily used for separate kernel and user address spaces.
  - VAs without the high bit set are translated into IPAs using the EL1-configured register, TTBR0\_EL1.
  - IPAs are translated to PAs using the EL2-configured VTTBR\_EL2 register.
- ▷ EL1&0 (single-stage)
  - For programs executing at EL0 or EL1 when virtualisation (at EL2) is disabled.
  - VAs with the high bit set are translated into PAs using the EL1-configured register, TTBR1\_EL1.
  - VAs without the high bit set are translated into PAs using the EL1-configured register, TTBR0\_EL1.
- ⊳ EL2

1491

1494

1495

1498

1499

1505

1506

1509

1512

1515

1516

1517

1518

1519

1520

1521

1523

1524

1525

1526

1530

1531

- For programs executing at EL2.
- VAs without the high bit set are translated into PAs using the EL2-configured register, TTBR0\_EL2.
- VAs with the high bit set are always unmapped.

Which translation regime is being used is defined by various system registers and the current system state.

- → Translations at EL1 or EL0 use one of the EL1&0 regimes.
- ▶ Translations at EL2 use the EL2 regime.
- ▷ TCR\_EL2 (set at EL2) determines whether the EL1&0 is a single-stage or two-stage regime.
- ▷ TTBR0\_EL1, TTBR1\_EL1 determine the stage 1 of the EL1&0 regimes, and can be set at EL1 or higher.
- ▷ TTBR0\_EL2 determines the stage 1 of the EL2 regime, and can only be set at EL2 or higher.

▷ VTTBR\_EL2 determines the stage 2 of the EL1&0 regime, and can only be set at EL2 or higher.

Arm define a wide range of other regimes, see the Arm ARM TODO: ?REF?. For simplicity, we ignore secure modes, including all of EL3.







Figure 7.5: Translation regimes that apply to EL0,EL1, and EL2.

## 7.6 Arm pseudocode

1533

It is now useful to examine the official Arm pseudocode, especially those parts that relate to memory events.

We will do this in three steps: first, by looking at the pseudocode that is executed for an Arm store instruction; following the memory accesses that it performs down to any translations it performs; finally looking at the Arm translation table walker in full. There is a lot of detail infused throughout the Arm psueocode, so in this section we shall focus on the most pertient parts, and give some idea of what detail is omitted.

### 7.6.1 The lifecycle of a store

Arm give precise executable semantics for every instruction in their domain-specific Architecture Specification
Language (ASL). This ASL code defines the sequential intra-instruction behaviour of each instruction, including
memory accesses, and any translation table walks they perform.

```
bits(64) address:
1
          bits(datasize) data;
3
 4
6
          if n == 31 then
7
               if memop != MemOp_PREFETCH then CheckSPAlignment();
8
              address = SP[];
10
          else
               address = X[n];
11
13
15
16
18
21
                       data = X[t];
                   Mem[address, datasize DIV 8, acctype] = data;
22
23
43
```

Figure 7.6: Arm "STR (immediate)" ASL code.

## TODO: the importance of the ASL, and of sequential v concurrent behaviour will already be explained, but recap here anyway?

Figure 7.6 shows the Arm ASL for the "STR (Immediate)" instruction: STR Xt,[Xn]. This instruction writes the value contained in register Xt into the memory location stored in register Xn. The figure has some uninteresting (for this thesis) parts greyed out: those parts that deal with optional extensions such as memory tagging; unknown register values; register writeback; and, the load and prefetch instructions which use the same ASL code.

The ASL code first reads the virtual address either from the stack pointer (line 9) or by reading register Xn (line 11). It then reads the data from the register Xt (line 21), which will be written to memory. Finally, it performs the store itself using the Mem[] function (line 22).

### 7.6.2 Writes to memory

The Mem[] function is responsible for checking alignment and performing each memory access the instruction does. The ASL for Mem[] can be found in Figure 7.7.

- It does some alignment checks, and then calls MemSingle[] once for each single copy atomic write the access performs.
- For example, for a fully aligned store, it calls MemSingle[] just once (lines 37 or 57), and, for a misaligned store, it will call MemSingle[] once for each byte (line 51).
- The MemSingle[] call then performs the translation, and (if successful), the actual write to memory. Its ASL can be found in Figure 7.8, with parts for extensions and store pair greyed out. On line 12, it calls AArch64. TranslateAddress to do the translation table walk. If the translation succeeds, then the code calls PhysMemWrite (on line 40), an uninterpreted function with no behaviour in ASL, which represents the actual write to memory. After perhaps handling any external aborts from the write, the function returns.

```
1 Mem[bits(64) address, integer size, AccType acctype, boolean ispair] = bits(size*8)
       value_in
2
        boolean iswrite = TRUE;
3
        constant halfsize = size DIV 2;
        bits(size*8) value = value_in;
4
5
        bits(halfsize*8) lowhalf, highhalf;
6
        boolean atomic;
7
        boolean aligned;
8
10
11
        if ispair then
12
            // check alignment on size of element accessed, not overall access size
13
            aligned = AArch64.CheckAlignment(address, halfsize, acctype, iswrite);
14
        else
15
            aligned = AArch64.CheckAlignment(address, size, acctype, iswrite);
        \quad \textbf{if} \ \text{ispair} \ \textbf{then} \\
16
17
            atomic = CheckAllInAlignedQuantity(address, size, 16);
        elsif size != 16 || !(acctype IN {AccType_VEC, AccType_VECSTREAM}) then
18
20
                atomic = aligned;
21
30
        if !atomic && ispair && address == Align(address, halfsize) then
31
32
            single_is_aligned = TRUE;
33
            <highhalf, lowhalf> = value;
34
            AArch64.MemSingle[address, halfsize, acctype, single_is_aligned, ispair] =
       lowhalf;
35
            AArch64.MemSingle[address + halfsize, halfsize, acctype, single_is_aligned,
       ispair] = highhalf;
36
        elsif atomic && ispair then
37
            AArch64.MemSingle[address, size, acctype, aligned, ispair] = value;
38
        elsif !atomic then
39
            assert size > 1;
40
            AArch64.MemSingle[address, 1, acctype, aligned] = value<7:0>;
41
42
            // For subsequent bytes it is CONSTRAINED UNPREDICTABLE whether an unaligned
        Device memory
            // access will generate an Alignment Fault, as to get this far means the
43
        first byte did
44
            // not, so we must be changing to a new translation page.
45
            if !aligned then
46
                c = ConstrainUnpredictable(Unpredictable_DEVPAGE2);
47
                assert c IN {Constraint_FAULT, Constraint_NONE};
48
                if c == Constraint_NONE then aligned = TRUE;
49
50
            for i = 1 to size-1
                AArch64.MemSingle[address+i, 1, acctype, aligned] = value < 8*i+7:8*i>;
51
57
            AArch64.MemSingle[address, size, acctype, aligned, ispair] = value;
58
        return;
59
```

```
{\tt AArch64.MemSingle[bits(64)\ address,\ integer\ size,\ AccType\ acctype,\ boolean\ aligned}
1
        , boolean ispair] = bits(size*8) value
 2
        assert size IN {1, 2, 4, 8, 16};
 3
        constant halfsize = size DIV 2;
 4
            assert address == Align(address, size);
8
9
        AddressDescriptor memaddrdesc;
10
        iswrite = TRUE;
12
       memaddrdesc = AArch64.TranslateAddress(address, acctype, iswrite, aligned, size)
        // Check for aborts or debug exceptions
13
        if IsFault(memaddrdesc) then
14
            AArch64.Abort(address, memaddrdesc.fault);
15
16
17
        // Effect on exclusives
        if memaddrdesc.memattrs.shareability != Shareability_NSH then
18
19
            ClearExclusiveByAddress(memaddrdesc.paddress, ProcessorID(), size);
20
21
        // Memory array access
22
        AccessDescriptor accdesc;
23
29
            accdesc = CreateAccessDescriptor(acctype);
30
31
36
        PhysMemRetStatus memstatus;
37
        (atomic, splitpair) = CheckSingleAccessAttributes(address, memaddrdesc.memattrs,
38
        size, acctype, iswrite, aligned, ispair);
39
            memstatus = PhysMemWrite(memaddrdesc, size, accdesc, value);
40
41
            if IsFault(memstatus) then
                HandleExternalWriteAbort(memstatus, memaddrdesc, size, accdesc);
42
43
61
        return;
```

#### 7.6.3 Translation table walks

It is the AArch64. TranslateAddress function which begins the process that performs the actual translation table walk, converting the input virtual address to the physical one. The full ASL code is too much to contain in a single figure, and so it can be found in §7.8 at the end of this chapter. This section will reference the relevant lines from the translation table walk ASL.

Figure 7.9 is an example trace of the execution of the STR Xt, [Xn] instruction, as it would happen if we were to execute it from EL1 in the EL1&0 two-stage regime. Each node represents an event in the trace (a memory or register access), and the arrows between them represent control flow. TODO: Generate from an actual isla trace rather than by hand? at least to be proper... TODO: Give labels to each event?

As described before, the instruction starts by reading the Xt and Xn registers, before beginning the call to AArch64.TranslateAddress.

The events drawn inside the dotted box come from accesses during the call to the translation table walk functions. 1578 It first calls FullTranslate (in AArch64.TranslateAddress, page 58, line 2), which calls S1Translate (in AArch64.FullTranslate, page 59, line 12), which calls S1Walk (in AArch64.S1Translate, page 60, line 29) to do the actual first-stage translation table walk. It begins by reading the relevant TTBR register to get the root 1581 table address (in AArch64. S1Walk, page 63, line 9). This is stored in a walkstate struct, which the ASL code 1582 uses to keep track of the state that changes as the walk progresses, notably, the next-level table address and 1583 any accumulated permissions. It then begins the loop to do the walk, starting from the table address read from the TTBR. On each iteration of the loop, the intermediate-physical address of the entry to be read is computed 1585 (in AArch64.S1Walk, page 63, line 38), and passed through a second stage of translation (in AArch64.S1Walk, 1586 page 63, line 47).

This second stage translation calls S2Walk, which behaves similarly to the S1Walk function, taking the following steps: it reads the VTTBR (in AArch64.S2Walk, page 67, line 11); computes the (now) physical address of the entry to read (in AArch64.S2Walk, page 67, line 41); and reads it (in AArch64.S2Walk, page 67, line 44), eventually calling PhysMemRead (in AArch64.FetchDescriptor, page 69, line 23), which appears as the first R S2 L0 node in Figure 7.9.

S2Walk continues to loop, each time updating the running walkstate with the next-level table address from the decoded descriptor (in AArch64.S2Walk, page 67, line 53), until a leaf entry is found. It is either invalid (in AArch64.S2Walk, page 68, line 65), or, a valid page or block entry (in AArch64.S2Walk, page 68, line 70). These correspond to the next three R S2 Ln events in the figure.

Assuming the walk did not fail with a fault, the S2Translate function returns with the physical address of the stage 1 level 0 table. S1Walk can continue, performing a read of the physical memory in the table (in AArch64.S1Walk, page 63, line 52). From there, S1Walk continues in much the same way as the stage 2 walk did: computing the current table intermediate-physical address, translating it to get the physical address, performing the read of memory to get the descriptor, until a leaf entry is found.

This process generates all the events up to, and including, the final stage 1 entry read (the R S1 L3 event), returning the intermediate-physical address that S1Walk computed.

Finally, FullTranslate calls S2Translate one last time (in AArch64.FullTranslate, page 59, line 22) on the intermediate-physical address, generating the last Rreg(VTTBR) and R S2 Ln events, and producing the final PA of the translation.

This output PA is what is passed to the PhysMemWrite of the MemSingle[] call we saw earlier, generating the final W [pa]=data event in the trace.

## 7.7 Caching in TLBs

1609

1610

1612

1613

Hardware does not simply perform the (up to) 24 additional memory accesses for every instruction-fetch, read, or write. This would have an unacceptable performance penalty. Instead, the results of previous translations of the same address are cached, in specialised structures called *Translation Lookaside Buffers*, or simply TLBs. These TLBs can store whole translation results, or the separate virtual and intermediate-physical mappings, or individual translation table entries, or a mix of the above, which we will explore more in the next chapter.



Figure 7.9: Memory and register accesses during a 'STR Xt, [Xn]' instruction.

When the processor translates a virtual address, it first looks for it in the TLB. If there is no entry, then this is called a TLB miss and a translation table walk must be performed. The results of this walk are typically then cached in the TLB, so future translations of the same address can directly grab the physical address, memory attributes, and permissions, without needing to do another translation table walk. This process and the various microarchitectural structures are explored more in §8.3.1.

If there is an entry, this is referred to as a TLB hit. In this case, the result can be taken directly from the TLB. 1620

1615

1617

1618

1619

1621

1624

1625

1632

1634

1637

1641

1642

1643

Under normal circumstances, the TLB is invisible to userspace programs. However, systems code is expected to manage the TLBs explicitly, using a set of instructions which Arm provide specifically for this purpose: the family of TLBI TLB-maintenance instructions. When context switching, the systems software must manually manage the TLB, invalidating stale entries for old mappings out of the cache. The behaviours that arise from reading from potentially stale TLB entries are explored in detail in §8.5.

Address space identifiers TLB misses and TLB maintenance are both expensive operations, and so to reduce 1626 the burden, Arm provide a mechanism to permit multiple processes' address spaces to be loaded into the TLB at the same time, by allowing the software to mark each address space with a numeric label. Arm call these address 1628 space identifiers (or ASIDs). 1629

Entries in the TLB are tagged with the current ASID, and so only that process will see entries in the TLB with that 1631

The current ASID is encoded in the high order bits of the current TTBR. During a context switch, the system software needs only switch to the new translation tables for the new address space of the other process, without doing TLB maintenance, so long as it ensures the ASIDs are distinct.

There are only finitely many ASIDs available (typically it is an 8-bit field), and so eventually TLB maintenance is 1635 required to re-use a previously allocated ASID for a new address space. But this happens far less frequently than the context switches themselves. The provided TLB maintenance instructions can target specific ASIDs, avoiding the need to over-invalidate other cached address space translations, preventing a cascade of TLB misses in other 1638 processes, further improving the runtime performance for a small amount of additional effort on the software side. 1639

VMIDs Address space identifiers are used only for stage 1 translations. Stage 2 has virtual machine identifiers (VMIDs).

As before, the current VMID is encoded in the VTTBR\_EL2 register, and the TLB entries are additionally tagged with the current VMID (as well as the ASID), and a translation will only use TLB entries that match the current ASID and VMID.

**TLB maintenace instructions** Arm define a whole family of instructions under the TLBI mnemonic.

The format for a TLBI instruction is a product of fields:

```
TLBI <type><level><broadcast>{,<reg>}
1647
1648
        <type> =
1649
          ALL | VMALL | ASID | VA{A|L} | IPAS2
1650
        <level> =
1651
          E1 | E2
1652
        <br/>
<br/>
dcast> =
1653
1654
          {IS}
        <reg> =
1655
          X0 | X1 | ... | X30
1656
```

1646

1659

Again, see the full description in the Arm manual for a more complete description [1, D5-4915].

1658 The most common, and the ones that will be discussed in the following chapters, are as follows:

- ▶ TLBI VAE1, Xn: Invalidate this CPU's cached copies of entries used to translate the virtual address in register Xn, for the EL1&0 regime, for the current ASID and VMID.
- TLBI VALE1, Xn: Invalidate this CPU's cached copies of any last-level entries used to translate the virtual address in register Xn, for the EL1&0 regime, for the current ASID and VMID.
- TLBI VAAE1, Xn: Invalidate this CPU's cached copies of any last-level entries used to translate the virtual address in register Xn, for the EL1&0 regime, for the current VMID, for any ASID.
- DESCRIPTION STATES IT SHELLS, No. 1665 The First No. 1665 The First No. 1666 The First No. 1666 The First No. 1666 The First No. 1667 The First N
- (...and equivalent TLBI VAE2, TLBI VALE2, TLBI VAE2IS instructions for virtual addresses in the EL2 regime)
- Description | TLBI IPAS2LE1, Xn: Invalidate this CPU's cached copies of any last-level entries used to translate the intermediate physical address in register Xn, for the EL1&0 regime, for the current VMID.
- Description | TLBI IPAS2E1IS, Xn: Invalidate all CPU's cached copies of entries used to translate the intermediate physical address in register Xn, for the EL1&0 regime, for the current VMID.
- 1675 > TLBI VMALLE1: Invalidate this CPU's cached copies of entries for the EL1&0 regime, for the current VMID.
- 1676 ▷ TLBI VMALLE1IS: Invalidate all CPU's cached copies of entries for the EL1&0 regime, for the current VMID.
- 1677 ▷ TLBI ALLE1: Invalidate this CPU's cached copies of entries for the EL1&0 regime, for any ASID or VMID.
- b TLBI ALLE1IS: Invalidate all CPU's cached copies of entries for the EL1&0 regime, for any ASID or VMID. (...and equivalent TLBI ALLE2, and TLBI ALLE2IS instructions for the EL2 regime)
- 1680 ▷ TLBI ASIDE1, Xn: Invalidate this CPU's cached copies of entries for the EL1&0 regime, for the ASID specified in register Xn.
- $_{1682}$   $\triangleright$  TLBI ASIDE1IS, Xn: Invalidate this CPU's cached copies of entries for the EL1&0 regime, for the ASID specified in register Xn.
- (Note that the EL2 regime does not have ASIDs)

### 7.8 Arm ASL Reference

Here I include the actual Arm ASL for the various parts of the translation machinery. This listing contains a verbatim subset of the ASL pseudocode for the translation table walk.

The sources have per-function line numbers and are annotated to direct the reader to those parts highlighted in §7.6.3. Lines which handle out-of-scope features (access flags, dirty bits, shareability domains, debugging, realms, secure states, atomics) are greyed out. Key lines have <u>coloured annotations</u>.

The ASL code listed here (minus the annotations) is copyright Arm, publicly available on Arm's webpage [55], and is restricted to only those parts of the translation table walk we are discussing. They are included here under fair dealings, for the purpose of critism and review [56, s. 30].

#### 7.8.1 AArch64.TranslateAddress

```
AddressDescriptor AArch64.TranslateAddress(bits(64) va, AccType acctype, boolean iswrite, boolean aligned, integer size)

result = AArch64.FullTranslate(va, acctype, iswrite, aligned); ← Do the translation

if !IsFault(result) && acctype != AccType_IFETCH then result.fault = AArch64.CheckDebug(va, acctype, iswrite, size);

if HaveRME() && !IsFault(result) && (acctype != AccType_DC || boolean IMPLEMENTATION_DEFINED "GPC Fault on DC operations") then accdesc = CreateAccessDescriptor(acctype);

result.fault.gpcf = GranuleProtectionCheck(result, accdesc);

if result.fault.gpcf.gpf != GPCF_None then result.fault.statuscode = Fault_GPCFOnOutput;

result.fault.paddress = result.paddress;

result.fault.acctype = acctype;

result.fault.write = iswrite;

if !IsFault(result) && acctype == AccType_IFETCH then result.fault = AArch64.CheckDebug(va, acctype, iswrite, size);

// Update virtual address for abort functions
result.vaddress = ZeroExtend(va);

return result;
```

#### 7.8.2 AArch64.FullTranslate

```
1 AddressDescriptor AArch64.FullTranslate(bits(64) va, AccType acctype, boolean
        iswrite, boolean aligned)
2
3
      fault = NoFault();
      fault.acctype = acctype;
 4
 5
      fault.write = iswrite;
7
8
      regime = TranslationRegime(PSTATE.EL, acctype);
g
10
                                                        <u>Do the first</u> stage of translation
11
      AddressDescriptor ipa;
       (fault, ipa) = AArch64.S1Translate fault, regime, ss, va, acctype, aligned,
12
        iswrite, ispriv);
13
      \textbf{if} \  \, \textbf{fault.statuscode} \  \, \textbf{!= Fault\_None} \  \, \textbf{then} \\ \longleftarrow \text{Check for stage 1 translation fault}
14
         return CreateFaultyAddressDescriptor(va, fault);
15
16
17
      if regime == Regime_EL10 && EL2Enabled() then
18
         s1aarch64 = TRUE;
19
20
         s2fs1walk = FALSE;
                                                          Do the second stage of translation
21
         AddressDescriptor pa;
         (fault, pa) = AArch64.S2Translate (fault, ipa, s1aarch64, ss, s2fs1walk, acctype,
22
          aligned, iswrite, ispriv);
23
         \textbf{if} \  \, \textbf{fault.statuscode} \  \, \textbf{!= Fault\_None} \  \, \textbf{then} \\ \longleftarrow \  \, \textbf{Check for stage 2 translation fault}
24
25
           return CreateFaultyAddressDescriptor(va, fault);
26
         else
27
           return pa;
28
      else
29
         return ipa;
```

#### 7.8.3 AArch64.S1Translate

```
1 (FaultRecord, AddressDescriptor) AArch64.S1Translate(FaultRecord fault_in, Regime
        regime, SecurityState ss, bits(64) va, AccType acctype, boolean aligned_in,
        boolean iswrite_in, boolean ispriv)
      FaultRecord fault = fault_in;
2
      boolean aligned = aligned_in;
3
      boolean iswrite = iswrite_in;
 4
      // Prepare fault fields in case a fault is detected
5
 6
      fault.secondstage = FALSE;
      fault.s2fs1walk = FALSE;
8
9
      if !AArch64.S1Enabled(regime) then
        return AArch64.S1DisabledOutput(fault, regime, ss, va, acctype, aligned);
10
11
      walkparams = AArch64.GetS1TTWParams(regime, va);
12
      if (AArch64.S1InvalidTxSZ(walkparams) ||
          (!ispriv && walkparams.nru -- , da dieg.)
)AArch64.VAIsOutOfRange(va, acctype, regime, walkparams)) then
Check VA is valid
18
19
        fault.statuscode = Fault_Translation;
20
        fault.level = 0;
        return (fault, AddressDescriptor UNKNOWN);
21
23
      AddressDescriptor descaddress;
24
      TTWState walkstate;
25
      bits(64) descriptor;
26
      bits(64) new_desc;
27
      bits(64) mem_desc;
                                                      Do the translation table walk
28
        (fault, descaddress, walkstate, descriptor) = AArch64. S1Walk(fault, walkparams,
29
        va, regime, ss, acctype, iswrite, ispriv);
30
31
        if fault.statuscode != Fault_None then ← Check for S1 translation fault
32
          return (fault, AddressDescriptor UNKNOWN);
33
34
37
38
        \textbf{elsif} \  \  \, \texttt{AArch64.S1HasPermissionsFault(regime, ss, walkstate, walkparams, ispriv, } \\
49
        acctype, iswrite) then← Check for permission fault
50
          fault.statuscode = Fault_Permission;
52
        new_desc = descriptor;
53
56
```

```
57
69
70
92
93
      // Output Address
94
      oa = StageOA(va, walkparams.tgx, walkstate); ← Compute IPA
95
      MemoryAttributes memattrs;
 96
112
         memattrs = walkstate.memattrs;
113
114
```

```
memattrs.shareability = walkstate.memattrs.shareability;
else
memattrs.shareability = EffectiveShareability(memattrs);

if acctype == AccType_ATOMICLS64 && memattrs.memtype == MemType_Normal then
if memattrs.inner.attrs != MemAttr_NC || memattrs.outer.attrs != MemAttr_NC ther
fault.statuscode = Fault_Exclusive;
return (fault, AddressDescriptor UNKNOWN);

ipa = CreateAddressDescriptor(va, oa, memattrs);
return (fault, ipa);
Return IPA and Memory Attributes
```

#### 7.8.4 AArch64.S1Walk

```
1 (FaultRecord, AddressDescriptor, TTWState, bits(64)) AArch64.S1Walk(FaultRecord
       fault_in, S1TTWParams walkparams, bits(64) va, Regime regime, SecurityState ss,
       AccType acctype, boolean iswrite_in, boolean ispriv)
      FaultRecord fault = fault_in;
2
     boolean iswrite = iswrite_in;
3
 4
8
9
      walkstate = AArch64.S1InitialTTWState(walkparams, va, regime, ss); ← read TTBR
10
11
     // Detect Address Size Fault by TTB
     if AArch64.0AOut0fRange(walkstate, walkparams.ps, walkparams.tgx, va) then
12
13
        fault.statuscode = Fault_AddressSize;
14
        fault.level
                       = 0;
        return (fault, AddressDescriptor UNKNOWN, TTWState UNKNOWN, bits (64) UNKNOWN);
15
16
17
     bits(64) descriptor;
     AddressDescriptor walkaddress;
18
20
     walkaddress.vaddress = va;
21
25
        walkaddress.memattrs = walkstate.memattrs;
26
27
34
35
     DescriptorType desctype;
                      For each level in {0,1,2,3}
36
      repeat -
37
        fault.level = walkstate.level;
        FullAddress descaddress = AArch64.TTEntryAddress(walkstate.level, walkparams.tgx
38
       , walkparams.txsz, va, walkstate.baseaddress);
39
                                                                Get IPA of entry to read
40
        walkaddress.paddress = descaddress;
41
42
        if regime == Regime_EL10 && EL2Enabled() then
43
          s1aarch64 = TRUE;
                                                                Do S2 translation to get the
          s2fs1walk = TRUE;
44
                                                                PA of the entry
45
                    = TRUE;
          aligned
          iswrite
46
                    = FALSE;
          (s2fault, s2walkaddress) = AArch64.S2Translate (fault, walkaddress, s1aarch64,
47
       ss, s2fs1walk, AccType_TTW, aligned, iswrite, ispriv);
48
49
          if s2fault.statuscode != Fault_None then← Check for S2 fault
            return (s2fault, AddressDescriptor UNKNOWN, TTWState UNKNOWN, bits(64)
50
       UNKNOWN);
51
52
          (fault, descriptor) = FetchDescriptor(walkparams.ee, s2walkaddress, fault);
53
                                    Read memory to get descriptor
          (fault, descriptor) = FetchDescriptor(walkparams.ee, walkaddress, fault);
54
55
56
        if fault.statuscode != Fault_None then←── Check for external abort
          return (fault, AddressDescriptor UNKNOWN, TTWState UNKNOWN, bits (64) UNKNOWN);
58
```

```
59
        desctype = AArch64.DecodeDescriptorType(descriptor, walkparams.ds, walkparams.
        tgx, walkstate.level);
60
61
        case desctype of
          when DescriptorType_Table
62
            walkstate = AArch64.S1NextWalkStateTable(walkstate, regime, walkparams,
63
        descriptor);
                                                          \Extract next level table address
64
             // Detect Address Size Fault by table descriptor
65
             if AArch64.0AOut0fRange(walkstate, walkparams.ps, walkparams.tgx, va) then
66
               fault.statuscode = Fault_AddressSize;
67
               \textbf{return} \hspace{0.1in} \textbf{(fault, AddressDescriptor UNKNOWN, TTWState UNKNOWN,} \hspace{0.1in} \textbf{bits} \textbf{(64)} \\
68
        UNKNOWN);
69
70
          when DescriptorType_Page, DescriptorType_Block
71
             walkstate = AArch64.S1NextWalkStateLast(walkstate, regime, ss, walkparams,
        descriptor);
                                                        \Extract page start address
72.
73
          when DescriptorType_Invalid
             fault.statuscode = Fault_Translation;
74
             return (fault, AddressDescriptor UNKNOWN, TTWState UNKNOWN, bits(64) UNKNOWN
75
        );
                       Return fault if invalid
76
77
          otherwise
            Unreachable();
80
      until desctype IN {DescriptorType_Page, DescriptorType_Block};
81
82
87
      // Detect Address Size Fault by final output
      elsif AArch64.0AOut0fRange(walkstate, walkparams.ps, walkparams.tgx, va) then
88
        fault.statuscode = Fault_AddressSize;
89
90
95
      return (fault, walkaddress, walkstate, descriptor);
96
```

#### 7.8.5 AArch64.S2Translate

```
1 (FaultRecord, AddressDescriptor) AArch64.S2Translate(FaultRecord fault_in,
       AddressDescriptor ipa, boolean s1aarch64, SecurityState ss, boolean s2fs1walk,
       AccType acctype, boolean aligned, boolean iswrite, boolean ispriv)
2
     walkparams = AArch64.GetS2TTWParams(ss, ipa.paddress.paspace, s1aarch64);
     FaultRecord fault = fault in:
3
5
     // Prepare fault fields in case a fault is detected
6
     fault.statuscode = Fault_None; // Ignore any faults from stage 1
7
     fault.secondstage = TRUE;
8
     fault.s2fs1walk = s2fs1walk;
                      = ipa.paddress;
9
     fault.ipaddress
10
     if walkparams.vm != '1' then		— Check if in a two-stage regime
11
12
        // Stage 2 translation is disabled
13
       return (fault, ipa);
     if (AArch64.S2InvalidTxSZ(walkparams, s1aarch64) ||
15
         AArch64.IPAIsOutOfRange(ipa.paddress.address, walkparams)) then
18
        fault.statuscode = Fault_Translation;
19
20
        fault.level
                       = 0:
21
        return (fault, AddressDescriptor UNKNOWN);
23
     AddressDescriptor descaddress;
24
     TTWState walkstate;
25
     bits(64) descriptor;
26
     bits(64) new_desc;
2.7
     bits(64) mem_desc;
28
29
        (fault, descaddress, walkstate, descriptor) = AArch64.S2Walk (fault, ipa,
       walkparams, ss, acctype, iswrite, s1aarch64);
                                                                       ^{
m f l} Do translation table walk
30
        if fault.statuscode != Fault_None then← Check for stage 2 translation fault
31
32
33
          return (fault, AddressDescriptor UNKNOWN);
34
        elsif AArch64.S2HasPermissionsFault(s2fs1walk, walkstate, ss, walkparams, ispriv
48
       , acctype, iswrite) then← Check for stage 2 permission fault
         fault.statuscode = Fault_Permission;
50
51
       new_desc = descriptor;
52
55
56
```

```
68
69
74
75
77
     ipa_64 = ZeroExtend(ipa.paddress.address, 64);
78
79
      // Output Address
80
     oa = StageOA(ipa_64, walkparams.tgx, walkstate); - Compute final PA
81
     MemoryAttributes s2_memattrs;
82
       s2_memattrs = walkstate.memattrs;
93
94
98
99
     MemoryAttributes memattrs;
100
       memattrs = S2CombineS1MemAttrs(ipa.memattrs, s2_memattrs); ← Merge memory attributes
101
102
104
105
     pa = CreateAddressDescriptor(ipa.vaddress, oa, memattrs);
      106
```

### 7.8.6 AArch64.S2Walk

```
(FaultRecord, AddressDescriptor, TTWState, bits(64)) AArch64.S2Walk(
2
        FaultRecord fault_in, AddressDescriptor ipa, S2TTWParams walkparams,
       SecurityState ss, AccType acctype, boolean iswrite, boolean s1aarch64)
3
4
      FaultRecord fault = fault_in;
5
      ipa_64 = ZeroExtend(ipa.paddress.address, 64);
6
7
     TTWState walkstate;
8
        walkstate = AArch64.S2InitialTTWState(ss, walkparams); ← read VTTBR
11
13
      // Detect Address Size Fault by TTB
     if AArch64.0AOut0fRange(walkstate, walkparams.ps, walkparams.tgx, ipa_64) then
14
15
        fault.statuscode = Fault_AddressSize;
                       = 0;
        fault.level
16
        return (fault, AddressDescriptor UNKNOWN, TTWState UNKNOWN, bits (64) UNKNOWN);
17
19
      bits(64) descriptor;
20
      AddressDescriptor walkaddress;
21
22
      walkaddress.vaddress = ipa.vaddress;
23
27
28
        walkaddress.memattrs = walkstate.memattrs;
29
30
31
     DescriptorType desctype;
      repeat ← For each level in {0,1,2,3}
32
        fault.level = walkstate.level;
33
35
        FullAddress descaddress;
36
40
          ipa_64 = ZeroExtend(ipa.paddress.address, 64);
41
          descaddress = AArch64.TTEntryAddress(walkstate.level, walkparams.tgx,
       walkparams.txsz, ipa_64, walkstate.baseaddress);_
42
                                                             Get PA of entry to read
43
        walkaddress.paddress = descaddress;
       (fault, descriptor) = FetchDescriptor(walkparams.ee, walkaddress, fault);
    Read descriptor from memory
44
45
46
        if fault.statuscode != Fault_None then← Check for external abort
          return (fault, AddressDescriptor UNKNOWN, TTWState UNKNOWN, bits(64) UNKNOWN);
47
48
        desctype = AArch64.DecodeDescriptorType(descriptor, walkparams.ds, walkparams.
49
       tgx, walkstate.level);
50
51
       case desctype of
52
          when DescriptorType_Table
            walkstate = AArch64.S2NextWalkStateTable(walkstate, walkparams, descriptor);

Extract next level table address
55
            // Detect Address Size Fault by table descriptor
            if AArch64.OAOutOfRange(walkstate, walkparams.ps, walkparams.tgx, ipa_64)
56
       then
57
              fault.statuscode = Fault_AddressSize;
58
              return (fault, AddressDescriptor UNKNOWN, TTWState UNKNOWN, bits(64)
       UNKNOWN);
59
60
          when DescriptorType_Page, DescriptorType_Block
```

```
walkstate = AArch64.S2NextWalkStateLast(walkstate, ss, walkparams, ipa,
61
      descriptor);
                                               ackslashExtract page start address
62
63
        when DescriptorType_Invalid
          fault.statuscode = Fault_Translation;
64
          return (fault, AddressDescriptor UNKNOWN, TTWState UNKNOWN, bits(64) UNKNOWN
65
      );
                    Return fault if invalid
66
67
        otherwise
68
          Unreachable();
69
70
     until desctype IN {DescriptorType_Page, DescriptorType_Block};
72
     fault.statuscode = Fault_AddressSize; Check output address is within bounds
79
80
85
     return (fault, walkaddress, walkstate, descriptor);
86
```

### 7.8.7 AArch64.FetchDescriptor

```
1 (FaultRecord, bits(N)) FetchDescriptor(bit ee, AddressDescriptor walkaddress,
       FaultRecord fault_in)
2
     // 64-bit descriptors for AArch64
 4
     bits(N) descriptor;
6
7
     FaultRecord fault = fault_in;
     AccessDescriptor walkacc;
 8
10
     walkacc.acctype = AccType_TTW;
11
13
14
21
22
     PhysMemRetStatus memstatus;
      (memstatus, descriptor) = PhysMemRead(walkaddress, N DIV 8, walkacc);
23
24
     if IsFault(memstatus) then
25
       fault = HandleExternalTTWAbort(memstatus, fault.write, walkaddress, walkacc, N
       DIV 8, fault);
26
       if IsFault(fault.statuscode) then
         return (fault, bits(N) UNKNOWN);
27
28
29
31
     return (fault, descriptor);
32
```

## Relaxed virtual memory

This chapter is based, in part, on: Relaxed virtual memory in Armv8-A [54] by Ben Simner, Alasdair Armstrong, Jean Pichon-Pharabod, Christopher Pulte, Richard Grisenthwaite, and Peter Sewell. Published in the proceedings of the 31st European Symposium on Programming (ESOP, 2022).

Now we will introduce the main concurrency architecture design questions that arise for virtual memory in
Arm. As usual, the architecture defines the *envelope* of behaviours which hardware must guarantee and on which
software may rely. This envelope must be tight enough to give the guarantees software needs to function, but still
loose enough to admit the range of existing and conceivable microarchitectures whose optimization techniques
are necessary for performance.

This chapter therefore will discuss both the relevant microarchitecture as we understand it, and also the behaviours which it is believed software relies upon. The discussion will touch on points of several kinds: some which are clear in the current Arm prose documentation; some where Arm are in the process of architecting a change; some that are not documented but where the semantics is (perhaps, after discussion with Arm) clear or constrained by current hardware or software practice; and, some where their modelling raised questions for which the architecture is not yet well-defined, and Arm must make an architectural decision.

Ideally, we would be able to specify which points belong to which kind. It is, however, not so easy. There is no clean separation between aspects there are clearly defined in the architecture reference, and those that are not; instead, the manual has a shallow covering of many of the behaviours described here. In other places, the reference may have been updated or changed over the course of the work, clarifying parts of the architecture, and while this may have happened concurrently with discussing those and other points with Arm, the reference text itself is solely the responsibility of Arm. In §8.8 we will return to this question, and more directly address the kinds of each point discussed.

**Chapter overview** The body of this chapter will explore a sequence of key behaviours, some of which the architecture guarantee and some that it does not. Each contains a description of the behaviour, including whether software relies on it or known hardware guarantees it; a short discussion of the architectural intent as we understand it; and any associated litmus tests.

This chapter will discuss a variety of interesting behaviours. In an attempt to make this chapter more approachable, it is broken down into a logical progression: slowly building up from the most simple and fundamental parts of the architecture, to increasingly more complex cases.

We will first discuss (in §8.2) how translation affects the prior 'data memory' **TODO: PS: user-mode?** tests covered in previous work. Then, we shall see how the caching of translation entries is limited (§??) and the fundamental behaviours of the translation table walk (§8.4). Building upon that, we will see that these translation table walks may be cached and re-used in later translations, which is explored in detail in §8.5. Then (in §8.6), we will explore how the various kinds of TLB maintenance interact with those cached translations, and other translation table walks. Finally, we touch on how all of the above fit together with system registers and other context changing and synchronising operations in §8.7.

### 1731 Chapter Contents

| 1732 | §8.2 Aliased data memory                                 |                    |
|------|----------------------------------------------------------|--------------------|
| 1733 | §8.2.1 Virtual coherence                                 | $\dots \dots 74$   |
| 1734 | §8.2.2 Aliasing different locations                      | 78                 |
| 1735 | §8.2.3 Might be same (physical) address                  | 79                 |
| 1736 | §8.3 What can be cached in TLBs                          |                    |
| 1737 | §8.4 Reads not from TLB                                  | 82                 |
| 1738 | §8.4.1 Out-of-order execution                            |                    |
| 1739 | §8.4.2 Enforcing thread-local ordering                   | 84                 |
| 1740 | §8.4.3 Enhanced Translation Synchronization              |                    |
| 1741 | §8.4.4 Forwarding to the translation table walker        | 91                 |
| 1742 | §8.4.5 Speculative execution                             | 93                 |
| 1743 | §8.4.6 Single-copy atomicity                             | $\ldots \ldots 94$ |
| 1744 | §8.4.7 Multi-copy atomicity                              | 94                 |
| 1745 | §8.4.8 Translation-table-walk intra-walk ordering        | 96                 |
| 1746 | §8.4.9 Multiple translations within a single instruction |                    |
| 1747 | §8.5 Caching of translations in TLBs                     | 97                 |
| 1748 | §8.5.1 Cached translations                               | 97                 |
| 1749 | §8.5.2 TLB fills                                         | 98                 |
| 1750 | §8.5.3 μTLBs                                             | 98                 |
| 1751 | §8.5.4 Partial caching of walks                          |                    |
| 1752 | §8.5.5 Reachability                                      |                    |
| 1753 | §8.6 TLB maintenance                                     | $\dots\dots 101$   |
| 1754 | §8.6.1 Recovering coherence                              |                    |
| 1755 | §8.6.2 Thread-local ordering and TLBI                    | 105                |
| 1756 | §8.6.3 <b>Broadcast</b>                                  |                    |
| 1757 | §8.6.4 Virtualization                                    |                    |
| 1758 | §8.6.5 Break-before-make                                 | 111                |
| 1759 | §8.6.6 ASIDs and VMIDs                                   | 111                |
| 1760 | §8.6.7 Access permissions                                |                    |
| 1761 | §8.7 Context synchronisation                             |                    |
| 1762 | §8.7.1 Relaxed system registers                          | 117                |
| 1760 | 88 8 Contributions                                       | 110                |

## 8.1 Virtual memory litmus tests

1765

1766

1768

1769

1779

1780

1781

1783

1784

1786

1787

1791

1792

1794

1795

1797

1798

1799

1801

1802

1805

1807

1809

1813

As previously discussed, one fundamental idea to come out of the field of relaxed memory is the concept of litmus tests. Virtual memory is no different, and exploring the architectural intent is best done through the creation, discussion and evaluation of small programs which are representative examples of common patterns.

However, as we explore more of the system semantics more and more of the system state plays an integral role in the behaviours we see. For this reason we need a new language for describing the state of the system, with features not supported by the language supported by the previous litmus, rmem, herd, and diy tools [?, 28, ?, ?], in particular, the translation table state.

The litmus tests here are given in the isla-axiomatic test format. I describe the isla tool itself, and the extended test format syntax, in more detail in TODO: ?REF?.

A virtual memory litmus test To illustrate this isla test format, Figure 8.1 contains the test listing for a non-trivial virtual memory litmus test called CoW (or "Copy-on-Write").

This test is derived from sequence of operations the Linux kernel takes when performing copy-on-write. Thread 0 tries to write to a location (call it x) that is currently read-only (line 1 in the thread code), then when the fault is taken the Linux exception handler begins executing (line 1 in the handler), Linux performs some checks that it's okay to copy and that it hasn't already done so (not part of the test), and then copies the physical page (lines 3 and 4 in the handler, although the test here only copies one value as demonstration), before flushing the data caches (line 5) so that later reads will be guaranteed to see the copied values. Then Linux needs to swap over the pagetable entry for x from a read-only view on the original page to a writeable mapping on the freshly copied page. It does this by first 'breaking' the entry, making it invalid (line 7), then performing the necessary TLB maintenance (line 9), before writing a new mapping to the new page (line 11). Now, Linux can return from the handler (line 13) and re-try the store instruction, hopefully this time successfully writing to the new page.

The test format is split into 4 main parts:

- ▶ The initial state, comprised of:
  - the per-thread register state.
  - the global memory and pagetable state.
- ▶ The thread code and any exception-handler code.
- ➤ The interesting final state, as a predicate over the final register and memory state.
- And, optionally, whether the outcome is allowed or forbidden by the model.

```
AArch64
                                                          Initial Stat
 0:R0=0x2
 0 · R1=x
 0:R3=y
 0:R4=2
 0:R5=2
 0:R6=0b0
 0:R7=pte3(x)
 0:R8=page(x)
 0:R9=mkdesc3(oa=pa2)
 0:R10=pte3(x)
 0:R20=0b0
 0:VBAR_EL1=0x1000
 0:PSTATE.EL=0b00
 virtual x y z;
 physical pa1 pa2;
 x \mapsto pa1 with [AP = Ob11] and default;
 x \mapsto invalid
 x \mapsto pa2 \text{ with [AP = 0b01] and default;}
 y \mapsto pa1;
 z \mapsto pa2;
 identity 0x1000 with code;
 *pa1 = 1:
 *pa2 = 0;
                                                            Thread 0
 01. STR X0,[X1]
                                                     Thread 0 EL1 F
    0x1400:
          CBNZ X20.exit
          LDR X2, [X3]
          STR X2, [X4]
          DC CIVAC, X5
         DSB SY
          STR X6.[X7]
          DSB SY
          TLBI VALE1IS, X8
          DSB SY
          STR X9. [X10]
          MOV X20.#1
          ERET
 14. exit:
          MRS X21.ELR EL1
          ADD X21, X21, #4
 16.
          MSR ELR_EL1,X21
          ERET
                                                           Final Stat
pa1=1 & pa2=2
```

**Figure 8.1:** Test CoW: code listing

**Initial state** The initial state has three virtual addresses (x, y and z), and two physical addresses (pa1 and pa2). Initial register values are written like 0:R4=z, meaning register R4 on Thread 0 initially contains the value z (in this case, a virtual address). Helper functions like pte3, page and mkdesc3 are used to get the address of the leaf entry, the page offset and to create a new valid descriptor with the given OA, a more detailed description of the functions are given later.

Behind the scenes, isla creates a full instantiation of the Arm translation tables, but with some holes for symbolic values where the test may modify the tables. There is a default translation table, where the code and the tables themselves are mapped by default and everything else is invalid.

The pagetable setup is then defined in a small DSL which defines a delta to that default table, specifying that certain pages should be mapped or unmapped initially, as well as being able to specify the set of locations and



Figure 8.2: Test CoW: execution diagram

their initial memory values the test will need.

Fundamentally we categorise those locations as either virtual, intermediate, or physical. The line virtual x y z in the CoW test allocates 3 virtual contiguous pages, and labels their page-aligned addresses as x, y, and z. It then allocates two physical pages with addresses pa1 and pa2. Next, the setup defines the initial value of the translation tables, as well as specifying the set of potential translation tables that may be in use by the test (for isla to create symbolic 'holes' for those). Namely, the initial state starts with x mapped to pa1 with the access permissions bits set to 0b11 (read-only). The next two lines tell isla that during the test x may become unmapped (the descriptor may be invalid), or mapped to pa2 with AP=0b01. The test also defines two other variables, y and z as aliases to the two physical pages, to help with copying the data between them, just as Linux would. Since there is an exception handler in this test, we need to ensure that the code page of the handler is mapped executable at EL1, which is what the identity 0x1000 with code line does (note that the handler section starts within the 0x1000 page). Finally, we say that the initial values of pa1 and pa2 are 1 and 0 respectively.

**Register translation helpers** The initial register state can reference parts of the initial state related to pagetables through the use of helper functions. Here are the helpers used by CoW, and most of the tests in this section. The full description of this format is given in **TODO:** ?REF?if more information is needed.

- ▷ pte<N>(va): The (intermediate) physical address of the level N entry in the default translation tables that maps va.
- $\triangleright$  page(va): The page number that va is in (equivalently: va  $\gg$  12).
- ▶ mkdesc<N>(oa=pa): A fresh 64-bit descriptor for a valid leaf entry at level N where the output address is given by the oa parameter.

Entries listed as f<N> mean a family of functions f1, f2, f3 and so on.

**Execution diagrams** Figure 8.2 is the isla-generated execution diagram for the CoW test. It illustrates a candidate execution which isla found (with any symbolic holes filled with concrete values) which matched the final state of the execution, and was consistent with the axioms of the model (given in Chapter 9).

The execution is rendered as a diagram, with separate traces for each thread, with multiple columns per thread, for translations and explicit events. In the diagram, there is one thread (Thread 0), and all events belong to

its trace. There are two columns; the right-hand side are the explicit events rendered in program-order, and the left-hand side contains translation events alongside any explicit events from the same instruction. Not all events from the trace are displayed in the execution diagram; many uninteresting events, of register reads and writes, and translation reads of unchanged entries, are suppressed. The execution displayed here is one where the initial store's translation table walk (event a1) reads an valid entry from the initial state but which did not have permissions to do a write, and so generates a Fault event (a2). The execution continues, copying the memory over to a new page (events b-c), before updating the translation tables to point to the new page (d-h, see §8.6.5), before returning from the exception handler (j) and re-trying the store which succeeds in writing to the new page (k2), giving a final state consistent with the expected final state from the test listing in Figure 8.1.

In general, while there could be multiple executions that correspond to the final execution, the tests are usually written in a way to ensure that there is only one consistent candidate execution which corresponds to the final state. In cases where the test is forbidden by the model, we still have isla induce a concrete candidate, and render a diagram of the interesting forbidden execution.

# 8.2 Aliased data memory

Much of the previous work on relaxed memory has been concerned with what we shall call 'data memory': the weak behaviour of concurrent loads and stores to memory. For Arm, we shall see that these previous models were implicitly assuming that all locations in the test were virtual addresses, with well-formed, constant, and injective, address translation mappings, which mapped all locations as readable, writable, and executable, normal cacheable memory.

Consider a non-injective mapping. Such mappings give rise to *aliasing*: the situation where two distinct virtual addresses in the same address space map to the same output physical address. This section will explore how the behaviours of those data memory tests change in the presence of aliasing.

#### 8.2.1 Virtual coherence

For data memory accesses, one of the most fundamental guarantee that architectures provide is *coherence*: in any execution, for each memory location, there is a total order of the accesses to that location, consistent with the program order of each thread, with reads reading from the most recent write in that order. Hardware implementations provide this, despite their elaborate cache hierarchies and out-of-order pipelines, by a combination of coherent cache protocols and pipeline hazard checking, identifying and restarting instructions when possible coherence violations are detected.

For Arm, coherence is with respect to physical addresses [1, B2.3.1 (p157)] [1, D5.11.1 (p4931)] . This means that if two virtual addresses alias to the same physical address, then:

▷ a load from one virtual address cannot ignore a program-order previous store to the other, as seen in the following CoWR.alias test [Figure 8.3]:



1878

1879

1880

This test is a variation on the standard CoWR test, where the VA is replaced with two distinct VAs, which both alias to the same PA.

The initial state is a configuration with two virtual addresses, x and y, which are both mapped to the physical address pa1, whose initial value is 0. The thread then stores 1 to x, then loads y. It is then forbidden for this load to read 0.

While the Armv8-A architecture reference manual describes data caches as being physically-indexed [1, D5.11.1 (p4931)] and so accesses via the same PA are 'fully coherent', further discussions with Arm clarify that this implies not just this coherence test, but that all prior data memory behaviours previously examined still apply when subjected to aliasing.

Figure 8.3: CoWR.alias test

- ▶ a load from one virtual address cannot ignore the write that a program-order previous load of the other address saw (CoRR0.alias+po [Figure 8.4], CoRR2.alias+po [Figure 8.5]).
- ▶ a load from one virtual address can have its value forwarded from a store to the other, and similarly on a speculative branch (MP.alias3+rfi-data+dmb [Figure 8.6], PPOCA.alias [Figure 8.6]).

| AArch64 CoRR0.alias+p    |              |                  |
|--------------------------|--------------|------------------|
| Initial State            |              |                  |
| 0:R0=0b1                 | 1:R1=x       |                  |
| 0:R1=x                   | 1:R3=y       |                  |
|                          |              | 1:PSTATE.SP=0b0  |
|                          |              | 1:PSTATE.EL=0b00 |
| <pre>physical pal;</pre> |              |                  |
| x  -> pal;               |              |                  |
| y  -> pal;               |              |                  |
| *pa1 = 0;                |              |                  |
| Thread 0                 |              | Thread 1         |
| CMD VO [V1]              |              | LDR X0, [X1]     |
| <b>STR</b> X0, [X1]      | LDR X2, [X3] |                  |
|                          | Final        | State            |
| 1:X0=1 & 1:X2=0          |              |                  |
| Forbid                   |              |                  |



This test is a variation of the data memory CoRR0 test, where one of the loads has been replaced with a load of a distinct virtual address which aliases to the same underlying physical address.

Note that, like the original test, it is forbidden to read from the initial state in the later load, as this would violate coherence: exactly what the earlier text from the manual explicitly forbade.

Figure 8.4: CoRR0.alias+po test



This test is a variation of the data memory CoRR2 test. Here there are many options for adding aliasing, so we choose the maximally aliased version where each individual store and load uses a distinct virtual address but where all those virtual addresses alias to the same physical one.

This gives us a classic coherence shape, where it is forbidden for different threads to observe writes to the same physical location in different orders.

Figure 8.5: CoRR2.alias+po test

| AArch64                                 | MP.alias3+rfi-data+dmb |  |  |
|-----------------------------------------|------------------------|--|--|
| Initial State                           |                        |  |  |
| 0:R0=0x1                                | 1:R1=y                 |  |  |
| 0:R1=x                                  | 1:R3=x                 |  |  |
| 0:R3=z                                  | 0:R3=z                 |  |  |
| 0:R5=y                                  |                        |  |  |
| physical pal pa2;                       |                        |  |  |
| x  -> pal;                              |                        |  |  |
| y  -> pa2;                              |                        |  |  |
| z  -> pal;                              |                        |  |  |
| *pa1 = 0;                               |                        |  |  |
| *pa2 = 0;                               |                        |  |  |
| Thread 0                                | Thread 1               |  |  |
| STR X0, [X1]                            | LDR X0, [X1]           |  |  |
| LDR X2, [X3]                            | DMB SY                 |  |  |
| <b>STR</b> X2, [X5] <b>LDR</b> X2, [X3] |                        |  |  |
| Final State                             |                        |  |  |
| 1:X0=1 & 1:X2=0                         |                        |  |  |
| Allow                                   |                        |  |  |

| Arch64              | PPOCA.al            |  |
|---------------------|---------------------|--|
| In                  | nitial State        |  |
| 0:R0=0x1            | 1:R1=y              |  |
| 0:R1=z              | 1:R2=0x1            |  |
| 0:R2=0x1            | 1:R3=x              |  |
| 0:R3=y              | 1:R5=w              |  |
|                     | 1:R7=z              |  |
| physical pal pa2 pa | a3;                 |  |
| w  -> pal;          |                     |  |
| x  -> pal;          |                     |  |
| y  -> pa2;          |                     |  |
| z  -> pa3;          |                     |  |
| *pa1 = 0;           |                     |  |
| *pa2 = 0;           |                     |  |
| *pa3 = 0;           |                     |  |
| Thread 0            | Thread 1            |  |
|                     | LDR X0, [X1]        |  |
|                     | CBNZ X0, L0         |  |
| STR X0, [X1]        | LO:                 |  |
| DMB SY              | <b>STR</b> X2, [X3] |  |
| STR X2, [X3]        | LDR X4, [X5]        |  |
|                     | EOR X8, X4, X4      |  |
|                     | LDR X6, [X7, X8]    |  |
| F                   | inal State          |  |
| 1:X0=1 & 1:X4=1 & 1 | 1:X6=0              |  |
|                     | Allow               |  |





These tests are variations of the standard PPOCA and MP+rfi-data+dmb tests, but with some aliasing. Both are examples of *forwarding*: a thread-local read of a write before that write has been propagated to memory. These two tests, determined to be allowed architecturally from our discussions with Arm, show that the processor can forward from a write even if the read was for a different virtual address so long as the physical addresses match, even down a speculative path.

**Figure 8.6:** PPOCA.alias and MP.alias3+rfi-data+dmb tests.

### 8.2.2 Aliasing different locations

1882

1886

1887

1889

In the previous section, we explored taking tests over a single location, and rewriting the test to use many locations, which all alias to the same address. One can also take a test that has multiple locations and make some of them alias to the same address.

Multi-location data memory tests, which are architecturally allowed, may become forbidden in the presence of aliasing. For example, taking the traditional MP+pos test, when the two locations are aliased to the same physical address then we get the forbidden MP.alias+pos test [Figure 8.7]. This new test is, essentially, equivalent to the old CoRR0 test: coherence with two writes and two reads to the same location; just using different aliases.

| AArch64                  | MP.alias+pos |  |
|--------------------------|--------------|--|
| Ini                      | itial State  |  |
| 0:R0=0x1                 | 1:R1=y       |  |
| 0:R1=x                   | 1:R3=x       |  |
| 0:R2=0x1                 |              |  |
| 0:R3=y                   |              |  |
| <pre>physical pal;</pre> |              |  |
| x  -> pal;               |              |  |
| y  -> pal;               |              |  |
| *pa1 = 0;                |              |  |
| Thread 0                 | Thread 1     |  |
| STR X0, [X1]             | LDR X0, [X1] |  |
| STR X2, [X3]             | LDR X2, [X3] |  |
| Final State              |              |  |
| 1:X0=1 & 1:X2=0          |              |  |
| Forbid                   |              |  |



Because x and y alias to the same phsical address pa1, the two loads (c and d) read the same location, and so cannot read different writes out-of-order.

Figure 8.7: Test MP.alias+pos

### Might be same (physical) address

1890

1905

1909

1924

1929

1930

There is a corner case that we now should consider. For load and store instructions, when the last register used in 1891 the calculation of the address is read, the address becomes known. This allows, in the flat model, for program-order 1892 later instructions to begin execution (or at least, know they will not be restarted) at that point. 1893

With the introduction of address translation, however, this point happens much later, after the whole translation 1894 table walk is performed. Between the read of the register and the completion of the translation table walk, other 1895 instructions may perform some part of their functionality. This may include reading from a different virtual address, before the physical address of a program-order previous instruction is known, but after the virtual address 1897 is known. 1898

One might expect that, when deciding whether to propagate a store, if the page offset of the virtual address is 1899 different to that of the in-flight program-order earlier instructions, then the write could go ahead early, knowing that the access could not be to the same physical address as any of those instructions. However, this is not the 1901 case. Although the accesses definitely will not access the same physical address, the program-order earlier access 1902 may still fault, meaning the write will not be reached. This means that writes must wait for program-order earlier translations to finish (or at least, be known to not fault) before they can be propagated to other threads. 1904

### What can be cached in TLBs

As was described in §7.7, Arm hardware can have TLBs, caching previously seen translations. But, there are some restrictions to this; both in what information a TLB must cache when it does so, but also in what kind of information it is not permitted to cache at all. 1908

#### 8.3.1 Microarchitectural TLBs

Here we must make a clear distinction between the actual microarchitectural translation caching one may encounter inspecting hardware, and the architectural model being discussed here. 1911

While there are possibly many different ways to describe the same architectural intent, here we carefully choose 1912 one which will make building tooling, extending the model, discussions with architects, and explaining individual 1913 tests easier. We will first look at a specific example to pin down terminology and gain some intuition for hardware, before giving a model MMU and TLB that abstracts away from the details. 1915

Microarchitectural MMU - A53 Let us explore more closely how the actual hardware fill and walk works on 1916 a modern microprocessor. The Arm Cortex A53 is an Arm-designed application class processor. Previous relaxed memory work included exercising this core design extensively during litmus testing validation of the models, finding it to be relaxed, exhibiting many relaxed behaviours, but not aggressively so. This makes the A53 a good 1919 candidate as a demonstrator of an average relaxed processor design. While other processors by Arm are more 1920 aggressive in their optimisations, the MMU and TLB layout of the A53 seems typical: other cores, such as the A57 TODO: ?CITE?, A72 TODO: ?CITE?, A76 TODO: ?CITE?, A78 TODO: ?CITE? and A715 TODO: ?CITE? all have comparable, or simpler, TLB configurations. 1923

The Arm A53 Technical Reference Manual (TRM) describes, in detail, the structure of the Memory Management Unit [57, 5-2] of the A53, and its constituent parts. Figure 8.8 shows a hand-written block diagram representing the key information from the TRM. 1926

We see that each core has its own MMU, and that each MMU contains a unit that will perform the translation 1927 table walk, in addition to a selection of translation caching structures:

- ▷ one instruction micro-TLB:
- ▷ one data micro-TLB;
- ▷ one unified TLB:
- ▷ one walk cache; and,
- ▷ one IPA cache. 1933

The microarchitectural TLBs store whole translations: virtual to physical mappings, plus permissions and so-on, 103/ tagged with their context. The TLBs are arranged hierarchically. With small, 10-entry, 'micro' TLBs for instruction and data streams separately, and one large 512-entry unified TLB. 1936



Figure 8.8: A53 Memory Management Block Diagram.

On a TLB miss, the MMU performs a translation table walk using the walker, computing the Arm translation table walk ASL code which we previously explored in §7.6.

When it begins this walk, the MMU first checks the walk cache for a matching entry. Walk cache entires are mappings from virtual address to the physical address of the last level translation table. If an entry is present the MMU can skip most of the walk entirely, performing just the very last read to read the leaf entry.

If a second stage of translation is required during the walk, the IPA cache is used (and may be, or not, used many times during the same walk). The IPA cache stores mappings from intermediate physical to physical memory—with no associated virtual address—which can be used during both the final stage 2 walk and any intermediate stage 2 walks during a stage 1 walk.

#### 46 TODO: PS: walk cache s1 only? BS: that is one of thibaut's questions to RG

The MMU is free to save the result of any translation table walk into these structures, including for walks due to speculation, prefetching, or architectural execution. This, essentially, allows the MMU to perform a walk for any arbitrary VA or IPA, at any point in time.

#### 8.3.2 Model MMU

1950

1951

1952

1953

1954

1955

1957

1958

1961

1962

1963

1965

1966

To abstract away from any specific microarchitecture, we will model the MMU as if it were a separate asynchronous unit, one for each thread, each with an overapproximate 'TLB'.

Later, we will see tests that justify and ground this particular choice of abstraction, and we will explore this model and the mathematics which corresponds to it in more rigorous detail. But for now, we can imagine this model MMU as a set of (concurrently) executing translation table walks and a 'model TLB' cache of translation table entries.

**Model TLB entries** In general, the architecture permits hardware to cache whatever information from the translation process the hardware sees fit, this may include the output of whole translation table walks (complete virtual to physical mappings) or individual translation table entries, or even the result of partial walks (the address of the last-level table, for example).

It would not be feasible to even attempt to enumerate all the possible shapes of TLBs and the kinds of information they can cache. Instead, we will define a *model* TLB. This model will treat the TLB as a cache of writes of translation table entries, each tagged with some context. This allows the model to cache any combination of entries read from a translation table walk, making it weak enough to allow all known TLB implementations, but strong enough to not break any of the guarantees Arm require of those TLB implementations. These guarantees are explored, in detail, in §8.4 and §8.5.

```
\label{eq:total_context} \begin{split} & \mathsf{TranslationTableEntry} \, \equiv \, \mathsf{u64} \\ & \mathsf{Context} \, \equiv \, \mathsf{ArchContext} \, \times \, \mathsf{Stage} \, \times \, \mathsf{option} \, \, \mathsf{VA} \, \times \, \mathsf{option} \, \, \mathsf{IPA} \, \times \, \mathsf{PA} \, \times \, \mathsf{Level} \\ & \mathsf{ArchContext} \, \equiv \, \mathsf{VMID} \, \times \, \mathsf{ASID} \, \times \, \mathsf{Regime} \\ & \mathsf{CachedTranslationTableEntry} \, \equiv \, \mathsf{PA} \, \times \, \mathsf{TranslationTableEntry} \, \times \, \mathsf{Context} \\ & \mathsf{TLB} \, \equiv \, \mathsf{set} \, \, \mathsf{CachedTranslationTableEntry} \end{split}
```

**Figure 8.9:** Model TLB type definitions.

Each entry in the model TLB contains the information about the write itself: the physical address of the entry, and the cached 64-bit entry. But it must also be tagged with some contextual information, some used during TLB lookup and some used to identify cached entries during TLB invalidation. Figure 8.9 gives a consise summary of the model TLB definition in some psuedo-type-definitions.

1971 This contextual information includes:

1972

1973

1974

1977

1978

1984

1985

1988

1989

1991

1992

1994

1995

1996

1998

1999

2003

- ▶ the architectural context information of the translation: the VMID, ASID (or a "global indicator"), and the translation regime;
- ▶ some *extended context* information, required for implementing TLB maintenance:
  - the virtual address, intermediate physical address, and/or physical address of the translation;
  - the translation stage and level at which the write was used;
  - the system register values used in the translation (those which can be cached); and,
  - for an entry used for a Stage 1 translation, whether it has been invalidated at both stages.

The model MMU then performs all translations by doing a full translation table walk, but being able to optionally satisfy any read during that walk from a matching entry in the model TLB which matches the architectural context and input address.

We imagine that any behaviour exhibited by a specific micro-architectural MMU and TLB configuration would also be explainable in this model.

**TLB fills** Hardware has a variety of mechanisms which may lead to a translation table walk: direct architectural execution of instructions, pre-fetching of data or instructions, and speculation down branches. These translation table walks may result in TLB misses, and those misses then result in reads from memory and the MMU 'filling' the TLB with a copy of the information it can use in future.

Arm do not wish to enumerate all the possible speculation machinery or prefetchers so instead opt for a model that is weaker: at any point in time, any thread's MMU can spontaneously perform a translation table walk for any virtual or intermediate-physical address for the current architectural context (VMID, ASID, etc, as in §8.3.2), and any reads that the translation table walk performs can either read from other TLB entries, or perform a non-TLB read of memory and then potentially cache a copy of the write it reads from in the TLB tagged with the extended context information from the walk. The behaviour of those non-TLB reads are explored more in §8.4.

#### 8.3.3 Invalid entries

It is architecturally forbidden to cache information from attempted translations which result in translation faults, access flag faults, or address size faults (Note that a translation table walk may give rise to other faults as well, as discussed in §7.3.2, such as permission faults and alignment faults, which do not impose restrictions on TLB caching). More specifically, a TLB entry cannot be a write of a translation table entry which is the *direct* cause of such a fault. In particular, the TLB cannot cache translation table entries whose valid bit is not set.

This is important, as it gives software a mechanism in which it can safely update a mapping without potentially having multiple entries in the TLB for the same virtual address. These problems are described in more detail during the exploration of break-before-make in §8.6.5.

TODO: PS: no forward refs to tests?





Thread local re-ordering lets the translation (b1) of the load instruction happen earlier than the write to the translation table (a). This allows the load to trigger a data abort (a translation fault, b2).

Figure 8.10: Test CoWTf.inv+po

# 8.4 Reads not from TLB

2005

2010

2011

2013

2014

2015

2016

The requirement that invalid entries are not cached in the TLB gives us a way to directly observe non-TLB reads: translation table reads which result in a translation fault *must* have come from a non-TLB read.

We will see that these reads have some important properties that software can rely on, but that some of those properties will depend on certain architecture features being enabled (namely FEAT\_ETS).

In this section will we explore the properties these reads have, and the guarantees software can rely on. We shall see that these reads are affected by thread-local re-ordering, even to a greater extent than data memory reads, and the synchronization that recovers the sequential semantics. We will see how these reads from the translation table walk relate to data memory reads, with respect to coherence, multi-copy atomicity, write forwarding and so on. Finally, we will see how the FEAT\_ETS architectural feature can change the required synchronization software needs to perform.

### 8.4.1 Out-of-order execution

First, let us consider whether reads that do not come from the TLB preserve the original program order.

po-previous writes One of the simplest questions one might ask is whether a translation-table-walk non-TLB read can ignore a program-order previous store.

This scenario is captured by the CoWTf.inv+po test [Figure 8.10]. Starting with a VA x initially invalid at level 3, and so cannot have its level 3 entry cached in any TLB (directly or indirectly), the test then overwrites the invalid entry with a new valid entry pointing to the physical address pa1. Program-order later, the thread then attempts to read x.

We see that the thread can take a translation fault. This fault is caused by reading an invalid entry, which was read from a stale entry in memory, ignoring the program-order previous store to the translation table entry's location.

One explanation that suffices to allow this outcome is that the instructions can be locally re-ordered; the translation table walk of the later load instruction can happen much earlier than the program-order previous store, and satisfy its read from memory first.

po-previous reads Similarly, the reads of a translation table walk can be locally re-ordered with respect to program-order earlier loads of the translation table entry, as demonstrated in the CoRpteTf.inv+po test [Figure 8.11].

| AArch64                          | CoRpteTf.inv+po      |  |
|----------------------------------|----------------------|--|
| Initial State                    |                      |  |
| 0:R0= <b>desc3</b> (y)           | 1:R1=pte3(x)         |  |
| 0:R1=pte3(x)                     | 1:R3=x               |  |
|                                  | 1:VBAR_EL1=0x1000    |  |
|                                  | 1:PSTATE.SP=0b0      |  |
|                                  | 1:PSTATE.EL=0b00     |  |
| <pre>option default_tables</pre> | = true;              |  |
| <pre>physical pal;</pre>         |                      |  |
| <pre>intermediate ipal;</pre>    |                      |  |
| x  -> invalid;                   |                      |  |
| $x \mapsto pal;$                 |                      |  |
| y  -> pal;                       |                      |  |
| identity 0x1000 with             | code;                |  |
| *pa1 = 1;                        |                      |  |
| Thread 0                         | Thread 1             |  |
| <b>STR</b> X0, [X1]              | LDR X0, [X1]         |  |
| SIR AU, [AI]                     | LDR X2, [X3]         |  |
|                                  | Thread 1 EL1 Handler |  |
|                                  | 0x1400:              |  |
|                                  | MOV X2,#0            |  |
|                                  | MRS X13,ELR_EL1      |  |
|                                  | ADD X13,X13,#4       |  |
|                                  | MSR ELR_EL1,X13      |  |
|                                  | ERET                 |  |
| Final State                      |                      |  |
| 1:X0=desc3(y) & 1:X2=0           |                      |  |
| Allow                            |                      |  |

The translation read (event c1) can be re-ordered with respect to the program-order previous load of l3pte(x) (b), even though the load read the new translation table entry, for the same location the translation reads from.



Figure 8.11: Test CoRpteTf.inv+po

| AArch64                    | LB.TT.inv+pos         |  |
|----------------------------|-----------------------|--|
| Initial State              |                       |  |
| 0:R1=x                     | 1:R1=y                |  |
| 0:R2=mkdesc3(oa=pa1)       | 1:R2=mkdesc3(oa=pa1)  |  |
| 0:R3= <b>pte3</b> (y)      | 1:R3= <b>pte3</b> (x) |  |
| 0:VBAR_EL1=0x1000          | 1:VBAR_EL1=0x2000     |  |
| 0:PSTATE.SP=0b0            | 1:PSTATE.SP=0b0       |  |
| 0:PSTATE.EL=0b00           | 1:PSTATE.EL=0b00      |  |
| <pre>physical pal;</pre>   |                       |  |
| x  -> invalid;             |                       |  |
| y  -> invalid;             |                       |  |
| $x \mapsto pal;$           |                       |  |
| y → pal;                   |                       |  |
| *pa1 = 1;                  |                       |  |
| identity 0x1000 with code; |                       |  |
| identity 0x2000 with co    |                       |  |
| Thread 0                   | Thread 1              |  |
| MOV X0,#0                  | MOV X0,#0             |  |
| LDR X0, [X1]               | LDR X0, [X1]          |  |
| <b>STR</b> X2, [X3]        | STR X2, [X3]          |  |
| Thread 0 EL1 Handler       | Thread 1 EL1 Handler  |  |
| 0x1400:                    | 0x2400:               |  |
| MRS X13,ELR_EL1            | MRS X13,ELR_EL1       |  |
| <b>ADD</b> X13,X13,#4      | ADD X13,X13,#4        |  |
| MSR ELR_EL1, X13           | MSR ELR_EL1,X13       |  |
| ERET ERET                  |                       |  |
| Final State                |                       |  |
| 0:X0=1 & 1:X0=1            |                       |  |
| Forbid                     |                       |  |

The writes to the translation tables (b and d) are forbidden from propagating to other threads before the programorder earlier translations (a1 and c1) are satisfied, forbidding them from reading from each other's writes.



Figure 8.12: Test LB.TT.inv+pos

po-future writes A translation table walk read may not, in general, be re-ordered with program-order later stores.

This is consistent with the description in §8.2.3, as the program-order later store might not architecturally happen if the translation table walk read were to fault. So, the later writes are speculative until the translation has finished, 2033 preventing the write from propagating until then.

This forbids both the general re-ordering of the propagation of the write to other threads (LB.TT.inv+pos [Fig-2035 ure 8.12]) with program-order earlier translation table walks, and, translations reading from program-order later 2036 writes (CoTW1.inv [Figure 8.13]). 2037

### **Enforcing thread-local ordering**

2030

2031

2034

2039

2041

2042

2044

2045

Since non-TLB reads do not necessarily preserve the program order, it appears that there are no coherence guarantees one can make about them. However, by introducing some thread-local ordering constructs, we can recover some of the strong guarantees we are used to.

To force a non-TLB read to happen after some program-order earlier event we can insert the two-instruction sequence DSB SY ; ISB between them. The DSB ("Data Synchronization Barrier") waits for all loads to satisfy and for all stores to have finished and be visible to translation table walkers, before the ISB ("Instruction Synchronization Barrier") flushes the pipeline and restarts any program-order later instructions, including any translation table walks they perform.

**Locally-ordered-previous writes** If we introduce this sequence into the previous CoWTf.inv+po test we obtain the CoWTf.inv+dsb-isb test [Figure 8.14], which is forbidden by Arm. This is because the non-TLB reads, in the



2049

2050

2052

2053

2055

2059

2060

2061

2062

2063

2066

2067

2068

2070

2071

2072

2075

2076



The store to the translation table (b) cannot be re-ordered with the program-order earlier translation table walk (a), preventing that walk from reading from the store.

Figure 8.13: Test CoTW1.inv

absence of non-coherent TLB caching structures (discussed more in §8.6.1), will read from the coherent storage subsystem, and so will be required to see the new write, or something coherence after it.

**Locally-ordered-previous reads** If a program-order previous load has already seen some other-thread write, either through a translation (CoTTf.inv+dsb-isb [Figure 8.15]), or through a normal data load of the translation table (CoRpteTf.inv+dsb-isb [Figure 8.16]), then translation table non-TLB reads which are ordered after that read must also see that write, or a write coherence after it. These tests use the DSB; ISB sequence previously described, but any ordering to the translation table walk (described in §8.4.3) will suffice.

Microarchitecturally this is because translation table walkers are 'separate observers'. The idea is that the MMU performs reads of memory the same way any of the other observers (threads) do, meaning that those reads behave almost exactly like normal data memory reads.

This 'separate observers' principle is a reasonable model, however, we will see later on in §8.4.4 where it begins to break down.

**Instruction synchronization barrier and control dependencies** The ISB instruction naturally orders all translation table walks of program-order later instructions with the ISB itself. This is because the ISB effectively restarts all program-order later instructions, including any translations they do.

However, an ISB is not naturally ordered with respect to program-order *earlier* instructions. That is why in the previous tests we introduced a DSB. But a control-dependency would also work (CoTTf.inv+ctrl-isb [Figure 8.17]).

**Address dependencies** In previous work, address dependencies were assumed fundamental, but now we can define what an address dependency is: a register dataflow dependency into the translation table walk reads.

Address dependencies remain a strong way to order events. Arm, here and in general, avoid speculation of the values and addresses of the explicit reads and writes to memory. This means that a translation table walk will not start until after its address dataflow dependent registers are fully determined. Note, that this does not mean that pre-fetching and caching of the walk cannot happen, it's just that the architectural translation table walk must retrieve any cached values after it is known what the address will be, see §TODO: ?REF?.

For non-TLB translation reads this means that a non-TLB read is locally ordered after any read whose value flows into the non-TLB read, as in CoRpteTf.inv+addr [Figure 8.18].

**Memory barriers** Much of the earlier work in relaxed-memory concurrency was dedicated to the behaviour of *barriers*. The Arm data memory barrier (DMB) creates ordering between memory events program-order earlier than the barrier, with memory events program-order after the barrier.





The write to the translation table (a) is ordered before the non-TLB read of the entry (d1) because of the intervening DSB; ISB sequence, creating local order. This ordering ensures that the non-TLB read respects the coherence order up to the point of the write a, preventing the non-TLB read from reading from a write coherence-before a.

Figure 8.14: Test CoWTf.inv+dsb-isb

| AArch64                    | CoTTf.inv+dsb-isl    |  |
|----------------------------|----------------------|--|
| Initial State              |                      |  |
| 0:R0= <b>desc3</b> (y)     | 1:R1=x               |  |
| 0:R1=pte3(x)               | 1:R3=x               |  |
|                            | 1:VBAR_EL1=0x1000    |  |
|                            | 1:PSTATE.SP=0b0      |  |
|                            | 1:PSTATE.EL=0b00     |  |
| <pre>physical pal;</pre>   |                      |  |
| x  -> invalid;             |                      |  |
| $x \mapsto pal;$           |                      |  |
| y  -> pa1;                 |                      |  |
| *pa1 = 1;                  |                      |  |
| identity 0x1000 with code; |                      |  |
| Thread 0                   | Thread 1             |  |
|                            | LDR X2, [X1]         |  |
|                            | MOV X0,X2            |  |
| STR X0, [X1]               | DSB SY               |  |
|                            | ISB                  |  |
|                            | LDR X2, [X3]         |  |
|                            | Thread 1 EL1 Handler |  |
|                            | 0x1400:              |  |
|                            | MOV X2,#0            |  |
|                            | MRS X13,ELR_EL1      |  |
|                            | ADD X13,X13,#4       |  |
|                            | MSR ELR_EL1,X13      |  |
|                            | ERET                 |  |
| Final State                |                      |  |
| 1:X0=1 & 1:X2=0            |                      |  |
| Forbid                     |                      |  |

The second translation-table non-TLB read of x (e1) is locally ordered after the first translation table walk (b1) because of the intervening dsb; isb sequence, and so cannot see a write coherence-before the write the earlier (b1) translation-read read from.



Figure 8.15: Test CoTTf.inv+dsb-isb

| AArch64                         | CoRpteTf.inv+dsb-isb  |  |  |
|---------------------------------|-----------------------|--|--|
| Initial                         | Initial State         |  |  |
| 0:R0= <b>desc3</b> (y)          | 1:R1=pte3(x)          |  |  |
| 0:R1=pte3(x)                    | 1:R3=x                |  |  |
|                                 | 1:VBAR_EL1=0x1000     |  |  |
|                                 | 1:PSTATE.SP=0b0       |  |  |
|                                 | 1:PSTATE.EL=0b00      |  |  |
| option default_tables =         | = true;               |  |  |
| <pre>physical pal;</pre>        |                       |  |  |
| <pre>intermediate ipal;</pre>   |                       |  |  |
| x  -> invalid;                  |                       |  |  |
| $x \mapsto pal;$                |                       |  |  |
| y  -> pal;                      |                       |  |  |
| identity 0x1000 with code;      |                       |  |  |
| *pa1 = 1;                       |                       |  |  |
| Thread 0                        | Thread 1              |  |  |
|                                 | LDR X0, [X1]          |  |  |
| STR X0, [X1]                    | DSB SY                |  |  |
| SIR AU, [AI]                    | ISB                   |  |  |
|                                 | LDR X2, [X3]          |  |  |
|                                 | Thread 1 EL1 Handler  |  |  |
|                                 | 0x1400:               |  |  |
|                                 | MOV X2,#0             |  |  |
|                                 | MRS X13,ELR_EL1       |  |  |
|                                 | <b>ADD</b> X13,X13,#4 |  |  |
|                                 | MSR ELR_EL1,X13       |  |  |
|                                 | ERET                  |  |  |
| Final State                     |                       |  |  |
| 1:X0= <b>desc3</b> (y) & 1:X2=0 |                       |  |  |
| Forbid                          |                       |  |  |

The final translation table walk of x (e1) cannot be reordered with the program-order previous load of pte3(x)(b), because of the intervening DSB; ISB sequence. The non-TLB translation read of pte3(x) (e1) therefore must read from the same write as the earlier load, or something coherence-after it.



Figure 8.16: Test CoRpteTf.inv+dsb-isb

| AArch64                    | CoTTf.inv+ctrl-isb    |  |
|----------------------------|-----------------------|--|
| Init                       | ial State             |  |
| 0:R0=desc3(y)              | 1:R1=x                |  |
| 0:R1=pte3(x)               | 1:R3=x                |  |
|                            | 1:VBAR_EL1=0x1000     |  |
|                            | 1:PSTATE.SP=0b0       |  |
|                            | 1:PSTATE.EL=0b00      |  |
| physical pal;              |                       |  |
| x  -> invalid;             |                       |  |
| $x \mapsto pal;$           |                       |  |
| y  -> pal;                 |                       |  |
| *pa1 = 1;                  |                       |  |
| identity 0x1000 with code; |                       |  |
| Thread 0                   | Thread 1              |  |
|                            | MOV X0,#0             |  |
|                            | LDR X0, [X1]          |  |
|                            | EOR X4,X0,X0          |  |
| <b>STR</b> X0, [X1]        | CBNZ X4, LC00         |  |
| SIR AU, [AI]               | LC00:                 |  |
|                            | ISB                   |  |
|                            | MOV X2,#0             |  |
|                            | LDR X2, [X3]          |  |
|                            | Thread 1 EL1 Handler  |  |
|                            | 0x1400:               |  |
|                            | MRS X13,ELR_EL1       |  |
|                            | <b>ADD</b> X13,X13,#4 |  |
|                            | MSR ELR_EL1,X13       |  |
|                            | ERET                  |  |
| Final State                |                       |  |
| 1:X0=1 & 1:X2=0            |                       |  |
| F                          | Forbid                |  |

Control-ISB locally-orders the later translation table walk (d1) after the resolution of the control flow, which happens only after the satisfaction of the read b2.



Figure 8.17: Test CoTTf.inv+ctrl-isb

| AArch64                       | CoRpteTf.inv+addr    |  |
|-------------------------------|----------------------|--|
| Initial State                 |                      |  |
| 0:R0= <b>desc3</b> (y)        | 1:R1=pte3(x)         |  |
| 0:R1=pte3(x)                  | 1:R3=x               |  |
|                               | 1:VBAR_EL1=0x1000    |  |
|                               | 1:PSTATE.SP=0b0      |  |
|                               | 1:PSTATE.EL=0b00     |  |
| option default_tables         | = true;              |  |
| <pre>physical pal;</pre>      |                      |  |
| <pre>intermediate ipal;</pre> |                      |  |
| x  -> invalid;                |                      |  |
| $x \mapsto pal;$              |                      |  |
| y  -> pal;                    |                      |  |
| identity 0x1000 with co       | ode;                 |  |
| *pa1 = 1;                     |                      |  |
| Thread 0                      | Thread 1             |  |
|                               | LDR X0, [X1]         |  |
| STR X0, [X1]                  | EOR X4,X0,X0         |  |
|                               | LDR X2, [X3, X4]     |  |
|                               | Thread 1 EL1 Handler |  |
|                               | 0x1400:              |  |
|                               | MOV X2,#0            |  |
|                               | MRS X13,ELR_EL1      |  |
|                               | ADD X13,X13,#4       |  |
|                               | MSR ELR_EL1, X13     |  |
| 774                           | ERET                 |  |
| Final State                   |                      |  |
| 1:X0=desc3(y) & 1:X2=0        |                      |  |
| Forbid                        |                      |  |

The address dependency from the load b to the second load, orders the reads due to the translation table walk of that load (c1) after b. Since c1 is a non-TLB read, it cannot read from a write coherence-before the write b read from.



Figure 8.18: Test CoRpteTf.inv+addr





The non-TLB read c1 is not locally ordered after the write a, despite the intervening dmb sy barrier (b).

Figure 8.19: Test CoWTf.inv+dmb

We will see that this applies to *explicit* memory events only: the principle reads and writes that load and store instructions perform, not the implicit reads and writes they do during translations (or instruction fetching, TODO: ref: ifetch chapter).

Ordering of the explicit memory events does not, automatically, induce ordering between those explicit events and any reads due to translation table walks performed by those instructions. In the next subsection, we will see how FEAT\_ETS (§8.4.3) extends the architecture to include more orderings between translations and other memory events in the same thread.

Figure 8.19 shows a simple coherence test, with a data memory barrier between a store to the translation tables and a load whose translation table walk might read from that. We can see that the barrier does not enforce that the translation table walk sees the update to the translation tables. From the previous tests, we know this means that the translation table walk happened (microarchitecturally) before the store was propagated to memory.

#### The arm DMB vs DSB instructions TODO: PS: discuss DMB v DSB

The architectural intent for DMB's ordering with respect to translation table walkers in the absence of FEAT\_ETS is still tentative, so we shall focus on the fragment with FEAT\_ETS**TODO**: ... and continue.

# 8.4.3 Enhanced Translation Synchronization

#### **TODO: PS: litmus tests?**

2078

2080

2085

2087

2088

2089

2092

2098

2099

Recent versions of the Arm architecture require support for FEAT\_ETS: Enhanced Translation Synchronization.
This feature does not change the ISA, but instead, requires implementations to enforce extra ordering.

The Arm Architecture Reference Manual says the following [1, D5.2.5 (p4802)]:

If FEAT\_ETS is implemented, and a memory access RW1 is Ordered-before a second memory access RW2, then RW1 is also Ordered-before any translation table walk generated by RW2 that generates any of the following:

- ▶ A Translation fault.
- ▷ An Address size fault.
- ▷ An Access flag fault.

This prose description is a little ambiguous, and we feel, needs some clarification: The scenario being described here is a case with two instructions,  $I_1$  and  $I_2$ , each either a load or store. Imagine  $I_1$  and  $I_2$  both executing to completion, without generating any translation, address size, or access flag faults. Then, each instruction would

have generated one or more explicit memory events. For example, a store might generate up to 8 separate write events (one for each byte). Call those events  $E_{ij}$  for the jth explicit event of instruction  $I_i$ .

Each explicit event  $E_{ij}$  would have required a translation table walk, generating translation read events which we can call  $T_{ijk}$  for the kth translation-table-walk read for the jth explicit memory event for instruction  $I_i$ .

Then, if  $I_2$  generates a translation, address size, or access flag fault, and  $E_{1n}$  would have been locally-ordered-before  $E_{2m}$  in the imagined execution without the fault, and FEAT\_ETS is enabled, then,  $E_{1n}$  is locally ordered before any translation table read  $T_{2m}$  in the execution with the fault.

The intuition here is that, microarchitecturally, on implementations that support FEAT\_ETS, when an instruction takes an exception, the access that caused it is re-tried once the prefix of instructions is non-restartable. This reduces *spurious aborts*: faults that come from an out-of-order read of a (what is now) stale value from memory.

Other effects For ETS to have the desired effect — of forbidding spurious aborts with standard local orderings such as barriers — then ETS must implicitly enforce more than just the aforementioned ordering constraints.

Specifically, TLBI instructions must have stronger thread-local orderings to translation-table walks (described in more detail later); translation table walks must be (other) multi-copy atomic; and, translation table walk reads must be coherent and single-copy atomic.

non-ETS fragment There is a question here as to whether we should consider the non-ETS behaviours of the architecture. On the one hand, hardware in use today is from a pre-ETS version of the architecture and so we cannot assume that the behaviours of those devices are consistent with ETS. On the other hand, ETS is a feature that is widely assumed by software, even if not present on hardware.

Linux, for example, assumes implementations are ETS compatible even when they are not. Building models that capture the full extent of the non-ETS fragment would have questionable benefits as one would have to assume an ETS model when verifying software. Additionally, as ETS is becoming a mandatory feature, the concerns over non-ETS hardware will diminish over time, perhaps even by the publication of this thesis, they will be questions of the past. Finally, the semantics of this non-ETS fragment is still unclear; there are numerous questions, especially around forwarding and multi-copy atomicity generally, which are grey areas in the non-ETS fragment which Arm have yet to explicitly decide one way or another.

2127 For these reasons we will assume FEAT\_ETS is present and enabled, unless explicitly stated otherwise.

Ordering to the translation table walk We can now define which constructs give rise to local ordering into a translation table walk. Address dependencies, and locally-ordered context-synchronisation (in particular, the DSB; ISB sequence) always give rise to ordering to the translation table walks. Control dependencies, on their own, never give rise to such ordering. If using FEAT\_ETS, then a plain DSB orders translation table walks of program-order later instructions after it. TODO: BS: even if there's no fault? Other barriers may give ordering to the translation table walker, if using FEAT\_ETS and the translation results in a translation fault, and those barriers would have ordered the event that would have happened otherwise.

### 8.4.4 Forwarding to the translation table walker

2130

2131

2133

2134

2135

Writes take time to propagate out to memory to other cores. One common performance optimization is *gathering*: collecting multiple writes together in a store buffer and propagating them all out together.

To maintain uniprocessor semantics, the core can read from its own store buffer, in effect, allowing it to read from writes before they've been propagated out to other cores. This behaviour is referred to as write forwarding.

Although the translation table walker is described as a 'separate' observer, it is also part of the core that hosts it, and is allowed to read from that core's store buffer, effectively allowing writes to be 'forwarded' to the walker, as shown in the R.TR.inv+dmb+trfi test [Figure 8.20].

The simplest model here is one where non-TLB translation reads behave as a normal data memory read, reading either from forwarding from the store buffer, or from the coherence-latest write in the storage subsystem.

| AArch64                                         |                       | R.TR.inv+dmb+trf |
|-------------------------------------------------|-----------------------|------------------|
|                                                 | Initial State         |                  |
| 0:R0=0x2                                        | 1:R0=mkdesc3(oa=pa1)  | 2:R1=pte3(w)     |
| 0:R1=x                                          | 1:R1= <b>pte3</b> (w) |                  |
| 0:R2=0x2                                        | 1:R3=w                |                  |
| 0:R3= <b>pte3</b> (w)                           | 1:VBAR_EL1=0x1000     |                  |
|                                                 | 1:PSTATE.SP=0b0       |                  |
|                                                 | 1:PSTATE.EL=0b00      |                  |
| physical pal;                                   |                       |                  |
| w  -> invalid;                                  |                       |                  |
| $w \mapsto pal;$                                |                       |                  |
| $w \mapsto raw(2);$                             |                       |                  |
| x  -> pal;                                      |                       |                  |
| *pa1 = 0;                                       |                       |                  |
| identity 0x1000 with co                         | ode;                  |                  |
| Thread 0                                        | Thread 1              | Thread 2         |
| STR X0, [X1]                                    | <b>STR</b> X0, [X1]   | LDR X0, [X1]     |
| DMB SY                                          | MOV X2,#1             | LDR X2, [X1]     |
| <b>STR</b> X2, [X3]                             | LDR X2, [X3]          | LDR AZ, [A1]     |
|                                                 | Thread 1 EL1 Handler  |                  |
|                                                 | 0x1400:               |                  |
|                                                 | MRS X13,ELR_EL1       |                  |
|                                                 | ADD X13,X13,#4        |                  |
|                                                 | MSR ELR_EL1,X13       |                  |
|                                                 | ERET                  |                  |
| Final State                                     |                       |                  |
| 1:X2=0 & 2:X0=2 & 2:X2= <b>mkdesc3</b> (oa=pa1) |                       |                  |
| Allow                                           |                       |                  |

The write of the new valid entry (d) can be forwarded locally to the translation of w (e1) allowing the read of w (e2) to satisfy early. TODO: PS: Thread2 needs explaining



Figure 8.20: Test R.TR.inv+dmb+trfi

| AArch64                      | MP.RTf.inv+dmb+ctrl  |  |
|------------------------------|----------------------|--|
| Initial State                |                      |  |
| 0:R0= <b>desc3</b> (z)       | 1:R1=y               |  |
| 0:R1= <b>pte3</b> (x)        | 1:R3=x               |  |
| 0:R2=0b1                     | 1:VBAR_EL1=0x1000    |  |
| 0:R3=y                       | 1:PSTATE.SP=0b0      |  |
|                              | 1:PSTATE.EL=0b00     |  |
| <pre>physical pa1 pa2;</pre> |                      |  |
| x  -> invalid;               |                      |  |
| $x \mapsto pal;$             |                      |  |
| z  -> pal;                   |                      |  |
| *pal = 1;                    |                      |  |
| y  -> pa2;                   |                      |  |
| identity 0x1000 with c       |                      |  |
| Thread 0                     | Thread 1             |  |
| STR X0, [X1]                 | LDR X0, [X1]         |  |
| DMB SY                       | CBNZ X0, L0          |  |
| STR X2,[X3]                  | LO:                  |  |
|                              | LDR X2, [X3]         |  |
|                              | Thread 1 EL1 Handler |  |
|                              | 0x1400:              |  |
|                              | MOV X2,#0            |  |
|                              | MRS X13,ELR_EL1      |  |
|                              | ADD X13, X13, #4     |  |
|                              | MSR ELR_EL1,X13      |  |
| 17' 1                        | ERET                 |  |
| Final State                  |                      |  |
| 1:X0=1 & 1:X2=0              |                      |  |
| Allow                        |                      |  |

The non-TLB read in Thread 1 (e1) is not locally ordered after the earlier load (d), despite the control dependency. This is because the processor can speculatively perform the translation table walk, before the earlier read is satisfied.



Figure 8.21: Test MP.RTf.inv+dmb+ctrl

# 8.4.5 Speculative execution

To facilitate fast out-of-order pipelines the machine has to begin fetching and executing the next instruction before the earlier instructions are finished. But, those instructions might control the flow of execution through the program. Executing later instructions before they are finished means that those later instructions are being executed *speculatively*: they may, if the predicted flow turns out to be incorrect, need to be discarded, **TODO: PS: what about restarting on coherence violations?** to avoid the need for rollback across threads.

When executing down a speculative path like this, there are additional restrictions that the core must adhere to. For example, stores should not be propagated out to memory, although they can still be read from by program-order later reads in the same thread.

Since we know reads and writes can be performed speculatively, their associated translations must also be allowed to have been performed speculatively. This is what allows the MP.RTf.inv+dmb+ctrl test [Figure 8.21] to see an old value for the translation table entry, as the translation can be performed speculatively. TODO: PS: If this were a "user" test, I'd say that e1 was satisfied out-of-order w.r.t. d, not that e1 was "performed speculatively". Or I'd expect to see a test with control-flow speculation, or argument that the second instruction is speculative until the first is known not to fault. Are you not distinguishing between out-of-order and speculative execution any more? TODO: BS: but speculation implies OoO?

However, forwarding from a speculative write to the translation table walker is disallowed. Since reads to readsensitive locations (such as devices) can have side-effects, software can protect those locations by marking them as device memory in the translation tables, or leaving them unmapped altogether. A speculative write could update the translation tables arbitrarily, including allowing reads to read-sensitive locations, so it must be forbidden for a





The non-TLB read of the translation table entry (f1) cannot read from a forwarded thread-local write (event e) when on a speculative path, requiring that f1 be ordered after d. TODO: PS: manual layout this

Figure 8.22: Test MP.RT.inv+dmb+ctrl-trfi

translation read to read from a still speculative write. The MP.RT.inv+dmb+ctrl-trfi test [Figure 8.22] demonstrates this, requiring that the translation table walk on the speculative path cannot read from the still-speculative store to the translation tables.

**Instruction restarts** A related, but separate, concept, is that of instruction restarts. In the **TODO: PS: user-mode?** base memory model a read might be satisfied early, out-of-order with respect to program-order previous instructions, even before those instructions' accesses addresses are known. If such an earlier access turned out to be to the same address, and the later access is not a read of the same write, then the later access must be restarted to avoid coherence violations.

Translation table walk reads, while they are reads, do not do this hazard checking, and so are not required to be restarted to recover coherence. See §8.2 for more discussion on this. TODO: PS: 8.2 has a lot of stuff, point to specifics?

### 8.4.6 Single-copy atomicity

2168

2169

2170

2172

2176

2182

2183

2184

2185

2186

2187

In the base memory model, there are two key guarantees on the *atomicity* of reads and writes: single-copy and multi-copy atomicity.

Recall that, single-copy atomic reads always read the maximum it can from another single-copy atomic write; in particular a 64-bit atomic never partially reads from another 64-bit atomic write.

Translation table walk reads are 64-bit single-copy-atomic reads of memory. This means that each of the reads generated by a translation table walk will read the entire descriptor in one shot. This causes the CoWroW.inv+dsb-isb test [Figure 8.23] to be forbidden, disallowing reading the output address obtained from one write, and access permissions from another.

# 8.4.7 Multi-copy atomicity

Multi-copy atomicity is a guarantee that requires any update to memory to propagate to all other threads simultaneously. This is one of the core guarantees Armv8 and RISC-V give, but earlier versions of Arm and IBM's

| AArch64                    | CoWroW.inv+dsb-isb   |
|----------------------------|----------------------|
| Initi                      | al State             |
| 0:R0=mkdesc3(oa=pa         | 1, AP=0b11)          |
| 0:R1=pte3(x)               |                      |
| 0:R2=0x1                   |                      |
| 0:R3=x                     |                      |
| 0:VBAR_EL1=0x1000          |                      |
| 0:PSTATE.SP=0b0            |                      |
| physical pal;              |                      |
| x  -> invalid;             |                      |
| x → pal with [AP           | = 0b11] and default; |
| *pa1 = 0;                  |                      |
| identity 0x1000 with code; |                      |
| Th                         | read 0               |
| STR X0, [X1]               |                      |
| DSB SY                     |                      |
| ISB                        |                      |
| <b>STR</b> X2, [X3]        |                      |
| Thread 0 EL1 Handler       |                      |
| 0x1400:                    |                      |
| MRS X20, ELR_EL1           |                      |
| ADD X20, X20, #4           |                      |
| MSR ELR_EL1, X20           |                      |
| ERET                       |                      |
| Final State                |                      |
| pal=1                      |                      |
| Forbid                     |                      |

2189

2190

2192

2193

2195

2196

2197

2199

2200

2202

2203

2206

2207

The translation table walk of the second store must read from the entire write from the earlier store, or not at all, forbidding its translation walk from reading a mix of both the initial state and the earlier write. This means there should be no way the final store can happen, as it must either be invalid or read-only.

Note that, isla does not generate candidates with non-atomic reads which are supposed to be single-copy atomic, and so the diagram is hand-drawn TODO: Draw it.

Figure 8.23: Test CoWroW.inv+dsb-isb

TODO: file not found ./islatests/gen/-TODO: ./islatests/gen/diagrams/pdfs/[/[.pdf tex/[.tex

Figure 8.24: Test [

current Power architectures do not. This has a caveat for Armv8, which is described as *other*-multi-copy atomic: threads can observe their own writes early (through write forwarding).

Microarchitecturally, a thread can only read another thread's write by reading from a global coherent storage subsystem. This ensures that after the thread reads from that write, any other thread must also see that write, or something coherence after it. While this is a property that the base model seems to have, whether it is true for accesses during translation table walks is a separate question.

The non-TLB reads during a translation table walk, in fact, do seem to respect this property: if one other thread has observed a write through a translation table walk then future translation table walk non-TLB reads by other threads will also observe that write (or something newer). Axiomatically, if one thread translation-reads-from a write, then all translation-table-walk reads locally-ordered after another memory event, which is itself ordered after the other thread's translation-table-walk read, will be ordered after that translation-table-walk read.

There are three combinations of multi-thread reads of interest, where a weaker architecture (with separate pagetable and data memory storage) might have mixed non-multi-copy atomic behaviours. The first of these is the most basic; translation-read to translation-read, that is, the pagetable accesses are multi-copy atomic, and this is what forbids reading the old translation table value in Thread 2 in the WRC.TRTf.inv+po+dsb-isb test [Figure ??]. The other two are combinations of read-to-translation-read and translation-read-to-read, these show us that the translation accesses and explicit data accesses are architecturally unified: information about the memory state learned through one kind of access apply to accesses of the other. This is what forbids the following WRC.RRTf.inv+dmb+dsb-isb [Figure ??] and WRC.TRR.inv+po+dsb [Figure ??] tests, from reading the old value from memory at the end of Thread 2.

TODO: PS: these all need text captions WRC.TRTf.inv+po+dsb-isb WRC.RRTf.inv+dmb+dsb-isb TODO: PS: why DSB not just any R/R ordering. WRC.TRR.inv+po+dsb TODO: PS: why DSB not just any R/R ordering.

```
TODO: file not found ./islatests/gen/-TODO: ./islatests/gen/diagrams/pdfs/[/[.pdf tex/[.tex ]
```

Figure 8.25: Test [

### 8.4.8 Translation-table-walk intra-walk ordering

2210

2229

All the tests so far have been concerned with changes to at most one of the translation table entries during a single walk, however, as we saw in §7 a translation table walk may perform many reads for a single translation.

The ASL for the translation table walker performs each translation, in order, starting with the root, and ending with the leaf entry.

While reads in a thread can be re-ordered, translation-reads within a translation table walk cannot, as this would require the hardware to do value speculation on the next-level table address, and as discussed in §8.4.5 reading from speculative values in a translation table walk is generally forbidden.

Requiring the translation reads from a translation table walk to be satisfied in translation walk order has an observable effect, for example in the following ROT.inv+dsb test [Figure ??] the translation table walk of the read in Thread 1 must see the writes to the translation table done by Thread 0 in the order they were propagated out to memory, and so reading from the old level 3 entry is forbidden.

ROT.inv+dsb The translation-table walk from the read of x in Thread 1 must perform its translation non-TLB reads in the order they appear in the walk, forbidding reading from the new level 2 table entry in d1, but then reading the stale initial value for that entry from memory.

The test listing contains some concrete values to make it executable in isla, namely fixing the location of the new table at 0x280000 so it's not symbolic, and the exact location of the level 3 entry within the new table will be at 0x283000 (known from the fixed isla configuration). Whether the exception comes from the level 2 or the level 3 entry can be determined by reading the ISS field of the ESR\_EL1 register, which the exception handler does.

# 8.4.9 Multiple translations within a single instruction

Some instructions generate multiple explicit memory events, such as for the load pair and store pair instructions, or misaligned accesses, or potentially some read-modify-writes. When there are multiple explicit memory events, there will be a dedicated translation for each of them, with its own translation table walk.

Here the architecture as it is written today is overly sequentialised: the ASL for these cases performs each translation (and the respective access) in some order, but the architectural intent is that the separate translations should be unordered with respect to each other.

Misaligned accesses, and the load and store pair instructions, should generate explicit memory events and associated translations which are unordered with respect to each other.

TODO: PS: litmus test with misalgined?





The translation read (b1) of the last-level entry for x can be re-ordered with respect to the program-order earlier store (a) to pte3(x).

Figure 8.28: Test CoWinvT+po

# 8.5 Caching of translations in TLBs

We have seen in §8.4 that, while non-TLB reads do not necessarily preserve the program-order without additional synchronisation due to the out-of-order execution of instructions, those translation table reads get satisfied from the coherent storage subsystem or from forwarding from earlier stores, much like the normal explicit data reads do. This section will explore what happens when translation table walk reads may instead be satisfied from the TLB.

Unfortunately for the programmer, the TLB need not be coherent with memory: it can have stale values. This section explores the behaviours that arise from this caching of stale values.

#### 8.5.1 Cached translations

2240

2242

2243

2247

In the previous section we carefully constructed tests which began with an initially invalid translation, to avoid TLB caching issues. Here, we will generally start with entries that are valid, and so might be present in the TLB.

The following CoWinvT+po test [Figure 8.28] begins with an *initially valid* (and therefore potentially initially cached in the TLB) translation for the virtual address x. It then updates the last-level translation table entry for x, setting it to 0, making it invalid (and thus unmapping x). Then, program order later, the same thread tries to read x.

The read can succeed, as its translation can read from the old value from memory. We saw earlier that translation table walks can be re-ordered with respect to program order, but even inserting thread-local ordering to the translation, such as in test CoWinvT+dsb-isb [Figure 8.29], does not forbid it.





The translation read (d1) of the last-level entry for x is required to be satisfied after the earlier store (a) to the entry's location because of the intervening dsb sy; isb sequence, but can be satisfied from a cached value in the TLB, allowing d1 to read from a stale value.

Figure 8.29: Test CoWinvT+dsb-isb

### 8.5.2 TLB fills

2258

2260

2261

2264

2265

2267

2268

2269

2270

2271

2272

2275

2276

2277

2278

2283

2285

Translation table walks can be requested by the core in two different ways: (1) through the architectural execution of an instruction; or, (2) from a spontaneous translation table walk (for example, due to speculation and prefetching of data or instructions). In either case, the result of that walk can be cached in the TLB and recalled for other translation table walks.

Architecturally a TLB fill is no different to a normal translation table walk; each fill originates from a non-TLB read, with all the behaviours described in the previous sections. Later translation table walks are allowed, however, to recall an earlier value and then reuse that rather than doing a fresh read.

**Spontaneous walks** The hardware may, at any time, try to prefetch or speculatively read some address. Architecturally these appear as spontaneous translation table walks. Those spontaneous walks may be cached. We can see this occurring in the following MP.RT.inv+poloc-dmb+ctrl-isb test [Figure 8.30], where a spontaneous translation and the resulting TLB fill allows a future translation table walk to see a stale value.

**Speculative paths** Since translation table walks, and therefore TLB fills from the result of those walks, can happen at any point, there is no need to consider TLB fills of architectural translation table walks down speculative paths as any such behaviour is subsumed by a spontaneous fill.

However, as described earlier, we saw that writes cannot be forwarded to translation table walks when down speculative paths (§8.4.5), as this would lead to security violations. This naturally excludes TLB fills of still speculative writes; since a speculative write cannot be used in the result of a translation table walk, it cannot end up cached in a TLB.

# 8.5.3 $\mu$ TLBs

So far we have covered the idea of something either being in the TLB, or not. But hardware may have multiple micro-TLBs, each with their own potential cached value.

In effect, these micro-TLBs together behave like a larger non-deterministic TLB with potentially many values.

The presence of these smaller caching structures in a superscalar machine means that different instructions may
be accessing different TLBs at the same time. This allows later instructions to 'skip' over a previously seen cached
entry, and then see it again later.

This is most obvious in the CoTfT+dsb-isb test [Figure 8.31], where the presence of these micro-TLBs (or other distributed caching structures) allow later events (even locally-ordered later) to see old cached entries after earlier events witnessed a TLB miss.

| AArch64                | MP.RT.inv+poloc-     |
|------------------------|----------------------|
|                        | dmb+ctrl-isb         |
| Initia                 | l State              |
| 0:R0=mkdesc3(oa=pa1)   | 1:R1=y               |
| 0:R1=pte3(x)           | 1:R3=x               |
| 0:R2=0b0               | 1:VBAR_EL1=0x1000    |
| 0:R3= <b>pte3</b> (x)  | 1:PSTATE.SP=0b0      |
| 0:R4=0b1               | 1:PSTATE.EL=0b00     |
| 0:R5=y                 |                      |
| physical pal pa2;      |                      |
| x  -> invalid;         |                      |
| $x \mapsto pal;$       |                      |
| y  -> pa2;             |                      |
| *pa1 = 0;              |                      |
| *pa2 = 0;              |                      |
| identity 0x1000 with o | ode;                 |
| Thread 0               | Thread 1             |
|                        | LDR X0, [X1]         |
| <b>STR</b> X0, [X1]    | CBNZ X0,L0           |
| STR X2, [X3]           | LO:                  |
| DMB SY                 | ISB                  |
| STR X4, [X5]           | MOV X2,#1            |
|                        | LDR X2, [X3]         |
|                        | Thread 1 EL1 Handler |
|                        | 0x1400:              |
|                        | MRS X13,ELR_EL1      |
|                        | ADD X13,X13,#4       |
|                        | MSR ELR_EL1,X13      |
|                        | ERET                 |
| Fina                   | l State              |
| 1:X0=1 & 1:X2=0        |                      |
| A                      | llow                 |

A spontaneous walk and fill can happen on Thread 1 after the write of the valid entry to pte3(x) (a), but before the immediate re-invalidation of that entry (b), allowing the later translation table walk to see the old cached entry (g1), even though the architectural translation table walk could not have happened while the valid entry was visible.



Figure 8.30: Test MP.RT.inv+poloc-dmb+ctrl-isb

| AArch64              | CoTfT+dsb-is          |
|----------------------|-----------------------|
| Init                 | tial State            |
| 0:R0=0b0             | 1:R1=x                |
| 0:R1=pte3(x)         | 1:R3=x                |
|                      | 1:VBAR_EL1=0x1000     |
|                      | 1:PSTATE.SP=0b0       |
|                      | 1:PSTATE.EL=0b00      |
| physical pal;        |                       |
| x  -> pal;           |                       |
| $x \mapsto invalid;$ |                       |
| y  -> pa1;           |                       |
| *pa1 = 0;            |                       |
| identity 0x1000 with | code;                 |
| Thread 0             | Thread 1              |
|                      | LDR X2, [X1]          |
|                      | MOV X0,X2             |
| STR X0, [X1]         | DSB SY                |
|                      | ISB                   |
|                      | LDR X2, [X3]          |
|                      | Thread 1 EL1 Handler  |
|                      | 0x1400:               |
|                      | MOV X2,#1             |
|                      | MRS X13,ELR_EL1       |
|                      | <b>ADD</b> X13,X13,#4 |
|                      | MSR ELR_EL1, X13      |
|                      | ERET                  |
| Final State          |                       |
| 1:X0=1 & 1:X2=0      |                       |
|                      | Allow                 |



The earlier translation read (b1) reads from the new invalid entry, reading from memory (as it cannot have been in the TLB), but a later translation read (f1) of the same location can still potentially see a stale cached entry.

Figure 8.31: Test CoTfT+dsb-isb

| AArch64                      | MP.RTT.inv3+dmb-     |
|------------------------------|----------------------|
|                              | dmb+dsb-isb          |
| Initia                       | l State              |
| 0:R0=0b0                     | 1:R1=y               |
| 0:R1=pte2(x)                 | 1:R3=x               |
| 0:R2=mkdesc3(oa=pa1)         | 1:VBAR_EL1=0x1000    |
| 0:R3= <b>pte3</b> (x)        | 1:PSTATE.SP=0b0      |
| 0:R4=0b1                     | 1:PSTATE.EL=0b00     |
| 0:R5=y                       |                      |
| <pre>virtual x y;</pre>      |                      |
| <pre>physical pa1 pa2;</pre> |                      |
| assert $x[4821] = y[4]$      | [821];!              |
| x  -> invalid;               |                      |
| $x \mapsto pal;$             |                      |
| $x \mapsto invalid at level$ | 2;                   |
| y  -> pa2;                   |                      |
| *pa1 = 0;                    |                      |
| *pa2 = 0;                    |                      |
| identity 0x1000 with c       |                      |
| Thread 0                     | Thread 1             |
| STR X0, [X1]                 | LDR X0, [X1]         |
| DMB SY                       | DSB SY               |
| STR X2, [X3]                 | ISB                  |
| DMB SY                       | MOV X2,#1            |
| STR X4, [X5]                 | LDR X2, [X3]         |
|                              | Thread 1 EL1 Handler |
|                              | 0x1400:              |
|                              | MRS X13,ELR_EL1      |
|                              | ADD X13,X13,#4       |
|                              | MSR ELR_EL1,X13      |
|                              | ERET                 |
| Final State                  |                      |
| 1:X0=1 & 1:X2=0              |                      |
| Allow                        |                      |

The translation-read of the level 2 entry for x (i1) can read from stale writes from a translation that the subsequent level 3 translation-read (i2) does not read from, as the level 2 entry could have been cached in the 'TLB' (in this case, a co-located 'walk cache' structure), while the level 3 entry gets read from memory.



Figure 8.32: Test MP.RTT.inv3+dmb-dmb+dsb-isb

# 8.5.4 Partial caching of walks

2286

2287

2288

2289

2292

2293

2294

2296

2297

2298

TLBs need not cache entire virtual to physical translations. Instead, they are free to cache any subset of the reads from the walk separately.

Caching of last-level table The most common kind of caching structure we see in microarchitecture is the walk cache (see §8.3.1). These structures allow a translation table walk to read a stale value for all the translation reads up to the last level, but then do a separate access for that read. This can be see in the MP.RTT.inv3+dmb-dmb+dsb-isb test [Figure 8.32], where a walk cache could allow the table entry to be cached separately from the last-level entry, allowing the last translation read to read from a much newer value.

**Caching of whole translation** A common configuration for the TLB is to cache whole translation walks, from virtual to physical. This kind of caching has an important caveat: there is no requirement for the TLB to remember the intermediate physical address of any stage 2 translations that were done during the walk. This includes the final stage 2 walk of the access address itself.

**Independent caching of IPAs** In a two-stage regime, the virtual addresses are first translated into intermediate physical address. The secondary translations based on the intermediate physical addresses, either of the final

- output address or of any of the intermediate table addresses, may be cached in the TLB without remembering the originating virtual address.
- 2302 This means these cached translations may be recalled for translations of different virtual addresses.
- Pre-fetching may perform translations of arbitrary IPAs, and so these cached translations might not correspond to any valid whole translation table walk, but may still be used during such walks.
- This is most clear in ROT.invs1+dmb2 [Figure 8.33], where, although the IPA was never reachable from the stage 1 translations, the old IPA to PA mapping was cached and used later.
- Caching of individual entries Architecturally, Arm wish to allow many more implementations of TLBs and translation caching structures than currently known hardware contains.
- The weakest variation on this is allowing each individual translation table entry to be cached separately and independently.
- One could construct litmus tests for each of the possible combination of translation table entries, or even a 'most relaxed' version where every translation table entry comes from different previous translations. But these tests would be overwhelmingly large, so for simplicity I give just one of them, ROT.inv2+dmb [Figure 8.34]; where the last-level entry came from a newer value than the previous levels.

### 8.5.5 Reachability

2315

2338

- One key property that the TLB must have is that it can only cache translation table entries which are *reachable*.

  That is, it can only cache an entry which is the result of a valid translation table walk, either using values from memory or other valid translation table entries from the TLB.
- In effect, the TLB can not synthesize translation table walks that are not valid.
- This means that writes coherence-before the most recent write at the time a translation table entry location becomes reachable are not visible to the walker, and cannot have been cached in any TLB.
- Importantly, it is not allowed for the TLB to contain entries for a translation table entry's location from a time when that location was not a valid translation table entry location.
- Notably, writes from memory before that memory was reachable by a translation table walk of any VA should not be visible once the location becomes reachable. This is captured in the RUE+isb [Figure 8.35] ("Read-unreachable-entry") test, which is forbidden as the write to the translation table from before the time the location becomes reachable by translation table walkers cannot have been cached in any TLBs, or read from by any spontaneous walks.
- 2329 This area is currently under discussion with Arm.

# 8.6 TLB maintenance

- Recovering coherence for translation reads in the presence of TLB caching can be achieved through the use of the TLB *maintenance* instruction: TLBI.
- TLB maintenance generally causes two microarchitectural effects: to erase stale entries from the TLB, ensuring future TLB fills (for example, due to a translation read) will see the coherent value from memory; and, to discard any partially executed instructions, on other cores, which had already begun execution using a stale entry but had not yet finished executing. We will explore both of these effects and the subtle interaction with other parts of the virtual memory systems architecture in more detail throughout this section.

# 8.6.1 Recovering coherence

- By inserting the correct TLBI in the previous CoWinvT+dsb-isb test [Figure 8.29], we can produce a new test, CoWinvT.EL1+dsb-tlbi-dsb-isb [Figure 8.36], which is forbidden.
- There are many flavours of TLBI that could have been inserted into this test, the one in the figure is TLBI VAE1, or, TLB invalidation by virtual address, for the EL1&0 translation regime. Using a TLBI-by-VA means the programmer

| AArch64                         | ROT.invs1+dmb2       |
|---------------------------------|----------------------|
| Initial                         | State                |
| 0:R0=mkdesc3(oa=pa1)            | 1:R1=x               |
| 0:R1= <b>pte3</b> (x, s2_table) | 1:VBAR_EL1=0x1000    |
| 0:R2=0b0                        | 1:VBAR_EL2=0x2000    |
| 0:R3= <b>pte3</b> (x, s2_table) |                      |
| 0:R4=mkdesc3(oa=ipal)           |                      |
| 0:R5= <b>pte3</b> (x)           |                      |
| 0:PSTATE.EL=0b01                |                      |
| physical pal;                   |                      |
| <pre>intermediate ipal;</pre>   |                      |
| x  -> invalid at level          | 2;                   |
| $x \mapsto ipal;$               |                      |
| ipal  -> pal;                   |                      |
| ipal → invalid;                 |                      |
| *pa1 = 1;                       |                      |
| identity 0x1000 with co         |                      |
| identity 0x2000 with co         |                      |
| Thread 0                        | Thread 1             |
| STR X0, [X1]                    |                      |
| DMB SY                          | MOV X0,#0            |
| <b>STR</b> X2, [X3]             | LDR X0, [X1]         |
| DMB SY                          |                      |
| STR X4, [X5]                    |                      |
|                                 | Thread 1 EL1 Handler |
|                                 | 0x1400:              |
|                                 | MRS X13,ELR_EL1      |
|                                 | ADD X13,X13,#4       |
|                                 | MSR ELR_EL1,X13      |
|                                 | Thread 1 EL2 Handler |
|                                 |                      |
|                                 | 0x2400:              |
|                                 | MRS X13,ELR_EL2      |
|                                 | ADD X13, X13, #4     |
|                                 | MSR ELR_EL2,X13      |
| Final                           |                      |
| Final State                     |                      |
| Allow                           |                      |
| Allow                           |                      |

The translation read of the stage 2 leaf entry for x (f2) can read from an old cached version, from a write (a) that was not reachable by any translation table walk for any VA, but only from an orphan IPA.



Figure 8.33: Test ROT.invs1+dmb2

| AArch64                        | ROT.inv2+dml            |
|--------------------------------|-------------------------|
| Initial State                  |                         |
| 0:R0=0b0                       | 1:R1=x                  |
| 0:R1=pte3(x, new_table         | ) 1:VBAR_EL1=0x1000     |
| 0:R2=mkdesc2(table=0x2         | 83000)                  |
| 0:R3= <b>pte2</b> (x)          |                         |
| 0:PSTATE.EL=0b01               |                         |
| <pre>physical pal;</pre>       |                         |
| <pre>intermediate ipal;</pre>  |                         |
| assert pal == ipal;            |                         |
| ipal  -> pal;                  |                         |
| $x \mid -> invalid at level$   | 2;                      |
| $x \mapsto table(0x283000)$ at | level 2;                |
| s1table new_table 0x28         | 0000 {                  |
| x  -> ipal;                    |                         |
| $x \mapsto invalid;$           |                         |
| };                             |                         |
| identity 0x1000 with co        |                         |
| Thread 0                       | Thread 1                |
| <b>STR</b> X0, [X1]            | MOV X0,#1               |
| DMB SY                         | LDR X0, [X1]            |
| <b>STR</b> X2, [X3]            |                         |
|                                | Thread 1 EL1 Handler    |
|                                | 0x1400:                 |
|                                | MRS X13,ELR_EL1         |
|                                | <b>ADD</b> X13, X13, #4 |
|                                | MSR ELR_EL1, X13        |
| 731 1                          | ERET                    |
| Final State                    |                         |
| 1:X0=0                         |                         |
| Al                             | low                     |

The translation-read of the level 3 entry (d2) can read from a stale cached translation, which was cached before the write to the level 2 entry (c). Note that this test assumes that the original new\_table was reachable (and therefore could be cached) before the write c. See §8.5.5 for a discussion on this.



Figure 8.34: Test ROT.inv2+dmb

```
RUE+isb
AArch64
             Initial State
0:R0=mkdesc3(oa=pa1)
0:R1=0x0
0:R2=pte3(x, new_table)
0:R3=ttbr(asid=0x01, base=new table
0:R4=x
0:VBAR EL1=0x1000
0:PSTATE.EL=0b01
0:PSTATE.SP=0b1
intermediate ipal;
physical pal;
*pa1 = 0:
sltable new table 0x2C0000 {
   identity 0x1000 with code;
    x |-> invalid;
    x \mapsto pal;
identity 0x1000 with code;
              Thread 0
01. STR X0. [X2]
2. STR X1, [X2]
03. MSR TTBR0 EL1, X3
)4. ISB
5. MOV X1, #1
06. LDR X1, [X4]
        Thread 0 EL1 Handler
1. 0x1200:
02. MRS X20.ELR EL1
3. ADD X20, X20, #4
4. MSR ELR EL1, X20
5. ERET
             Final State
0 \cdot x_1 = 0
```



The write to the new\_table translation table entry for x (a) is not visible at the point of the change of TTBR (c), and so the later translation table walk (e1) cannot read from it.

Note that isla currently does not do any kind of reachability analysis, and so does not forbid this test.

Figure 8.35: Test RUE+isb





The read of the translation table entry for x (f1) is required to happen after the earlier store (a), because of the intervening dsb sy; isb sequence (d and e), and cannot be satisfied from the TLB, because of the TLBI (c), forbidding it from still seeing a stale value. Note that TLBI instructions can only be executed from EL1, so this test starts execution at EL1 rather than the usual default of EL0.

Figure 8.36: Test CoWinvT.EL1+dsb-tlbi-dsb-isb





The TLBI (b) can be re-ordered with program-order earlier events, due to the lack of DSBs ordering it after them, allowing the store (a) to happen later, letting the final translation read (e1) still see the old stale translation.

Figure 8.37: Test CoWinvT.EL1+tlbi-dsb-isb

has to provide the virtual page to invalidate, and the TLBI only affects addresses for that specific invalidated entry, not all of them.

Using the incorrect TLBI leads to insufficient invalidation occuring. For example, if in the aforementioned CoWinvT.EL1+dsb-tlbi-dsb-isb the TLBI had the wrong page, then it would have no effect and the test would remain allowed.

#### 2348 FEAT\_nTLBPA

2356

2357

2358

2360

2361

2363

2364

2365

2367

2368

Armv8.4-A introduced a new optional Arm feature, FEAT\_nTLBPA [1, A2.2.1 (p79)] .

This feature adds a field to the memory model feature register (AA64MMFR1\_EL1) which can identify whether the current processor's TLB (and related microarchitectural caching structures) may contain non-coherent copies of stage 1 entries indexed by those entries intermediate physical address. Microarchitecturally, this corresponds to there being non-coherent caches associated with the TLB, which must be flushed on a TLBI.

These caches would allow TLB misses to read from a non-coherent cache, thus not seeing the most up-to-date value from the coherent storage subsystem like described in §8.4.

Note that the text in the reference manual is a little ambiguous, the entry in A2.2.1 describes it as a "mechanism to identify if [TLB caching] does not include non-coherent caches [of old translation entries] since the last completed TLBI". This change adds a field to the register, whose reserved value in Armv8.0 corresponds to the non-coherent caches existing. This implies that implementation of the feature is not only the existence of the runtime identification register's field, but additionally that its value is 0b0001 (that is, that non-coherent caches do not exist). This further implies that in processors without FEAT\_nTLBPA one should assume that TLBs may contain non-coherent caching structures.

### 8.6.2 Thread-local ordering and TLBI

TLB maintenance instructions are not naturally locally ordered with respect to other instructions in the instruction stream, this means that they can be re-ordered with other instructions. To ensure they are synchronized with other instructions, the programmer can use the DSB barrier instruction to order instructions before and after it.

Leaving out one, or both, of the DSBs around the TLBI leads to insufficient ordering around the TLBI and allows the invalidation to occur at the wrong time. For example, the CoWinvT.EL1+tlbi-dsb-isb test [Figure 8.37] is allowed as the initial write and TLBI may be re-ordered, negating the architectural effect of the TLBI.

TODO: talk about FEAT\_ETS





The broadcast TLBI on Thread 0 (c) ensures that the earlier unmapping (a) is seen by the ordered later translation read on Thread 1 (i1), by ensuring Thread 1's local TLB is cleaned of any stale entries for x.

Figure 8.38: Test MP.RT.EL1+dsb-tlbiis-dsb+dsb-isb

### 8.6.3 Broadcast

2378

2379

2394

2395

2396

Arm provide broadcast variants of the TLBI instructions. These are generally suffixed with the letters IS (for "Inner-shareable").

Broadcast TLBIs, sometimes referred to as TLB *shootdowns*, allow one processor to perform maintenance on another core's TLB.

This is in contrast to other systems, such as for IBM's Power architecture, where maintenance of other cores must be achieved in software through the use of only thread-local invalidation instructions.

**TLB** invalidation on another core One of the simplest examples is a message passing invalidation pattern, where the old entry is removed and a message is sent to another core. This can be seen in the MP.RT.EL1+dsb-tlbiis-dsb+dsb-isb test [Figure 8.38].

Instruction restarts Broadcast TLBIs must do more than touch the other thread's TLB. If the other processor had already performed translation, using the old stale value, but has not yet finished execution, then that instruction must be restarted.

This ensures that Arm broadcast TLBIs have the same behaviour as the traditional software IPI-based shootdown (With context synchronization); but also provides a needed security guarantee.

If a mapping is taken away from a process, then future writes to the physical location it used to map to, should not be visible to that process anymore.

This guarantee is captured in the RBS+dsb-tlbiis-dsb [Figure 8.39] (**Read-broken-secret**) test. Once a mapping has been *broken*, and sufficient TLB maintenance performed, any future reads or writes to the original physical location will not be visible through that mapping anymore. Note, however, that this does not mean that instructions which have already completed their execution will be restarted, even if they occur after an earlier restarted instruction.

This can be seen in the RBS+dsb-tlbiis-dsb+poloc test [Figure 8.40], where the program-order later load can see the old value, even after the first faults.

While here I describe things in terms of instruction restarting, these behaviours can be (and presumably are) implemented in terms of waiting: instead of the TLBI forcibly restarting instructions that already started but haven't finished, the TLBI can simply wait for them to complete. This phrasing of waiting for completion is how this process is described in the Arm ARM [1, D5.10.2 (p4928)].

| AArch64                    | RBS+dsb-tlbiis-dsl    |
|----------------------------|-----------------------|
| Init                       | ial State             |
| 0:R0=0b0                   | 1:R1=x                |
| 0:R1=pte3(x)               | 1:VBAR_EL1=0x1000     |
| 0:R5= <b>page</b> (x)      |                       |
| 0:R2=0x2                   |                       |
| 0:R3=y                     |                       |
| 0:PSTATE.EL=0b01           |                       |
| <pre>physical pal;</pre>   |                       |
| x  -> pal;                 |                       |
| $x \mapsto invalid;$       |                       |
| y  -> pa1;                 |                       |
| *pa1 = 0;                  |                       |
| identity 0x1000 with code; |                       |
| Thread 0                   | Thread 1              |
| STR X0, [X1]               |                       |
| DSB SY                     |                       |
| TLBI VAE1IS, X5            | LDR X0, [X1]          |
| DSB SY                     |                       |
| STR X2, [X3]               |                       |
|                            | Thread 1 EL1 Handler  |
|                            | 0x1400:               |
|                            | MOV X0,#1             |
|                            | MRS X13,ELR_EL1       |
|                            | <b>ADD</b> X13,X13,#4 |
|                            | MSR ELR_EL1,X13       |
|                            | ERET                  |
| Final State                |                       |
| 1:X0=2                     |                       |
| Forbid                     |                       |



The broadcast TLBI of x (c) ensures that the execution of the load of x in Thread 1 either entirely executes using the old translation and finishes before the TLBI does, or begins execution after the TLBI finishes.

Figure 8.39: Test RBS+dsb-tlbiis-dsb

| AArch64                       | RBS+dsb-tlbiis       |
|-------------------------------|----------------------|
|                               | dsb+polo             |
| Initial State                 |                      |
| 0:R0=0b0                      | 1:R1=x               |
| 0:R1= <b>pte3</b> (x)         | 1:R3=x               |
| 0:R5= <b>page</b> (x)         | 1:VBAR_EL1=0x1000    |
| 0:R2=0x2                      |                      |
| 0:R3=y                        |                      |
| 0:PSTATE.EL=0b01              |                      |
| <pre>physical pal;</pre>      |                      |
| x  -> pal;                    |                      |
| $x \mapsto invalid;$          |                      |
| y  -> pa1;                    |                      |
| *pa1 = 0;                     |                      |
| identity $0 \times 1000$ with | code;                |
| Thread 0                      | Thread 1             |
| STR X0, [X1]                  | MOV X0,#1            |
| DSB SY                        | LDR X0, [X1]         |
| TLBI VAE1IS, X5               | MOV X2,#1            |
| DSB SY                        | LDR X2, [X3]         |
| STR X2, [X3]                  | , , ,                |
|                               | Thread 1 EL1 Handler |
|                               | 0x1400:              |
|                               | MRS X13,ELR_EL1      |
|                               | ADD X13,X13,#4       |
|                               | MSR ELR_EL1,X13      |
|                               | ERET                 |
| Final State                   |                      |
| 1:X0=1 & 1:X2=0               |                      |



Even though the broadcast TLBI on Thread 0 (c) ensures that not-yet-completed instructions using the old mapping are restarted, it does not require that the second load of x in Thread 1 (h) be restarted if it has already satisfied its value, as that value must have come from a write before the TLBI.

Figure 8.40: Test RBS+dsb-tlbiis-dsb+poloc

Atomic TLBIs In the previous RBS-shaped tests, I describe the behaviour in terms of writes that occur 'before' the TLBI.

Microarchitecturally a TLBI instruction is very non-atomic: it sends messages to all other cores, performs some action, and sends messages back to the originating core. The program-order earlier DSB ensures that program-order earlier instructions are complete before sending the messages. The program-order later DSB ensures that all program-order later instructions wait for those messages to return.

The presence of these DSBs ensure that the TLBI's effect happens entirely at that point in the instruction stream, and cannot be broken up and re-ordered amongst the other instructions in the stream. This, coupled with the fact that these messages *strengthen* and never weaken the behaviour of other cores, means that you cannot observe a partial TLBI effect. So long as the programmer takes care to maintain the required thread-local ordering.

Because of this, we can think of the TLBI as executing either before an instruction or after an instruction, but do not need to consider a TLBI executing in the middle of another instruction. This allows us to simplify things, fitting TLBIs into a (generalised) coherence order, with other writes occurring either before or after.

#### 8.6.4 Virtualization

2411

Throughout this section we have considered tests for stage 1 translation with virtual mappings. But many of these questions and behaviours also apply to the stage 2 intermediate physical mappings, with some key differences.

Virtual to physical and IPA caches The existence of TLBs that cache virtual to physical mappings (§8.5.4) complicates the TLB maintenance sequence required for changes to the intermediate physical mappings.

When invalidating stale second stage entries from the TLB, it is required for the programmer to do *two* sets of invalidations: first one TLB invalidation to remove any of the old entries for the old IPA to PA, then, perhaps surprisingly, a second TLB invalidation is needed to remove any stale whole translation, VA to PA mappings or any combination thereof, as these could have indirectly cached the result of a second stage translation without remembering the IPA.

This can be seen in MP.RT.EL2+dsb-tlbiipais-dsb+dsb-isb [Figure 8.41], where invalidation of *just* the IPA is not enough. Adding an invalidation of the VA (or all VAs), like in MP.RT.EL2+dsb-tlbiipais-dsb+dsb-isb [Figure 8.42], ensures that later translations cannot see the stale value anymore.

| AArch64                            | MP.RT.EL2+dsb-        |  |
|------------------------------------|-----------------------|--|
|                                    | tlbiipais-dsb+dsb-isb |  |
| Initial State                      |                       |  |
| 0:R0=0b0                           | 1:R1=y                |  |
| 0:R1=pte3(ipa1, s2_table):R3=x     |                       |  |
| 0:R2=0b1                           | 1:VBAR_EL2=0x2000     |  |
| 0:R3=z                             | 1:PSTATE.EL=0b00      |  |
| 0:R4= <b>page</b> (ipal)           |                       |  |
| 0:PSTATE.EL=0b10                   |                       |  |
| <pre>physical pal pa2;</pre>       |                       |  |
| <pre>intermediate ipal ipa2;</pre> |                       |  |
| x  -> ipal;                        |                       |  |
| ipal  -> pal;                      |                       |  |
| $ipal \mapsto invalid;$            |                       |  |
| y  -> ipa2;                        |                       |  |
| ipa2  -> pa2;                      |                       |  |
| z  -> pa2;                         |                       |  |
| identity 0x2000 with c             | ode;                  |  |
| *pa1 = 0;                          |                       |  |
| *pa2 = 0;                          |                       |  |
| Thread 0                           | Thread 1              |  |
| <b>STR</b> X0, [X1]                | LDR X0, [X1]          |  |
| DSB SY                             | DSB SY                |  |
| TLBI IPAS2E1IS,X4                  | ISB                   |  |
| DSB SY                             | MOV X2,#1             |  |
| <b>STR</b> X2, [X3]                | LDR X2, [X3]          |  |
|                                    | Thread 1 EL2 Handler  |  |
|                                    | 0x2400:               |  |
|                                    | MRS X13,ELR_EL2       |  |
|                                    | ADD X13,X13,#4        |  |
|                                    | MSR ELR_EL2, X13      |  |
|                                    | ERET                  |  |
| Final State                        |                       |  |
| 1:X0=1 & 1:X2=0                    |                       |  |
| Allow (if not ETS)                 |                       |  |

Despite the TLB invalidation of the stale IPA (c), a later stage 2 translation-read of that IPA (i1) can still see the old stale value.



Figure 8.41: Test MP.RT.EL2+dsb-tlbiipais-dsb+dsb-isb

| AArch64                            | MP.RT.EL2+dsb-        |  |
|------------------------------------|-----------------------|--|
|                                    | tlbiipais-dsb-tlbiis- |  |
|                                    | dsb+dsb-isb           |  |
| Initial State                      |                       |  |
| 0:R0=0b0                           | 1:R1=y                |  |
| 0:R1=pte3(ipa1, s2_tab)            | le1):R3=x             |  |
| 0:R2=0b1                           | 1:PSTATE.EL=0b00      |  |
| 0:R3=z                             | 1:PSTATE.SP=0b0       |  |
| 0:R4=page(ipal)                    | 1:VBAR_EL2=0x2000     |  |
| 0:PSTATE.EL=0b10                   |                       |  |
| <pre>physical pal pa2;</pre>       |                       |  |
| <pre>intermediate ipal ipa2;</pre> | ;                     |  |
| x  -> ipal;                        |                       |  |
| ipal  -> pal;                      |                       |  |
| $ipal \mapsto invalid;$            |                       |  |
| y  -> ipa2;                        |                       |  |
| ipa2  -> pa2;                      |                       |  |
| z  -> pa2;                         |                       |  |
| identity 0x2000 with co            | ode;                  |  |
| *pa1 = 0;                          |                       |  |
| *pa2 = 0;                          |                       |  |
| Thread 0                           | Thread 1              |  |
| STR X0, [X1]                       |                       |  |
| DSB SY                             | LDR X0, [X1]          |  |
| TLBI IPAS2E1IS,X4                  | DSB SY                |  |
| DSB SY                             | isb                   |  |
| TLBI VMALLE1IS                     | LDR X2, [X3]          |  |
| DSB SY                             |                       |  |
| STR X2, [X3]                       |                       |  |
|                                    | Thread 1 EL2 Handler  |  |
|                                    | 0x2400:               |  |
|                                    | MOV X2,#1             |  |
|                                    | MRS X13,ELR_EL2       |  |
|                                    | ADD X13,X13,#4        |  |
|                                    | MSR ELR_EL2,X13       |  |
| Tr. 1                              | ERET                  |  |
| Final State                        |                       |  |
| 1:X0=1 & 1:X2=0                    |                       |  |
| Forbid                             |                       |  |

By performing TLB invalidation of the stage 1 entries (e) after invalidating the stage 2 ones (c1), it is guaranteed that the later translation-read (k1) cannot see the old stale value anymore.



Figure 8.42: Test MP.RT.EL2+dsb-tlbiipais-dsb-tlbiis-dsb+dsb-isb

#### 8.6.5 Break-before-make

2424

2431

2432

2437

2439

2456

TLBs are not required to store only a single cached translation for a given address. There may, in general, be multiple valid translations cached in the TLB.

To avoid this possibility, the architecture provides a *break-before-make* sequence, which will ensure that there cannot be two cached translations existing in the TLB at the same time.

The architecture requires break-before-make when writing to the translation tables to update an already valid entry with a new valid entry, and the change involves any of the following<sup>1</sup>:

- ▶ A change in output address, if the new or old entry is writeable.
- ▶ A change in output address, if the new and old locations have different contents.
- ▶ A change in memory type.
- ≥ A change in block size (e.g. replacing a page of 4KiB leaf with a 2MiB block mapping).

<sup>2435</sup> For those cases where break-before-make is required, the programmer must:

- (1) write an invalid entry to overwrite the currently valid translation table entry in memory;
- (2) perform a dsb sy (or equivalent);
- 2438 (3) perform any TLB maintenance required to sufficiently invalidate the old entry from any TLB(s) required;
  - (4) perform a dsb sy (or equivalent);
  - (5) write the new valid translation table entry, overwriting the old invalid entry.

Litmus test For completeness, the BBM+dsb-tlbiis-dsb [Figure 8.43] gives possibly the simplest valid to valid concurrent update test,

#### 43 Break-before-make violations

Architecturally, there is no hard requirement to perform break-before-make. Failure to do so simply leads to a degraded state, defined by ConstrainedUnpredictable behaviour.

The Arm reference manual does make it clear that failure to perform break-before-make when required can lead to failure of single-copy atomicity, coherence or even the full breakdown of uniprocessor semantics. While the reference manual does not give motivation for this, we can speculate that this is to allow hardware to perform multiple translations during execution of the instruction, for example, during hazard checking. As such, we do not try to give a full picture of ConstrainedUnpredictable behaviour arising from break-before-make not being followed.

Understanding ConstrainedUnpredictable in full is future work, but a quick summary might be 'any behaviour that this program could have performed if it wanted to'. That is, an instantenous change in the state to a random new state that would have been reachable by executing arbitrary code at that same exception level, security state and translation regime.

#### 8.6.6 ASIDs and VMIDs

In an effort to reduce the expense of TLB maintenance the architecture provides a mechanism to separate out the address spaces by tagging translations with *address space identifiers* (or ASIDs). These ASIDs allow TLB entries to be tagged with only the address space they are used with, and allow TLB maintenance operations to selectively target only the address space being updated.

<sup>2461</sup> Crucially, this allows software to switch between address spaces without having to invalidate the TLB.

This idea is extended not just to address spaces at EL1 (used primary for the operating system and its processes), but to EL2 with *virtual machine identifiers* (or VMIDs). These VMIDs serve the same function as ASIDs, giving IDs to address spaces, except in this case IDs to second-stage IPA to PA address spaces.

<sup>&</sup>lt;sup>1</sup>See the Arm ARM "TLB maintenance requirements and the TLB maintenance instructions" [1, D5.10.1 (p4913)] . for the full list of conditions.

| AArch64                      | BBM+dsb-tlbiis-dsb   |  |
|------------------------------|----------------------|--|
| Initial State                |                      |  |
| 0:R0=0b0                     | 1:R1=x               |  |
| 0:R1=pte3(x)                 | 1:VBAR_EL1=0x1000    |  |
| 0:R2=mkdesc3(oa=pa2)         | 1:PSTATE.SP=0b0      |  |
| 0:R4=0b1                     | 1:PSTATE.EL=0b00     |  |
| 0:R6= <b>page</b> (x)        |                      |  |
| 0:PSTATE.EL=0b01             |                      |  |
| <pre>physical pal pa2;</pre> |                      |  |
| x  -> pal;                   |                      |  |
| $x \mapsto invalid;$         |                      |  |
| $x \mapsto pa2;$             |                      |  |
| identity 0x1000 with o       | ode;                 |  |
| *pa2 = 2;                    |                      |  |
| Thread 0                     | Thread 1             |  |
| STR X0, [X1]                 |                      |  |
| DSB SY                       |                      |  |
| TLBI VAE1IS, X6              | LDR X0, [X1]         |  |
| DSB SY                       |                      |  |
| STR X2, [X1]                 |                      |  |
|                              | Thread 1 EL1 Handler |  |
|                              | 0x1400:              |  |
|                              | MOV X0,#1            |  |
|                              | MRS X13,ELR_EL1      |  |
|                              | ADD X13,X13,#4       |  |
|                              | MSR ELR_EL1,X13      |  |
|                              | ERET                 |  |
| Final State                  |                      |  |
| 1:X0=0                       |                      |  |
| Allow                        |                      |  |

The update of the translation table entry for x in Thread 0 follows the break-before-make sequence, first *breaking* x (a), then performing the necessary TLBI sequence (b-c-d), before *making* x be the new mapping (e). This ensures the concurrent access in Thread 1 is guaranteed to see either the old value, the intermediate broken page (and so a page fault), or the new value. This test is the variant whose final state asserts that the old value was read.



Figure 8.43: Test BBM+dsb-tlbiis-dsb

#### 8.6.7 Access permissions

2487

2488

2489

2490

2491

2492

Accesses which result in permission faults can have been satisfied from the TLB, and writes which update translation table entries AP field can be cached in the TLB.

Translations can give rise to permission faults. These are unlike translation faults, in that, they are based not just upon the descriptor read, but also on the *kind* of access requested: whether a read, or a write.

Accesses which result in permission faults result in exceptions, much like translation faults do, but may have been read from the TLB. This can clearly be seen in the CoWinvTp.ro+dsb-isb test [Figure 8.44], where ordered after a write to the translation tables a permission failure is experienced, whose descriptor must have come from the TLB.

Multiple cached entries The changing of access permissions not necessarily being break-before-make violations allows us to observe multiple cached entries within the TLB. It is permitted for these entries to exist simultaneously.

When reading from the TLB, and there existing multiple entries for the same input address, it is allowed for the hardware to generate a *TLB conflict abort*. These aborts are reported as data aborts.

If the hardware does not generate a conflict abort, then translation reads of that address are ConstrainedUn-PREDICTABLE, nondeterministically able to read one or the other or an "amalgamation" of the values [1, K1.2.3 (p11243)].

Here there seems a contradiction: it is not required to perform break-before-make, but there is no requirement that only one entry be cached in the TLB. We can side step this issue by constructing a test that only changes a single bit of the descriptor, in a way that is not a break-before-make violation, and therefore avoiding any questions about how 'amalgamation' of entries happens. This can be seen with the MP.RTpT.ro+dmb-dmb+dsb-isb-dsb-isb test [Figure 8.45], where the existence of multiple cached entries in the TLB allows multiple translation-reads to read from different stale writes.

**Atomic TLB reads** Existence of multiple cached translation table entries in the TLB, without break-before-make violations, introduces the question of whether those TLB fills and subsequence TLB reads must read from entire single-copy atomic writes of the original translation table entries (much like a read of memory would) or whether a translation read can read from a mix of different writes. RMD+dmb [Figure 8.46] ("Read-mixed-descriptor") shows that translation reads cannot read partially read from a write, it must read from the entire write or none of it.

| AArch64                | CoWinvTp.ro+dsb-is               |  |  |
|------------------------|----------------------------------|--|--|
|                        | Initial State                    |  |  |
| 0:R0=0x0               |                                  |  |  |
| 0:R1=pte3(x)           |                                  |  |  |
| 0:R2=0x1               |                                  |  |  |
| 0:R3=x                 |                                  |  |  |
| 0:VBAR_EL1=0x          | :1000                            |  |  |
| 0:PSTATE.SP=0          | 060                              |  |  |
| physical pal;          |                                  |  |  |
| x  -> pal wit          | th [AP = Obl1] and default;      |  |  |
| $x \mapsto invalid;$   |                                  |  |  |
| *pa1 = 0;              |                                  |  |  |
| identity 0x10          | 000 with code;                   |  |  |
|                        | Thread 0                         |  |  |
| <b>STR</b> X0, [X1]    |                                  |  |  |
| DSB SY                 |                                  |  |  |
| ISB                    |                                  |  |  |
| MOV X13,#0             |                                  |  |  |
| <b>STR</b> X2, [X3]    |                                  |  |  |
| Thr                    | ead 0 EL1 Handler                |  |  |
| 0x1400:                |                                  |  |  |
| // read ESR_EL         | 1.ISS to see if Permission or Ti |  |  |
| MRS X14, ESR_EL1       |                                  |  |  |
| AND X14, X14, #0b1111  |                                  |  |  |
| CMP X14,#0b1111        |                                  |  |  |
| MOV X15                | , #1 // Permission               |  |  |
| MOV X16                | , #2 // Translation              |  |  |
| CSEL X13, X15, X16, eq |                                  |  |  |
| MRS X20, ELR_EL1       |                                  |  |  |
| ADD X20                | ,X20,#4                          |  |  |
| MSR ELR                | _EL1,X20                         |  |  |
| ERET                   |                                  |  |  |
|                        | Final State                      |  |  |
| 0:X13=1                |                                  |  |  |
|                        | Allow                            |  |  |



The translation-read (d1) of x, which happens after the program-order earlier store to the translation tables (a) because of the intervening dsb; isb sequence (b-c), can read from a stale value and result in a permission fault, as the read-only entry from the initial state may be cached in the TLB.

Figure 8.44: Test CoWinvTp.ro+dsb-isb

| AArch64                        | MP.RTpT.ro+dmb-            |  |
|--------------------------------|----------------------------|--|
|                                | dmb+dsb-isb-dsb-isb        |  |
| Initial State                  |                            |  |
| 0:R0=mkdesc3(oa=pa1, AP=DbRD+y |                            |  |
| 0:R1= <b>pte3</b> (x)          | 1:R4=x                     |  |
| 0:R2=0b0                       | 1:VBAR_EL1=0×1000          |  |
| 0:R3=pte3(x)                   | 1:PSTATE.SP=0b0            |  |
| 0:R4=0b1                       |                            |  |
| 0:R5=y                         |                            |  |
| physical pal pa2;              |                            |  |
| x  -> pal with [AP = 0]        | oll] and default;          |  |
| $x \mapsto pa1$ with [AP = 0b1 | [0] and default;           |  |
| $x \mapsto invalid;$           |                            |  |
| y  -> pa2;                     |                            |  |
| *pa1 = 0;                      |                            |  |
| identity 0x1000 with c         | ode;                       |  |
| Thread 0                       | Thread 1                   |  |
|                                | LDR X0, [X1]               |  |
|                                | DSB SY                     |  |
| <b>STR</b> X0, [X1]            | ISB                        |  |
| DMB SY                         | LDR X13, [X4]              |  |
| <b>STR</b> X2, [X3]            | MOV X2,X13                 |  |
| DMB SY                         | DSB SY                     |  |
| STR X4, [X5]                   | ISB                        |  |
|                                | LDR X13, [X4]              |  |
|                                | MOV X3,X13                 |  |
|                                | Thread 1 EL1 Handler       |  |
|                                | 0x1400:                    |  |
|                                | // read ESR_EL1.ISS to see |  |
|                                | MRS X14,ESR_EL1            |  |
|                                | <b>AND</b> X14,X14,#0b1    |  |
|                                | CMP X14,#0b1111            |  |
|                                | MOV X15,#1 // Pern         |  |
|                                | MOV X16,#2 // Tran         |  |
|                                | CSEL X13, X15, X16         |  |
|                                | MRS X20, ELR_EL1           |  |
|                                | ADD X20, X20, #4           |  |
|                                | MSR ELR_EL1, X20           |  |
|                                | ERET                       |  |
| Final State                    |                            |  |
| 1:X0=1 & 1:X2=1 & 1:X3=0       |                            |  |
| Allow                          |                            |  |

The first translation-read of x (i1) reads from the write that removes read permissions (a) and this write must have come from the TLB because of the intervening invalidation (c), message pass (e-f), and dsb; isb sequence (g-h). The later translation-read of x (m1) can still see an even older value with read permissions, from the initial state, as it may *also* have been cached in the TLB.



Figure 8.45: Test MP.RTpT.ro+dmb-dmb+dsb-isb-dsb-isb

| AArch64                                       | RMD+dm               |  |
|-----------------------------------------------|----------------------|--|
| Initial State                                 |                      |  |
| 0:R0=mkdesc3(oa=pa2,                          | AP=DbRD+x            |  |
| 0:R1=pte3(x)                                  | 1:VBAR_EL1=0x1000    |  |
| 0:R2=0x1                                      | 1:PSTATE.SP=0b0      |  |
| 0:R3=y                                        |                      |  |
| <pre>physical pa1 pa2;</pre>                  |                      |  |
| $x \mid -> pal with [AP = Obl1] and default;$ |                      |  |
| $x \mapsto pa2$ with [AP = 0b10] and default; |                      |  |
| y  -> pa2;                                    |                      |  |
| *pa1 = 0;                                     |                      |  |
| *pa2 = 1;                                     |                      |  |
| identity 0x1000 with code;                    |                      |  |
| Thread 0                                      | Thread 1             |  |
| STR X0, [X1]                                  | MOV X0,#0            |  |
| DMB SY                                        | LDR X0, [X1]         |  |
| <b>STR</b> X2, [X3]                           | , , ,                |  |
|                                               | Thread 1 EL1 Handler |  |
|                                               | 0x1400:              |  |
|                                               | MRS X20,ELR_EL1      |  |
|                                               | ADD X20, X20, #4     |  |
|                                               | MSR ELR_EL1, X20     |  |
|                                               | ERET                 |  |
| Final State                                   |                      |  |
| 1:X0=1                                        |                      |  |
| Forbid                                        |                      |  |

The translation-read of x (d1) cannot read from both the 64-bit single-copy atomic write a as well as from the initial state. Note that this test does not, as far as we can see, violate the break-before-make requirements, as currently prescribed by the Arm manual, as the contents in memory of both locations pa1 and pa2 are the same at the time of the write to the translation tables.

This diagram was generated by hand, as isla does not generate a candidate execution of this shape.



Figure 8.46: Test RMD+dmb

# 8.7 Context synchronisation

There are many operations which change the current context the system is in. We will focus in on two of these: taking and returning from exceptions, and writing to system registers.

These actions can change the context that the system is executing in: the current exception level, the translation regime, the translation table base, the ASID or VMID, and a variety of other system configuration state.

#### 8.7.1 Relaxed system registers

2498

2501

2507

2508

2516

2518

2519

2520

2521

2522

2523

2527

2529

2530

So far, in this and previous work, register reads and writes have been completely coherent: instructions programorder after a write to a register will always read from that write (or an intervening write) when it reads that register. System registers break this guarantee.

Arm System registers may require the programmer to insert explicit synchronization, as stated clearly in the Arm reference manual [1, D13.1.2 (p5235)]:

Reads of the System registers can occur out of order with respect to earlier instructions executed on the same PE, provided that both:

- ▶ Any data dependencies between the instructions, including read-after-read dependencies, are respected.
- ▶ The reads to the register do not occur earlier than the most recent Context synchronization event to its architectural position in the instruction stream.

2505 This means a read of a system register might not read from the most recent write to that system register.

To ensure that writes to system registers are seen by program-order later reads, the programmer can ensure that a *Context synchronization* event occurs. These are typically things which flush the pipeline causing future instructions to restart: The ISB instruction and taking and returning from exceptions.

There are two important caveats: (1) this does not apply to non-System registers, such as special-purpose or general-purpose registers, and they never require synchronization; and (2), the synchronization required for System registers depends on the *kind* of accesses.

There are typically two kinds of accesses to System registers: direct, and indirect. Direct accesses are the way we think of registers: instructions which specifically read or write to those registers. Indirect accesses happen when an instruction which does not explicitly mention the register by name performs an access, a read or a write, to that register, during the execution of its behaviour.

Because of the out-of-order nature of the pipeline, these indirect register reads and writes may occur out-of-order with respect to any program-order earlier direct reads or writes of that register. This means that before any direct read, and after any direct write, the programmer must perform a context-synchronizing event to ensure that these direct accesses occur in-order with respect to other indirect accesses. The programmer does not have to insert context-synchronization *after* any direct read, as it is guaranteed that register reads or writes cannot be affected by program-order later accesses.

**System register ASL** In the previous chapter we explored the Arm ASL code for the translation table walk and for one of the store instructions. We saw that this ASL code reads from system registers (as indirect reads).

A naive attempt at a first interpretation of the relaxed semantics is to allow these reads to read-from the most recent indirect write and any program-order later direct writes since the last context synchronization event.

However, this would not give the correct behaviour.

The Arm ASL is not written to accommodate relaxed system register behaviours. It leaves questions open about whether these registers can be redundant re-read during execution, whether the instruction reads the entire register at once or piecemeal over the course of execution, and whether repeated accesses to the same register within an instruction are able to read-from different writes. These questions, and others, are still under discussion with Arm.

We will see later in **§TODO: ?REF?** that we give a simple incomplete (and possibly unsound) interpretation in our model in the *pointed set* semantics of system registers, which allows the model to observe some of the known behaviours in this area, without yet fully exploring the architecture.

Caching of system registers in TLBs In addition to being out-of-order due to pipeline effects, some system registers may be indirectly cached within the TLB.

We have already seen one of these: the MAIR register. Direct writes to the MAIR may not be seen by program-order later translations, even after context-synchronization, as the translations may get their value from the TLB and the TLB may have stored a result which depended on the previous value of the MAIR, effectively causing a stale read of it at that point in the instruction stream.

To ensure that an update to the MAIR is observed by later translations therefore requires both TLB maintenance and context synchronization, in that order.

The registers which can be cached in this way, and the behaviours that arise from this caching, are still under current investigation with Arm.

#### 8.8 Contributions

2539

2540

2546

2549

2550

2551

2553

2554

2556

2557

2572

2574

2575

We have now covered all the relaxed memory behaviours, and will, in the next chapter, move on to discuss the extant models created to capture those behaviours. But before that, it may at this point be unclear what the *contribution* of this chapter is. They come in three forms: (1) the attempt at some systematic coverage of the kinds of behaviours which systems software must account for; (2) the precise, formal description (in prose, and as litmus tests) of those behaviours; and, (3) the clarification of the architecture where such behaviours were otherwise unclear.

**Coverage of behaviours** While this chapter attempts to systematically cover the behaviours we imagine software may try to rely on, starting from the basics of translation table walks and exploring the effects of out-of-order pipelines, caching, and barriers, we cannot claim it is *exhaustive*. As this is a manually compiled and curated list of behaviours, from reading the text and talking with architects, there are surely corner cases missed and software patterns overlooked. However, we believe we have covered those patterns known and required for the features we cover enough for software verification efforts of microkernels and hypervisors.

Clarification of architecture Attempts to clarify the architecture come primarily from the confidential discussions with architects. The behaviours discussed usually fell into one of three categories, whether they were clear already, needed further exploration or are, still, under invesitgation by Arm.

The first major category are those behaviours which were already clear and potentially covered in the architecture text. As alluded to right at the start of this chapter, these are not whole sections or sub-sections or even necessarily whole tests. The most obvious cases are §8.3.3 ('Invalid entries'), §8.2.1 ('Virtual coherence'), and §8.6.5 ('Break-before-make'). These are fundamental behaviours to the correctness of all modern systems software, and for which the architecture reference manual has clear words (at least, enough to cover the basic sequences software rely upon).

Most of the subsections fall into a more general category, of things that either had some associated reference materials, or was otherwise clear from discussion with architects, but for which further investigation was needed. This includes: forwarding (§8.4.4) and speculation (§8.4.5) for translation table walks; multi-copy atomic translation table walks (§8.4.7); intra-instruction ordering (§8.4.8,§8.4.9); micro-TLBs (§8.5.3) and partial walk caching (§8.5.4); a variety of TLBI questions (§8.6); and, system register accesses (§8.7.1).

Despite the work conducted here, from reading the architecture reference text, discussions with architects, and the testing of existing hardware, there are still many questions which are under current investigation with Arm. These include further questions about the scope of TLBIs, interaction with exceptions and interrupts, changes in cacheability, translations for instruction fetching, and relaxed system register accesses. Those areas will require more work before giving a concrete semantics.

2578

2591

2598

2599

2600

2602

2603

2606

2607

2608

2610

2611

2612

# An axiomatic VMSA model

This chapter is based, in part, on: Relaxed virtual memory in Armv8-A [54] by Ben Simner, Alasdair Armstrong, Jean Pichon-Pharabod, Christopher Pulte, Richard Grisenthwaite, and Peter Sewell. Published in the proceedings of the 31st European Symposium on Programming (ESOP, 2022).

We now define a semantic model for Armv8-A relaxed virtual memory that, to the best of our knowledge, captures the Arm architectural intent for the questions discussed in §??, including Stage 1 and Stage 2 translation-table walks and the required TLB maintenance.

In §8 we described the design issues in microarchitectural terms, discussing the behaviour of translation table walks and TLB caching, along with the needs of system software. We now abstract from microarchitecture: constructing a model based on ordering between translation-read events and others, avoiding modelling TLBs and out-of-order pipelines directly.

This chapter will present this model, as an extention to the 'user-mode' Armv8-A axiomatic model presented in §TODO: ?REF?.

# 9.1 Extended candidate executions

The base Armv8-A axiomatic model is defined as a predicate over *candidate executions*, each of which is a graph with various events (reads, writes, barriers) and relations over them, notably the per-thread program order po, the per-location coherence order co, the reads-from relation rf from writes to reads, the addr, data, and ctrl-dependency subsets of po, and others.

We extend these candidates with both new events and new relations over those events, as well as modifying some of the original ones.

#### 9.1.1 Candidate events

In addition to the events of the original model, we add the following events to the candidates:

- ▶ T for reads originating from architected translation-table walks.
  - These roughly correspond to the actual satisfaction from memory which with TLBs may happen very early.
- ▶ TLBI events for each TLBI instruction, with a single such event per TLBI instruction, corresponding to the TLBI being completed on all relevant cores.
- ▶ TE and ERET events for taking and returning from an exception (these might not correspond to changes in exception level).
- ▶ MSR events for writes to relevant system registers, such as the TTBR.
  - ▷ DSB events for DSB instructions.
  - ▶ Fault events for translation and permission faults.

**Translation-reads** During execution of the ASL TranslateAddress function (§7.6) there will be many reads, which would usually generate R events. When those reads happen during the TranslateAddress call, they instead generate T events. This means that each translation table walk may generate up to 24 T events, before the instruction generates the (explicit) R|W event.

Alternative representations were explored, including leaving them as R events or collecting all reads into a single 2613 large translation event. But these options did not give the clarity and fine granularity we desired in the model, 2614 and would require more relations and axioms than presented here.

We also choose not to include TLB hits and misses in the model directly, but instead model the TLB as a relaxation 2616 of the values the walk can read from, much like normal R data memory read events and modelling load buffering, 2617 write gathering and caches. 2618

We add a helper set, T\_f, for translation reads which read-from a write whose value is even. That is, an entry whose invalid bit is 0. If a translation read results in a fault, either because it was an invalid entry and we get a 2620 translation fault, or because the access permissions of the resulting translation do not permit the kind of requested 2621 access and so result in a permission fault, the candidate will contain a Fault event (partitioned into Fault\_t and Fault\_p for translation and permission faults) in po order where the explicit memory event would have been. 2623 See text on obETS for more discussion of these 'ghost' fault events. 2624

We partition the T set into two subsets: Stage1 and Stage2 for translation read events from a stage 1 or stage 2 2625 walk respectively (stage 2 reads during a stage 1 walk are marked as stage 2, not stage 1). 2626

Finally, we leave the M set unchanged, which contains only the explicit reads and writes performed by instructions. 2627

**TLBIs** As described in §7.7 Arm have a variety of TLBI instructions, with varying arguments. All of these 2628 TLBIs generate a single TLBI event.

To aide in modelling, there are a set of subsets of TLBI for various kinds of TLBI: 2630

- ▶ TLBI-S1 for invalidations of Stage 1 entries. 2631
- ▶ TLBI-S2 for invalidations of Stage 2 entries.
  - ▶ TLBI-IPA for invalidations by intermediate physical address.
- ▶ TLBI-VA for invalidations by virtual address. 2634
  - ▶ TLBI-ASID for invalidations by ASID.
- ▶ TLBI-VMID for invalidations by VMID.
- ▶ TLBI-ALL for the TLBI ALL instructions. 2637
- ▶ TLBI-IS for broadcast TLBIs. 2638

2633

2635

- ▶ TLBI-EL1 for invalidations of the EL1&0 regime. 2639
- ▶ TLBI-EL2 for invalidations of the EL2 regime.

These events do not *cut* the TLBI set into partitions, but rather any TLBI event may belong to multiple. For 2641 example, a TLBI VAE1IS event would belong to TLBI-VA, TLBI-VMID, TLBI-EL1, and TLBI-IS. 2642

We also include all TLBIs in a general C ("Cache maintenance") set.

**Exceptions** Despite not modelling exceptions in general in this work, we do need to include some exception ma-2644 chinery in the model to capture the minimal ordering requirements arising from both their context synchronisation 2645 effects and also behaviours from crossing exception level boundaries.

To support this we add two new events to capture taking and returning from exceptions: TE ("Take-exception") and ERET. 2648

Barriers The Arm DSB ("Data synchronization barrier") instruction is required for TLB maintenace, as was seen 2649 in the previous chapter. We include DSB events, one for each kind of DSB instruction:

- DSB. SY and DSB. ISH (here, equivalent as we do not model shareability domains) 2651 ▷ DSB.NSH
- 2652
- ▷ DSB.ST
- ▷ DSB.LD

Arm define a hierarchy of barriers where, for example: DMB.LD < DMB.SY < DSB.SY That is, any ordering imposed 2655 by a DMB.LD is also imposed by a DMB.SY, and therefore also a DSB.SY. 2656

To help capture this, and reduce the explosion in the number of relations in the model later on, we simplify and update the barrier story in the Arm model and include the helper sets given in Figure 9.1. 2658

```
let dsbsy = DSB.ISH | DSB.SY | DSB.NSH
                                            DSB. NSHST
let dsbst
          = dsbsy
                     DSB.ST
                              DSB.ISHST
let dsbld = dsbsy
                            I DSB. ISHLD
                     DSB.LD
                                           DSB. NSHLD
let dsbnsh = DSB.NSH
                     DMB.SY
let dmbsv = dsbsv
          = dmbsy
                              DMB.ST
                                       DSB.ST | DSB.ISHST | DSB.NSHST
                     dsbst
            dmbsy
let dmbld
                     dsbld
                             DMB.LD
                                     | DSB.ISHLD | DSB.NSHLD
let dmb = dmbsy
                   dmbst
                            dmb1d
let dsb = dsbsy
                  dsbst
                           dsbld
```

**Figure 9.1:** Barrier helper sets.

**Context changing and synchronisation** Finally, we add events for context-changing and context-synchronising operations. Context changes involve updates to system registers which change the current translation regime, which generate MSR events. We add a general context-synchronisation event set CSE which includes ISB, TE and FRET.

Changes to system registers may have relaxed behaviours, as described in §8.7.1, but full relaxation of the system register reads done by the Arm psueocode is unlikely to be valid, consistent or meaningful. Instead, we introduce a *pointed-set semantics*: when generating a candidate, we keep a per-system-register set of writes to that register, remembering which one is the most recent. On a write to that system register, we add it to the set. On a read of that system register, we generate one candidate for each value in the set, and then 'lock' the remainder of the execution of that instruction to that value so repeated reads will see the new value. When a context-synchronization event is generated (that is, an event that will be in the CSE set) all the sets are reduced to singleton sets containing only the most recent write.

This gives us some relaxed behaviours, enough to see relaxed behaviours around changes to the TTBR, but we note that this is unlikely to be the full story for relaxed system registers.

#### 9.1.2 Candidate relations

2659

2660

2661

2663

2664

2666

2667

2670

2671

2673

2676

2677

2678

2679

2680

2681

2684

2685

2689

2690

2691

2692

2693

2674 In addition to those new events, we introduce new relations over those (and other) events:

- ▶ trf and tfr as analogues to rf and fr but for translation-read (T) events.
- ▶ iio ("intra-instruction order") which relates events of the same instruction in the order they occur during execution of that instruction's intra-instruction semantics as defined by the Arm ASL.
- > same-va, same-ipa, same-pa relations which relate events whose virtual, intermediate physical or physical address of the associated explicit memory access are the same.

- ▷ wco, a generalised coherence order which includes TLBIs.

Addresses, ASIDs and VMIDs Each translation table walk will read from registers and system registers and get a value for the (input) address, the current ASID and current VMID. We then relate each T with any other T where the translation associated with it is for the same virutal address (with same-va), the same intermediate-physical address (with same-ipa), or the same resulting physical address (same-pa). This means that all T events within a translation have the same same-\* relations. We also include relations which match translation's virtual, intermediate physical and physical addresses if they are in the same page rather than exactly, with the same rules, but as a same-\*-page relation.

If two translations are for the same ASID, their translation reads are related by same-asid. If two translations are for the same VMID, their translation reads are related by same-vmid.

To use these relations we also include TLBI events. A TLBI-X is related to T by same-X if the parameter to the TLBI instruction (the page, vmid, or asid) either passed by register, immediate or through the current context, if the T event's associated translation matches X. For example, a TLBI-IPA event would be same-ipa-page related to a T whose translation was for an intermediate physical address in the page provided as the parameter to the TLBI IPA instruction.

Generalised coherence order We create an extended coherence order wco, which is the usual co (a per-location total order of writes to that location) as well as their relative ordering to all TLBI events.

One might be concerned at the validity of doing this, on two fronts. Generally, extending coherence to a total order over all locations is sound [9, §10.5 p174], and so there is no issue in doing this. Secondly, for broadcast TLBIs, microarchitecture will implement these with message passing to and from each core separately, and so there is no single moment the TLBI 'happens'. However, as described in §8.6.7 we seem to be able to consider TLBI instructions as executing 'atomically' so long as there are no break-before-make violations. This is a similar justification as to including DC and IC events in a similar generalised coherence order for instruction fetching [58, §5 p29].

**Dependencies** A candidate execution consists not only of events, and reads-from relations but also a set of dependencies: addr, data, ctrl, po and loc. We add iio and tdata to these.

The intra-instruction ordering iio relation relates two events in the same instruction in the order the Arm pseudocode generated the events. This relation therefore captures a total order over all events within an instruction, regardless of the intra-instruction dependencies (control, data) or unordered accesses (for example, for misaligned accesses). We are currently invesitaging a relaxation of this ordering, and associated changes in the underlying Arm pseudocode definitions, to enable a more relaxed definition of the ordering within an instruction to handle these cases.

We make loc relate events with the same physical address (for T events, this is the physical location of the translation table entry).

Program order (po) is restricted to explicit events: R, W, F, C, CSE and MSR. Implicit translation reads (T) and any indirect reads or writes of registers are not included in po.

Address dependencies were once fundamental, but now we can define address dependencies in the presence of address translation as dependencies into the translation table walk. To do this, we include a new relation tdata that relates reads with the translation read events of a translation which reads from the register written by that read to compute the address. The traditional addr can be recovered as tdata; iio\*; [M].

## 9.2 Cat model

2720

2721

2724

2725

2727

2728

2733

The base Arm axiomatic model had three axioms: internal, external, and atomic. These were acyclicity and emptyness checks of unions of set of relations: obs, dob, aob and bob. We will slightly modify three of these relations obs, bob and dob, and add 5 new ones (tob, obtlbi, ctxob, obfault, obETS) to handle the ordering between translations and TLBIs, and include them in the external acyclicity check. Then we will introduce one final new axiom translation-internal.

Figure 9.2 contains the axioms and relations for the updated Armv8-A relaxed virtual memory axiomatic model (RVM). Unchanged parts from the original are greyed out. Note that some helper relations are elided here, and will be described in more detail later.

## 9.3 Axioms

The RVM model axioms are, mostly, a syntactic extension to the original Armv8-A axiomatic model presented in §TODO: ref intro. This is deliberate. Although there may be other, perhaps even nicer or more succinct, ways of phrasing the given model, the variation presented here is designed to be syntactically as close as possible to the original. This helps with readability for those familiar with the original; it allows us to present the differences to the original in an easier form; it makes recovery of the original model easier; and, it makes it easier to prove equivalence of the axiomatic models in the presence of constant address translation, increasing the confidence we have in the model.

The model has 3 kinds of axioms: internal ones for per-location guarantees, an external axiom for the global happens-before ordering, and the atomic axiom for RMWs (untouched in this work).

Internal axioms The new model has two per-location axioms: internal and translation-internal.

```
let speculative =
                                                                                               | [R|W]; po; [Fault_T & IsFromW &
    IsReleaseW]
     ctrl
| addr; po
| [T]; instruction-order
                                                                                                ETS-ordered-before *)
  (* translation-ordered-before *)
                                                                                           let obETS =
                                                                                               (obfault; [Fault_T]); iio^-1; [T_f]
([TLBI]; po; [dsb]; instruction-order
; [T]) & tlb-affects
let tob =
    [T_f]; tfre
    | [T]; iio; [R|W]; po; [W]
    | speculative; trfi
                                                                                            (* dependency-ordered-before *)
(* observed by *)
let obs = rfe | fr | wco
| trfe
                                                                                           (* dependency-ordered-bi
let dob =
    addr | data
    speculative; [W]
    addr; po; [W]
    (addr | data); rfi
    (addr | data); trfi
     ordered-before TLBI and translate *)
let obtlbi_translate =
[T & Stage1] ; tlb_barriered ; [TLBI-
                                                                                            (* atomic-ordered-before *)
    | ([T & Stage2] ; tlb_barriered ; [TLBI
                                                                                           ([I & Stage2]; tlu_balliered, tll2
-S2])
& (same-translation; [T & Stage1];
trf^-1; wco^-1)
(([T & Stage2]; tlb_barriered; [
TLBI-S2]); wco?; [TLBI-S1])
& (same-translation; [T & Stage1];
maybe TLB cached)
                                                                                           maybe_TLB_cached)
(* ordered-before TLBI *)
let obtlbi =
    obtlbi_translate
| [R|W|Fault_T]; iio^-1; (
    obtlbi_translate & ext); [TLBI]
                                                                                                [dsb]; po
                                                                                           (* Ordered-before *)
let ob = (obs | dob | aob | bob
    | iio | tob | obtlbi | ctxob | obfault
    | obETS)^+
 (* context-change ordered-before *)
(* context-change ordered-before *)
let ctxob =
    speculative; [MSR]
[ [CSE]; instruction-order
[ [ContextChange]; po; [CSE]
    speculative; [CSE]
    po; [ERET]; instruction-order; [T]
                                                                                            (* Internal visibility requirement *)
acyclic po-loc | fr | co | rf as internal
(* External visibility requirement *)
irreflexive ob as external
(* ordered-before a translation fault *)
let obfault =
    data; [Fault_T & IsFromW]
    speculative; [Fault_T & IsFromW]
    [[dmbst]; po; [Fault_T & IsFromW]
    [[dmbld]; po; [Fault_T & (IsFromW]
    IsFromR)]
                                                                                           | [A|Q]; po; [Fault_T & (IsFromW |
```

Figure 9.2: RVM axioms and relations

```
(* Internal visibility requirement *)
acyclic po-loc | fr | co | rf as internal
(* Writes cannot forward to po-future translates *)
acyclic (po-pa | trfi) as translation-internal
```

Unchanged from the original, the internal axiom captures the SC-per-location guarantee briefly discussed in §TODO: ?REF?. Translations, however, do not have the same per-location guarantees. To account for this, we introduce a second axiom, translation-internal, which captures the weaker per-location guarantee for translation table walks. Since translation reads, in the presence of TLB caching and out-of-order pipelines, do not guarantee even coherence, the only behaviour this axiom ends up preventing is translation reads reading from program-order later stores.

External axiom The external axiom asserts acyclicity of the global happens-before ordering for Arm. The happens-before (called ob, 'ordered-before', in Arm) relation is the union of all the ordering relations, given in \$9.4.

```
(* Ordered-before *)
let ob = (obs | dob | aob | bob | iio | tob | obtlbi | ctxob | obfault |
    obETS)^+
(* External visibility requirement *)
irreflexive ob as external
```

We choose to include all the pipeline and TLB effects as ordering requirements, rather than introducing new ordering axioms just for translation and TLB invalidation. This produces a model that is more consistent with the previous Arm memory models, and ensures ordering information gained through observing translation table walks are respected by non-translation-table accesses.

Atomic axiom The atomic axiom remains unchanged. In this work we do not consider the interaction of translation with atomic accesses.

```
(* Atomic requirement *)
empty rmw & (fre; coe) as atomic
```

# 9.4 Relations

The RVM model modifies some of the original, and introduces some new, ordering relations. This section goes through each in detail, describing the mechanism and justifying the existence or non-existence of particular clauses.

#### 9.4.1 obs

2744

2747

2748

2750

2754

2755

2761

2762

2763

2764

2766

2767

2768

2771

2772

```
(* observed by *)
let obs = rfe | fr | wco | trfe
```

The 'observed-by' relation. It includes the original rf and fr (over physical locations), the *generalised coherence* order wco (§9.1.2), and the trfe (translation-reads-from-external) relation.

**Generalised coherence** Including wco, which is existentially quantified over the candidates, fixes some global order the writes and TLBIs happen in. Consider, informally, some microarchitectural execution. It would propagate writes to the coherent storage subsystem, and would complete TLBI instructions, and these events would be interleaved in some whole-machine trace. The generalised wco relation captures the relative ordering of these events in the axiomatic model, as they would have happened in the traces of machine executions. The model is then quantified over all such orderings, accounting for any interleaving of these events.

**External translation reads** Inclusion of trfe enforces that translation-table-walk translation reads, which could not come from forwarding, must have originally come from the coherent storage subsystem and so the write must have been globally propagated before the translation read happened (§8.4.2, §8.4.7).

However, the translation read might have happened much later, either due to extreme out-of-order (§8.4.1) or TLB caching (§8.5.1), and so we do not include tfre (translation-from-reads-external) in ob.

Additionally, writes may be propagated to that thread's translation table walker before they are propagated to the coherent storage subsystem (§8.4.4), in other words, they can be forwarded. Therefore we do not include trfi (translation-reads-from-internal) in obs.

#### 9.4.2 dob

2777

2781

2783

2784

2785

2786

2788

2790

2792

2793

2796

2799

2800

2801

2803

2804

```
let dob =
    addr | data
    | speculative; [W]
    addr; po; [W]
    (addr | data); rfi
    (addr | data); trfi
```

The dependency-ordered-before relation is mostly unchanged, we add a single (addr | data); trfi clause to the end to forbid thin-air creation of values (§8.4.1, §8.4.2, TODO: need dedicated thin air paragraph/test in prev chapter) similarly to the original model for data memory reads.

### 9 9.4.3 bob

```
let bob =
    [R]; po; [dmbld]
    [W]; po; [dmbst]
    [dmbst]; po; [W]
    [dmbld]; po; [R|W]
    [L]; po; [A]
    [A | Q]; po; [R | W]
    [R | W]; po; [L]
    [F | C]; po; [dsbsy]
    [dsb]; po
```

We rewrite the original barrier-ordered-before relation to use the barrier helpers defined in Figure 9.1. This does not change the underlying model for DMB instructions, but allows those same clauses to capture the barrier hierarchy imposing the same ordering when using stronger barriers (namely, DSBs).

The Arm DSB instruction has some extra ordering however. Firstly that a DSB SY orders TLBI instructions (§8.6.2) and so we include [F|C];po;[dsbsy]. Secondly, all program-order later events must wait for an earlier DSB to finish before performing its explicit memory events, so we also include [dsb];po in ob.

## 9.4.4 tob

```
let tob =
   [T_f]; tfre
   | [T]; iio; [R|W]; po; [W]
   | speculative; trfi
```

Translation table walks themselves impose ordering on the surrounding events.

**Invalid writes** The first of these is one of the key behaviours described in §8.3.3, that reads of invalid entries must not have come from the TLB. So we add the <code>[T\_f]</code>; tfre edge to capture this, that any translation-reads which read an invalid entry must happen before any writes coherence after the one it read from.

There is a major caveat here: write forwarding to the translation table walker. We cannot simply have [T\_f]; tfr as a thread-local write may be forwarded to the translation table walker before it's propagated to memory (§8.4.4).

However, it should not be the case that the write is forwarded from a write that is too old or behind a DSB if FEAT\_ETS, except it may be the case that there might be other intervening writes in between. For now, we are unable to give a precise bound on the ordering for thread-local [T\_f]; tfr, and this area is still currently under investigation with Arm.

**Speculation** As we saw earlier, speculation interacts with translation in two ways: first, it is forbidden to read-from a still speculative write (§8.4.5), and, secondly, events program-order after an instruction which does a translation table walk are speculative until the translation table walk completes (§8.4.1).

To capture these we first define when one event is considered speculative until another event happens, with a new speculative relation, defined as following:

```
let speculative = ctrl | addr; po | [T]; instruction-order
```

This captures all the control-flow dependencies that we model here, the classic ctrl and addr; po, as well as a new general [T]; instruction-order which says that all events ordered (iio|po)+ after a translation read are speculative until the translation read satisfies. We can then include speculative; trfi to succinctly forbid any forwarding of still-speculative writes to translation table walks.

Finally, we include [T]; iio; [M]; po; [W] which captures that writes cannot propagate until program-order earlier instructions have their physical address (so, do not fault). Although, this edge is subsumed by the speculative; [W] edge in dob, it is kept here for clarity.

#### 9.4.5 ctxob

```
let ctxob =
    speculative; [MSR]
    | [CSE]; instruction-order
    | [ContextChange]; po; [CSE]
    | speculative; [CSE]
    | po; [ERET]; instruction-order; [T]
```

The ctxob relation captures the orderings required from context changing and synchronising operations, without trying to capture the full extent of the relaxed behaviours. As such, these orderings are likely to be incomparable to the real semantics.

**Speculation** The first guarantee we see is that context changes and synchronisation should not happen speculatively. Speculative context changes may end up creating translation table roots and therefore translation table walks using unreachable writes (§8.5.5). To prevent this we ensure that context changing operations only happen once they are non-speculative, by enforcing speculative; [MSR] in ob. Forbidding speculative execution of context synchronisation is done through the inclusion of speculative; [CSE] in ob.

**Context synchronising** A context synchronisation event (such as an ISB or ERET instruction) should ensure that program-order earlier context-changing events are seen by program-order later instructions. Microarchitecturally this is achieved by having context-synchronisation events flushing the pipeline, restarting all program-order later instructions. For now this effect seems fixed in the architecture (§TODO: the CSE section needs expanding in prev chapter), and so we get [CSE]; instruction-order in ob subsuming the earlier ISB orderings.

To ensure that context changes are seen after the synchronisation we include [ContextChange]; po; [CSE], and the union of these two relations ensures the context change is ordered before any program-order later events.

**Exceptions** Taking and returning from exceptions are context synchronising (§TODO: the CSE section needs expanding in prev chapter), and so those are captured by the previous clauses. However, translation reads of a lower exception level should not satisfy during execution at a higher exception level. We over approximate this with po; [ERET]; instruction-order; [T] ensuring all translation reads after an ERET wait.

#### 9.4.6 obfault and obETS

```
(* ordered-before a translation fault *)
let obfault =
    data; [Fault_T & IsFromW]
    | speculative; [Fault_T & IsFromW]
    | [dmbst]; po; [Fault_T & IsFromW]
    | [dmbld]; po; [Fault_T & (IsFromW|IsFromR)]
    | [A|Q]; po; [Fault_T & (IsFromW | IsFromR)]
    | [R|W]; po; [Fault_T & IsFromW & IsReleaseW]

(* ETS-ordered-before *)
let obETS =
    (obfault; [Fault_T]); iio^-1; [T_f]
    | ([TLBI]; po; [dsb]; instruction-order; [T]) & tlb-affects
```

To capture the specific guarantees described by FEAT\_ETS (§8.4.3, §8.6.2), we include 'ghost' Fault events in the candidate executions. These events sit in the execution (in po order) where the explicit memory event would have been if there was no fault, and tags the fault with the kind of fault it was (translation or permission).

**Ordering to a fault** To fully capture the strength of FEAT\_ETS we keep track of syntactic dependencies *into* the instruction which faulted, and apply those dependencies to the Fault event itself. obfault then the syntactic subset of bob and dob where the right-hand side of each clause is substituted with a Fault\_T (a translation fault).

Using obfault we can then keep track of the (syntactic) subset of ob that would have ordered the explicit event after, and associate those relations with the Fault\_T event instead. obETS's first clause then adds to ob this ordering, but attached to the translation read of the invalid entry itself, as architected by FEAT\_ETS.

Note that dependencies and orderings *from* a faulting instruction seem not respected, and so we do not induce orderings out of a Fault\_T.

**FEAT\_ETS** and **TLBI** The second clause of obETS captures the second architected behaviour of FEAT\_ETS (§TODO: TLBI ordering needs ETS explained), that faults after a thread-local TLBIs do not need context synchronisation to be ordered after the TLBI. Note that one still needs a DSB to complete the TLBI in that case.

#### 9.4.7 obtlbi

2844

2846

2847

2849

2850

2853

2854

2856

2857

2858

2859

2860

2861

2863 2864

```
(* ordered-before TLBI *)
let obtlbi =
    obtlbi_translate
    | [R|W|Fault_T]; iio^-1; (obtlbi_translate & ext); [TLBI]

(* translate ordered-before TLBI *)
let obtlbi_translate =
    [T & Stage1]; tlb_barriered; [TLBI-S1]
| (([T & Stage2]; tlb_barriered; [TLBI-S2]); wco?; [TLBI-S1])
    & (same-translation; [T & Stage1]; maybe_TLB_cached)
| ([T & Stage2]; tlb_barriered; [TLBI-S2])
    & (same-translation; [T & Stage1]; trf^-1; wco^-1)
```

Finally, there is the obtlbi relation which captures the ordering from translations (and their explicit memory events) and the TLB invalidations which affect them. The relation is split in two: the obtlbi\_translate clause enforces order between stale translations and the TLBIs they are invalidated by, the second clause covers broadcast TLBIs.

Capturing stale TLB entries When a translation read happens, it is allowed for it to read from a stale write (§8.5.1). That is, the translation need not be ordered before writes which come after the write it actually reads from. Consequently the tfre relation is not included in ob.

We strengthen this, by including some edges from translations to TLBIs, when there is an interposing newer write.

The general shape of this ordering is illustrated in Figure 9.3.



Figure 9.3: General obtlbi\_translate shape.

This shape is succinctly captured by the tlb\_barriered auxiliary relation, which relates any translate-read that reads from a write which is woo before another write which is woo before a TLBI which targets the address, ASID or VMID of the translation:

```
let tlb_barriered =
  ([T] ; tfr ; wco ; [TLBI]) & tlb-affects^-1
```

We cannot simply include tlb\_barriered in ob, however. Instead, we must consider the orderings for stage 1 and stage 2 translation reads separately.

**Stale stage 1 reads** For stage 1 translation reads, either in single-stage regimes or as part of a two-stage regime, we can include a variant of tlb\_barried specialised to stage 1 translation-reads and TLBIs which affect stage 1 entries.

**Stale stage 2 reads** Stage 2 walks are more subtle. The requirement to perform stage 1 invalidation (§8.6.4) means that, in those instances, we do not get tlb\_barriered directly.

Instead, we have to case split on the execution: either, (1) the translation table walk does a stage 1 translation read which reads-from an older write, in which case there may have been a whole cached translation that must be invalidated; or, (2) one of the stage 1 translation reads of the translation table walk reads from a write that is newer than the stage 2 TLBI and so there cannot have been any cached whole translation entries in the TLB and so, logically, we only need the stage 2 invalidation. These cases are illustrated in Figure 9.4, and correspond to the two clauses of obtlbi\_translate which match on stage 2 translation reads.



Figure 9.4: obtlbi stage 2 scenarios.

We capture the general shape of (1), where a translation-read may have been cached in the TLB, with the following maybe\_TLB\_cached relation:

```
let maybe_TLB_cached = ([T]; trf^-1; wco; [TLBI]) & tlb-affects^-1
```

We then use this relation to add ordering from a stage 2 translation-read to the stage 1 TLBI, wco-after a stage 2 TLBI that removed any stale IPA mappings, which would remove any cached whole-translation any stage 1 translation-read might have read from, and after which any fresh translation table walk would be required to not see the stale stage 2 entry the translation-read read from.

2891

2892

2894

2895

2897

2898

2902

2903

2904

**Broadcast TLBIs** Recall that broadcast TLBIs impose restrictions on other threads (§8.6.3). When a broadcast TLBI's invalidation affects a translation on another core, then it must also affect the explicit memory effect associated with it. This shape is illustrated in Figure 9.5, and corresponds to the final clause of obtlbi.



Figure 9.5: obtlbi broadcast TLBI shape.

**Connecting TLB invalidations to translation reads** The final part of the puzzle is how to relate TLBI events with translations which may be affected by the invalidation. Recall that the TLBIs are grouped into subsets of TLBI-S1, TLBI-VA, and so on. We define a tlb\_might\_affect that is the cross-product of these with the same-\* relations:

```
let tlb_might_affect =
    TLBI-S1 & ~TLBI-S2 &
                            TLBI-VA
                                      &
                                          TLBI-ASID & TLBI-VMID]; (same-va-
    page & same-asid & same-vmid) ;
                                       [T & Stage1]
                                          TLBI-ASID & TLBI-VMID]; (same-
    TLBI-S1 & ~TLBI-S2 & ~TLBI-VA
                                       &
   asid & same-vmid) ; [T & Stage1]
TLBI-S1 & ~TLBI-S2 & ~TLBI-VA & ~TLBI-ASID & TLBI-VMID] ; same-vmid
       ΓT &
            Stage1]
   ~TLBI-S1 &
                TLBI-S2 &
                             TLBI-IPA & ~TLBI-ASID & TLBI-VMID]; (same-ipa
            same - vmid) ; [T & Stage2]
& TLBI - S2 & ~TLBI - IPA & ~TLBI - ASID & TLBI - VMID] ; same - vmid
     page &
    TLBI-S1 &
       [T & Stage2]
    TLBI-S1 &
                TLBI-S2 & ~TLBI-IPA & ~TLBI-ASID & TLBI-VMID] ; same-vmid
       [T]
    TLBI-S1 &
                            ~TLBI-IPA & ~TLBI-ASID & ~TLBI-VMID) * (T &
    Stage1)
                 TLBI-S2 & ~TLBI-IPA & ~TLBI-ASID & ~TLBI-VMID) * (T &
```

Finally, we get tlb-affects by attaching tlb\_might\_affect to events in the same thread, and if a TLBI-IS, to ones in other threads too:

```
let tlb-affects =
  [TLBI-IS]; tlb_might_affect
| ([~TLBI-IS]; tlb_might_affect) & int
```

# Validating the VMSA model

- 2907 10.1 Extending isla-axiomatic
- 2908 10.2 Running on hardware: system-litmus-harness
- 10.2.1 Harness overview
- 2910 10.2.2 Results from hardware

# An operational VMSA model

## 11.1 Introduction

- In the previous chapters I give a formalisation of the Arm virtual memory systems architecture (VMSA) as an extension to Arm's own official memory consistency model.
- Here I describe an informal sketch of a microarchitectural-style structural operational semantics, as an extension to Flat; let us call this extension PFlat (for Flat over *Physical* memory).

### 2918 11.2 Structure of the state

- <sup>2919</sup> PFlat's state is much like the state of Flat, except with two additions:
- 2920 ▷ A per-thread MMU.
- 2921 ▷ A per-MMU TLB.

### 2922 11.2.1 The MMU

- 2923 In flat each thread executes the intra-instruction semantics as defined by each instruction's sail code sequentially.
- We alter this slightly to remove the sequential execution of the address translation function, and instead add a dedicated MMU.
- When the thread wishes to perform a translation of some given virtual address, it sends a *translation request* to the MMU which may return a *translation result* (or simply a *translation*) in response.
- <sup>2928</sup> The MMU is a set of in-progress calls to the translation function plus a TLB.
- 2929 At any point in time, the MMU can spontaneously begin

## <sub>2930</sub> 11.2.2 Its TLB

# 2931 11.3 Virtual memory axioms

Describe each new 'ob' edge in detail, and the new axioms.



Figure 11.1: Operational VM machine.

- 2933 11.4 Break-before-make violation detection
- 11.5 A weaker VMSA model
- 2935 11.6 Executing the models with Isla
- Just briefly describe (post-cav) extensions to isla and optimisations for virtual memory.

# Limitations

# 12.1 Exceptions and Interrupts

- <sup>2940</sup> We care about this part of the architecture, we give an under/overapproximate model earlier but it's not correct.
- Describe Ohad work and a brief overview of the issue and some tests perhaps.

# 2942 12.2 Cross-interaction of instruction fetch and virtual memory

- Ifetches need translation, why can't we just copy/paste our vmsa model for ifetch talk about non-pipt icaches and aliasing, talk about "po".
- 12.3 Other architectures
- Here we discussed just Armv8, but we could do the same for x86, Power or RISC-V.
- 2947 Describe possible extensions and differences.

# **Conclusion**

Recap of contributions, limitations and remaining open questions.

2951 Glossary

# **Bibliography**

- [1] Arm Limited. Arm architecture reference manual. Armv8, for Armv8-A architecture profile. https://developer.arm.com/documentation/ddi0487/ha/?lang=en, February 2022. H.a Armv9 EAC. ARM DDI 0487H.a (ID020222). 11530pp.
- 2956 [2] Intel Corporation. Intel 64 and ia-32 architectures software developer's manual combined volumes: 1, 2a, 2b, 2c, 2d, 3a, 3b, 3c, 3d and 4. https://software.intel.com/en-us/download/
  2958 intel-64-and-ia-32-architectures-sdm-combined-volumes-1-2a-2b-2c-2d-3a-3b-3c-3d-and-4,
  2959 accessed 2019-06-30, May 2019. 325462-070US.
- 2960 [3] ARM Limited. ARM architecture reference manual. ARMv8, for ARMv8-A architecture profile, October 2018. v8.4. ARM DDI 0487D.a (ID103018).
- [4] Alastair Reid. Trustworthy specifications of ARM v8-A and v8-M system level architecture. In *FMCAD 2016*, pages 161–168, October 2016.
- [5] Alastair Reid. ARM releases machine readable architecture specification. https://alastairreid.github.io/ARM-v8a-xml-release/, April 2017.
- [6] Alastair Reid, Rick Chen, Anastasios Deligiannis, David Gilday, David Hoyes, Will Keen, Ashan Pathirane,
   Owen Shepherd, Peter Vrabel, and Ali Zaidi. End-to-end verification of processors with ISA-Formal. In
   Swarat Chaudhuri and Azadeh Farzan, editors, Computer Aided Verification 28th International Conference,
   CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part II, volume 9780 of Lecture Notes in Computer
   Science, pages 42–58. Springer, 2016.
- [7] Alasdair Armstrong, Thomas Bauereiss, Brian Campbell, Alastair Reid, Kathryn E. Gray, Robert M. Norton,
   Prashanth Mundkur, Mark Wassell, Jon French, Christopher Pulte, Shaked Flur, Ian Stark, Neel Krishnaswami,
   and Peter Sewell. ISA semantics for ARMv8-A, RISC-V, and CHERI-MIPS. In *Proc. 46th ACM SIGPLAN Symposium on Principles of Programming Languages*, January 2019. Proc. ACM Program. Lang. 3, POPL,
   Article 71.
- [8] Alasdair Armstrong, Thomas Bauereiss, Brian Campbell, Shaked Flur Jon French Kathryn E. Gray, Gabriel Kerneis, Neel Krishnaswami, Prashanth Mundkur, Robert Norton-Wright, Christopher Pulte, Alastair Reid, Peter Sewell, Ian Stark, and Mark Wassell. Sail. https://www.cl.cam.ac.uk/~pes20/sail/, 2019.
- [9] Christopher Pulte. *The Semantics of Multicopy Atomic ARMv8 and RISC-V*. PhD thesis, University of Cambridge, 2019. https://doi.org/10.17863/CAM.39379.
- [10] Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. Simplifying ARM
   Concurrency: Multicopy-atomic Axiomatic and Operational Models for ARMv8. In *Proceedings of the 45th ACM SIGPLAN Symposium on Principles of Programming Languages*, January 2018.
- [11] Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis, Luc Maranget, Kathryn E. Gray, Ali
   Sezgin, Mark Batty, and Peter Sewell. Mixed-size concurrency: ARM, POWER, C/C++11, and SC. In *The 44st* Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Paris, France, pages
   429-442, January 2017.
- Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and
   Peter Sewell. Modelling the ARMv8 architecture, operationally: Concurrency and ISA. In Proceedings of
   POPL: the 43rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016.
- [13] Kathryn E. Gray, Gabriel Kerneis, Dominic Mulligan, Christopher Pulte, Susmit Sarkar, and Peter Sewell.

  An integrated concurrency and core-ISA architectural envelope definition, and test oracle, for IBM POWER

- multiprocessors. In *Proc. MICRO-48*, the 48th Annual IEEE/ACM International Symposium on Microarchitecture,
  December 2015.
- <sup>2995</sup> [14] Jade Alglave, Luc Maranget, and Michael Tautschnig. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. *ACM TOPLAS*, 36(2):7:1–7:74, July 2014.
- <sup>2997</sup> [15] Luc Maranget, Susmit Sarkar, and Peter Sewell. A tutorial introduction to the ARM and POWER relaxed memory models. Draft available from http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7. pdf, 2012.
- Susmit Sarkar, Kayvan Memarian, Scott Owens, Mark Batty, Peter Sewell, Luc Maranget, Jade Alglave, and
   Derek Williams. Synchronising C/C++ and POWER. In Proceedings of PLDI 2012, the 33rd ACM SIGPLAN
   conference on Programming Language Design and Implementation (Beijing), pages 311–322, 2012.
- Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell. Litmus: running tests against hardware. In
   Proceedings of TACAS 2011: the 17th international conference on Tools and Algorithms for the Construction and
   Analysis of Systems, pages 41–44, Berlin, Heidelberg, 2011. Springer-Verlag.
- Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. Understanding POWER
   multiprocessors. In Proceedings of PLDI 2011: the 32nd ACM SIGPLAN conference on Programming Language
   Design and Implementation, pages 175–186, 2011.
- Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. x86-TSO: A
   rigorous and usable programmer's model for x86 multiprocessors. *Communications of the ACM*, 53(7):89–97,
   July 2010. (Research Highlights).
- [20] Scott Owens, Susmit Sarkar, and Peter Sewell. A better x86 memory model: x86-TSO. In *Proceedings of TPHOLs 2009: Theorem Proving in Higher Order Logics, LNCS 5674*, pages 391–407, 2009.
- <sup>3014</sup> [21] Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell. Fences in weak memory models. In *Proc. CAV*, <sup>3015</sup> 2010.
- Jade Alglave, Anthony Fox, Samin Ishtiaq, Magnus O. Myreen, Susmit Sarkar, Peter Sewell, and Francesco
   Zappa Nardelli. The semantics of Power and ARM multiprocessor machine code. In *Proc. DAMP 2009*, January
   2009.
- Susmit Sarkar, Peter Sewell, Francesco Zappa Nardelli, Scott Owens, Tom Ridge, Thomas Braibant, Magnus
   Myreen, and Jade Alglave. The semantics of x86-CC multiprocessor machine code. In *Proceedings of POPL* 2009: the 36th annual ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, pages
   379–391, January 2009.
- Nathan Chong and Samin Ishtiaq. Reasoning about the ARM weakly consistent memory model. In *MSPC*, 2008.
- <sup>3025</sup> [25] Allon Adir, Hagit Attiya, and Gil Shurek. Information-flow models for shared memory with an application to the PowerPC architecture. *IEEE Trans. Parallel Distrib. Syst.*, 14(5):502–515, 2003.
- Will Deacon. The ARMv8 application level memory model. https://github.com/herd/herdtools7/blob/master/herd/libdir/aarch64.cat (accessed 2019-07-01), 2016.
- [27] Andrew Waterman and Krste Asanović, editors. The RISC-V Instruction Set Manual Volume I: Unprivileged ISA. 3029 December 2018. Document Version 20181221-Public-Review-draft. Contributors: Arvind, Krste Asanović, 3030 Rimas Avižienis, Jacob Bachmeyer, Christopher F. Batten, Allen J. Baum, Alex Bradbury, Scott Beamer, 3031 Preston Briggs, Christopher Celio, Chuanhua Chang, David Chisnall, Paul Clayton, Palmer Dabbelt, Roger 3032 Espasa, Shaked Flur, Stefan Freudenberger, Jan Gray, Michael Hamburg, John Hauser, David Horner, Bruce 3033 Hoult, Alexandre Joannou, Olof Johansson, Ben Keller, Yunsup Lee, Paul Loewenstein, Daniel Lustig, Yatin Manerkar, Luc Maranget, Margaret Martonosi, Joseph Myers, Vijayanand Nagarajan, Rishiyur Nikhil, Jonas Oberhauser, Stefan O'Rear, Albert Ou, John Ousterhout, David Patterson, Christopher Pulte, Jose Renau, 3036 Colin Schmidt, Peter Sewell, Susmit Sarkar, Michael Taylor, Wesley Terpstra, Matt Thomas, Tommy Thorn, 3037 Caroline Trippel, Ray VanDeWalker, Muralidaran Vijayaraghavan, Megan Wachs, Andrew Waterman, Robert 3038 Watson, Derek Williams, Andrew Wright, Reinoud Zandijk, and Sizhuo Zhang. 3039
- Shaked Flur, Jon French, Kathryn Gray, Christopher Pulte, Susmit Sarkar, and Peter Sewell. rmem. www.cl. cam.ac.uk/~pes20/rmem/, 2017.

- <sup>3042</sup> [29] Jade Alglave and Luc Maranget. The herd7 tool. http://diy.inria.fr/doc/herd.html/, 2019. Accessed 2019-07-08.
- [30] Gerwin Klein, June Andronick, Kevin Elphinstone, Toby Murray, Thomas Sewell, Rafal Kolanski, and Gernot Heiser. Comprehensive formal verification of an OS microkernel. *ACM Transactions on Computer Systems*, 32(1):2:1–2:70, February 2014.
- Ronghui Gu, Zhong Shao, Jieung Kim, Xiongnan (Newman) Wu, Jérémie Koenig, Vilhelm Sjöberg, Hao
   Chen, David Costanzo, and Tahina Ramananandro. Certified concurrent abstraction layers. In Proceedings
   of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018,
   Philadelphia, PA, USA, June 18-22, 2018, pages 646-661, 2018.
- [32] Ronghui Gu, Zhong Shao, Hao Chen, Xiongnan (Newman) Wu, Jieung Kim, Vilhelm Sjöberg, and David
   Costanzo. CertiKOS: An extensible architecture for building certified concurrent OS kernels. In 12th USENIX
   Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4,
   2016., pages 653–669, 2016.
- Andrew Ferraiuolo, Andrew Baumann, Chris Hawblitzel, and Bryan Parno. Komodo: Using verification to disentangle secure-enclave hardware from software. In *Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017*, pages 287–305, 2017.
- Roberto Guanciale, Hamed Nemati, Mads Dam, and Christoph Baumann. Provably secure memory isolation for linux on ARM. *J. Comput. Secur.*, 24(6):793–837, 2016.
- [35] Christoph Baumann, Oliver Schwarz, and Mads Dam. Compositional verification of security properties for
   ambedded execution platforms. In PROOFS@CHES 2017, 6th International Workshop on Security Proofs for
   Embedded Systems, Taipei, Taiwan, Friday September 29th, 2017, pages 1–16, 2017.
- Yong Kiam Tan, Magnus O. Myreen, Ramana Kumar, Anthony C. J. Fox, Scott Owens, and Michael Norrish.
   The verified CakeML compiler backend. J. Funct. Program., 29:e2, 2019.
- Ramana Kumar, Magnus O. Myreen, Michael Norrish, and Scott Owens. CakeML: a verified implementation of ML. In *The 41st Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL*14, San Diego, CA, USA, January 20-21, 2014, pages 179–192, 2014.
- 3068 [38] Xavier Leroy. A formally verified compiler back-end. J. Autom. Reasoning, 43(4):363-446, 2009.
- [39] Jade Alglave, Luc Maranget, Kate Deplaix, Keryan Didier, and Susmit Sarkar. The litmus7 tool. http://div.inria.fr/doc/litmus.html/, 2019. Accessed 2019-07-08.
- <sup>3071</sup> [40] Jade Alglave and Luc Maranget. The diy7 tool. http://diy.inria.fr/, 2019. Accessed 2021-07-01.
- JO72 [41] Dominic P. Mulligan, Scott Owens, Kathryn E. Gray, Tom Ridge, and Peter Sewell. Lem: reusable engineering of real-world semantics. In *Proceedings of ICFP 2014: the 19th ACM SIGPLAN International Conference on Functional Programming*, pages 175–188, 2014.
- <sup>3075</sup> [42] Azalea Raad, John Wickerson, and Viktor Vafeiadis. Weak persistency semantics from the ground up: Formalising the persistency semantics of ARMv8 and transactional models. *Proc. ACM Program. Lang.*, 3(OOPSLA):135:1–135:27, October 2019.
- <sup>3078</sup> [43] Azalea Raad, John Wickerson, Gil Neiger, and Viktor Vafeiadis. Persistency semantics of the Intel-x86 architecture. *PACMPL*, 4(POPL):11:1–11:31, 2020.
- [44] Magnus O. Myreen. Verified just-in-time compiler on x86. In *Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages*, POPL '10, pages 107–118, New York, NY, USA, 2010. ACM.
- Shilpi Goel, Warren A. Hunt, Matt Kaufmann, and Soumava Ghosh. Simulation and formal verification of
   x86 machine-code programs that make system calls. In *Proceedings of the 14th Conference on Formal Methods* in Computer-Aided Design, FMCAD '14, pages 18:91–18:98, Austin, TX, 2014. FMCAD Inc.
- Shilpi Goel. The x86isa books: Features, usage, and future plans. In *Proceedings 14th International Workshop on the ACL2 Theorem Prover and its Applications, Austin, Texas, USA, May 22-23, 2017.*, pages 1–17, 2017. arXiv version: https://arxiv.org/abs/1705.01225.
- Rishiyur S. Nikhil and Niraj Nayan Sharma. Forvis: A formal RISC-V ISA specification. https://github.com/rsnikhil/Forvis\_RISCV-ISA-Spec, 2019. Accessed 2019-07-01.

- [48] Ian J Clester, Thomas Bourgeat, Andy Wright, Samuel Gruetter, and Adam Chlipala. riscv-plv risc-v isa formal specification. https://github.com/mit-plv/riscv-semantics, 2019. Accessed 2019-07-01.
- [49] Hira Syeda and Gerwin Klein. Reasoning about translation lookaside buffers. In LPAR-21, 21st International
   Conference on Logic for Programming, Artificial Intelligence and Reasoning, Maun, Botswana, May 7-12, 2017,
   pages 490-508, 2017.
- [50] Hira Taqdees Syeda and Gerwin Klein. Program verification in the presence of cached address translation.
   In Interactive Theorem Proving 9th International Conference, ITP 2018, Held as Part of the Federated Logic
   Conference, FloC 2018, Oxford, UK, July 9-12, 2018, Proceedings, pages 542-559, 2018.
- Bogdan F. Romanescu, Alvin R. Lebeck, and Daniel J. Sorin. Specifying and dynamically verifying address
   translation-aware memory consistency. In *Proceedings of the Fifteenth Edition of ASPLOS on Architectural* Support for Programming Languages and Operating Systems, ASPLOS XV, pages 323–334, New York, NY, USA,
   2010. ACM.
- [52] Bogdan Romanescu, Alvin Lebeck, and Daniel J. Sorin. Address translation aware memory consistency. *IEEE Micro*, 31(1):109–118, January 2011.
- Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhattacharjee. COATCheck: Verifying memory ordering at the hardware-OS interface. SIGOPS Oper. Syst. Rev., 50(2):233–247, March 2016.
- [54] Ben Simner, Alasdair Armstrong, Jean Pichon-Pharabod, Christopher Pulte, Richard Grisenthwaite, and Peter
   Sewell. Relaxed virtual memory in Armv8-A. In *Proceedings of the 31st European Symposium on Programming*,
   ESOP 2022, April 2022.
- [55] Arm Limited. Exploration Tools Arm Developer. https://developer.arm.com/downloads/-/exploration-tools, 2022. Accessed May 2022.
- [56] UK. Copyright, designs and patents act. c. 48, 1988.
- [57] Arm Limited. Arm Cortex-A53 MPCore Processor Technical Reference Manual, 2022. ARM DDI 0500J.
- [58] Ben Simner, Shaked Flur, Christopher Pulte, Alasdair Armstrong, Jean Pichon-Pharabod, Luc Maranget, and
   Peter Sewell. ARMv8-A system semantics: instruction fetch in relaxed architectures. In *Proceedings of the* 29th European Symposium on Programming, ESOP 2020, 2020.