## Static Cache Analysis and Design for Predictable Multi-Core

Claire Maïza (Burguière)

Department of Computer Science Saarland University

#### LIP6 2010



## **Timing Analysis**





## **Timing Analysis**





Restriction: uninterrupted execution of the task on a single-core



- Preemptive System:
  - Example of cache analysis: Resilience analysis
- Multi-Core:
  - Predictability on a single-core level
  - Design principles for predictable multi-core

## **Context - Preemptive Systems**



Cache related preemption delay (CRPD):

- Impact of preemption on the cache content
- Overall cost of additional reloads due to preemption
- ⇒ Analysis aims at bounding the number of additional cache misses due to preemption



## CRPD for Set-Associative Caches - LRU



#### CRPD computation:

- preempted task: Useful Cache Blocks (UCB)
- preempting task: Evicting Cache Blocks (ECB)
- CRPD from UCB and ECB:
  - Previous combination overestimates
  - $\Rightarrow$  Resilience analysis

## Useful Cache Block - [Lee et al., 1996]



#### Definition (Useful Cache Block)

A memory block m at program point P is called a useful cache block, if

- a) m may be cached at P
- b) m may be reused at program point P' that may be reached from P with no eviction of m on this path.



## Useful Cache Block - [Lee et al., 1996]



#### Definition (Useful Cache Block)

A memory block m at program point P is called a useful cache block, if

- a) m may be cached at P
- b) *m* may be reused at program point P' that may be reached from P with no eviction of m on this path.



## Useful Cache Block - [Lee et al., 1996]



#### Definition (Useful Cache Block)

A memory block m at program point P is called a useful cache block, if

- a) m may be cached at P
- b) m may be reused at program point P' that may be reached from P with no eviction of m on this path.



## Evicting Cache Blocks [Tomiyama & Dutt, 2000]



#### Definition (Evicting Cache Blocks (ECB))

A memory block of the preempting task is called an evicting cache block, if it may be accessed during the execution of the preempting task.



e additional miss due to preemption (CRPD)

## Evicting Cache Blocks [Tomiyama & Dutt, 2000]



#### Definition (Evicting Cache Blocks (ECB))

A memory block of the preempting task is called an evicting cache block, if it may be accessed during the execution of the preempting task.



## Impact of the Preempting Task on the Preempted Task



#### CRPD (using UCB and ECB)

$$CRPD_{UCB\&ECB} = \sum_{s=1}^{c} min(CRPD_{UCB}^{s}, CRPD_{ECB}^{s})$$

## Impact of the Preempting Task on the Preempted Task (Example)





$$[c, b, a, x] \xrightarrow{a} [a, c, b, x] \xrightarrow{b} [b, a, c, x] \xrightarrow{c} [c, b, a, x]$$
 no miss

## Impact of the Preempting Task on the Preempted Task (Example)



$$\mathsf{ECBs} \left( \begin{matrix} [c,b,a,x] \xrightarrow{a} [a,c,b,x] \xrightarrow{b} [b,a,c,x] \xrightarrow{c} [c,b,a,x] \end{matrix} \right)$$
 no miss

$$= \{\mathbf{e}\} \left( \underbrace{[\mathbf{e}, c, b, a]}_{a} \xrightarrow{a} [a, \mathbf{e}, c, b] \xrightarrow{b} [b, a, \mathbf{e}, c] \xrightarrow{c} [c, b, a, \mathbf{e}] \right)$$
 no miss

• 
$$\mathsf{CRPD}_{\mathsf{UCB}} \Rightarrow |\mathsf{ucb}| = 3$$

• CRPD<sub>ECB</sub>  $\Rightarrow$  n = 4

0x0**a** 0x0**b** 

0x0c

- $\blacksquare \ \mathsf{CRPD}_{\mathsf{UCB&ECB}} = \mathit{min}(\mathsf{CRPD}_{\mathsf{UCB}},\mathsf{CRPD}_{\mathsf{ECB}}) \Rightarrow 3$ 
  - Overestimation: number of additional misses= 0 < 3</p>

## Impact of the Preempting Task on the Preempted Task (Example)



$$\mathsf{ECBs} \xrightarrow{[c, b, a, x]}{\overset{a}{\longrightarrow}} [a, c, b, x] \xrightarrow{b} [b, a, c, x] \xrightarrow{c} [c, b, a, x] \qquad \text{no miss}$$

$$= \{\mathbf{e}\} \left( \underbrace{[\mathbf{e}, c, b, a]}_{a} \xrightarrow{a} [a, \mathbf{e}, c, b] \xrightarrow{b} [b, a, \mathbf{e}, c] \xrightarrow{c} [c, b, a, \mathbf{e}] \right)$$
 no miss

• 
$$\mathsf{CRPD}_{\mathsf{UCB}} \Rightarrow |\mathsf{ucb}| = 3$$

- $\mathsf{CRPD}_{\mathsf{ECB}} \Rightarrow n = 4$
- $\blacksquare CRPD_{\tt UCB\&ECB} = \textit{min}(CRPD_{\tt UCB}, CRPD_{\tt ECB}) \Rightarrow 3$ 
  - Overestimation: number of additional misses= 0 < 3</p>
- Why?

0x0**a** 0x0**b** 

0x0c

- |ECB| to evict a UCB = 2
- ▶ but, |ЕСВ| = 1
- One single ECB is not sufficient to evict a UCB

## Refinement



Determining max ECB s.t. no additional cache miss occur

$$m \in UCB \left[ \begin{array}{c} & -m & [m, -, -, -, -, -, -] \\ & -a_1 & [m, -, -, -, -, -] \\ & -a_2 & \\ & -a_3 & [a_3, a_2, a_1, m, -, -, -, -] \\ & -m & [m, a_3, a_2, a_1, -, -, -, -] \end{array} \right]$$

## Refinement



Determining max ECB s.t. no additional cache miss occur

## **Resilience Analysis**



#### **Definition (I-Resilience)**

A memory block m is called I-resilient at program point P, if all possible next accesses to m that would be hits, would still be hits in case of a preemption at P with I accesses.

## **Resilience Analysis**



#### **Definition (I-Resilience)**

A memory block m is called I-resilient at program point P, if all possible next accesses to m that would be hits, would still be hits in case of a preemption at P with I accesses.



## **CRPD** using Resilience



#### CRPD (combining UCB and ECB by using resilience)













- a, b and c are 1-resilient
- $CRPD_{UCB\&ECB}^{res} = BRT \times |UCB \setminus \{m \mid m \text{ is } |_{ECB}|\text{-resilient}\}| = 0$





- a, b and c are 1-resilient
- $CRPD_{UCB\&ECB}^{res} = BRT \times |UCB \setminus \{m \mid m \text{ is } |_{ECB}|\text{-resilient}\}| = 0$

Instead of:  $CRPD_{UCB&ECB} = min(CRPD_{UCB}, CRPD_{ECB}) = 3 \times BRT$ 





- a, b and c are 1-resilient
- $CRPD_{UCB\&ECB}^{res} = BRT \times |UCB \setminus \{m \mid m \text{ is } |_{ECB}|\text{-resilient}\}| = 0$
- Instead of:  $CRPD_{UCB\&ECB} = min(CRPD_{UCB}, CRPD_{ECB}) = 3 \times BRT$
- Resilience analysis leads to a more precise CRPD

## Limitations of CRPD Estimation



Cache Policy: LRU

- ► FIFO and PLRU cannot be "directly" analysed using UCB/ECB.
- Architecture: Need a constant bound on the block reload time
  - The number of additional misses is used to bound the interferences.

## Predictable Multi-Core Architecture



- Predictability: hard to quantify
- Predictable cores are a prerequisite for predictable multi-cores
- General problem: Sharing of resources
  - Main memory, caches
  - Busses
  - I/O
  - Flash memory
- Sharing may be
  - fundamental (necessary access to application global variables)
  - incidental (processors happen to use the same bus for access to non-shared devices)

## Predictability on the Single-Core Level: Caches



Several new notions regarding cache replacement policies:

Predictability:

*quantitative* measure of how fast information about cache state can be gained

Competitiveness:

*quantitative* measure of how numbers of hits and misses of different policies relate

Sensitivity:

*quantitative* measure of how the number of hits and misses are influenced by intial cache state

Gives a sound and precise quantitative definition of predictability of caches

## Predictability on the Single-Core Level: Caches



Predictability of Caches depends on their replacement policy:

- Least-Recently-Used (LRU) highly predictable, precise and efficient abstractions exist
- First-In First-Out (FIFO) inherently less predictable
- Pseudo Round-Robin as in Motorola ColdFire, extremely unpredictable

## Predictability on the Single-Core Level: Pipelines



Classification of architectures:

 Timing compositional (e.g., ARM7): No timing anomalies present, local worst-case behaviour safely approximates global worst-case behaviour

 Compositional with bounded effects (e.g., TriCore (probably)): Timing anomalies but no domino effects, local worst-case behaviour safely approximates global worst-case behaviour up to a constant, additive factor

 Non-compositional architectures (e.g., PPC 755): Timing anomalies, domino effects, all global paths have to be considered

from Wilhelm et al.: Memory Hierarchies, Pipelines, and Buses for Future Architectures in Time-critical Embedded Systems, IEEE TCAD, July 2009

## Predictability of Multi-Core Architectures:



## What One Should Avoid in a Multi-Core Architecture

- a complex architecture, not fully-compositional:
  - high complexity of the analysis
  - bound on additional delays are not constant (access to a shared resource, preemption)
- a cache with a non-LRU policy
  - less precise WCET bounds
  - more complicated computation of less precise CRPD bounds
- a unified cache
  - interferences impair precision
  - more complex analysis
- shared bus protocol with unbounded access delay
  - unbounded execution time
  - no guarantees on timing constraints
- frequently used shared resources
  - may lead to an unschedulable system
  - TDMA-like protocol: limited by idle-time

## PROMPT



Minimise sharing in multi-processor architectures:

- Interferences might be huge (bus contention, cache pollution)
- Huge overestimation when analysis is possible
  - Set of tasks that might be executed in parallel
  - Cache contents
- PROMPT (PRedictability Of Multi-Processor Timing)
  - Start with a generic, parameterisable architecture with predictable (fully timing compositional) cores
  - Instantiate architecture for given set of applications, based on their resource requirements

## **Design Principles**



- Simplifications of components of the architecture
- Elimination of interferences on shared resources:
  - Wherever it is not absolutely needed
  - Private resources for private uses
  - Data sharing for global state
- Accesses to the shared global state
  - Determination of delays, or
  - Cumulative analyses of WCET, bus arbiter and scheduling

## Conclusions



#### Preemptive systems:

- Precise analysis of cache interferences
- Requires LRU cache and constant-bounded cache access delay
- Predictability: not a boolean property
  - Introduction of shared resources decreases predictability but may benefit efficiency
- PROMPT:
  - Predictable components of the architecture (core)
  - Reduction of interferences (when possible)
  - Determination of access delay

## Further reading - resilience





#### Altmeyer, S. & Burguière, C. (2009).

In Proceedings of the 21st Euromicro Conference on Real-Time Systems (ECRTS '09) pp. 109–118, IEEE Computer Society.



Altmeyer, S., Maiza, C. & Reineke, J. In LCTES'10.



Lee, C.-G., Hahn, J., Min, S. L., Ha, R., Hong, S., Park, C. Y., Lee, M. & Kim, C. S. (1996). In RTSS'96 p. 264, IEEE Computer Society.



Negi, H. S., Mitra, T. & Roychoudhury, A. (2003). In CODES+ISSS'03 ACM.



Reineke, J. (2008).

Caches in WCET Analysis. PhD thesis, Universität des Saarlandes, Saarbrücken.



Tan, Y. & Mooney, V. (2004). In SCOPES'04 pp. 182–199,.



Tomiyama, H. & Dutt, N. D. (2000). In CODES'00 ACM.

## Steps of the Design Process



- 1 Hierarchical privatization
  - decomposition of the set of applications according to the sharing relation on the global state
  - allocation of private resources for non-shared code and state
  - sound (and precise) determination of delays for accesses to the shared global state
- 2 Sharing of lonely resources
  - Costly lonely resources will be shared
  - Access rate is low compared to CPU and memory bandwidth
  - Accesses happen infrequently  $\Rightarrow$  access delay contributes little
- 3 Controlled socialization
  - introduction of additional sharing to reduce costs
  - controlling loss of predictability

## *I*-resilience analysis





## CPRD using ECB: Pitfall





$$\begin{array}{l} {\sf ECBs} \\ = \{ {\bf e} \} \end{array} \begin{pmatrix} [b,a,9,8] \xrightarrow{8} [8,b,a,9] \xrightarrow{9} [9,8,b,a] \xrightarrow{a} [a,9,8,b] \xrightarrow{b} [b,a,9,8] & 0 \text{ misses} \\ \\ [{\bf e},b,a,9] \xrightarrow{8^*} [8,{\bf e},b,a] \xrightarrow{9^*} [9,8,{\bf e},b] \xrightarrow{a^*} [a,9,8,{\bf e}] \xrightarrow{b^*} [b,a,9,8] & 4 \text{ misses} \\ \end{array}$$

- |UCB(s)| = 4
- |ECB(*s*)| = 1
- *n* = 4
- number of additional misses= 4

Upper-bound on the CRPD - direct-mapped cachesity

■ using UCB [Lee et al., 1996]:

 $\mathsf{CRPD}_{\mathsf{UCB}} = \mathsf{BRT} \cdot |\{s_i \mid \exists m \in \mathsf{UCB} : m \bmod c = s_i\}|$ 

using ECB [Tomiyama & Dutt, 2000]:

 $\mathsf{CRPD}_{\mathsf{ECB}} = \mathsf{BRT} \cdot |\{s_i \mid \exists m \in \mathsf{ECB} : m \bmod c = s_i\}|$ 

using UCB and ECB [Negi et al., 2003, Tan & Mooney, 2004]:

$$\begin{array}{ll} \mathsf{CRPD}_{\mathsf{UCB&ECB}} &= \mathsf{BRT} \cdot |\{s_i \mid \exists m \in \mathsf{UCB} : m \ \mathsf{mod} \ c = s_i \\ & \land \exists m' \in \mathsf{ECB} : m' \ \mathsf{mod} \ c = s_i \}| \end{array}$$

## **CRPD** for FIFO: Pitfalls



# $\begin{array}{l} \mathsf{ECBs} \\ = \{\mathbf{X}\} \end{array} \begin{pmatrix} [b,a] \xrightarrow{a} [b,a] \xrightarrow{e^*} [e,b] \xrightarrow{b} [e,b] \xrightarrow{c^*} [c,e] \xrightarrow{e} [c,e] & 2 \text{ misses} \\ \\ [\mathbf{x},b] \xrightarrow{a^*} [a,\mathbf{x}] \xrightarrow{e^*} [e,a] \xrightarrow{b^*} [b,e] \xrightarrow{c^*} [c,b] \xrightarrow{e^*} [e,c] & 5 \text{ misses} \end{array}$

## **CRPD** for FIFO: Pitfalls



## $\begin{array}{l} \mathsf{ECBs} \\ = \{\mathbf{x}\} \end{array} \begin{pmatrix} [b, a] \xrightarrow{a} [b, a] \xrightarrow{e^*} [e, b] \xrightarrow{b} [e, b] \xrightarrow{c^*} [c, e] \xrightarrow{e} [c, e] & 2 \text{ misses} \\ \\ [\mathbf{x}, b] \xrightarrow{a^*} [a, \mathbf{x}] \xrightarrow{e^*} [e, a] \xrightarrow{b^*} [b, e] \xrightarrow{c^*} [c, b] \xrightarrow{e^*} [e, c] & 5 \text{ misses} \end{array}$

- |UCB(*s*)| = 2
- |ECB(*s*)| = 1
- n = 2
- But: number of additional misses= 3

## **CRPD** for PLRU: Pitfalls





But: number of additional misses= 5