Introduction to Multicore

Why Multicore? Is it worth the additional work?

One way of increasing the output of a system is increasing the clock frequency, however this can only be done to a certain point and has negative side effects: An increase in frequency results in better performance, but also an increase in power consumption, which in turn generates more heat and requires more advanced cooling, whilst also shortening the longevity of the device.

Instead, modern processors and micro-controllers add additional processing units (cores). In the early 2010s, the first multicore micro-controllers entered the market, and the ETAS Basic Software (BSW) stack was the first AUTOSAR stack that supported multicore and was used in multicore projects in production.

To make use of this additional processing power, it is necessary to distribute application and base software across cores.

Types of Multicore System

Homogeneous - Implement multiple identical cores
Heterogeneous - Implement multiple cores with different types of instruction sets.

The market today consists of a mix of these two types: most devices have a number of identical "main" cores, plus some specialized cores to accelerate certain functions, for example a built-in HSM (hardware security module) that offers performant AES encryption.

Multicore Memory Topology

Type	Description
Distributed Memory	Each core typically has it's own private memory and communication between cores is performed over a high speed network
Shared Memory	The ECU contains a public memory that is shared by multiple cores
Hybrid Design	There is a shared memory resource but each core has its own private memory as well.

The vast majority of current micro-controllers have a hybrid design. The different memory areas have vastly different access times. Therefore, the memory layout has a big influence on the performance of the system. It's common to gain 10% of runtime by optimizing the memory access patterns.

Multicore overhead

Usually, running software on multiple cores is faster than running it on a single core. 2 cores are not 2 times as fast as one core. The speedup is always below 100%, the exact value depends on the nature of the task, which is expressed in Amdahl's law. It focuses on the fact that there is a certain part of the program that cannot be parallelized. In practice, there are additional factors limiting speedup:

Synchronization and communication overhead - some part of the program might require data from another core, or it might have to synchronize (e.g. one part of a program has to run immediately after a program on another core, or one core has to wait until a lock is available). In both cases, the result will be that the processor is blocked without doing any work (wait states)
Context switch overhead - there is always an overhead when a CPU has to do a context switch. This can be as little as two cycles (in certain situations on certain microcontrollers) and as much as millions of cycles (if a processor has to flush its caches, perform a page table walk and reload RAM contents from disk)
Contention - if cores share a hardware resource, as long as one is using it, the other cores have to wait until it is available
Interference effects - an example is bus transport - even if two cores access two separate peripherals, these two peripherals might be attached to a common bus, and that bus can then become a bottleneck

The previous section focused on the run time / load aspects of distributed computing. In hard real time systems, reaction time jitter is a second important issue that stems from the same root causes. These systems are usually more concerned with worst case execution time than with average core loads. The more components that are involved, the harder it gets to define, reproduce and measure this worst case.

These facts influence the distribution of software to cores and are project and device specific. This also applies to the Basic Software - some BSW modules should be moved to another core, some should be distributed across multiple cores and some should be kept on a single core.

Design for performance

It's only possible to design for performance, if one knows the performance goals of the system. In some systems, the most important property might be CPU load, with the goal to load all cores equally. In others, it might be worst case latency - for example a gateway ECU that should not take longer than 1ms to forward a message. In others, it's more important to have a highly predictable system with low jitter.

When discussing performance, the only way to get a reliable answer is to measure. While intuition can sometimes give you some idea about which pieces of your code are relevant to performance, only a good measurement can give a definitive answer. However, especially at the beginning of a project, there is the need to make design decisions before software is available that can be measured.

That being said, a common goal in a multicore system is to use the cores efficiently, and the biggest source of inefficiency is waiting. This takes two main forms: waiting for memory, and waiting for other cores. Waits for memory can occur on any memory access, but the slower the access the longer the wait. So these waits can be reduced by primarily using local memory - the memory that is fastest to access, because it is closest to the core running the code. Waits for other cores mostly occur when there are dependencies between software running on different cores. These waits can be reduced by a design that reduces dependencies. For example, sometimes tasks are chained to get predictable response times in a multithreaded system - task2 runs directly after task1 - but such a chaining also means that everything in task2 can only run after task1. If these tasks run on different cores, the cores have to wait for one another.

So the general rules of thumb are:

Know your performance goals
Measure
Keep data local
Avoid dependencies, especially between cores

Concurrency and Parallellism

Wikipedia defines Concurrent computing as: "several computations are executed concurrently—during overlapping time periods—instead of sequentially—with one completing before the next starts". So multiple threads of execution can interrupt each other (multithreading). Parallel computing refers to the case where more than one core is present, so as many pieces of code as there are cores can run at the same time. Microcontrollers are often used for real time, interrupt heavy workloads, and aside from the most basic systems they are always using multithreading. So in a sense, many of the issues mentioned in 2. Concurrency issues are already familiar. However, adding more cores makes it more likely that existing concurrency issues occur, removes some mitigations (on a single core system, a lot of concurrency problems can be avoided by a clever choice of thread priorities) and also adds some additional issues (mostly related to memory access).

AUTOSAR Multicore

Basic Terms

OsApplication

These are used to define the privilege level and the protection boundary of the tasks, ISRs and locks (OsResources) that they own. In essence, they are a container for the operating system to manage things that belong together. There is exactly one OsApplication for each EcucPartition. Detailed information about OsApplications and related concepts can be found in the ETAS RTA-OS User Guide, Chapter 19.1 OS-Applications:

AUTOSAR OS provides a higher-level abstraction that allows OS objects (Tasks, ISRs, Events, Resources, Alarms, Schedule Tables and Counters etc.) to be grouped together into a cohesive functional unit called an OS-Application

OsApplications are used to manage access rights and execution privileges:

Trusted OS-Applications run in privileged (supervisor) mode. They have unrestricted access to memory, all configured OS objects and the complete OS API. [...]
Untrusted OS-Applications run in non-privileged (user) mode [...] Tasks and ISRs in an OS-Application only have access to objects owned by the same OS-Application by default
Trusted-With-Protection OS-Applications run in privileged (supervisor) mode. Memory protection is applied to them in the same way as untrusted code, but otherwise they are the same as Trusted OS-Applications.

ASIL separation is realized by the configuration of timing, memory and service protection for each OS Application.

The following picture shows the properties of OsApplications in an overview:

EcucPartition

Each OsApplication refers to one EcucPartition (1:1 mapping via OsAppEcucPartitionRef). The OsApplication is used during runtime by the OS, whereas the EcucPartition is used during configuration, mainly to describe the relation with SW-Cs as well the interaction with other BSW modules (like the ComMUser).

Mapping of SW-Cs

There is a two way mapping of SW-Cs to OsApplications

Source: after AUTOSAR_CP_EXP_LayeredSoftwareArchitecture, slide id 11eer

System Level concepts - EcuPartition ApplicationPartition

Overall status of Multicore support in AUTOSAR

Multicore support was introduced to AUTOSAR in release 4.0.1 consisting of extensions to the OS, the RTE and the EcuM. There were, however, performance issues with the initial approach of having the complete BSW run on one core and using remote procedure calls from the other cores. Therefore, a concept called "Enhanced BSW allocation in partitioned systems" was introduced in release 4.1.1, and BSW is based upon this concept.

The BSW distribution scenario supported by RTA-BSW is based upon the AUTOSAR concept "Enhanced BSW allocation in partitioned systems", also known as Master-Satellite:

Source: AUTOSAR_CP_EXP_BSWDistributionGuide

Scheduling of tasks and execution of services

According to the BSW Module Description Template there are 3 types of executable entities in the BSW:

BswSchedulableEntity: e.g. the module main function which is designed to be controlled by the BSW Scheduler => called by an OS-Task and executed on a specific core
1. the OsApplication in which the function is executed is determined by the RteBswEventToTaskMapping, if it exists. If it does not exist, the function is executed in the context of the caller.
BswInterruptEntity: Interrupt Service Routine (ISR) => triggered by an interrupt and executed on a specific core
BswCalledEntity: service (API) which is designed to be called from another BSW module => executed in the context and thus on the core of the caller

Additional concepts

Inter-OS-Application Communication / Communicator (IOC)

The "IOC" is responsible for the communication between OS-Applications and in particular for the communication crossing core or memory protection boundaries.
[...]
The IOC provides communication services which can be accessed by clients which need to communicate across OS-Application boundaries on the same ECU or Software Cluster.
The RTE uses IOC services to communicate across such boundaries. All communication must be routed through the RTE on sender (or client) and on receiver (or server) side.

Source: AUTOSAR_CP_SWS_OS, ch. 7.10 "Inter-OS-Application Communicator (IOC)"

Service Component / Service Proxy

A BSW module (specified by a BSW Module Description - BSWMD) cannot have ports. This is intentional so that ASW components cannot circumvent the layered software architecture and access lower level BSW modules. But there are BSW services that the ASW has to interact with via ports using RTE, for example Dem. The solution to this problem is the ServiceSwComponent. A BSW module that offers a service to the ASW has both a BSWMD and a ServiceSwComponent. The ServiceSwComponent owns the ports that are accessed by the ASW.

Since the ServiceSwComponent is related to the BSW, it is not part of the System Description and only added to the ECUExtract.

The ServiceProxySwComponent is a special case for mode management. In general, BSW services can only be called locally, and calls are not forwarded to other ECUInstances. The one exception is mode management, which sometimes requires coordinating modes with other ECUInstances. For this purpose, the local BswM service is connected to a service proxy, which forwards the calls over the network.

More details can be found in AUTOSAR_CP_TPS_SoftwareComponentTemplate, ch. 11.2 "Service Software Component Type" and 11.3 "Service Proxy Component Type"

Master core / slave core

Typically, the hardware only starts one core, referred to as the master core, while the other cores (slaves) remain in halt state until they are activated by the software. On systems where cores start independently from each other, it is necessary to emulate master-slave behavior by software.

On architectures with a sequential start of cores, there is one designated master core, in which the boot loader starts the master EcuM via EcuM_init.
The EcuM in the master core starts some drivers, determines the Post Build configuration and starts all remaining cores with all their satellite EcuMs.

Source: AUTOSAR_CP_EXP_BSWDistributionGuide, ch. 2.4.4 "Configuring the EcuM (per Core)"

Currently, RTA-CAR supports a single EcuM and BswM per EcuInstance. In AUTOSAR, that there is exactly one EcuM per core, while there can be multiple BSW Mode Managers (BswM) (one for each OsApplication that contains BSW code). For information regarding the configuration, refer to AUTOSAR_CP_EXP_BSWDistributionGuide, ch. 2.4.3 "Configuring the BswM (per Partition)" and ch. 2.4.4 "Configuring the EcuM (per Core)"

EcuM and BswM multicore usecases

Although there is only a single BswM and EcuM instance, it is usually possible to cover the usecases that would require multiple instances. For example, using multiple schedule tables.

Exclusive areas

An exclusive area is a piece of code that must not be executed concurrently. If one thread is executing in that section, another thread must not enter it. To ensure this, locks (for example an OsResource) are used - when entering, the first thread takes a lock. When a second thread tries to enter the section, it also tries to take the lock. Since the lock is already taken, the second thread cannot enter until the first thread releases the lock.

Memory Mapping

In the AUTOSAR classic platform, the addresses of code, variables and static data (constants / parameters) are determined at build time, when the binary is created (during the linker run). To make it possible to locate code, variables and data to the right addresses, AUTOSAR has a memory mapping concept. For more information see AUTOSAR_CP_SWS_MemoryMapping, especially Example 7.2 to 7.4

RTA Knowledge Base

1. Multicore and AUTOSAR