Robust Design Patterns - Part 1
Introduction
In this codelab, we address the problem of unresponsive applications. Unresponsive is referring to the situation a program installed on an MCU becomes unavailable due to various reasons (e.g. gobbler task, dead-lock, …).
What you’ll build
In this codelab, you will reuse the application used in previous codelabs (or that of the project) and add watchdog to it so as to ensure that we “always” return to a known, running state no matter what the issue in the SW is.
What you’ll learn
- What type of Wathdogs various platforms offer and know what their main differences are.
- How to dimension and implement a Watchdog - the solution of last resort for safe and reliable systems.
- The inner working of watchdog in the target we are using as target.
- Recover the reset reason a system.
- Define a solution enabling to keep values in memory and surving a reboot.
- How to make an idle system visible… 🙃
What you’ll need
- You need to have finished the Digging into Zephyr RTOS.
- You need to have completed the Scheduling of periodic tasks codelab
- You need to have finalized the Robust Development Methodologies (I) codelab and the Robust Development Methodologies (II) codelab.
- The codelabs related to scheduling (part 1 and part 2) are concluded.
Introduction to Watchdog
In the first part of the codelab, we will learn about what watchdog options on our board there are. The board we are using for this course (nRF5340), offers multiple independent watchdogs called WTD that are used to detect and resolve malfunctions due to software failures. The watchdog functions by triggering a reset sequence when it is not refreshed within the expected time-window. The board offers, as it can be seen below, two watchdogs for the application core and one for the network core.

The watchdogs rely on an independent 32-kHz low-speed input
(LFCLK - for which multiple sources are possible) and thus remains active even
if the main clock (HFCLK) fails as it can be seen on the picture below.

As stated, there exist two watchdogs that can activated on the application cpu. In the present codelab we are going to put WDT0 into service.
Questions
Can you explain :
- what is the interest of having two system watchdogs?
- is it possible to avoid a reset while in debug and having WTD activated?
Calculating the dimensioning of the watchdog
So, the first question one asks is “How much time shall I wait before the reset is triggered?”. That is, how big shall be the refresh window? This corresponds to the maximum time allowed without a refresh from the application prior to issuing a reset. In this calculation the assumption is that the idle task shall run at least every second engine task period (see Phase A - Periodic tasks) because at design it was calculated that the system shall have spare capacity. So, not being able to run the idle task means that there is no such spare capacity and this is deemed an unsustainable situation.
If one reads the manual of nRF5340, she sees that the formula for calculating the maximum time allowed without a refresh is the following:
\(timeout [s] = \frac{ CRV_{reg} + 1 }{32768}\)
Questions
- First of all : calculate the parameters of the watchdog for complying to above stated timing requirements
- Secondly, although a watchdog is very important, it can be a burden while debugging as it triggers resets at speeds the human being cannot handle (in the milliseconds - single digit seconds range). So : how could we possibly tackle this?
Danger
With the present solution it is assumed that the application cannot withstand a situation in which the idle task is not served for a whole period as this would imply the system has no free capacity left and at design this was deemed impossible. Obviously, this is not a case that will be applied oftentimes. In fact, the watchdog is more often used as a solution of last resort - thus its application will be scrutinized carefully and resulting in slower triggering of such measure. For instance applying other measures first to recover from the situation prior to resetting the system.
In order to avoid landing in difficult situation, the timeout chosen is 10x that of the period of the idle task. The reason being that the watchdog survives a reboot of the system and thus the boot sequence shall have time to execute prior to the watchdog hitting.
Note
In reality this computation is done by Zephyr RTOS already, we have the luxury to simply specify the time in ms in our code.
Implementing a watchdog
First of all, we need to ensure that a task exists with a very low priority. For doing so, we define a new task having the second lowest priority task, just above idle task. For this,
- activate watchdog support in your
prj.conf# configure system watchdog CONFIG_WATCHDOG=y - add an
overlaygranting access to the watchdog/* * Copyright 2024 Nordic Semiconductor ASA * SPDX-License-Identifier: Apache-2.0 */ &wdt0 { status = "okay"; }; -
include the following into your
main.cpp:using namespace std::chrono_literals; static constexpr auto kWdtTimeOut{500ms}; static constexpr auto k50msTimeOut{50ms}; static constexpr auto kIdleThreadStackSize{512}; // Get the watchdog device from the Devicetree alias and prepare channel ID static const struct device *const wdt = DEVICE_DT_GET(DT_ALIAS(watchdog0)); static uint32_t wdt_channel_id; // Function for refreshing the watchdog static void wdt_feed_thread(void *p1, void *p2, void *p3) { while (1) { wdt_feed(wdt, wdt_channel_id); LOG_DBG("System watchdog refreshed"); k_sleep(K_MSEC(k50msTimeOut.count())); } }More elegant and, especially, more secure
In the above code, we are forced to convert the
std::chronointo a value that has no type for its use inK_MSEC(k50msTimeOut.count()). If we used the appropriatezpp_libmethod, this would elegantly - and more secure as we would check the type - transform intozpp_lib::ThisThread::sleep_for(k50msTimeOut);decide whether you want a
c-styleorzpp_libidle thread definitionc-style
// Thread for refreshing the watchdog when there is nothing else to do K_THREAD_DEFINE(wdt_feeder, kIdleThreadStackSize, wdt_feed_thread, wdt, &wdt_channel_id, nullptr, CONFIG_NUM_PREEMPT_PRIORITIES - 1, 0, -1);zpp_lib
#include "zpp_include/thread.hpp" #include "zpp_include/types.hpp" zpp_lib::Thread wdt_feeder( zpp_lib::PreemptableThreadPriority::PriorityIdle, "wdt_feeder" // optional name );and then instantiate what is needed within an
initSystemWatchdog()functionc-style
static uint32_t initSystemWatchdog() { int32_t err; if (!device_is_ready(wdt)) { LOG_ERR("Watchdog device not ready\n"); return -ENODEV; } // 2. Configure the timeout behavior struct wdt_timeout_cfg wdt_config = { .window = { .min = 0U, // No minimum wait time .max = kWdtTimeOut.count(), // Maximum window }, .callback = wdt_callback, // No callback (direct reset) .flags = WDT_FLAG_RESET_SOC, // Reset the whole chip on timeout }; // 3. Install the timeout and get a channel ID wdt_channel_id = wdt_install_timeout(wdt, &wdt_config); if (wdt_channel_id < 0) { LOG_ERR("System Watchdog install error: %d\n", wdt_channel_id); return wdt_channel_id; } // 4. Start the watchdog with optional pause-on-debug err = wdt_setup(wdt, WDT_OPT_PAUSE_HALTED_BY_DBG); if (err < 0) { LOG_ERR("System Watchdog setup error: %d\n", err); return err; } // 5. Start the watchdog refresh k_thread_start(wdt_feeder); LOG_INF("System Watchdog started! Feed it within %lld ms...\n", kWdtTimeOut.count()); return 0; }zpp_lib
static uint32_t initSystemWatchdog() { int32_t err; if (!device_is_ready(wdt)) { LOG_ERR("Watchdog device not ready\n"); return -ENODEV; } // 2. Configure the timeout behavior struct wdt_timeout_cfg wdt_config = { .window = { .min = 0U, // No minimum wait time .max = kWdtTimeOut.count(), // Maximum window }, .callback = wdt_callback, // No callback (direct reset) .flags = WDT_FLAG_RESET_SOC, // Reset the whole chip on timeout }; // 3. Install the timeout and get a channel ID wdt_channel_id = wdt_install_timeout(wdt, &wdt_config); if (wdt_channel_id < 0) { LOG_ERR("Watchdog install error: %d\n", wdt_channel_id); return wdt_channel_id; } // 4. Start the watchdog with optional pause-on-debug err = wdt_setup(wdt, WDT_OPT_PAUSE_HALTED_BY_DBG); if (err < 0) { LOG_ERR("System Watchdog setup error: %d\n", err); return err; } // 5. Start the watchdog refresh - attention to the lifetime of the objects used by lambda err = wdt_feeder.start([&]() { wdt_feed_thread(static_cast<void *>(const_cast<struct device *>(wdt)), &wdt_channel_id, nullptr); }); if (err < 0) { LOG_ERR("System Watchdog setup error: %d\n", err); return err; } LOG_INF("System Watchdog started! Feed it within %lld ms...\n", kWdtTimeOut.count()); return 0; }Danger
Attention to the lifetime of the objects used by lambda — since they are captured by reference, this would become an issue if the enclosing scope exits before the lambda finishes executing.
Note
Obviously, the initSystemWatchdog() function shall be called from within main().
From this point onwards, every time your very own idle task
is active, you should be able to spot it if you set the right
LOG level 😎.
Warning
In order to make it work, you obviously need to :
- define the constant
k50msTimeOut - add the necessary
#includereferences as applicable (e.g.<zephyr/kernel.h>, <zephyr/device.h>, <zephyr/drivers/watchdog.h>)
Question
Answer the following questions:
- What woud one need to do to install a callback?
- What does
WDT_FLAG_RESET_SOCmean? What other options are there?
Info
Zephyr RTOS is not exposing the idleTask directly, thus the solution we use here. However, we may have used other options like ensuring that the most important task is refreshing the feed, relaxing the time constraints and so forth.
Understanding what triggered the reset of the board
At times, the reset happens so suddenly that we do not really understand (nor do we have the time to log anything) what caused the reset in the first place.
So, the question arises: is there a way to know the reset reason? Luckily, there is.
This information is oftentimes vendor specific but there is a Zephyr RTOS implementation in the Hardware Info chapter. However, for exemplary reasons the Nordic Semiconductor solution is used here.
static void resetReason(){
/* 1. Read and clear the reset reason */
uint32_t reason = nrfx_reset_reason_get();
if (reason & NRFX_RESET_REASON_DOG_MASK) {
LOG_ERR("Reboot Cause: WATCHDOG RESET (%u times)\n", val);
} else if (reason & NRFX_RESET_REASON_RESETPIN_MASK) {
LOG_ERR("Reboot Cause: PIN RESET\n");
} else if (reason & NRFX_RESET_REASON_SREQ_MASK) {
LOG_ERR("Reboot Cause: SOFTWARE RESET\n");
} else {
LOG_ERR("Reboot Cause: POWER-ON / OTHER (0x%08x)\n", reason);
}
}
Warning
One shall include the following for it to work
#include <helpers/nrfx_reset_reason.h> // Include for reset reason
For more information about reset reasons, read the corresponding vendor information under https://docs.nordicsemi.com/bundle/ncs-2.7.0/page/nrfx/nrfx_api/reset_reason.html.
Memorizing values surviving a reset
So, now that we know that we can retrieve the information about the reset reasons, we can ask ourselves the next question: how am I to know how often this has occurred? In an ideal World, we would be able to store the information in a file, a database or some other form of storage but this is not always possible when there is very little time. One solution is to store it in dedicated chips. Another, is to have a part of the RAM that is not re-initialized at start so that we can store and retrieve information from it. This proceeding is called Retained Memory in Zephyr RTOS and it provides a way of reading from/writing to memory areas whereby the contents of the memory is retained whilst the device is powered.
Critical
Data may be lost in low power modes and will be for sure in case of power outage.
For this to work, one needs to
- activate the functionality in
prj.confCONFIG_RETAINED_MEM=y CONFIG_RETAINED_MEM_ZEPHYR_RAM=y -
define the corresponding information in the overlay so as to carve out a memory block (4 KB block in this example) at the end of SRAM for retained memory, register it as a Zephyr retained memory device and ensure the rest of SRAM does not overlap with it. Concretely
Here is what each part of the devicetree overlay does:/ { sram0_retained@2006f000 { compatible = "zephyr,memory-region", "mmio-sram"; reg = <0x2006f000 0x1000>; zephyr,memory-region = "RetainedMem"; status = "okay"; retained_mem0: retainedmem { compatible = "zephyr,retained-ram"; status = "okay"; }; }; chosen { zephyr,retained-mem = &retained_mem0; }; aliases { retainedmemdevice = &retained_mem0; }; }; &sram0_image { /* Shrink SRAM to avoid overlap with retained memory region */ reg = <0x20000000 0x6f000>; };/ { ... };: this block defines new nodes or properties at the root of the devicetree.sram0_retained@2006f000 { ... };: this node defines a memory region starting at address0x2006f000with size0x1000(4 KB).compatible = "zephyr,memory-region", "mmio-sram";: declares this node as a generic memory region and memory-mapped SRAM.reg = <0x2006f000 0x1000>;: sets the base address and size.zephyr,memory-region = "RetainedMem";: names this region “RetainedMem” for Zephyr.status = "okay";: enables this node.retained_mem0: retainedmem { ... };: this is a child node representing the actual retained RAM device.compatible = "zephyr,retained-ram";: marks it as a Zephyr retained RAM device.status = "okay";: enables this device.
chosen { zephyr,retained-mem = &retained_mem0; };: tells Zephyr to useretained_mem0as the system’s retained memory device.aliases { retainedmemdevice = &retained_mem0; };: creates an alias so you can refer to the retained memory device asretainedmemdevicein code.
&sram0_image { reg = <0x20000000 0x6f000>; };: shrinks the main SRAM region to avoid overlapping with the retained memory region you just defined.
-
create a function that initializes the memory and reads the content. In the example below, not only does one read the data but also increases its content by one. The goal being of storing a value that is incremented every time the memory is initialized
static uint32_t initAndIncreasePseudoStaticMemory(){ const struct device *const ret_mem = DEVICE_DT_GET(DT_CHOSEN(zephyr_retained_mem)); uint32_t saved_val = 0; /* Check Retained Memory first */ if (device_is_ready(ret_mem)) { retained_mem_read(ret_mem, 0, static_cast<uint8_t*>(static_cast<void*>(&saved_val)), sizeof(saved_val)); saved_val++; } /* Save a value to memory before we let the watchdog bite */ retained_mem_write(ret_mem, 0, static_cast<uint8_t*>(static_cast<void*>(&saved_val)), sizeof(saved_val)); return saved_val; } - define a function that clears the content of the memory section
static void clearRetainedMemory(){ const struct device *const ret_mem = DEVICE_DT_GET(DT_CHOSEN(zephyr_retained_mem)); if (device_is_ready(ret_mem)) { retained_mem_clear(ret_mem); } }
Question
How can you modify resetReason() so that it
- reads the value, prints out the number of times the watchdog has hit sequentially (see below) and increases it every time there is such watchdog
[00:00:00.250,244] <inf> car_system: Program started [00:00:00.250,305] <err> car_system: Reboot Cause: WATCHDOG RESET (35 times in a row) [00:00:00.250,305] <inf> car_system: Starting system watchdog [00:00:00.250,335] <inf> car_system: Watchdog started! Feed it within 2000 ms... - it resets the memory value to 0 every time there has been a reset initiated by anthing but a watchdog.
Task watchdog
A hardware watchdog can only monitor one thing: “is anything feeding me?” In a multi-threaded RTOS, that is not enough. If thread A is stuck but thread B keeps feeding the watchdog, the system never resets — even though it is broken.
The task watchdog (task_wdt) is a software layer that monitors individual
threads. Each thread gets its own channel and must call task_wdt_feed() within
its deadline. The hardware watchdog is only fed when all channels are healthy.
One stuck thread → hardware starves → system resets.
Thread A Thread B Thread C
| | |
v v v
channel 0 channel 1 channel 2
\ | /
\ | /
+------ task_wdt (software) ----+
single k_timer
|
v
hardware watchdog
(optional fallback)
All channels share a single k_timer. On every task_wdt_feed(), the timer is
reprogrammed to the nearest deadline across all channels. If no thread misses
its deadline, the timer ISR never fires. This makes it very lightweight — no
periodic polling, no extra threads.
In order to activate this thread watchdog framework , one
-
adds the following to
prj.confThe hardware watchdog timeout is set toCONFIG_TASK_WDT=y CONFIG_TASK_WDT_HW_FALLBACK=n # use hardware wdt as safety netMIN_TIMEOUT + HW_FALLBACK_DELAY. This ensures the software layer always has a chance to detect the failure before the hardware forces a reset.Hint
Obviously, the values need to be adapted to your timing needs.
-
implements the following APIs
task_wdt_init(hw_wdt)inmain(). If a system watchdog is available, either you remove that and pass the control totask_wdt_initor you leave that tonullptr.
For this codelab and thecar-simproject, this shall be set tonullptrint err = task_wdt_init(nullptr); if (err < 0) { LOG_ERR("Task watchdog init failed: %d", err); return err; }-
task_wdt_add(timeout_ms, callback, user_data)in the defined tasks one would like to monitor. For exampleLOG_DBG("Thread %d starting at time %lld ms", taskIndex, nextPeriodStartTime.count()); /* When registering: */ int wd_id = task_wdt_add(500, my_task_wd_callback, NULL); if (wd_id < 0) { LOG_ERR("Failed to add task watchdog channel"); return; }Danger
If no
callbackis specified, a reset takes place. -
task_wdt_feed(channel_id)for feeding the watchdog. Very straight forward as intask_wdt_feed(wd_id); task_wdt_delete(channel_id)for enregistering the task.
-
adds a callback for the timeouts - an example may be
static void my_task_wd_callback(int channel_id, void *user_data) { LOG_ERR("Thread on channel %d is stuck!\n", channel_id); /* System will reset after this returns if CONFIG_TASK_WDT_HW_FALLBACK=y One may choose to signal a resume of the task duty if so desired. IMPORTANT: this is ISR context. */ }
Note
How the Timer Works Internally
task_wdt_add()records an absolute tick deadline for the channel.task_wdt_feed()updates that channel’s deadline, then scans all channels to find the soonest one. It reprograms the single k_timer to fire at that soonest deadline. If hardware fallback is enabled, it also feeds the hardware watchdog.- If a thread stops feeding, the timer eventually fires. The ISR invokes the expired channel’s callback (or resets the system).
- During normal operation, the timer ISR never actually runs — it keeps getting pushed forward.
Details available under Task Watchdog Source code.
Question
What would imply setting CONFIG_TASK_WDT_HW_FALLBACK=y? What is the
meaning of the following in that context?
CONFIG_TASK_WDT_MIN_TIMEOUT=500U # minimum timeout across all channels (ms)
CONFIG_TASK_WDT_HW_FALLBACK_DELAY=1000U # extra margin before hw wdt fires (ms)
Task Watchdog vs. Hardware Watchdog
| Hardware Watchdog | Task Watchdog | |
|---|---|---|
| Monitors | Whole system | Individual threads |
| Backed by | Dedicated hardware peripheral | k_timer (software) |
| Channels | Usually 1 | As many as you configure |
| Failure mode | System reset | Callback or system reset |
| Catches | Hard faults, scheduler hang | Thread deadlocks, starvation |
They are complementary. The task watchdog sits on top of the hardware watchdog, using it as a fallback safety net.
Implement Reset Reason, System and Task Watchdogs for all threads of your car-sim project
Now that you have all the elements required, go ahead and implement
- a system watchdog fulfilling the requirements as stated in the
exercice. Make sure that reset reasons is
recorded and the number of resets for watchdog reasons is available in an
error log like
[00:00:00.250,305] <err> car_system: Reboot Cause: WATCHDOG RESET (35 times in a row) -
task watchdogs for all your car components. Ensure that
- the deadlines are those specified by the requirements
- a
LOG_ERR("Thread on channel %d is stuck!\n", channel_id)is implemented in a task specifc callback
Going beyond / References
- nRF5340 reference manual, check out chapter 7.41
- Nordic Semiconductor nRF5340 reset reasons
- [Retained Memory](https://docs.zephyrproject.org/latest/hardware/peripherals/retained_mem.html{target="blank”}
- Task Watchdog Official docs
- Task Watchdog API reference
- Task Watchdog Source code (~200 lines, very readable)
- Task Watchdog Original PR with design discussion