Mars Pathfinder system resets due to priority inversion
NASA · Mars Pathfinder flight software
The Mars Pathfinder spacecraft experienced unexpected system resets a few days after its landing in July 1997. These resets occurred when the high-priority bc_sched task detected that the bc_dist task, responsible for data distribution, failed to complete its execution within its hard deadline. Each reset reinitialized hardware and software, terminating current ground-commanded activities and delaying daily operations, though no collected data was lost.
The root cause was identified as a priority inversion problem within the VxWorks operating system. A low-priority ASI/MET task acquired a mutual exclusion semaphore but was then preempted by several medium-priority tasks before releasing it. When the high-priority bc_dist task subsequently attempted to acquire the same semaphore, it became blocked, effectively waiting for the much lower-priority ASI/MET task to complete, leading to the deadline violation and system reset.
The problem was diagnosed by reproducing the failure in a lab environment. This was achieved using built-in debug and trace facilities that were part of the flight software. Once the failure was successfully reproduced, the underlying priority inversion issue became evident.
The solution involved remotely patching the spacecraft’s flight software. This patch enabled priority inheritance for the semaphore used by the select() mechanism within VxWorks. This mechanism ensured that if a high-priority task blocked on a semaphore held by a lower-priority task, the lower-priority task would temporarily inherit the higher priority, allowing it to complete and release the semaphore promptly.
Extensive testing was conducted on the ground to verify the fix and assess any potential performance impacts or behavioral changes. The remote software update process involved sending only the differences between the onboard and desired software versions, which were then applied by custom software on the spacecraft. This remediation successfully resolved the system resets, allowing the mission to continue its scientific objectives.