SpaceX Mars transportation software: Rad hardened or rad tolerant?

He does have a point in terms of registers in the processor itself, but at least some Intel Xeon processors have ECC protection of the registers as well (not sure what all others do).

As a reference, Xeon E7 whitesheet : http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-e7-family-ras-server-paper.pdf

Some highlights:

ECC ECC is used to protect processor registers, processor caches, and system memory from transient faults that can corrupt program data without damaging the hardware. The increasing density of modern processors increases the likelihood of such faults.

Memory Demand and Patrol Scrub These features provide the ability to find and correct memory errors, either reactively (demand) or proactively (patrol) addressing memory problems. In all cases, whenever the system detects an ECC error, it will attempt to correct the data and write it back, if possible. When correcting the data is not possible, as is the case with a permanent memory error, the corresponding memory is tagged as failed or “poisoned.” Demand scrubbing is the attempt to correct a corrupted read transaction. Patrol scrubbing involves proactively sweeping and searching system memory and attempting to repair any errors found. Patrol scrubbing errors may activate the Machine Check Architecture Recovery (MCA recovery) mechanism described later.

Enhanced DRAM Single Device Data Correction (SDDC), Enhanced DRAM Double Device Data Correction (DDDC+1) Protect the system from memory chip failure. SDDC can correct any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. It can reconstruct memory contents on the fly, even in the event of the complete failure of one chip. DDDC enables a memory DIMM to continue operation even in the event of two sequential DRAM device hard-errors. Enhanced DDDC (DDDC+1) adds the capability to detect and correct an additional single bit error on top of DDDC. DDDC+1 is a new feature unavailable in previous-generation processors. The ability to recover from two DRAM failures improves uptime and extends the time between service calls, lowering overall service costs.

Fine Grained Memory Mirroring A method of keeping a duplicate (secondary or mirrored) copy of the contents of select memory that serves as a backup if the primary memory fails. The Intel Xeon processor E7 family supports more flexible memory mirroring configurations than previous generations allowing the mirroring of just a portion of memory, leaving the rest of memory un-mirrored. The benefit to IT is more cost-effective mirroring for just the critical portion of memory versus mirroring the entire memory space. Failover to the mirrored memory does not require a reboot, and is transparent to the OS and applications.

QPI Viral Mode Viral mode notifies the system of an uncorrectable error, with all packets having the viral bit set to indicate the presence of such errors. Viral mode causes the CPU and QPI to go into viral state, blocking QPI to PCIe messages. Software can detect this condition and respond to it appropriately. The system configuration agent will stay in that state until software changes the state or is reset.

There's a lot of other features in there that correspond with these to make them very reliable.

With these sorts of features, it shouldn't be hard to implement a system in which the data is ECC protected from register to main memory, and mirror all memory across multiple modules, much like a RAID mirror works. This will detect and correct even "uncorrectable" errors since you're very unlikely to get the same data words having bit errors at the same time in the mirrored memory areas.

In the event somehow you can't recover, the node is taken out of action and rebooted and synchronized with the other nodes - or, with QPI Viral Mode, Corrupt Data Containment Mode, Electronically Isolated Partitioning (see the PDF for more) recovery can be performed without rebooting the system, if the software is properly designed to take advantage of these features.

/r/spacex Thread