How to detect and monitor for ECC RAM errors on Debian Linux.
ECC RAM can detect and correct simple bit-flip errors caused by cosmic radiation, etc.
It is useful to be able to detect and monitor for ECC errors.
Linux supports logging ECC errors via the Error Detection and Correction driver (EDAC).
In Debian, the following tools are available:
- edac-utils (although it is mostly deprecated)
- rasdaemon
rasdaemon monitors for and reports ECC errors.
ECC error messages typically includes:
- The memory address in which the error was detected.
- DRAM address (column, row, rank, bank)
- Channel
- DIMM slot number
The amount of defails varies significantly between the different CPU models and memory controllers.
Installing rasdaemon on Debian
apt install rasdaemon
Querying for EEC errors
~ # ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No MCE errors.
~ # ras-mc-ctl --error-count
Label CE UE
mc#0csrow#1channel#0 0 0
mc#0csrow#0channel#1 0 0
mc#0csrow#3channel#0 0 0
mc#0csrow#2channel#0 0 0
mc#0csrow#0channel#0 0 0
mc#0csrow#3channel#1 0 0
mc#0csrow#2channel#1 0 0
mc#0csrow#1channel#1 0 0