[Gene] has a project that writes a lot of settings to a PIC microcontroller’s Flash memory. Flash has limited read/erase cycles, and although the obvious problem can be mitigated with error correction codes, it’s a good idea to figure out how Flash fails before picking a certain ECC. This now became a problem of banging on PICs until they puked, and mapping out the failure pattern of the Flash memory in these chips.
The chip on the chopping block for this experiment was a PIC32MX150, with 128K of NOR Flash and 3K of extra Flash for a bootloader. There’s hardware support for erasing all the Flash, erasing one page, programming one row, and programming one word. Because [Gene] expected one bit to work after it had failed and vice versa, the testing protocol used RAM buffers to compare the last state and new state for each bit tested in the Flash. 2K of RAM was tested at a time, with a total of 16K of Flash testable. The code basically cycles through a loop that erases all the pages (should set all bits to ‘1’), read the pages to check if all bits were ‘1’, writes ‘0’ to all pages, and reads pages to check if all bits were ‘0’. The output of the test was a 4.6 GB text file that looked something like this:
Pass 723466, frame 0, offset 00000000, time 908b5feb, errors 824483 ERROR: (E) offset 0000001E read FFFFFFFB desired FFFFFF7B. ERROR: (E) offset 00000046 read FFFFFFFF desired 7FFFFFFF. ERROR: (E) offset 00000084 read EFFFFFFF desired FFFFFFFF. ERROR: (E) offset 0000008E read FFEFFFFF desired FFFFFFFF. ERROR: (E) offset 000000B7 read FFFFFFDF desired FFFFFFFF. ERROR: (E) offset 000000C4 read FFFBFFFF desired FFFFFFFF. ERROR: (E) offset 000001B8 read FF7FFFFF desired 7F7FFFFF. ERROR: (E) offset 000001BE read 7FFFFFFF desired FFFFFFFF. ERROR: (E) offset 000001D2 read FFFFFF7F desired FFFFFFFF. Pass 723467, frame 0, offset 00000000, time 90aea31f, errors 824492 ERROR: (E) offset 00000046 read 7FFFFFFF desired FFFFFFFF.
The hypothesis tested in this experiment was, “each bit is independently likely to fail, with exponential dependence on number of erase/write cycles”. There were a number of interesting observations that led [Gene] to reject this hypothesis: There were millions of instances where an erase did not reset a bit to ‘1’, but none where a write did not change a ‘1’ bit to ‘0’. That’s great for developing an error correction code scheme.
There was also a bias in which bits in a word produced errors – bits 31 and 32 were slightly more likely to have an error versus other bits in a word. The most inexplicable finding was a bias in the number of failures per row:
A row of Flash in a PIC is 128 bytes, and if all rows were equally likely to produce an error, the above graph would be a little more boring. It’s odd, [Gene] has no idea how to interpret that data, and only decapping one of these PIC and looking at it with a microscope will tell anyone why this is the case.
At the very least, Microchip is severely underrating the number of Flash read/erase cycles on this microcontroller; this chip was rated for 20,000 cycles, and the very first failed bit happened on cycle 229,038. With a separate run, the first failure was around cycle 400,000.