Samsung’s Leap Month Bug Teaches Not To Skimp On Testing

Date and time handling is hard, that’s an ugly truth about software development we’ll all learn the hard way one day. Sure, it might seem like some trivial everyday thing that you can easily implement yourself without relying on a third-party library. I mean, it’s basically just adding seconds on top of one another, roll them over to minutes, and from there keep rolling to hours, days, months, up until you hit the years. Throw in the occasional extra day every fourth February, and you’re good to go, right?

Well, obviously not. Assuming you thought about leap years in the first place — which sadly isn’t a given — there are a few exceptions that for instance cause the years 1900 and 2100 to be regular years, while the year 2000 was still a leap year. And then there’s leap seconds, which occur irregularly. But there are still more gotchas lying in wait. Case in point: back in May, a faulty lunar leap month handling in the Chinese calendar turned Samsung phones all over China into bricks. And while you may not plan to ever add support for non-Gregorian calendars to your own project, it’s just one more example of unanticipated peculiarities gone wild. Except, Samsung did everything right here.

So what happened?

A Tale Of Many Calendars

Apart from the Gregorian (no relation), a variety of calendars have been in existence since the literal invention of time, and are generally separated into three categories:

  • solar calendars that are entirely based on the sun’s annual cycle, like the Gregorian calendar
  • lunar calendars that are entirely based on the moon’s monthly cycle, like the Islamic calendar
  • lunisolar calendars that are based on the moon’s monthly cycle, but aligned within the sun’s annual cycle, like most historical calendars, including the traditional Chinese calendar

Considering the (roughly) extra quarter day of a solar year that we accumulate and squeeze into February every four years to keep the seasons from drifting too far away, things are certainly a tad more complex when we want to do the same with the ~29.5 days cycles of the moon. It just doesn’t add up nicely, and leap months are the result in lunisolar calendars, adding an entire extra month to the year every now and then. Purely lunar calendars on the other hand don’t care about the sun at all, and therefore ignore any corrections to keep the seasons in sync.

If you ever wondered why Chinese New Year or Easter doesn’t have a fixed date, but is still around the same period within the year, lunisolar calendars are your answer — just like lunar calendars explain why Ramadan is always at a completely different time each year.

Bugs And Stones May Brick Your Phones

So what happened? [Gao Ye] took a close look at the issue and wrote an extensive description of the findings. The write-up is in Chinese, and the page doesn’t seem to like being loaded through Google Translate, so in case your browser won’t translate it for you, here’s a PDF of it. Well, let’s have a closer look, and add a few details in case you’re not too familiar with Android.

As the write-up shows, the issue itself originated in the Always-On Display (AOD) service, a feature some Android devices provide to display basic information on the screen while the phone is otherwise asleep — like the current time and date. In case of the affected Samsung phones, this was the date according to the Chinese calendar. Come midnight, the leap month began and the AOD service unexpectedly crashed.

Being a system service, Android simply restarted the service, and to no surprise, that didn’t fix the problem, causing an immediate crash again. Rinse and repeat often enough, the operating system eventually decided that one of its crucial components is unable to run, and as mitigation shut down and rebooted into its recovery mode. Had this happened in a regular app, the crash would be more a nuisance as the app would just keep on crashing, but this way, the phone was essentially bricked.

After all the talk about home-brew date-and-time implementations in the beginning, the obvious assumption would be that a flaw in the leap month handling caused this crash. However, based on the code [Gao Ye] dug out, the logic actually worked out here:

chinaLunar = (sSolarLunarConverter.isLeapMonth()
        ? context.getResources().getString(R.string.common_date_leap_month) + months[convertMonth]
        : months[convertMonth])
    + days[convertDay];

As complex as handling (soli)lunar calendars may seem, the Chinese calendar has a pragmatic solution for most of the months by simply prefixing the equivalent of the word “leap” to the name of the month it is inserted to. This year, the leap month falls before the fourth month, so its name becomes “leap fourth month”, followed by the normal “fourth month” afterwards. The code resembles that logic.

Looking at the stack trace in the analysis, the crash was actually caused by failing to retrieve the string resource for that “leap” prefix string resource, common_date_leap_month. While that explains the issue on a high level, it actually raises more questions than it answers, primarily how the service got in such a situation in the first place. To make sense of it, let’s have a look at how Android handles strings and other resources.

Android Resource Handling

Along with arrays, integers, anything related to the UI layout, and pretty much everything else that isn’t hardcoded, strings are defined in XML resource files that are separated from the app’s Java (or Kotlin) source code. Each resource type has a common file used for default values, and additional files to optionally override these defaults with values more specific to things like different UI layouts, device features, or locales.

As an example, a string could be defined with English as default option in a res/values/strings.xml file like this:

<string name="tongue_twister">Peter Piper picked a peck of pickled peppers.</string>

And its localized German counterpart could be defined the same way in res/values-de/strings.xml (note the language code in the values path):

<string name="tongue_twister">Fischers Fritz fischt frische Fische.</string>

During the build process, the entire collection of resources is gathered and compiled into a single resource object file, the some.package.R file. The intermediate Java file generated for our example would then contain the tongue twister string like this:

package some.package;

public final class R {
  ...
  public static final class string {
    ...
    public static final int tongue_twister=0x7f0d001d;
    ...
  }
  ...
}

As you can see, this is neither an actual string, nor does it contain information regarding translations, but is just one integer serving as a reference. In the app code, we would simply access the string as R.string.tongue_twister and the system will fill in the rest during runtime. Depending on the device’s settings, one in three things will happen:

  1. the translated string is shown if the device settings ask for it and a language-specific string exists
  2. the default string is displayed if there’s no language-specific translation available (or needed)
  3. a NoSuchFieldError is thrown if there’s no reference for the resource at all

Getting back to the stack trace in [Gao Ye]’s analysis, the third case is exactly what happens in the affected devices for the common_date_leap_month string resource. Meaning, the code uses that string resource, but the executable running on the device doesn’t contain a valid entry for it.

You might be thinking now that this is something a compiler could and should already catch — and you are absolutely right. This is where the issue becomes both interesting and sobering.

Failing To Fail

Usually, using a non-existing R.string.something resource value will indeed cause an error, just like trying to access any other non-existing field or method would. Same way a C compiler would throw an error about an undeclared identifier, or warn about an implicitly declared function and have the linker throw an error in the end.

Some time around October 2018, the string entry in question went missing from the resources, but remained in use inside the source code — i.e. the code wasn’t changed, but the string value it uses was removed from the XML file. At this point, compilation should have failed, and yet it didn’t. [Gao Ye] concludes that a version mismatch in the packages and the calendar’s dependency chain might cause a situation where the build system is satisfied and happily compiles, but the app then fails during runtime.

Checking the occurrences of common_date_leap_month within the entire app (both resources and source code) shows that in the correct version, the resource is available in two different packages:

  1. com.samsung.android.app.aodservice
  2. com.samsung.android.uniform

The second package also contains the logic for the leap month handling shown earlier, and is the origin of the NoSuchFieldError causing the crash. Now, I don’t have the source code at my disposal or have details on the build process, so I cannot say for sure, but it seems plausible to me that removing the resource entry from both packages can succeed.

Let’s assume a scenario where the second, uniform package is compiled before recompiling the first, aodservice package: while the uniform part won’t find the common_date_leap_month resource in its own package, thanks to caching and other mechanism found in complex build systems to speed up the build time, it can succeed finding a reference in the old aodservice part. Since the aodservice package on the other hand doesn’t use the common_date_leap_month itself, recompiling it aftewards won’t mind that it’s missing now.

In other words, both parts independently are correct at the time of compilation, but once they are packaged together and installed on a device, the missing resource has become a ticking time bomb that has now gone off.

The good news is that the original issue was already resolved in June 2019, so the impact wasn’t as bad as it could have been. The bad news is that people were still affected, implying that they either didn’t, or couldn’t update their phones in the 9+ months since the fix. Without further details, it’s difficult to say how the string resources went missing in the first place.

In the end, the entire issue was essentially a wild series of unfortunate events, and despite people’s opinions on Java or smartphone vendors, asking “who’s to blame?” won’t have a simple answer — not that such an answer would be of much use anyway. It happened, and the fact that it happened in the first place is what we should focus on here, and primarily shows once again that software is hard, and nothing ever is as clear and obvious as one might think or hope it is.

While it won’t have a simple answer either, the better question is: what can we learn from this?

Lessons To Learn

After I read through the bug analysis, I started to wonder how an error like this could have been prevented. Best case, it happens in a common enough place, and a quick check to verify one’s changes would detect it right away, but that’s unlikely if it requires an out-of-the-ordinary situation like a leap month. Automated testing comes to mind, and considering the article’s title, it’s clearly where I’m trying to get with this. Not to say that Samsung didn’t have proper testing in place, I’m sure they did, at least to a certain extent.

Would I have written a test case myself to check if the string resource existed? No, not for the sole purpose to check its existence. The issue at hand doesn’t change that either, no matter how obvious the error is in hindsight. And that’s the thing, this specific error is obvious now, but the next case wouldn’t be. Essentially, we’d have to verify every single resource value to be sure we’d catch a similar scenario in the future, and that’s certainly a step in the wrong direction. We might as well start verifying that constants have the value we assigned to them.

Obviously, we’d need a better way. The usual choice would be unit tests, since that’s pretty much all the testing the average developer would want to undertake anyway — on a good day. Comparing the strings returned by the calendar method against expected values for a selective set of dates would be a textbook test case. If leap months are included in those dates, that test might have even crashed the same way the devices did in the wild, and detected the issue early enough, unless the test is run within the uniform component and uses the same outdated aodservice part, which brings us back to square one.

Still, unit testing is a better solution than checking if resource values exist, and while the calendar bug could still slip through, it’s a solid test to have. Of course, running that kind of unit test using the final, combined application would be the best solution — but then again, it’s easy to say that now when we know what we’re looking for. There is no universal answer here except that testing is a complex subject and requires an entirely different mindset compared to development.

The problem is that as developers, we like to convince ourselves that our code is flawless, so writing unbiased test cases can prove difficult. Not to mention the general attitude towards testing, which alongside code documentation and error handling forms the holy trinity of annoyances people mainly do to keep others off their back, instead of utilizing it for their own benefit.

The general discrepancy between developers and testers doesn’t help either. We see testers as hostile killjoys that try to break our creations, who are convinced we screwed up somewhere, making it their mission to prove that — and I’m sure the aversion goes both ways here. Sure, in a sense, testers and developers do work against each other, but that doesn’t mean it can’t happen in a collaborative way. Instead of proving who to blame, working out well thought-out test cases together, and learning from each other along the way, will certainly bear more fruitful results.

If anything, the Samsung issue should show us that bugs are lurking in the strangest of places, and tackling them might require some thinking outside the box. Combining the perspectives of those who build the software and those who are determined to break it might spark inspiration and help to consider angles neither might have seen on their own. As with error handling, instead of thinking “this should never happen”, we might be better off to think about ways to add an extra check or test case to make sure it really won’t — and stay a step ahead for the day when it does happen.

23 thoughts on “Samsung’s Leap Month Bug Teaches Not To Skimp On Testing

    1. Pfft lol. I’m sure getting tired of like every year being declared the worst one, even though there’s definitely some accuracy in the statement. It’s just like… okay, we’ve been having progressively more awful years for a while now and they will probably just continue as the climate begins to fall apart and also the US loses superpower status and becomes more and more of a post-Soviet failed state. But all we can do is go online and post “omg worst year ever right?”

      Maybe we should do-over our political and economic systems.

  1. “BUGS AND STONES MAY BRICK YOUR PHONES“
    But sticks will never find them.

    I certainly hope that we are over black box testing. Any testing should be grey box or white box. If you are relying on black box testing you may as well just be poking at things with a stick.

    1. Also the idea that any modern software house really gives any significant amount of a damn about finding and fixing bugs. Nah, they’re too busy trying to push new features to monetize. Maintenance and quality are not profitable or marketable enough. Nearly all software today is so low-quality it’s pathetic.

  2. Rather than trying to avoid all possibility of this kind of error surely it would be mroe sensible to design the system as a whole not to fully crash just because the date displating code does. Something like this should have eben an app, not an OS service. And the OS should have eben set up to recognise that if this service crashed it should just try to run like nromal without it rather than completely keeling over.

    1. You’d think that with something named after a Pope it would be a fair bet that nobody could be a direct relation, but Gregory XIII didn’t keep it in his pants and actually had a son, who went on to have 14 kids. Then by the time you’ve gone back 4 centuries on a family tree it seems to be crawling with notables. So I’d only give anyone with any Italian blood evens on not being related to him.

      1. I characterise those as more like failed conversion therapies. younger proto-perverts confess dark desires to local priest, priest recommends prayer, or if sexual desires cannot conform to social norms then devoting one’s life to god, then of course it’s ineffectively pushed down in the psyche, and comes exploding out again a few years later.

  3. Android on top of Java on top of C on top of assembly … what could possibly go wrong. I’m amazed that they actually found out what the problem is.

    Too much complexity breeds bugs.

    1. Phones do complex jobs, there’s not much of an alternative. If you write the entire UI in assembly or something you will get waaaaay more bugs, and probably exciting fun security problems too, and it’ll be much harder to work on.

  4. And now imagine the poor phone owner in China, at the moment, where you’re less than nothing if s/he can’t display a Covid-19 app saying “all green, I’m OK”.

    Shudder.

    1. Uh, yeah, no. Samsung is a Korean company which to date hasn’t chosen any government, and the poor phone owner in China didn’t choose his government either. It’s so cute that you think people choose their governments.

  5. And that it somehow got to be a system app so could brick the phone instead of having maybe a samcorp user separate from the user, but if they failed they wouldn’t cause the system to self-destruct.

    So much easier to run everything as root, those pesky permission problems all go away.

  6. And I thought countries using wacky measurement units different from the rest of the world was stubborn… but using a different calendar… wow! Must be a lot of work calculating that stuff back and forth all the time when developing for international markets. On the other hand, the Gregorian calendar doesn’t feel like a particularly clean, straightforward solution to the problem either. Perhaps there is some country using a better system and we should adopt that instead.

    I wonder if there’s countries using their own measurements for the time of day?

  7. I’d assume the phones that got bricked were older models that hadn’t seen an Android version update for quite a while. Did the bug get fixed for older Android versions on older phones or did they just tell all those people to chuck it and buy a new phone?

Leave a Reply

Your email address will not be published. Required fields are marked *

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.