Fail Of The Week: Roboracer Meets Wall

There comes a moment when our project sees the light of day, publicly presented to people who are curious to see the results of all our hard work, only for it to fail in a spectacularly embarrassing way. This is the dreaded “Demo Curse” and it recently befell the SIT Acronis Autonomous team. Their Roborace car gained social media infamy as it was seen launching off the starting line and immediately into a wall. A team member explained what happened.

A few explanations had started circulating, but only in the vague terms of a “steering lock” without much technical detail until this emerged. Steering lock? You mean like The Club? Well, sort of. While there was no steering wheel immobilization steel bar on the car, a software equivalent did take hold within the car’s systems.  During initialization, while a human driver was at the controls, one of the modules sent out NaN (Not a Number) instead of a valid numeric value. This was never seen in testing, and it wreaked havoc at the worst possible time.

A module whose job was to ensure numbers stay within expected bounds said “not a number, not my problem!” That NaN value propagated through to the vehicle’s CAN data bus, which didn’t define the handling of NaN so it was arbitrarily translated into a very large number causing further problems. This cascade of events resulted in a steering control system locked to full right before the algorithm was given permission to start driving. It desperately tried to steer the car back on course, without effect, for the few short seconds until it met the wall.

While embarrassing and not the kind of publicity the Schaffhausen Institute of Technology or their sponsor Acronis was hoping for, the team dug through logs to understand what happened and taught their car to handle NaN properly. Driving a backup car, round two went very well and the team took second place. So they had a happy ending after all. Congratulations! We’re very happy this problem was found and fixed on a closed track and not on public roads.

[via Engadget]

52 thoughts on “Fail Of The Week: Roboracer Meets Wall

  1. And this is precisely the reason that self driving cars should never be allowed on public roads.

    A coding error buried so deep that it is not until a crash into a wall (or a dead pedestrian) that it is even realised there is a problem. Law of averages says you won’t be that dead pedestrian. So sorry you were.

    1. And that brings up the standard ethics question of whether that “one” dead pedestrian is “worth it” if finding that coding error enables self-driving cars to improve. It’s not hard to imagine that with such constant improvements, almost all such deadly incidents will soon be eliminated. Following that, the self-driving cars have many fewer such incidents than with human drivers. (Obviously, those 10,000 *annual* deaths due to drunk driving are all but completely eliminated.)

      But it IS an ethics question of whether it’s appropriate to pursue this end. :-)

      1. Our software has been “constantly improving” for 70 years now, and it’s still full of errors.

        I say every creator of a self driving car must sit in his creation and be driven around by it for 1 year.

        Those who survived will be few.

        1. I have a couple of responses to this:

          1) “Our software has been “constantly improving” for 70 years now, and it’s still full of errors.”
          Sorry – I don’t think there’s actually any software from 1950 that’s still in use and being actively debugged. If you’re implying that new software IN GENERAL is always being developed and debugged, this is true, but this is NEW software that might have bugs.

          Also, consider that the “self-driving code” doesn’t have to be 100% bug-free to be essentially a flawless driver. If the software makes a decision that causes the car to begin to accelerate, but then, 10 milliseconds later realizes it made a mistake (for whatever reason) and then begins to apply breaks then there very likely has been no real-world harm. Meanwhile, the software will flag that “exceptional incident” to go back to the development group for review and appropriate remediation. And thus, a true bug, causes no harm, and gets fixed anyway.

          2) “I say every creator of a self driving car must sit in his creation and be driven around by it for 1 year.
          Those who survived will be few.”

          Interesting. I wonder if the same logic should compel every single driving instructor to be driven around for a year by his/her students.

          As to the “Those who survived will be few” comment, I suspect you don’t have intimate knowledge of the state of affairs of self-driving. There are a number of companies (at least 4 that I’m familiar with) that have self-driving code that works exceptionally well.

          And, to address the hyperbole, the *vast* majority of car accidents do not result in death or even severe injury. :-)

          1. By “our software” he was probably referring to our squishy wetware up there…

            (BTW, “I don’t think there’s actually any software from 1950 that’s still in use and being actively debugged” — tell that to the COBOL folks from the banking industry :-P)

          2. @victroniko
            Some interesting COBOL was found during the search for Y2K bugs in an unnamed English bank, that was using pounds, shillings and pence at the fundamental units in their system and that code was patched in 1971, when the UK decimalised to handle the money.
            (FYI: There is 12 pence in a shilling, 20 shillings or 240 pence in a pound).

            So oddball code can hang around for nearly 30 years.

          3. to answer the comments above – I’ve worked on software that was written in the 60’s (COBOL), only a couple of years ago. ie it was over 50 years old…

            But the bigger problem is 100% of drivers think they are above average.

      1. At least in the software case you can patch out the bug across the entire fleet once it’s discovered. Humans just keep failing in the same damn ways and nobody does anything about it.

    2. Self-driving cars would make more sense on dedicated roads with excellent pavement markings. I’m not comfortable in the city. I’m sure time will lead to improvements and if it beats avoiding deaths due to drunk driving, I’m for it.

    3. I agree that a self-driving vehicle will (in the future sometime) outperform a human, but who takes liability?
      Is it the driver, who had nothing to do with writing the code?
      Is it the manufacturer, who isn’t able to control the fact mud splashed up onto the sensor?
      Is the the software engineer who signed off on the code, but has since moved on (or been run over by his own ‘approved’ code!)?

      I think it is unrealistic to expect the the driver “must always be *fully* ready to take over” (as all autopilot car manufacturers are now adopting) when the autopilot throws a fit.
      What is the point of having a self-driving car?

      Well, the reason is legal liability: to shift the blame from the manufacturer/designer to someone else.

        1. How has that worked out for safer forms of travel, like air and train? Nope. Liability is important and the software industry, which is protected from it, makes lousy products as a result.

      1. Whenever you own and drive a car you are responsible for the risk that poses, that’s the default position. This changes when the manufacturer takes control away from you without your consent, E.g. A software bug occurs and the brake pedal is now an accelerator. Self driving and cruise control however require your consent to activate so you are still the one in control and therefore the responsibility lies with you. This will be the case until either one of those features doesn’t deactivate when you try to regain control or human-facing controls disappear altogether which I could only see being a thing on taxi cabs.

        1. > so you are still the one in control and therefore the responsibility lies with you

          Automatic driving may become mandatory on the point that it’s “safer than average”, which is forcing the average risk onto everybody. Then the responsibility cannot be on the driver anymore, because they had no choice.

        2. > therefore the responsibility lies with you.

          Nope, already settled in court (volvo).
          In this case Volvo cannot prove that a mandatory recall due to brake failure was done on a car involved in a fatality (driver lost control of car and killed one children).
          Note that driver was also responsible as she didn’t take measure to avoid the group of children (which was not on the street btw).

    4. > A coding error buried so deep that it is not until a crash into a wall (or a dead pedestrian) that it is even realised there is a problem.

      That’s why automotive code must be as simple as possible. It also must be completely provable. There must be no place where coding error could be “buried”. It must be fully readeable, understandable and analyzable by single person in a day. So, no any neural networks or other unpredictable and unstable “modern” crap. If your code, including used libraries, takes more that few thousand lines or need some “new, innovative language”, it should not be run in a crucial module of a car. Never. There is entertainment system for different bloatware, that should not be connected to vehicle network.

      Our ancestors from both continents send spaceships to the moon and back by means of 40kloc or less, don’t tell me that shitty cars need more. If they are – you definitely doing something completely wrong.

      1. I’m actually employed by a company working on the autonomous vehicle problem in a role where we are developing and proving the safety of our software and hardware. My work is all at the foundational layers of the software so I can’t say much about the safety process for the DNNs.

        We are implementing an ISO26262 process which covers a lot of ground for how to write software. If you just want to talk about code complexity, that is covered. But a simple metric like lines of code is a terrible one for ascertaining complexity. ISO26262 recommends adopting a metric like cyclomatic code complexity (CCM). Combined with code coverage requirements (really only realistically done via a combination of unit, integration, and system testing) and security requirements (that generally involve techniques such as fuzzing) the result is something that is very punishing unless the code is low complexity and fairly easy to understand.

        The DNN problem is being looked at very seriously as well. We understand that testing alone is the the best way (even years of it done with both simulators and on the road). Because I’m no expert, and because of how much of this is still proprietary research I can’t comment. Every company in this space is highly aware of the failures of Tesla and Uber and no one is eager to add their name to that list.
        But the public should absolutely demand proof of safety and to examine the methodology behind those claims of safety when it comes to the DNNs. The other stuff in these cars is all know quantities at this point and there are decades or practices to pull from.

  2. If you read the article. This is exactly why you DONT put low end coders behind something of this magnitude. So many coders these days only focus on the getting from point A to point B. Where where they should put equal if not more effort is in the failure points. Uh, if i want to steel left and i dont get the feed back that we are then stop. How simple is that?
    Thus the reason why you can see these people on the software teem suck.

    1. Oh piss off. Nobody writes defect free code. Not me, not these guys, and certainly not you. Crashes are rarely this spectacular but either way it doesn’t deserve some armchair “engineer” making bullshit generalisations about how everyone else but them is stupid.

      The fact that they could track down the full failure path from defect to cause shows they know what they are doing.

  3. I wonder if the team used Coverity or similar tool to check if the code complied with the applicable MISRA, AutoSAR, and CERT specifications. I would also wonder if they applied an ASIL process to their work. At ASIL-B they should at least have had range checks on their interfaces that would have caught this.
    Yes, all these things come with a lot of overhead and being a race car, strictly adhering to best practices for the automotive industry probably seemed too onerous. But these things exist for reasons like this.

    1. Such code-checking tools neither verify algorithms, nor problems using floating point. And if you read the summary, the problem was in range-checking code!

      I think I’m more concerned that they didn’t have an E-stop button in their design. Yes, I know it would have to be wireless, with all the extra problems that brings. And the human would have to have been fast enough to push it in time! But this is why I, as a human driver, just ease off the brakes, or put on the gas only a little bit, and verify that the car is going in the right direction before stepping on it. Sometimes it doesn’t go the way you expect, which is why you need to confirm the feedback.

      1. It’s not really in range checking code, in fact it is really close from Ariane5 failure: one system goes out-of-limits (NaN or overflow), then other systems tied to it do not handle well the issue because it wasn’t explicitly specified.

        This is basically a system architect failure, and a bad one.

        1. “This is basically a system architect failure, and a bad one.”
          One that if missed in my job would get the team doing the FMEAs in a lot of hot water. Anyone doing an FTA or DFA would also catch a lot of flack.
          Because this is a race situation, I doubt the team is following automotive safety processes though.

          And as I state in my reply to 8bitwiz, this failure sounds like a blatant violation of CERT FLP04-C. If they were using Coverity with checkers for the CERT C rule set, this should have been flagged and the code should have at least been reviewed.


        There are indeed rules and checkers around this sort of issue.
        From the post “So during this intialization lap something happened which apparently caused the steering control signal to go to NaN and subsequently the steering locked to the maximum value to the right.”

        They allowed the NaN to propagate though multiple interfaces in the system.
        At the very least CERT FLP04-C was violated somewhere.

      3. Looks like my reply didn’t make it through the spam filter because I had a number of URLs in it. There are a number of MISRA and CERT rules around floating point numbers an NaN. Coverity advertises support for them. We use Coverity at my workplace and what we have observed is that Coverity doesn’t miss things, but it can make a lot of false positives.

        One specific CERT C rule FLP04-C specifically says to check for NaN. The post says the NaN was propagated through software interfaces in the car and is sounds like none of them checked of this. If they were targeting even ASIL A they should have been checking their data at their interfaces.
        What I find extra ironic is that this is a team from Acronis which is a cybersecurity company and they should be aware of the CERT coding guidelines.

        You can see some of the measure required for the different ASIL levels if you search for “Analysis of ISO26262 standard application in development of steer-by-wire systems”. The first link should be a PDF with that tile. In that PDF, look for table 6.

        Had the team followed ISO 26262 and aimed for even the lowest ASIL level of A then the error would almost certainly have been caught.

        However, being a race car without humans around on a closed course with safety features to keep the cars separated from the people, perhaps there are no mandated safety certifications teams must meet.

  4. The simple solution is machine only roads, if a machine destroys a machine, the people inside the cars/lorries/busses knew the risks and were willing to hand off their safety to what are in effect advanced data harvesting company. That way the company is innocent if people die during the testing phase. And every employee of all these companies must legally be on these machine only roads in their companies cars for at least 7 hours a week.

    Problems will be rapidly fixed.

  5. I completely understand the software bug BUT… was there no collision avoidance algorithm that could actuate the brakes??

    Like, ok, the steering algorithm failed. But I would think at some point a lidar sensor would be like “Hey, a wall” and the control algorithm would be like “Hmmm, brakes I guess”.

    Perhaps the algorithm was assuming the car should easily be able to steer away from the obstacle but you can’t always rely on steering. Cars can understeer and oversteer. The other systems should be prepared to compensate.

  6. Why was this ever allowed to happen? What kind of programmer doesn’t design his or her input to be tolerant of a simple NAN error?
    Sheesh, I figured out how to do that in TI-99/4A BASIC when I was just a young kid. If a stupid kid like me could figure it out on his own, surely a group of adults with degrees should be able to manage it!

  7. The steering got stuck, that’s fine, it could happen mechanically as well. It didn’t look like that was the only problem though.
    Does it not have any means to detect an object in front of the car and perhaps take multiple actions? If this had been a human driver and let’s say the steering wheel somehow detached from the wheels, there’s a pretty good chance that they would have thought, ooh there’s a thing coming towards me unexpectedly, but I’m steering away from it, can I perhaps press quite hard on the brake and see what happens instead of trying to accelerate and just steer around it?

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.