Catching A Rogue Train With Data

If you have been a regular traveler on one of the world’s mass transit systems over the last few decades, you will have witnessed something of a technological revolution. Not necessarily in the trains themselves, though they have certainly changed, but in the signalling and system automation. Nineteenth and twentieth century human and electromechanical systems have been replaced by up-to-date computers, and in some cases the trains even operate autonomously without a driver. The position of every train is known exactly at all times, and with far less possibility for human error, the networks are both safer and more efficient.

As you might expect, the city-state of Singapore has a metro with every technological advance possible, recently built and with new equipment. It was thus rather unfortunate for the Singaporean metro operators that trains on their Circle Line started to experience disruption. Without warning, trains would lose their electronic signalling, and their safety systems would then apply the brakes and bring them to a halt. Engineers had laid the blame on electrical interference, but despite their best efforts no culprit could be found.

Eventually the problem found its way to the Singaporean government’s data team, and their story of how they identified the source of the interference makes for a fascinating read. It’s a minor departure from Hackaday’s usual  hardware and open source fare, but there is still plenty to be learned from their techniques.

They started with the raw train incident data, and working in a Jupyter notebook imported, cleaned, and consolidated it before producing analyses for time, location, and train IDs. None of these graphs showed any pointers, as the incidents happened regardless of location, time, or train.

They then plotted each train on a Marey chart, a graph in which the vertical axis represents time  and the horizontal axis represents stations along a line (Incidentally Étienne-Jules Marey’s Wikipedia entry is a fascinating read in itself). Since it represents the positions of multiple trains simultaneously they were able to see that the incidents happened when two trains were passing, hence their lack of correlation with location or time. The prospect of a rogue train as the source of the interference was raised, and analyzing video recordings from metro stations to spot the passing train’s number they were able to identify the unit in question. We hope that the repairs included a look at the susceptibility of the signalling system to interference as well as the faulty parts on one train.

We’ve been known to cover a few stories here with a railway flavor over the years. Mostly though they’ve been older ones, such as this film of a steam locomotive’s construction, or this tale of narrow gauge preservation.

[via Hacker News]

[Main image source: Singapore MRT Circle line trains image: 9V-SKA [CC BY 3.0], via Wikimedia Commons]

16 thoughts on “Catching A Rogue Train With Data

  1. I saw this posted elsewhere a day or two ago – a really good story of catching an intermittent fault with data analysis. They were lucky, of course, to have information to work with, but still the whole exercise must have been very satisfying for all of those involved.

    1. I’m a HUGE terminal fan/user, and do all my Python dev work in GNU Screen and Vim. However, I’ve started using Jupyter for quick exploration of data over the last year or so.

      I’m loving it! It installs and runs in Python Virtual Environments, which is a definite plus for me. The integration with Numpy/SciPy and embedded plots via Matplotlib is great. The ability to run a “server” instance (since Jupyter is web based) is pretty nice, so I can run workshops and lead people through things without the need for a whole extra workshop on installing/setting up python :)

      The “cell” execution (similar to Mathmatica or Sage) makes it easy to write a walk through on data analysis (check out the LIGO and CERN notebooks!), and I’ve used them to add interactive “documentation” on some internal projects.

      I’d definitely give Jupyter a shot. It’s a pretty fun tool.

      1. If you want to take it up a notch, try using Jupyter widgets!

        I use the the first few cells to import python functions, the next few as a widget-driven GUI to control Lab equipment (function generators, xyz axis, cameras etc), then immediately process the collected data, and finally save the processed data with the %store function.

        So, you have a record of the equipment you used, and the raw and processed data in a single file, together with the algorithms you used for the data analysis!

  2. Cannot confirm this story’s accuracy but it sounds amusingly similar.

    In the 1980s, my mentor Sergei was writing software for an SM-1800, a Soviet clone of the PDP-11. The microcomputer was just installed at a railroad station near Sverdlovsk, a major shipping center for the U.S.S.R. at the time. The new system was designed to route train cars and cargo to their intended destinations, but there was a nasty bug that was causing random failures and crashes. The crashes would always occur once everyone had gone home for the night, but despite extensive investigation, the computer always performed flawlessly during manual and automatic testing procedures the next day. Usually this indicates a race condition or some other concurrency bug that only manifests itself under certain circumstances. Tired of late night phone calls from the station, Sergei decided to get to the bottom of it, and his first step was to learn exactly which conditions in the rail yard were causing the computer to crash.

  3. Love transient problems. Back in the dark ages we had a full HP3000 installation that would go out every night…people left in the evening and came back to a dead computer. Since it took a while to get it back on its feet, this was a pretty substantial thing to work out. IIRC solution was to have it write the current time to a file that could be read after the event.

    The failures happened consistently at about 3:45 AM. Since nothing failed at any other time (and code wasn’t running overnight with any consistency) that left extrinsic factors. Voltage. As it turned out, the voltage in the building dropped precipitously just ahead of the failures…always at the same time.

    What causes huge voltage drops in an industrial building in a business park at 3:45 AM? Not a bakery…the printing operation for the local newspaper was cranking up its presses for the morning edition and the grid couldn’t take it. At that time, there wasn’t enough installed computing equipment to cause anyone else to howl about it (and our analytical section powered things to standby overnight).

    1. Me too; after I’ve solved them. What fascinates me about this kind of puzzle is how we humans solve them. As far as I can tell it’s a combination of past experience, educated guestimating and logical thinking that enables us to tackle “new” problems. Plus a bit of brute force every now & then :)

    2. My HP3000 circuit simulation failure was a lot simpler. We’d run long simulations overnight, and they’d often fail with no indication of why. (The software didn’t log anything useful.) Short jobs often worked, but occasionally failed. Long simulations were more likely to fail.

      There were no detailed logs, but working backwards we realized they were all failing somewhat related to how late in the day they were started.

      Doing a little math revealed they were all failing at midnight. Clock roll-over.

  4. Many years ago, one of our clients experienced severe network and subsequent database errors on a very regular basis. After months of fruitless bug-searching at the application level, with custom debugging builds and everything, one of my a co-workers was on site when it happened again and he discovered the source. The flooring in that building was wooden and sometimes, when a heavy built person would walk by, the boards would flex enough to move the network cabling causing the Ethernet connector to wiggle a bit and break the connection.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s