Personally, I’m a fan of trains. They’re a nice, albeit slow, way to get around the country. Canada isn’t the best candidate for rail transit, given the rather large space between coasts, but Via Rail does operate regular train service in their corridor between Windsor and Quebec City.
Unfortunately, passenger rail has to yield to commercial rail in Canada which often causes delays. After noticing that some trains have very frequent delays, it seemed like it would be useful to know the average performance of each Via train. Via does not provide this data publicly.
However, they do provide some data about arrival and departure times. Digging into the data available through any browser viewing the Via Rail site, it was possible to query for past scheduled/actual arrival data. The result is TrainStats.ca, a display of Via’s on time performance. Join me after the break as I discuss how this all works, and how to pick a winner when buying your next train ticket.
Getting the Data
Via does provide schedule data for the previous, current, and next day on their status page. This would let us build up a set of trip data, but only one day at a time. Fortunately, we can fire up Chrome’s inspector and find this get request:
There’s a few juicy parameters here.
TsiTrainNumber is obviously the train number we’re looking at.
DepartureDate is the date the train left, and
ArrivalDate is when it arrived.
TrainInstanceDate also appears to be set to the date the train left. With this in mind, it’s time to jump into Python and use the fantastic requests library to forge some requests.
This code allows us to fetch data for any train number on any date. After some testing, we found that Via’s data goes back to April 2015, which gives us over 6 months of data. For each trip, we get the scheduled and actual arrival and departure times for every station. With that information, we can easily calculate how delayed the trains are.
With the page data fetched as HTML, a script was hacked together using BeautifulSoup to extract all the values. This script then creates objects for the trip data and stores them in a PostgreSQL database using SQLAlchemy. This makes it easy and efficient to access the data later.
The last step was to iterate over all the train numbers and days to pull the data. This script just uses some nested loops to grab the data and store it. Another script grabs the previous day’s data and stores it in the database. This is set up on a cron job, so the database stays fresh.
Building a (cheap) Website
At this point, we have arrival data on over 12,000 trips. While we can manually run queries and write scripts to generate plots, it’s far more fun to put the data online. That means it’s time to build a website. Making things look good on the internet is not my forte, so [Phil Everson] jumped in to do some web development.
To add a constraint, we wanted to make the site as cheap as possible to run. Platform as a Service offerings like Heroku ran about $20 a month. A Virtual Private Server from DigitalOcean would cost at least $5. The cheapest option was to make a static site.
This hack was mostly built for fun, but it has a few interesting findings. On my usual Ottawa to Toronto route, I’m more likely to opt for the train that’s on time 84% of the time, versus the one that only rolls into the station without delay on 28% of trips. Some other travellers might find the stats useful as well. Either way, it was an interesting exercise in scraping up a dataset and providing a web service on the cheap.
If you’re interested in the source, it’s all up on Github for the taking. We kindly request that you don’t DDoS Via Rail with it.