The OpenStreetMap project is an excellent example of how powerful crowdsourced data can be, but that’s not to say the system is perfect. Invalid data, added intentionally or otherwise, can sometimes slip through the cracks and lead to some interesting problems. A fact that developers Asobo Studio are becoming keenly aware of as players explore their recently released Microsoft Flight Simulator 2020.
Like a Wiki, users can update OpenStreetMap and about a year ago, user nathanwright120 marked a 2 story building near Melbourne, Australia as having an incredible 212 floors (we think it’s this commit). The rest of his edits seem legitimate enough, so it’s a safe bet that it was simply a typo made in haste. The sort of thing that could happen to anyone. Not long after, thanks to the beauty of open source, another user picked up on the error and got it fixed up.
But not before some script written by Asobo Studio went through sucked up the OpenStreetMap data for Australia and implemented it into their virtual recreation of the planet. The result is that the hotly anticipated flight simulator now features a majestic structure in the Melbourne skyline that rises far above…everything.
The whole thing is great fun, and honestly, players probably wouldn’t even mind if it got left in as a Easter egg. It’s certainly providing them with some free publicity; in the video below you can see a player by the name of Conor O’Kane land his aircraft on the dizzying edifice, a feat which has earned him nearly 100,000 views in just a few days.
But it does have us thinking about filtering crowdsourced data. If you ask random people to, say, identify flying saucers in NASA footage, how do you filter that? You probably don’t want to take one person’s input as authoritative. What about 10 people? Or a hundred?
The Army Marches on Data
When you think about geospatial data, what heuristics could you use to at least identify areas to look at closer? In this case, the fact that the tallest building in the world only has 163 floors would have been a good clue. Even if the building had 100 floors, the fact that nothing else near it has even a quarter of that number would be another clue. In either case, the Great Tower of Melbourne could have been avoided with a single line of code validating the height data pulled from OpenStreetMap.
For terrain, rapid changes in elevation might be another data indicator. That would have prevented the wall of ice that guards us from the White Walkers. We wondered if anyone had given this any thought before. Turns out, the US Army has. They even mention OpenStreetMap and many other sources, some of which we didn’t know about.
Section 4 of the aptly named Crowdsourced Geospatial Data talks about how to vet crowdsourced data and address errors due to sensor variability, language, and other technical factors. However, errors from logical inconsistency should be moderately simple to filter out, and the paper identifies efforts to automate that for geospatial data. For example, the angle between two intersecting roads is typically within a relatively narrow range of angles.
According to the paper, several researchers have validated data and found high error rates in public information sources. For example, in the United Kingdom and Ireland, OpenStreetMap data with more than 15 edits had errors in about 8% of the time. In France, about 5% of the crossroads had geometric inaccuracies.
Of course, that assumes the errors are the result of honest mistakes. Protecting against malicious data entry is an entirely different problem, and one that’s potentially much harder to identify and fix.
This situation is also addressed in the Army’s report, but only briefly. It stands to reason that if the military has some particular tips and tricks they use to sniff out this sort of thing, they probably don’t want them to become public knowledge.
With crowdsourced data growing in popularity, it would be easy to imagine wanting to displace key targets slightly or even significantly. A bunker “known” to be at the center of a facility might survive if the data says that the facility is a few hundred meters to the right of its actual position. Disinformation has always been a powerful tool, and it’s only amplified in the era of Big Data.
That said, some of it isn’t too hard to find. People actually use GPS tracks to spell out graffiti in OpenStreetMap, for example. So if you stumble upon any mile-wide letters written in the countryside, it’s probably safe to leave them out of your flight simulator.
Closer To Home
This doesn’t just apply to geospatial data, either. How often do you take data from a pressure or temperature sensor? Do you validate it? For high-reliability data, you might need multiple redundant sensors with some voting logic. That’s common in aircraft and spacecraft. You might have three sensors and take the average of the three if they read close together or reject one if it is way off compared to its counterparts.
I had a commercial drone once suddenly decide it was 4,096 feet below sea level due to a failed pressure sensor. The resulting rapid ascent to try to correct the altitude was both amazing and terrifying since it was a big drone. The firmware should have made some simple assumptions about data quality, such as realizing that it wasn’t likely to suddenly find itself hundreds of feet below seal level, or that the data wasn’t trending in the expected way as it tried to gain altitude. It certainly would have made my day easier, not to mention the pilot’s.
What’s your favorite data validation trick? How much do you trust crowdsource data? Wikipedia is usually right over the long term, but there are certainly cases where bad data slips through until someone catches it.
Thanks [ptkwilliams] for the tip about Flight Simulator.