Engineers tend to worry about uptime, whether it’s at a corporate server farm or just our own little hobby servers at home. Every now and then, something will go wrong and take a box offline, which requires a little human intervention to fix. Ideally, you’ll still have a command link that stays up so you can fix the problem. Lose that, though, and you’re in a whole lick of trouble.
That’s precisely what happened to Australia’s second largest telecommunications provider earlier this month. Systems went down, millions lost connectivity, and company techs were left scrambling to put the pieces back together. Let’s dive in and explore what happened on Optus’s most embarrassing day in recent memory.
Where to Go?
It all went down in the wee small hours of November 8, around 4:05 AM, when a routine software upgrade was scheduled. As part of the upgrade, there was a change to routing information for the Border Gateway Protocol (BGP) for Optus’s network from an international peering network. According to the company’s analysis after the event, “These routing information changes propogated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves.”
That’s all a bit of a mess, so what does it mean? Well, fundamentally, the BGP routing information tells Optus’s routers where to find other machines on the internet. The routing information updates came from a Singtel internet exchange, STiX, which Optus uses to access the global internet. What happened is that the updates overwhelmed Optus’s own routers, which shut down in response to reaching a certain default threshold level of route updates. These limits are pre-configured into the router equipment from the factory. As this occurred in routers on Optus’s core network, as they went offline, they took down the telco’s entire national network, affecting voice, mobile, and internet customers.
Engineers spent the first six hours investigating various causes of the incident, while millions were waking up to dead internet connections and phones without signal. Crews rolled back recent changes by Optus itself and looked into whether they were under some kind of DDoS attack. In the end, the engineers determined the issue of the routers self-isolating to avoid the overwhelm of routing information updates that had propagated through the network. Resetting routing back to normal was enough to get networks back online, with engineers carefully reintroducing traffic to Optus’s backbone to avoid any unseemly surprises during the process. Optus eventually put the blame on the automatic safety mechanisms, stating “It is now understood that the outage occurred due to approximately 90 PE routers automatically self-isolating in order to protect themselves from an overload of IP routing information. These self-protection limits are default settings provided by the relevant global equipment vendor (Cisco).” It perhaps implies that the self-protection limits are unduly cautious and took the network offline when it was not really necessary to do so.
According to Optus, 150 engineers were directly involved in investigating the problem and restoring service, with another 250 staff and 5 vendors working in support. Meanwhile, the efforts to get back online were frustrated by the fact that, with Optus’s network down, it was difficult for technicians to actually access machines on the network to fix the problem. Ultimately, it would take a full fourteen hours for Optus to get its systems fully back online, with technicians having to attend some equipment in person to get it back online.
Optus isn’t the only company to have had issues with a major BGP meltdown. Facebook famously disappeared from the internet in 2021 for a few hours when it got the settings wrong on a few of its own backbone routers.
The Aftermath
The result of this unprecedented outage was Optus temporarily becoming public enemy #1 in the Australian media. Millions across the nation had spent the day with no internet connection, no mobile connectivity, and few to little updates from Optus about what was actually going on. Customers had to get their updates via conventional media like newspapers, radio, and television—as they had no way to access the internet or receive calls via their own devices. Thankfully, cellular users were at least able to contact emergency services via alternate cellular networks, but landline users were cut off.
Businesses relying on EFTPOS payment terminals with Optus SIM cards were unable to take payments, while banks, hospitals, and even some train services were affected. The Melbourne train network underwent a one-hour shutdown as drivers could not communicate with the control centre, with hundreds of trains cancelled throughout the day. As for Optus itself, it shed $2 billion in value on the stock market as the day wore on, with CEO Kelly Bayer Rosmarin resigning a few weeks later due to the outage. Thus far, the company has offered customers 200 GB of free data as restitution for the outage. It’s proven cold comfort for many, particularly those in small businesses who lost out on hundreds or thousands of dollars in trade during the period.
The only winners in the scenario were Optus’s main competitors, namely, Telstra and Vodafone. The two companies run competing cellular networks as well as offering home internet connections across the nation. With this disaster occurring only a year after a major data breach at Optus saw customers compromised en masse, the two companies will be seeing dollar signs when it comes to stealing their rivals’ customer base.
Ultimately, there’s a lesson to be learned from Optus’s downfall. Crucial systems should be able to handle a routine update without collapsing en masse, even if something goes wrong. In 2023, customers simply won’t accept losing connectivity for 14 hours, especially if it’s due to some poorly-configured equipment. Connectivity is now almost as important to people as the air they breathe and the water they drink. Take that away and they get very upset, very quickly indeed.
HA! It’s BGP, once again. (It’s always BGP, unless it’s DNS. :: snickers :: )
This isn’t the first time a downstream connection flooded the BGP tables and took out large portions of the Internet. The actual fault is with the network architects that allow any BGP update without any sort of sanity check, not with default settings to prevent large outages.
If Optus “fixed” the core issue by removing those default safeguards entirely, the next time this happens (and history shows it occurs every few years) it won’t just be Optus that goes down, but their peering neighbors who don’t have safeguards of their own.
Canada had a similar problem in 2022. Rogers did a software upgrade to their internal network and it took their entire system down for most of the day.
Just to make things worse, the outage also disconnected the payment processor that handled most of the credit and debit transactions in the entire country!
I had $50 cash in my pocket that day, and I felt so rich for being able to pay for my stuff in cash.
This is why I use ca$h only. It never goes down. I haven’t used a debit card in over 15 years.
In the USA, consumers have a financial incentive to not use cash – credit cards and electronic payment processors often give buyers a portion of their fee as an incentive. Sometimes it’s a few percent of the transaction. Even a few debit cards do that, though rarely as they don’t collect as high of fees to pay for the incentives.
At least we still have pennies… :P
I predict the biggest IT disaster will be if a hostile party releases a rogue AI whose only purpose is to infiltrate the Intel ME and wreck havoc. The Intel ME is run by a closed source version on MINIX 3 at a Ring -3 level, and is always active, even if the computer is off. It has full access and can’t be disabled. What little could be determined about it has shown no shortage of major security flaws. AMD has an equivalent PSP, of which even less is known.
You can cause far more havoc by disrupting the internet.. Unfortunately more stuff is handled by internet like telephonenetwork, public transportation, money exchange, traffic systems, powerplants, etc.
Some laptops disable it.
https://news.ycombinator.com/item?id=33345040
The more realistic and dismal scenario is simply people’s own governments sabotaging and dismantling the networks while at war with their own people. No AI needed
ROUGE AI ? BEWARE THE CLEANING LADY !!!
I remember a fairly regular power outage, even the backups failed. When a cleaner plugged in a hoover in an international insurance company I once worked for.
In the end, they flew experts in from IBM because this happened a couple of times. They surmised what was happening and placed people around the building to watch. Having found the culprit, they added screws to the socket in question. Problem solved.
Cost the company an absolute fortune. It was all very exciting at the time. Like being in an episode of the X-Files.
I can’t say it’s unfair to call it too sensitive in this instance, but… you can do a lot more harm by accepting wrong or malicious routes than by rejecting the right routes. They should have done this that and the other thing, but you do have to balance both risks against each other.
IMO blaming it on Cisco is a cop out. Optus bought them, configured them, runs them and maintains them. They literally admitted they don’t check their own infrastructure.
There are thousands of configuration settings and deeply complex logic present. You can check the obvious stuff but you can’t check every possible consequence of every possible input.
The only practical approach is to do as much testing as you can, and *occasionally* have a very bad day where a swiss-cheese error occurs and you have to clean up, then add that failure mode to the list of things to defend against
And we shouldn’t be so critically dependent on a system that being denied access for 12 hours causes us to lose our collective shit.
how long can you last without oxygen? :P
Modern ever increasingly fast life needs fast connectivity, you either fork out the hard $$$ for backups or live dangerously without them.
Medical systems like that are infinitely simpler, as well as being discrete independent units. It’s actually feasible to test them in isolation and to swap them out when one fails. So it’s not even *close* to being a valid comparison.
For a telco, ensuring your remote equipment is reachable via an alternative method when primary connectivity goes offline for any given reason is a pretty basic concept. Back in the T1 days, all of our Cisco routers had a modem hanging off of the console port connected to a phone line provided by an alternative provider on disparate infrastructure. This ensured that if a borked firmware upgrade or settings change were to happen for any reason, we could manually dial into the modem, and access the router via a direct serial connection to the console port. This worked even if the unit was unable to boot due to a corrupt firmware image or failed flash media. Only a physical failure such as a failed media card or entirely failed unit could take a node offline to the point it required physical access to correct, and we generally kept a spare flash card onsite that any untrained local employee could be walked through physically swaping out for us in the event of a media failure.
Furthermore, for an INTERNET outage to take down VOICE service, is a travesty and betrayal of everything the POTS network once stood for.
I’m sure there are business reasons to chase a cheaper network core, but when it compromises redundancy and independent systems, it’s a bridge too far. Where’s the regulatory oversight to make sure that these systems don’t depend on each other?
I remember reading a research paper that argued the internet shouldn’t be IP based in the first place, because the routing tables will eventually grow to become unmanageable no matter what you do, and the network will have to split and branch further and further until getting from one end point to another in a different branch will have so many routing devices in between that the processing lag becomes an issue.
They suggested that the internet routing should be like the phone network, where the subscriber number doesn’t identify the device but the path to the device from some known reference point. That way, if you know a better route, or you want to send your packets down multiple routes, you can, and there’s no limit to how many devices can be connected because the network doesn’t have to know where everyone is. Each packet contains information about what path it’s going to take, and if the router knows a shortcut then it can modify that path, otherwise it’s just going to toss it along to the next router.
Wow – the authors just described UUCP. :-)
(I know… I’m old, and all you kids need to get off my lawn. Also, bang paths sucked, well-known hosts weren’t always well known, yada yada, etc. Yeah, I get it. Relax. But when ethernet cards were priced like cars and VAXes were priced like the houses you parked them at, UUCP rocked.)
Damn, I’m old. I remember UUCP, and I remember having to configure it on an early Linux box because of a sadistic *nix instrcutor.
! paths!!! AAAARG!!!
One argument about why the internet was built on IP rather than other routing schemes is that it required the use of a central authority (IANA, originally under the US Department of Defense) to assign the numbers to avoid collisions. That way you could police who gets to be on the internet.
Most Core routers can do a full routing table with no problem. BTW full routing table doesn’t mean you know all the exact routes an boils down to what you describe in your second paragraph. You send the packets to the destination that is best (for you), and that dstination sends it to his best destination, etc. until you reach the destination. The return packets follow a simular pattern (which is not perse the same route). BGP (or any other routing protocol helps to more dynamically learn routes.
IP is just a number, just as a telephone number.
IP is an unique endpoint identifier, whereas a phone number is actually a route description. Many different phone numbers can end up in the same receiver, whereas different IP addresses will never end up in the same computer (without special network translation measures).
Back in the day, if you knew what numbers to dial, you could make loops around the network to hide the originating number, because it would take time for the authorities to contact each exchange and check where the line is connecting from.
How do you think IP numbers with their subnets and route tables actually work? the hostpart of a subnet translates to a DDI range which is reached through a country and area code and some extra numbers (the subnet). The other way around is the table in the PABX that routes for example the call to the outside world when a 9 or 0 is dialed (ie. the default gateway in networking). And Internal phones without a DDI are handled like NAT.
Both are similar technology but have a different way of representing the numbers/addresses.
regarding the unique endpoint both has exceptions, like 8.8.8.8 which is handled geographicly local, so everyone get a fast answer. Otherwise your DNS query can take a while if you are located in the ‘wrong’ part of the world (because of the routing to the unique server). This is comparable to 911 (or 112 depending on which part of the world you are located). Both are not unique nor the same endpoints. Another example is 085 or 088 numbers in the Netherlands (which is not a redirect number like 0800 or 0900 number which actually points to a regular subscriber number), those ‘route’ to different geographically locations; 0851234567 can be located in the north and 0851234568 in the south and so on. Kinda like an IP subnet isn’t recognizable (anymore) to specific geographical area.
Also you can assign multiple IPs to the same interface and they don’t have to be on the same subnet either. I’m working for a Dutch ISP and I can assign multiple IPs to the same customer on the same wire which are handled by the same host (ie a server or firewall/router).
As a disclaimer I have also worked a lot with PABX and telco before I started working with the ISP.
Well, actshually… (from somebody formerly responsible for the call routing software and services for a major VoIP provider)
First of all, phone numbers have never been routes. They haven’t even directly been endpoint locations for a very long time. There are shortcuts, but in general just finding out where you want the call to go can entail finding the official owner (Local Exchange Carrier) of the number, discovering to which LEC the numbers has been loaned (“ported”), and querying a location server to find out where that endpoint is currently being handled. (It can vary widely for cell numbers, and even varies for land lines depending on call routing features.) For load-balancing, failover, and feature reasons you can even end up with a *list* of endpoints. What you end up with are the identities one or more big server- or router-like boxes; often they are literal IP addresses, but can be something equivalent for the switched-circuit network.
The next step is to figure out how to contact that big box. Usually you dump the call to one of your big exchange peers, who may directly forward it to the end provider, and may pass it along to one of their peers.
In all cases, you know how to route the call using… wait for it… routing tables. If you’re a big enough operator, part of that probably literally involves BGP. If not, its something equivalent to it.
I suspect what you have in mind is how each voice packet travels over the network. There’s two flavors. The first is the traditional “switched circuit network”. Once the two endpoint servers are identified a dedicated path between them is allocated. It’s probably a virtual circuit, but the bandwidth is guarenteed. From the standpoint of the two servers there’s no contention with any other call. It’s like there’s a direct wire for that call. Physically, there are still lots of routers in between, and the virtual circuit is bundled up with other traffic, split out, and rebundled. It’s just that every packet (almost) always follows the same route, and has its own dedicated reservation on particular frequencies or timeslots.
The newer method is “packed switched network”–plain old IP. Theoretically each packet can follow a different route from one end to the other, depending upon instantaneous load from other competing users. Packet-switched is *much* more efficient than circuit-switched and a bit more flexible. The downside is you have to handle redundant or dropped packets, variable latency, and more possibilities that something can fatally interrupt the call mid-session.
In both cases, control-plane and voice packets must travel across multiple routers. Specific high-volume routes may be determined in advance, but in the general case will require step-by-step decisions.
Oh, and multiple IP addresses for a host are not unusual. Not only can you have multiple interface cards (e.g. hard-wired and WiFi), but a single interface can easily be assigned multiple IP addresses, and your network may in effect assign even more (e.g. LAN addresses plus ephemeral WAN address/port combinations).
> phone numbers have never been routes.
Yes they were. That’s literally how the POTS system worked when it was still circuit switched from end to end. The old relay switches and then crossbar switches were made with the explicit point that the user of the phone would connect themselves across the network by dialing numbers.
What came after digitalization is what you describe. The original system wasn’t computerized and didn’t have any provisions for lookup tables and number substitutions on the fly – each exchange would just follow what the phone user would input and connect accordingly.
(Replying to sibling comment about crossbar routing)
The routing tables were hard-coded in the crossbar wiring. But it’s true that the phone number was a nice hierarchical location which made that much easier. The actual route taken, though, depended upon where you were calling from *and* which circuit legs were free. You could compare it to an address such as “United States, Iowa, Podunk, Main St., 1234”. That’s a location, not a route. How you get there differs for someone from next door vs. New York City vs. New Orleans vs. London, and whether you’re traveling in the middle of the night or during rush-hour.
Also, because an phone number is a route instead of an ID, using the same number in different parts of the network is no problem. If I’m 1234 and you’re 1234, the way we address each other would be prefixed by the route towards each other, so you might be 011234 while I’m 021234 and so forth.
The clever bit is that you can add to the route at both ends. If the first number means “Go to the global internet”, then then next number might mean “Go to Brazil”, and then subsequent numbers specify where in Brazil you want to connect. If you don’t want to leave the country, you might specify “Go to the local exchange” for the first number, and then “Connect to my neighbor” for the next. If both numbers are globally referenced, the local exchange can notice that the beginning of both routes is the same, so they can just drop that part and connect you directly without knowing how the rest of the network is structured.
The most similar dialing I’ve done is internal calls being shorter than local calls being shorter than long distance calls being shorter than international calls. I can’t control the routing by adjusting the number, while still reaching the same endpoint. It’s just omitting the digits that are the same, while having completely specific addresses otherwise, no matter how they’re assigned.
I believe while there was a time before my memory where you might say “Get me San Francisco” and get sent along a chain of operators according to what you ask them one piece at a time, most of the time that there has been a specific 10-digit number for someone in the U.S. the routing was not in your control after you chose the destination.
These days the scheme is messed up because of cellphones, because you can’t know where they are at any given moment. You specify country, carrier, and the rest is more or less a unique identifier tied to the IMEI of the phone. That means the network has to know how to reach each endpoint at the location they’re currently at.
In the old days, at least where I’m at, the first zero you dialed would put you at the local exchange, and the next number would decide whether you would be switched to other local exchanges, international exchange, special service numbers etc. The exchange would throw you to the next line along that branch, if available, but otherwise you chose where you wanted to go. The point of “phreaking” was that you’d have these secret numbers and sequences that would patch you through to places where you shouldn’t be going.
> a specific 10-digit number
Each subscriber in such a network would have a “global” route that specifies things like country, state, county, locality, area, subscriber. If you were dialing your neighbor, the exchange would ignore the part that refers back to itself once it notices that happening. In theory you could dial in only the relevant part, but the order of operations meant that the first numbers would have a different meaning. In effect, if you wanted to “short dial”, there should have been a special digit to skip ahead in the sequence.
Dude. Don’t forget the three illegal touch tones.
Just numerical progressions of the normal ones, but woe onto the fool that got caught making those noises into a phone receiver.
Later they made it harder to direct dial Chyanne Mountain. The three tones were available w a C-64 app. You were still playing w fire.
Though in the case of major network failure, it would probably take quite a long time for the systems to scramble around and find alternate routes, since the assumption is that nobody needs to know the full topology of the network.
So what would happen is, you’d occasionally have to send broadcast packets that go everywhere and those that reach the destination would be sent back the same way to report what route they took and how long it took to get there. If everyone’s doing that, it would spam the network full.
IPv6 solves that…
ATM is dead, deal with it.
It was not a scalable network architecture because the routes had to be pre-defied during the architecture. That made modifications or expansions messy and prone to human(architect) error.
Gladys as CEO. Oh no…
Just what Optus needs, a CEO who effs up everything and people still think she’s great. Might work.
And absolutely didn’t lose her last job because of corruption of course!
C’mon mate, everyone knows she stuck her fingers in her ears and went “la la la I can’t hear you” whenever Dazza started crapping on about his totally not dodgy deals.
We don’t always test. But when we do, we do in in production.
Everyone has a test environment. Some lucky snots get to have it separate from their production environment!
What happened to updating only a small portion of a network initially.
My wife was in hospital with our 4month old son overnight when this went down.
I woke up and was frantically trying to get in contact with her, but both of us were with Optus so we couldn’t call, text , etc.
She had the same experience, she was standing out the front of the hospital in a gown with her butt hanging out trying to see if getting a better signal would fix it.
Our home internet is with Telstra so I could get on the internet, and so was the hospital wifi. One of the nurses saw on social media people were talking about the Optus outage, but there wasn’t any official statement from Optus.
We ended up talking on messenger, I found out they were both okay, what room they were in, etc.
It was a terrifying experience needing to communicate and Optus letting us down
Terrifying? Placing your ENTIRE trust into a single point of failure is terrifying. If you couldn’t get a hold of her, then maybe GO to where you knew she was. Not being able to function because one means of communication is down is terrifying. Don’t stray to far from cell towers and wi-fi. I don’t want to imagine the terror you’ll experience if YOUR only means of communication fails.
Terrifying, Just like how it is for the people caught up in the wars in the Middle East and the Ukraine when they can’t communicate…… except it’s not really is it…..
As others have pointed out, this is almost identical the the Canadian Rogers outage in 2022. Software upgrade to their BGP Edge peering router and they propagated throughout their internal systems, overwhelming them and bringing the entire network to a halt.
Their larger problem wasn’t so much the outage, but the woeful recovery. They didn’t have remote access or control of the routers so they had to dispatch people to numerous remote sites. 6 hours into the collapse of the network they still didn’t know what was wrong or caused it. It took them 12-14 hours to fully recover. Making it even worse, the company communicated poorly, including the CEO. Apparently they’d only simulated partial network failures and had never contemplated a full network collapse. This is after a massive security breach a year earlier under the same CEO that resulted in states re-issuing and redesigning drivers licenses, passports being renewed and a huge amount of personal data being leaked. That’s when the CEO’s poor communications skills came to light.
“These routing information changes propogated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves.”
That is the most bullshit StarTrek technobabel explanation I have heard in a LONG time.
We need to stop accepting that it when PR people write about stuff they don’t understand. If 1 paragraph from an engineer is too much for the general public to understand, then they don’t understand it anyway. Telling them the flobberbobin exceeded it’s hourly PKT threshold does nothing but make people THINK they understand things and make bad assumptions.
Makes perfect sense to me; there’s a maximum amount of changes that can be safely made to the routing tables before the device removes itself from the network – minimising corruption to said tables (& other devices).
What’s so hard?
Starlink ? …