LAX Meltdown Caused By A Single Network Interface Card
According to the LA Times, the LAX computer meltdown that stranded 20,000 international passengers was the work of a single malfunctioning network interface card on a single desktop computer in the LAX international terminal. From the LA Times:
The card, which allows computers to connect to a local area network, experienced a partial failure that started about 12:50 p.m. Saturday, slowing down the system, said Jennifer Connors, a chief in the office of field operations for the Customs and Border Protection agency.As data overloaded the system, a domino effect occurred with other computer network cards, eventually causing a total system failure a little after 2 p.m., Connors said.
"All indications are there was no hacking, no tampering, no terrorist link, nothing like that," she said. "It was an internal problem" contained to the Los Angeles International Airport system.
LAX outage is blamed on a single computer [LA Times]
(Photo:Kenny Miller)
PREVIOUSLY: 20,000+ International Passengers Stranded At LAX
Post a comment
Comments:
Strange. You would think an important system like this would have some sort of redundancy or backup capacity built in.
Of course, this is American where we don't invest in infrastructure, give contracts to the cheapest bidder, and don't worry about fixing things until they are very very broken. So I guess I shouldn't be too surprised.
Isn't this what error messages are for?
I find it hard to believe that they somehow managed to set up a system with no notifications for hardware failure. Christ, even the crappiest versions of Windows will throw up a dialog box when something's not working right.
(Linux, in my experience, will go into kernel panic and shit itself. Maybe they were running Debian.)
I'm with the others who are shocked (and horrified) that a single bad NIC can bring down an airport. That sort of thing may be fine for a small business, but for anything close to critical should have some redundancy.
Cumaeansibyl: I''ll give them SOME (maybe 0.00001" inch) of credit on this one -- Windows only pops up error messages for a total lack of connectivity and when an IP can't be resolved. If you have a truly malfunctioning NIC, Windows might see it as working properly. Living in lightning country here in Florida, I've seen countless NIC & networking failures and seen some odd partial-failures where everything appears right but not properly passing packets.
Fortunately, this wasn't malicious, just incompetence, but while just sitting here, I've thought up of a few ways which someone could attack such a network for their own gain or mayhem. Worse, you could cause an airport to "ground" flights or run everyone back through security, then let a bomb go in the screening area queue. Not good.
Kinda sad that this happened with US Customs. I expected better from them. If this said TSA, sadly, I don't think anyone would have been surprised.
@dieselbug: Come play with some Florida lightning or a serious case of static electricity. Then you'll see it happen with Ethernet.
Not "corporate" LAN related, but if you really want to cause some major outages, hook up a cheap 27MHz transmitter to your CATV line. That's the reverse channel for many cable modems. Let's just say you'll end up pissing off a lot of people.
@FLConsumer:
My point was that it's a known issue with TR that a failed NIC on ANY device in the segment can pull the segment down for extended periods of time. This is one of the reasons (apart from the ridiculous price) that T-R has been replaced by Ethernet as a "standard". The industry has may legacy technologies in use (don't ask about ATCs - you'll never fly if you knew how old some of their technology is. . . )
I didn't RFA, but I could see this happening if the "computer" they were referring to was part of the infrastructure (i.e. a router or a switch). If a blade of a router "partially failed" and started a broadcast storm, that could easily bring a network to its knees. It might be unlikely, but its not impossible.
It seems most likely that this was a Token Ring Network -- they are semi infamous for this. OTOH, even normal ethernet can have similar issues, but, it should be mitigated if not eliminated by a switch -- a NIC can flood a hub( if its cross talk detection fails for example), and easily keep it flodded, but a properly configured switch should confine this issue to only that network segment.
The bad NIC may have been what caused it, but why it happened was inadequate/ improper setup on the network.( or possibly a well designed, but already damaged/degraded network( old hw/ hw needing work) being driven over the edge).
Additionally any even meso competent network person should be able to fix this fairly quickly.
My bet is that its an older token ring network -- failures on those can be much more of a pain to fix. ( This of course leads us to the question of why do it that way)
However, even if a blade goes bad its easy to isolate, and replace once you notice the storm. ( assuming you have spares on site -- but I'd assume that this is a mission critical system so they'd better have them)
My first thought was also "Token ring network". Followed shortly by "They're on a token ring network?!!!!"
If the network was actually ethernet, whoever set it up needs to be sued. An ethernet network should not fail like that. Hubs, switches, etc, should be keeping flooding from occurring (to the point that a single network card is bringing down an entire airport network). It was a NIC on a desktop computer, which indicates that it was a leaf on the network. If it had the ability to bring down the network, that's design so crappy, I can't even comprehend.
I'm not sure what's better. The possibility the network is token ring, or that an ethernet network was set up so poorly.
I was thinking the same thing thing. I was thinking, "How did they get ethernet to fail so spectacularly?" I do know some municipal networks around here are still on tokenring (i.e. if it ain't broke, don't fix it), so I guess it's still in use. God knows why.
@Malethos: Aha! You know what happens when you ass|u|me...
My little off the cuff hypothetical diagnosis also ignored the fact that, as others had mentioned, any network engineer worth their salt would have take router failure into account and had redundant circuits out the wazoo. If they were too shortsighted to build a highly available network, they were probable too shortsighted to have hardware stockpiled too.
Just my 2 cents. I'm probably WAY wrong.
@CumaeanSibyl: (Linux, in my experience, will go into kernel panic and shit itself. Maybe they were running Debian.)
Nice troll!
it sounds like a chatty NIC making broadcast storms on a flat ethernet network with no VLAN'ing to segment the end user desktops from the core switches and servers.
The wifi networks at the airports are completely isolated from the airport/airline functions. They probably have 1000 times better hardware and configuration for just the casual internet surfer waiting on their plane.
this is not good news. hopefully/luckily this incident was an accident and unintentional.
However any 9th grader(or terrorist) these days could use a flood script on an exposed network jack in the airport to do the same thing...
the airport has a large data network but it probably isn't as large and complex as one might think. I guessing about 1000 endpoints total for just the airport/airline internal systems.
@gibsonic: One would still assume, however (I know, see my previous ass|u|me post) that the network that the travelers use would be completely segregated from the mission critical networks that they airport requires to operate (ticketing, flight data, etc).
the reason i say they are separate is b/c i know they are based on being in the industry.
the airports/airlines are using ancient systems that have been around for decades.
3rd party companies such as t-mobile and their partners ran all new infrastructure cabling in the airports they service with their own firewalls, routers, servers, etc. connected to their own circuits from the CO.
parallel systems in the same building.
The failure itself was a relatively common incident. What's really interesting is the set of management failures that led to this failure. I've blogged about this over at ZDNet:
Michael Krigsman
[projectfailures.com]
@pestie: Yeah, but nobody went for it! What gives?
Actually, I speak from personal experience. I have a laptop with a faulty hard-drive connector that occasionally causes errors in Windows. When I ran Debian, the damn thing crashed at least once a day. Windows just seems to be more forgiving of that kind of thing.
I do wonder what software they're using. It might be some godawful decades-old proprietary thing that they paid the programmer for in beer and Cheetos.
Its called a broadcast storm. Sounds like LAX still has some old hubs in their network.
This time its not TSA's fault, its the dumb-asses at LAX who manage their IT infrastructure. Geeze, hubs went out 10 years ago with the introduction of switch-based networking.
It amazes me when stupid people running multi billion dollar corporations and have ONE-SINGLE-PC take down an entire system!!!
It's US Customs, not FAA/TSA/NTSB. They're running Ethernet. The problem here is how their network was designed with no redundancy.
I wonder just how dated the ATC equipment is compared to other gov't agencies. The last time I checked, NOAA was looking for 8088's to handle their upper air soundings. The WSR88D's (doppler radars) still run on Fortran. Nothing wrong with the old stuff as long as you've designed it well.










"As data overloaded the system, a domino effect occurred with other computer network cards, eventually causing a total system failure a little after 2 p.m."
WOW... what a load of CRAP.
Also, anything stranding 20,000 passengers is NOT a "partial" failure.