Frame relay outage causes disruption in AT&T network

Officials blamed a set of three unique circumstances for the April 13 failure, which kept thousands of transaction-oriented customers like banks and credit card centers off the network for between six and 26 hours, depending on the customer. First, said the officials, an upgrade procedure for firmware on a circuit card inside a Cisco switch was "inadequate," partly because it allowed maintenance while the switch was still connected to the network. There were no customers hooked up to the switch, the officials stressed, so the device was considered inactive.

Second, the officials said, a command in the procedure triggered software flaws on the circuit card being upgraded and the switch started to loop. In this condition it sent out to the other switches a surge of false signals that swamped the network and eventually brought it to a halt.

Third, software installed to monitor system health failed to recognize the signal storm as being false messages and so did not cut off the offending source. The reason, said the officials, is that administrative messages of the sort sent out in last week's failure often are sent out in great volumes, especially after a node has gone down and is being brought back online.

The combination of the three problems caused serious problems in the network and received national attention.