T-Cell errors induced a nationwide outage, however the FCC doesn't penalize the provider
Enlarge /. T-Mobile advertisement in Times Square, New York on October 15, 2020.
Getty Images | SOPA images
The Federal Communications Commission has closed T-Mobile's investigation into a network outage that Chairman Ajit Pai has described as "unacceptable". Rather than penalize the wireless operator, the FCC is simply issuing a public notice to remind telephone companies of "industry best practices" that could have prevented T-Mobile from going down.
After the 12-hour nationwide outage on June 15 disrupted text messaging and calling services, including 911 911 calls, Pai wrote that "the T-Mobile network outage is unacceptable" and that "the FCC is investigating. Us demand answers – and so do American consumers. "
Pai has spoken harshly with transportation companies in the past and has not imposed any penalties that could have greater dissuasive effects than strict warnings. It appears to have happened again yesterday when the FCC announced the results of their investigation into T-Mobile. Pai said that "T-Mobile's outage was a failure" because the operator was not following best practices that could have prevented or minimized it, but did not announce any punishment. The matter appears to be closed based on yesterday's announcement, but we contacted Chairman Pai's office today to ask if T-Mobile is facing any impending punishment. We'll update this article when we get a response.
FCC describes T-Mobile errors
The staff's investigation report identified several mistakes T-Mobile had made during the outage, which began when T-Mobile was installing new routers in the southeastern United States. If a fiber optic link fails in the region, T-Mobile's network should have transmitted traffic over a different link. However, the operator "had misconfigured the weight of the connections to one of its routers," which "prevented traffic from flowing to the new active router as intended". T-Mobile had not implemented a fail-safe process to prevent the misconfiguration or to alert network technicians to the problem.
The Atlanta market was "isolated" from the rest of the network, causing all LTE users in the area to lose connectivity. A software bug worsened the situation by preventing mobile devices in the Atlanta area from re-registering with the IP multimedia subsystem over WiFi. Instead of forwarding device registration attempts to another node, "the registration system repeatedly forwarded re-registration attempts for each mobile device to the last node in its records that was not available due to market isolation."
The software error had existed in the T-Mobile network for months. "This software bug probably didn't cause any problems before this outage occurred, as the outage was the first notable market isolation since T-Mobile added this software to its network," said the FCC. Regular testing "may have discovered the software bug and misconfiguration of routing before it could affect live calls," the FCC also said.
After problems started on June 15, T-Mobile engineers "exacerbated the impact [of the failure] because they misdiagnosed the problem." The FCC report continued:
T-Mobile believed the fiber optic transport link that went down earlier in the day continued to cause the ongoing outage. Based on this belief, T-Mobile manually closed the connection in order to divert traffic away from it. However, due to the still misconfigured Open Shortest Path First weights, these steps restored the initial conditions of the failure. LTE customers in the Atlanta market were again disconnected from the LTE network and had to make calls via WiFi. Their registration attempts failed again, causing a registration storm that further overloaded T-Mobile's IP multimedia subsystem.
T-Mobile engineers realized almost immediately that they had misdiagnosed the problem. However, they could not fix the problem by reconnecting because the network management tools required to do this relied remotely on the same paths they had just disabled. When T-Mobile engineers were able to access the devices on site an hour later and correct their errors by re-establishing the connection, customers in the Atlanta market were able to try again to register with VoLTE [Voice over LTE]. However, this again led to an additional overload as the engineers at T-Mobile had not yet corrected the software error that was preventing the registrations from being completed.
Failure goes nationwide
The FCC report explained how the outage in the Atlanta market spread nationwide. The external traffic destined for the Atlanta system was redirected to other regions, "causing these registration systems to become overloaded enough to cause the T-Mobile network to send the registration attempts to other nodes. The software bug redirected the registration attempts again to the last remaining node in the recording, which was probably already severely overloaded. "Shortly thereafter," IP Multimedia Subsystem, VoLTE and Voice over Wi-Fi registrations failed nationwide. "
The vast majority of T-Mobile customers could not connect to Voice over LTE or Voice over Wi-Fi networks, so they used T-Mobile's 3G and 2G circuit-switched networks to make and receive calls while the device continued its registration attempts to the VoLTE network. "This overloaded 3G and 2G, causing many phone calls to fail. Network nodes continued to hold resources for those call sessions after the calls ended, overloading the nodes' computing resources and causing even more call failures.