Post-Incident Analysis - Telephone System Issue:
Because we recognize that system service interruptions impact campus business, we are communicating this post-incident analysis of the telephone issue that occurred on Friday, June 13, 2014. This analysis provides an explanation of what occurred as well as an explanation of changes being made as a result.
Incident Summary and Cause:
Based on a small number of telephone issues reported during the week of June 9-12, IS&T identified a possible hardware/software failure on a circuit card on a server within the campus Avaya telephone system. During emergency, scheduled maintenance on the night of Thursday, June 12, the issue was confirmed. Subsequent to this testing, the card did not recover. Beginning at 4:00 a.m. on the morning of Friday, June 13, 2014, failure of the circuit card caused about ten percent of campus telephones to be unable to make or receive calls. IS&T immediately requested a replacement circuit card from the vendor.
To reduce the impact of service interruption, IS&T communicated this issue to campus users via status page at around 7:15 a.m. and via campus broadcast at around 9:00 a.m., directing affected users to the IS&T Help Center for assistance forwarding phones to alternate extensions. When a solution for temporarily restoring phone service was discovered, IS&T began to remotely reset campus phones that the system reported had been affected. Because reporting could not identify all affected phones, a status page notice was posted at about 10:15 a.m. notifying users that manually resetting their phones could resolve their issue. After confirming that manual reset worked in most cases, a campus broadcast was sent at around 12:45 p.m. to reach all additional affected users.
IS&T received a replacement circuit card at 1 p.m. At 5:30 p.m., IS&T replaced the affected card and began monitoring the system to confirm the incident was resolved. A short interruption of phone service, lasting about 15 minutes, occurred after business hours as IS&T made the repairs.
To be in position to more quickly diagnose and repair affected phones, should a similar incident occur in the future, we are putting in place additional reporting software that can speed up our ability to identify and remotely reset affected phones. This should minimize the time users experience impacted service and reduce the need for users to manually reset phones.