|
|
|||
|
|
Finding Common Ground to Resolve Outages
Susana Schwartz
01/31/2008 With increasingly complex convergent services, the incidence of outages is expected to grow. In 2007, there were a variety of high-profile outages that were difficult to resolve because of a lack of consistency in language and data among stakeholders in the delivery of services. Persistent problems with outages and reporting of network outages drove AT&T Inc., Alcatel-Lucent, Qwest Communications International Inc., Sprint, T-Mobile USA Inc., U.S. Cellular and Verizon Communications Inc. to participate in the ATIS Network Reliability Steering Committee (NRSC). After 18 months of concentrated work, the NRSC has released its Outage Classification Standard. The standard is designed to provide carriers, vendors and the FCC a template for uniformity in outage data and reporting models. The hope is to resolve inconsistencies that plague network analysis and reporting. The project was driven initially by AT&T, whose principals realized consistency was paramount after the company’s divestiture: "We realized a common view of processes and data among disparate pieces of the company would be necessary to ensure network reliability," says Archie McCain, chairman of the NRSC and director of Six Sigma programs for AT&T. AT&T and the other NRSC members recognized that outage analysts and coordinators grappled with different protocols and methodologies for notifications, updates and modifications to outages. The disparity was attributable to the fact service providers, government agencies and suppliers all possessed unique methods and systems for classifying outages. For that reason, most outages had to be re-classified or converted as they worked their way through carriers’ systems, and on to suppliers and/or government systems. "We all needed outages to be classified just once, even in instances where changes and modifications had to be made to outage information," says Jay Bennett, principal scientist at Telcordia Technologies Inc., one of the lead editors of the standard’s documentation for enumerating possible outage classifications. The ATIS standard was designed to classify outages within three categories: cause of outage failure (i.e, hardware, software, cable, wireless transmission or capacity); reason for outage (including a primary and secondary descriptive protocol); and responsibility for outage, (such as acts of nature, reporting service provider, government, or vendor). The creation of these broader categories was meant to ensure that the classification system would be applicable to all network types, and to a wide range of statistical analysis and outage causes. "The template helps answer questions around what failed, why and who is responsible," says McCain, noting there are lists and sub-lists under each of those questions. "You can have primary reasons and then secondary reasons for an outage. A primary may be damage, design problems or engineering issues. Then within that, there can be more granular detail such as whether it was an accident, an intentional act, a documentation problem, or a supervision problem," explains McCain. Once the what and why are known, it’s easier to understand the who, such as whether it was the service provider, the utility company, or the individual that was responsible. The ATIS standard provides guidance through documentation of certain outage examples and ensuing classifications for the what, why and who. (See table below.)
In the ATIS system, there exist primary and secondary reasons for outages: For example, an invalid pointer added to an office retrofit tape indicates that a trunk group may experience a failure. The "what failed" question would yield an answer of "software," under which a primary reason of "design" could be determined. That would point to the vendor as the source to which the carrier must go for resolution. That process also would lead to a subgroup, or "secondary" issue of "accident" so that further rules about actions to take can be implemented. Having a standard template for mapping issues and resolutions is a first step, but the next challenge is writing the software to implement systems in databases. "To do this effectively, service providers may have to establish a dual system in their databases so that they can classify the old way while starting to classify the new way," concedes McCain. By allowing the traditional and burgeoning methods to run side by side for a period of time, data would be allowed to build up so that service providers could conduct operational reporting without disruption. "You don’t want to do a clean sweep because you still need more than a few days worth of data to report on things like card failures or digital switches," notes Bennett, who believes He notes some companies may choose to use a phased approach. "Some may implement the software for re-classification of a year’s worth of records to start off, so they are mapping the old classification into the new," says Bennett. The approach to conversion will depend on the methodologies and systems used in their original approaches. A standard to address those questions gives service providers a chance to prioritize worthwhile outage issues, as opposed to exhausting resources on those where improvements cannot be made. "Rather than throw resources at issues where further improvements are difficult to attain, the template will help identify places where there is room for improvement," says McCain. "The outages caused by human error may require more attention to procedure than those which are more straightforward, like a cable cut." With the Outage Classification standard, he believes service providers can target areas where rules are lacking.
Share this article: Email,
Slashdot, Digg,
Del.icio.us, Yahoo!MyWeb,
Windows Live Favorites,
Furl
|
|
| Sponsored Links | xchange Announcements |