The disruption in services of the National Stock Exchange (NSE) has triggered a widespread discussion about why the Exchange did not switch operations to its disaster recovery (DR) site when it faced a major technical issue on 24th February. The DR is expected to get started in 45 minutes after a disaster. The NSE is an extraordinarily technology-intensive operation, which allows traders to conduct extremely high frequency trading (HFT) in milliseconds through its co-location systems. On paper, NSE’s DR is tested and run from the back-up site several times a year as part of a regulatory requirement mandated by the Securities and Exchange Board of India (SEBI).
The question being asked by investors is pertinent, since the investment community collectively lost crores of rupees when their trades were forcibly squared off by their brokers before NSE suddenly announced an extended session that evening.
Usually, when a disruption happens, based on the extent of its impact on trading, the management has to declare a disaster. While the tech teams work at restoring services in the fastest possible time, in theory, the exchange is supposed to start operating seamlessly from the DR site, while the original site is being repaired and restored.
Most high-tech operations today have DR sites which are expected to protect them from disruption through a seamless switch. In my experience, in most cases, top management hesitates to declare a disaster, as they seldom have the confidence of actually running operations from a near site or a remote disaster recovery site. This is mainly due to the complex technology linkages between different elements of the IT service.
In addition, people are most critical to the switchover process. Having the right resources available during a crisis is crucial. While internal IT teams perform disaster drills and claim they are able to run operation from the remote site, the fact is that new discoveries come to the fore when a disaster strikes and top management team is usually oblivious to these issues.
These so-called drills are often just a farce put on for external and internal auditors, who seldom get to the core issues that may have occurred during the drill. Most auditor reports are checklists and they are happy to tick them off and present a rosy picture to the management.
It is important to remember that NSE’s press release also said that it did not invoke the disaster recovery site based on management consultations.
So what actually is a disaster recovery site? In layman terms, it is an alternate site (which can be within the same city or another city) which is capable of running all the operations designed in a primary site.
The genesis of a disaster recovery site is the business continuity policy (BCP). This document lists all IT systems that hold the data of an organisation, its dependencies with other systems, and all the elements which are necessary to run the system (for example, a trading system will have servers, databases, third-party software, connectivity links, third-party applications where it would be sending data for reporting, and human resources).
Once the list is ready, every IT system is graded on a three-point scale. The three points are confidentiality, integrity and availability (CIA). It is actually a matrix which drives your need for a redundant site.
In this case, the CIA rating would have been the highest, since NSE systems are accessed by users across the globe (investors, and mutual funds) and involve transactions which are of very high value and large volumes.
Once the CIA is clear, plans are drawn up towards which elements of an IT system need redundancies (servers, databases, third-party software, connectivity links, human resources). Once this is clear, budgets are drawn up and approval is sought from the management.
Besides this, especially for a business continuity plan, different teams are defined to decide who in the organisation will declare a disaster, the recovery time objective (RTO) and the recovery point objective (RPO).
RTO is the time taken to invoke the disaster recovery site after a disaster has been announced, RPO is the point at which the data can be recovered. Investment decisions vary on RTO & RPO. The lesser the RTO and RPO, the higher are the investments.
In this case, NSE claims that it runs a quarterly drill and also operates twice a year from the DR site. It also says that it was a connectivity issue.
Unlike the good old days, India today has multiple connectivity service providers (Tata Communications, Airtel, Vodafone Idea and Sify, to name a few) who use each other’s last mile connectivity and provide services to the customer. The probability of all service-providers failing at the same time is low but not nil.
The other part of the exercise is the DR drill itself. There has to be an investigation into which authority actually reviewed the drill report and what observations were made during the drill.
Finally, the NSE management team, which decided not to invoke the DR site, needs to be scrutinised as the extent of losses to the investment community may be huge. A robust, resilient, secure and fault tolerant system delivery requires a lot of commitment from the people responsible for it.
This incident must also act as a caution to corporate India, across verticals, to review their respective BCP and DR plans. In a scenario where there are no uniform standards set, it is entirely up to companies to realise that their reputation is at stake in the event of a disaster.
Listed below are a few points that need to be rigorously followed at a defined frequency as per the business vertical. It must also be noted that this is an evolving process due to continuous changes in technology as well as the people in charge of it, both internal and external.
1. Interviewing key stakeholders and participants in the programme.
2. Reviewing business, case, planning and IT-related documents.
3. Reviewing individual BCP and DR plans by ensuring that they are complete, accurate, and up-to-date.
4. Looking for defined recovery times and whether there is evidence that they can be met in a crisis.
5. Examining training materials, procedures, guidelines, and so forth, plus any management communications regarding BCP and DR situations that might occur and what employees should do.
6. Reviewing testing plans and the results of any tests already conducted.
7. Evaluating relevant employee preparedness and familiarity with procedures.
8. Reviewing impact of new regulation on plan.
9. Reviewing contractor and service-provider 'readiness' efforts.
Another source for business continuity planning is https://www.thebci.org/
. It has both, corporates and individuals as members and it also conducts training programmes.
One can only hope that the recent incident would lead to a proper root analysis and fix issues, while also holding out a model for others with DRs to follow.
(The writer has worked in the IT industry for over 22 years having worked with data centres, IT applications, connectivity and products.)