I must go down to HP again, to the lonely PC on standby…

Barry Alford, head of ICT at Paragon Community Housing, explains how the housing provider carried out a live simulation of a full ‘DR day’ to test its business continuity strategy and what it learned from the experience.

It was in one of our senior management team meetings (in December 2012) that I presented a DR options report. I didn’t really think that the management team would go along with one particular recommendation and so, when they apparently did, I had to double-check: “Just to be certain on this, you do realise that this means closing down the head office for the day, shipping some staff to our fall-back site, and other staff working from our other premises or at home?”

There were a few moments of silence as the full significance dawned on them and we went on to reach a (qualified) agreement on how and when this could be carried out. Anyway, in the event we didn’t progress it at the time until the idea was resurrected by our new finance director, Paul Rickard, who proved to be very supportive. ‘DR day’ was provisionally planned for November 2015, thus missing the worst of the winter months.

It was surprisingly difficult to pin down an exact date in November; we had to avoid events such as a board away-day, a staff conference and a group-wide ’amalgamation’ (restructuring). A Friday was favourite for DR day, generally being a ‘lighter’ working day than others and with a weekend available for staff to catch up on work (this was for re-entering data and for systems to synchronise; more of that later). We selected 20 November, which later slipped to 27 November.

Main objectives and success criteria

Our main objectives can be summarised as:

To ensure that the DR environment initialises and was ready for use;

Validation of all procedures and recovery of all Paragon-protected servers, and capturing any failing processes;

To document a repeatable set of procedural testing activities;

To be able to carry out all normal day-to-day business operations;

To re-synchronise from our production environment to our DR environment immediately after the end of the test.

With the help of our audit team, we drew up 19 success criteria, of which these were the most important:

Paragon has validated all recovery procedures, and where changes are identified, the appropriate actions are captured for follow up;

Hewlett Packard (HPE) has recovered all Paragon-protected equipment, with internet access available;

The Case House (HQ) switchboard has successfully diverted to HPE;

The DR servers were resynchronised post-test within 48 hours.

HPE arrangements

At this point, it’s probably worth briefly describing our historic DR arrangements. Until early 2014, our business continuity (BC) strategy had been based on the availability of a second head-office building five miles away. This was a convenient, ready-made mutual back-up situation, which we used with a server ‘ship to site’ contract. However, we lost the use of the second head-office site around the same time as our old DR contract came to an end so we decided to use this as an opportunity to improve our BC and DR capabilities.

We took out a contract for data-streaming and the provision of office accommodation with HPE at its datacentre in Reading. With the new arrangements, we were able to improve our RTO and RPO to hours and minutes respectively. The HPE datacentre would function as Paragon’s headquarters until our relocation back to our normal head office or other alternative accommodation.

We did our ‘on-boarding’ (an initial data load of 21 virtual machines with one physical server and 7Tb of SAN data) in record time using a data circuit that had been increased in size for a few weeks.

During the early part of our contract we carried out increasingly comprehensive tests, extending them each time, and reached a situation where we wanted to replicate a real disaster situation.

Preparation and planning

There was considerable negotiation with HPE on the subject of risk; we eventually reached an agreement and started detailed planning with them. A tremendous amount of work and detailed thought was carried out by our networks manager, Dave Anthony, to whom great credit is due for this successful exercise. He covered questions and activities such as:

How close could we go in simulation or replication of a real-life disaster event?

Should we plan to resynchronise the data entered at Reading to our live servers after the day of the test? No, we decided that such entered data would be discarded.

Holding workshop sessions with staff on getting them prepared and able to work on the day, whether at Reading, at home or in a car parked on one of our estates. We covered the sequence of events, the documentation needed for the day and got them familiar with Mimecast (our internet email fall-back system).

Do we organise a coach to take staff to Reading (which, of course, would not be available in the real-life situation)? We decided to do so because it would mean that a good number of staff would arrive together and give us a good start to the test.

Corporate BC and ICT DR plans needed to be updated, which included the checking of several emergency ‘battle boxes’ at some backup sites on our housing schemes.

Can we take payments? It was all done over the internet, so there was no reason why not.

Could staff without their own equipment at home work effectively at public locations?

There would probably be a dip in KPIs.

What would our co-located contractors want to do?

Frequent planning meetings and communications were necessary during the weeks leading up to the actual test day. Meetings were held with our senior managers group to explain what we wanted to do and to get their buy-in, and presentations were given by Dave Anthony to staff at our chief executive’s briefing sessions, demonstrating commitment and support from the top.

PR and communications with stakeholders

We had an interesting discussion about how much to tell tenants and stakeholder organisations. A full-blown campaign would cause a lot of questions and unease, so we decided that it was probably better to take a low-key approach.

We advised staff not to say “We’re having a disaster recovery day!” as it would be an irrelevant and perhaps strange concept to tenants. Calls would be answered in the normal way and if it was necessary to add any further information, we’d just say something like “Hello, it’s Tom from Paragon and just to let you know, we’re operating from another office today” or ”I might need to follow that up on Monday”.

Pre-DR day preparations

A copy of the live replica at Reading was taken at 6.30pm on Thursday (a “snapshot”), which would form our live data for the test day, although it would all be discarded at the end of the Friday.

We normally have two-factor authentication fronting our internet-facing Citrix system but we turned it off for the day to simplify log-in by remote workers.

We had to choose the optimum time to switch over our internet outward-facing addresses from the head-office to Reading; in good time so that our (new) network location was known to the outside world but not so early that work on the Thursday was seriously affected. We chose 7pm and by 8.40pm all of our HPE server environment was up and running.

We turned off two routers and isolated the head office from the world.

DR day arrives

Isolating the head office did have one unfortunate effect; it meant that we couldn’t check, by any means of communication, that our overnight backup had been successful. That meant an early trip up the A3 to check that the job had worked successfully (I got up by mistake at 4.30am and promptly went back to bed for a couple of hours).

The big red banner at the top of the backup job results email did not presage well! However, fortunately it only concerned one minor admin server; if it had related to any significant live server then we had already agreed that we would abort the test.

Dave Anthony had gone straight to Reading with an early start to check the integrity of the domain, servers and PCs and, apart from a few tweaks, it was all looking good.

We were go!

Staff started to trickle into our head office for the coach that we’d arranged to take them to Reading; others were making their own way and a few brave souls were on public transport. There was a nervous moment when the coach was a little late but then it was away…

Most of the DR ICT staff went directly to Reading to check the kit and help settle in our operational teams, while one member of our team worked at Head Office to support some training courses going on and do a special disc-to-tape back-up job. Willmott Dixon (our main maintenance company) staff are normally co-located with our repairs team and several accompanied us to Reading.

The coach arrived at Reading as planned at 10.30am and the first hour turned out be rather frantic. Many phones started ringing at once when we pointed our main NGN to Reading and switched over from our overnight control centre. It took staff a while, of course, to get used to their new equipment and desktops, and it was quite a shock for some with everybody working in an open-plan environment.

As the day progressed, our staff settled in and thoughts turned to what would be happening about lunch? A sandwich run was organised which had a miraculously calming effect (with M&S doing well that day).

At the end of the day, we closed down the HPE environment, opened up the head office communications to the internet and started the resynchronisation of data from the head office to Reading, which was completed by 1.45pm on the Saturday. This was a period of risk, with no off-site data replica of integrity being available until the resynchronisation was complete.

In the event, only a few staff came in on the Saturday to rekey data from the test day and most preferred to wait until their return to the office on the Monday.

What worked well and not so well

Getting staff involved before the event was key to their willingness to make it work and one of the delights for me was the way our staff came together, with a great attitude of coming up with innovative ways of coping. Two examples were the use of WhatsApp to communicate within teams and nominated individuals leading on retrieving data from our housing system to avoid multiple Citrix logins.

The only real significant problem found during the day was that our live finance system would not come up, although surprisingly there was no problem at all with the test system (we think that this was due to an oddity with the overnight internal backup on ‘live’). A couple of other systems did not come up straightaway but were fixed within several hours.

On the day, we met 17 of our 19 success criteria, one was partly met and we’re still completing the final one.

HPE staff again proved to be very helpful and technically excellent.

Would we do anything differently?

After DR day, we asked our staff to carry out a survey on what worked well and what could be improved:

Most of them said that they were enthused by it and were able to do what they needed to do;

Those directly involved in the DR test felt very involved;

Others working remotely felt ‘out of the loop’ to some degree;

‘Tethered’ phones generally worked well;

Enhancements are needed to our ‘hubs’ (almost a small district office) with more facilities.

It highlighted the heavy reliance of our staff on email, which may be well served by moving Microsoft Exchange to the cloud.

It was business as usual on the Monday after the test, with no ill effects and staff using our live systems quite normally.

On our ‘things to do better’ list:

We need clear (DR) leads for each department, with plans of the immediate requirements for the day;

Better organisation of lunch.

Conclusion

Without doubt, it was a worthwhile exercise to carry out and confirmed our confidence in our plans, the back-up environment and HPE.

The most important result was proving we could successfully transfer and use our systems on up-to-date data at our fall-back accommodation.

Would we carry it out again? Most certainly, yes. However, we would concentrate on different areas not covered in this first test, such as remote staff and the hubs, and test restoration of those systems not streamed to Reading.

Barry Alford is head of ICT at Paragon Community Housing.