The Staging Phase of Deployment

Despite some humorous examples of deployments gone wrong, failures are not funny. William Brewer explains why staging is so important and how it can help avoid the types of disasters he recalls in this article.

Staging is a vital part of doing the deployment of any application, particularly a database quickly, efficiently, and with minimum risk. Though vital, it only gets noticed by the wider public when things go horribly wrong.

On 2nd February 1988, Wells Fargo EquityLine customers noticed, on the bottom of a statement, this message

“You owe your soul to the company store. Why not owe your home to Wells Fargo? An equity advantage loan can help you spend what would have been your children’s inheritance.”

A few days later, the company followed this with an apology:

“This message was not a legitimate one. It was developed as part of a test program by a staff member, whose sense of humor was somewhat misplaced, and it was inadvertently inserted in that day’s statement mailing. The message in no way conveys the opinion of Wells Fargo Bank or its employees.

James G. Jones, Executive Vice President, South Bay Service Center”

This mishap was an accident in Staging. It is so easy to do. It could have been worse. In early 1993, a small UK-based company was working with one of the largest UK telecom companies in launching a new ‘gold’ credit card for their wealthier customers. They needed a mailshot. The mailshot application was in Staging and as not all of the addresses were ready, where there was a NULL name, the programmer flippantly inserted the place-holder ‘Rich Bastard’, before leaving the project. Sadly, the mail addresses that the test mailshot went to were from the live data. His successor unwittingly ran the test. The rest of the story is in IT History. The blameless developer running the test, nonetheless, left the profession and became a vet.

Staging rarely causes a problem for the business: far more frequently, it can save the costly repercussions of a failed deployment. Had the UKs TSB (Trustee Saving Bank) engaged in conventional, properly conducted, staging practices in 2018, they’d have saved their reputation and costs of 330 million pounds. Staging allows potential issues and concerns with a new software release to be checked, and the process provides the final decision as to whether a release can go ahead. The fact that this vital process can occasionally itself cause a problem to the business is ironic. Mistakes in Staging can be far-reaching. It requires dogged attention to detail and method.

Staging

It is often said that the Customers of an application will think it is finished when they see a Smoke and Mirrors demo. Developers think it is done when it works on their machine. Testers think it is complete when it passes their tests. It is only in Staging, when a release candidate is tested out in the production environment, can it be said to be ready for release.

The team that conducts the Staging process within an organisation have a special responsibility. If a production application that is core to the organisations business is being tested, the senior management have a responsibility to ensure that the risk of changes to the business is minimised. They usually devolve this responsibility, via the CIO, to the technical team in charge of Staging, and it is highly unusual that management would ignore the recommendations of that team. If the two don’t communicate effectively, things go wrong. In the spectacular case of TSB,  the release was allowed to proceed despite there being 2,000 defects relating to testing at the time the system went live.

Whether you are developing applications or upgrading customised bought-in packages, these must be checked out in Staging. The Staging function has a uniquely broad perspective on a software release, pulling together the many issues of compliance, maintenance, security, usability, resilience and reliability. Staging is designed to resemble a production environment as closely as possible, and it may need to connect to other production services and data feeds. The task involves checking code, builds, and updates to ensure quality under a production-like environment before the application is deployed. Staging needs to be able to share the same configurations of hardware, servers, databases, and caches as the production system

The primary role of a staging environment is to check out all the installation, configuration and migration scripts and procedures before they’re applied to a production environment. This ensures that all major and minor upgrades to a production environment are completed reliably, without errors, and in a minimum of time. It is only in staging that some of the most crucial tests can be done. For example, servers will be run on remote machines, rather than locally (as on a developer’s workstation during dev, or on a single test machine during test), which tests the effects of networking on the system.

Can Staging be Avoided?

I’ve been challenged in the past by accountants on the cost of maintaining a staging environment. The simplest answer is that it is a free by-product of the essential task of proving your disaster-recovery strategy. Even if no significant developments of database applications are being undertaken in an organisation, you still need a staging environment to prove that, in the case of a disaster, you can recover services quickly and effectively, and that an action such as a change in network or storage has no unforeseen repercussions with the systems that you, in operations, have to support in production. You can only prove that you can re-create the entire production environment, its systems and data in a timely manner by actually doing it and repeating it. It makes sense to use this duplicate environment for the final checks for any releases that need to be hosted by the business. Not all the peripheral systems need to be recreated in their entirety if it is possible to ‘mock’ them with a system with exactly the same interface that behaves in exactly the same way. It isn’t ideal, though: The more reality that you can provide in staging the better.

Staging and Security

Before the Cloud blurred the lines, it was the custom in IT that Staging was done entirely by the operational team in the production setting, which meant a separate Ops office or data-centre. This meant that security for Staging was identical to Production. In a retail bank, for example, where I once worked as a database developer, the actual client data would be used. As my introductory stories illustrated, this could lead to highly embarrassing mistakes. However, security was excellent: To get into the data centre where Staging was done, you needed a key fob, and your movements were logged. There was close supervision, video surveillance, and nobody got a key fob without individual security vetting. It was ops territory, though I was able to call in to check a release because I’d been security-checked. This was a rigorous process that took weeks by a private investigator, an avuncular ex-cop in my case with the eye of a raptor. I explain this to emphasise the point that if the organisation has the security and organisational disciplines, you can use customer data within a production environment for Staging. Without stringent disciplines and supervision, it simply isn’t legally possible. The GDPR makes the responsible curation of personal data into a legal requirement. It doesn’t specify precisely how you do it.

It isn’t likely that you’d want a straightforward copy of the user data, though. Leaving to one side the responsibility that any organisation has for the owners of restricted, personal or sensitive data, it really mustn’t ever contain such data as client email, phone numbers, or messages that are used for messaging, alerts, or push notifications. Any data, such as XML documents of patients’ case histories for example, or anything else that is peripheral in any way to the objectives of Staging, ought to be pseudonymized. The Staging environment is as close as possible to production so that the final checks to the system and the database update scripts for the release are done realistically, but without being on production. As Staging is under the same regime as production, there isn’t a risk to data above that of the production system.

What is In-scope for Staging?

It is difficult to make any hard-and-fast rules about this because so much depends on the size of the organisation, the scale of the application, and the ongoing methodology within IT. Staging is often used to preview new features to a select group of customers or to check the integrations with live versions of external dependencies. Sometimes Staging is suggested as the best place for testing the application under a high load, or for ‘limit’ testing, where the resilience of the production system is tested out. However, there are certain unavoidable restrictions. Once a release enters Staging, there can only be two outcomes: either the release is rejected or goes into production. Staging cannot therefore easily be used for testing for integrity, performance or scalability for that particular release because that requires interaction. It can, of course, feed the information back for the next release, but generally It is much better done in development interactively using standard data sets before release so that any issues can be fixed without halting the staging process. Sometimes there is no alternative If the tests done in Staging show up a problem that is due to the pre-production environment such as data feeds or edge cases within the production data, then the release has to go back to development. Nowadays, where releases are often continuous, Staging can be far more focused on the likely problems of going to production since there are more protections against a serious consequence, by using safety practices such as blue/green, feature toggles, canary or rolling releases

How to Make it Easier for a Deployment in Staging

The most obvious ways of making Staging easy, even in the absence of DevOps, is to involve operations as early as possible in the design and evolution of a development, to tackle configuration and security issues as early as possible, and for the development team to develop techniques for automating as much as possible of the deployment of the application. The ops teams I’ve worked with like the clarity that comes with documentation. Some people find documentation irksome, but documents are the essential lubricant of any human endeavour because they remove elements of doubt, confusion and misunderstanding. As far as development goes, the documentation needs to cover at least the interfaces, dependencies and configuration data. I always include a training manual and a clear statement of the business requirements of the application. Because the Staging team have to judge the ‘maintainability’ of the system, they will want to see the instructions that you provide for the first-line responders to a problem for the production system.

Staging Issues

Several different factors have conspired to make the tasks of the Staging team more complex. These include the demands for continuous integration and rapid deployment. This has made the requirement for automation of at least the configuration management of the Staging environment more immediate. The shift to Cloud-native applications has changed the task of Staging, especially if the architecture is a hybrid one, with components that are hosted, and the presence of legacy systems that contribute data or take data.

As always, there are developments in technology such as the use of containers, perhaps in conjunction with a microservices architecture, that can provide additional challenges. However, these changes generally conspire to force organisations to maintain a permanent staging environment for major applications in active development. If there is a considerable cloud component, this can amplify the charges for cloud services unless ways can be found to do rapid build and tear-down.

Against this rise in the rate of releases and the complexity of the infrastructure, there is the fact that a regularly performed system will usually tend to become more efficient, as more and more opportunities are found for automation and cooperative teamwork.

It is in staging that DevOps processes, and a consistent infrastructure, can help. However, it is crucial to make sure that the specialists who have to check security and compliance have plenty of warning of a release candidate in Staging and are able to do their signoffs quickly. They also may need extra resources and documentation provided for them, so it is a good idea to get these requirements clarified in good time.

An acid test for a manager is to see how long a new team member takes to come up to speed. The faster this happens, the more likely that the documentation, learning materials, monitoring tools, and teamwork are all appropriate. There must be as few mysteries and irreplaceable team members as possible.

Conclusion

Despite the changes in technology and development methodology over the years, the Staging phase of release and deployment is still recognisable because the overall objectives haven’t changed. The most significant change is from big, infrequent releases to small, continuous releases. The problems of compliance have grown, and the many issues of security have ballooned, but the opportunities for automation of configuration management, teamwork and application checks have become much greater. Because releases have a much narrower scope, the checks can be reduced to a manageable size by assessing the risks presented by each release and targeting the checks to the areas of greatest risk. The consequences of a failed release can be minimised by providing techniques such as feature-switching for avoiding rollbacks. However, seasoned ops people will remember with a shudder when a small software update caused a company to lose $400 million in assets in 45 minutes. Staging is important and should never be taken for granted.