Weekend maintenance kicks an Italian bank offline for days

👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover one out of four topics from today’s subscriber-only The Pulse issue. To get full issues twice a week, subscribe here.

Sella is a mid-sized bank in Italy with around $14B in assets under management. However, Since Sunday (7 April), most of its services are inaccessible. The Sella banking app, the Sella Invest app, and internet banking are all down. Two trading apps – Sella XTrading and Sella Trader, continue to operate normally. So, it seems all the banks’ digital properties, except those for trading, have gone down hard.

“Something, something Oracle.” What caused this issue? From Sella’s status page:

“Following the installation of an update to the operating system and related firmware which led to an unstable situation. Our technicians have been collaborating with Oracle, a multinational and supplier of the Sella group's operating system for over twenty years, to restore full operations: several progresses have been made but at the time we are writing to you the situation has not yet returned to normal.

This type of intervention takes time, which is why the activities carried out so far have not completely solved the problem.”

It’s hard to get too much out of these vague reports, but here’s my attempt at decrypting what might have happened:

An Oracle version was updated, and/or database schema changes were made. There’s a much smaller possiblity that both happened at the same time.
The changes messed up all major databases in some unexpected way.
Sella needs Oracle’s help to figure things out. It’s suggested that the problem was probably some edge case in how they use Oracle, or how they define schemas, or a widely-used Oracle feature might have been changed, etc.

You might expect something catastrophic like this to happen if Sella was trying to “jump ahead” by updating to several versions of Oracle from a very old version. After all, if it was “just” a schema change: this change should be straightforward enough to revert.

Still, I’m puzzled by how long the system has been down. If it was a risky update, just restore the backups! Every sensible tech company has a backup strategy to bring back its systems – and performing a backup before a major update is common practice. If you perform a risky update, this is a good reminder to start by doing a backup, or inserting a “rollback point” that you can revert to, should things go wrong.

If it was an update to Oracle, or to the operating system, then why not roll back the update? This is also why banks utilize a 2-day “blackout period” which creates space to revert failed migrations or updates. Of course, if it was that simple, then Sella would surely have already done the change. It all suggests there was an unexpected edge case.

The problem might be caused by an external — but kind of internal — platform owned by the banking group. Fabrick is a platform that Sella is operating on top of. This platform was created in 2018 by the Sella group, and is a platform that supports open banking and open payment services. The platform is used by clients outside of the Sella group as well. What confirms the suspicion that this is a Fabick-level issue is Fabrick’s status page that says Fabrick’s tech staff are involved:

“We would like to inform you that the event is currently still ongoing, continuous monitoring is in place, and any slowdowns are affecting only the services provided jointly with the Sella group. There are still ongoing slowdowns in bank transfers, payments with debit and prepaid cards and in accessing and using online services, including Internet Banking, Smart Business Sella and apps. The resolution of the event has top priority for all of us. (…)

These events occurred after the installation of an update to the Sella group's IT systems, which led to an instability affecting Fabrick's recalled services.”

We can already see a problem: a web of dependencies. So far, Sella mentioned a problem with Oracle. The platform they use — Fabrick — says theere’s an “instability” caused by the update to Sella's own systems. So who is at fault here?

Would have this issue occurred if Sella owned its own platform, instead of creating a platform that they spin off, and now something falls through the cracks? This might well have happened: but at least it would have been easier to identify the issue.

A software engineer I talked with with visiblity into the issue shared that the engineering team is grappling with tech debt built up, which is making it hard to identify and fix the issue. This engineer speculated that the common practice of subcontracting several layers within banks could also lead to the situation that it becomes an “it’s not my problem” situation — which is what we see, as the Fabrick platform seems to blame the “update to Sella group’s IT systems” which caused Fabrick instability. But Fabrick is — or should be — Sella group as well!Are 2-day “blackout” windows good for an engineering culture? Banks regularly do maintenance during weekends; it’s when data migration projects are run, and things like schema updates, rolling out of new systems, and other riskier changes.

In some ways, banks are lucky to have these “blackout periods” when money movements are frozen, and engineering teams have up to 48 hours to make changes, test them, and roll them back if needed. In contrast, most other industries like ecommerce, utilities, airlines, etc, have no such luxury and must invest in zero-downtime migrations. This means these companies need to get very good at detecting and resolving incidents in real time. Banks rarely do!

So, it’s fair to ask if it is a net good or a net bad that banks don’t need to plan for short downtime, or zero downtime maintenance or migration? I’d argue this approach unintentionally helps create a weaker engineering culture, and a worse work-life balance. Banks can simply have engineers work at weekends on risky migrations. They don’t need to do rollback plans, because they already have one: “if it goes wrong on Saturday, we have another 24 hours to fix it on Sunday.”

Good luck to the Sella and Fabrick teams in resolving this outage. Perhaps we’ll learn what caused this serious incident.

As related reading, here are past articles touching on the topic of migrations and tricky outages:

Why it took Roblox 73 hours to get their systems back online in 2021. An interesting deepdive, and details on why it took so long to find the root cause of the outage.
Migrations done well: a guide for executing migrations well, at both small and large scales.

Update on 11 April: clarified that Fabrick is, in fact, owned by Sella group.

This was one out of the five topics covered in this week’s The Pulse. The full edition additionally covers:

Industry pulse. The first mass layoffs at Apple since 1997 (or not?); amateurish URL rewrite at X (formerly Twitter); never-ending job interviews for engineering executives, and more.
The end of Hopin. It took Hopin just two years to become the fastest-ever growing European startup by valuation. Four years later, the company is no more. The final valuable parts of Hopin are being sold, and all staff are expected to be let go. Exclusive details on the StreamYard sale.
Adyen, the only major Fintech with zero mass layoffs? All major Fintech startups have let go of some staff over the past two years, except Adyen. Meanwhile, the business has quietly become one of Stripe’s biggest competitors. A close look at this curious phenomenon.

Read the full The Pulse here.

Subscribe to my weekly newsletter to get articles like this in your inbox. It's a pretty good read - and the #1 tech newsletter on Substack.

Menu