PyConES24/docs/introduction.md
Borrell.Guillem@bcg.com 237cf81fe6 Deploy documentation too
2023-09-18 10:45:12 +02:00

5.5 KiB

Introduction

This long chapter intends to answer why all the techonologies and techiniques covered in this text are seldom used in an enterprise environment. There has been a substantial effort to introduce the techniques to implement data management pipelines in the most efficient way. It's humbling to find out that what we see in clients points in the opposite direction:

  1. The use of open-source DBMS tends to be marginal, and Oracle still dominates the market.
  2. Automation is managed by old-school enterprise management tools like Control-M, or
  3. No-code solution will be preferred, and Devops and code-first tools sound much like a startup-led fad.
  4. Tools that are good enough are often abused. If you can do X with SAP, you will do it.
  5. Business automation is still mostly developed in SQL, PL/SQL, and Java.
  6. There's a real resistence to change, and some core systems written in COBOL in the nineties are still alive and kicking.
  7. There are strong barriers between "critical" systems (transactional systems, core automation) and "non-critcal" components (Data warehouse, CRM).
  8. Things that work are not touched unless strictly necessary, and new features are discussed at length, planned and executed carefully, and thoroughly tested.
  9. Benign neglect is the norm. If a new reality forces a change, sometimes it's preferable to adapt reality with wrappers, translation layers, interfaces...
  10. Complexity coming from technical debt is seen as a trade off between risks. Complexity is a manageable risk, while reimplementation is a very hard-to-manage risk.

It's common to see software not as an asset, but as a liability. It's something you have to do to run your company and create profits, and you only remember about its existence when it breaks or when a team of expensive consultants suggests a feature that can't be implemented. Companies that follow this line of thought end up considering software mostly as a source of risk, and it's very hard to put a value tag at a risk. You seldom see teams of risk experts estimating that, unless a major rewrite of a particular component takes place, probability of total meltdown will be around 50% five years down the road.

In addition, the effect of aging and unfit systems and software is commonly too slow to notice or too fast to react to. Sometimes current systems are capable of keeping the operations alive with no major incidences but implementing new features is too hard. Competitors may not face these limitations and market share will be lost one customer at a time. Maybe these aging automations just melt down when an event never seen before, like COVID, the emergence of generative AI, or a huge snowstorm caused by climate change can't be handled and a relevant portion of the customer base leaves and never comes back.

!!! example

Southwest's scheduling meltdown in 2022 was so severe that it has its own [entry on Wikipedia](https://en.wikipedia.org/wiki/2022_Southwest_Airlines_scheduling_crisis). The outdated software running on outdated hardware caused all kinds of disasters, like not being able to track where crews were, and their availability. The consequence were more than fifteen thousand flights cancelled in ten days. Razor-thin operation margins were continously blamed, but Southwest has historically paid [significant dividends](https://www.nasdaq.com/market-activity/stocks/luv/dividend-history) to shareholders, and EBITDA was higher than 4B/year from 2015 to 2019. Southwest announced their intention to allocate $1.3b to upgrade their systems which, considering that the investment will probably be spreaded across a decade, it's not a particularly ambitious plan. Southwest had strong reasons to update their software and their systems but they never considered a priority until it was too late.

Pragmatism dominates decision making in enterprise IT, and touching things as little as possible in the most simple way tends to be the norm. You've probably heard that when all you have is a hammer, everything looks like a nail; but many things work like a nail if you hit them with a hammer hard enough. There's sometimes an enormous distance between the ideal solution and something that Just Works™, and pragmatic engineers tend to chose the latter. The only thing you need is a hammer, and knowing what can work as a nail. But if you keep hitting things with a hammer as hard as you can you start cracks that may induce a major structural failure under stress fatigue.

This is why startups tend to disrupt markets more than well-established corporations. It's easier to put a price tag to features, predict the impact of disruptions, and simulate meltdown scenarios when you're starting with a blank sheet of paper.

You may love or hate the ideas of N. N. Taleb, but I think it's interesting to bring the concepts of fragility and antifragility into play. You create fragility by constantly taking suboptimal but pragmatic choices, which create scar tissue all across the company. With a cramped body you can't try new things out, you just make it to the next day. Antifragile systems can have major components redesigned and rearchitected because each piece is robust enough to withstand the addititional stress of constant change.

In the following chapters the implementation of a digital twin of a retail corporation that operates a chain of grocery stores will be described. The implementation is of course limited but comprehensive enough to get some key insights about why corporate IT looks the way it looks. If anyone intends to create antifragile data systems it's important to study how fragile systems come to be on the first place.