Deploy documentation too
This commit is contained in:
parent
ba4bcfc8d9
commit
237cf81fe6
54
.github/workflows/docs.yml
vendored
Normal file
54
.github/workflows/docs.yml
vendored
Normal file
|
@ -0,0 +1,54 @@
|
||||||
|
name: Build documentation
|
||||||
|
|
||||||
|
on:
|
||||||
|
push:
|
||||||
|
branches:
|
||||||
|
- "main"
|
||||||
|
|
||||||
|
env:
|
||||||
|
PYTHON_VERSION: 3.9
|
||||||
|
|
||||||
|
permissions:
|
||||||
|
contents: read
|
||||||
|
pages: write
|
||||||
|
id-token: write
|
||||||
|
|
||||||
|
concurrency:
|
||||||
|
group: "pages"
|
||||||
|
cancel-in-progress: false
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
build-docs:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
environment:
|
||||||
|
name: github-pages
|
||||||
|
url: ${{ steps.deployment.outputs.page_url }}
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- id: checkout
|
||||||
|
uses: actions/checkout@v3
|
||||||
|
- id: setup-python
|
||||||
|
uses: actions/setup-python@v4
|
||||||
|
with:
|
||||||
|
python-version: ${{ env.PYTHON_VERSION }}
|
||||||
|
|
||||||
|
- id: install-deps-and-build
|
||||||
|
name: Install dependencies and test
|
||||||
|
run: |
|
||||||
|
pip install -r requirements.txt
|
||||||
|
mkdocs build --site-dir build
|
||||||
|
|
||||||
|
- name: Setup Pages
|
||||||
|
id: configure-pages
|
||||||
|
uses: actions/configure-pages@v3
|
||||||
|
|
||||||
|
- name: Upload artifact
|
||||||
|
id: upload-artifact
|
||||||
|
uses: actions/upload-pages-artifact@v1
|
||||||
|
with:
|
||||||
|
# Upload entire repository
|
||||||
|
path: './build'
|
||||||
|
|
||||||
|
- name: Deploy to GitHub Pages
|
||||||
|
id: deployment
|
||||||
|
uses: actions/deploy-pages@v2
|
19
docs/automation.md
Normal file
19
docs/automation.md
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
# How automation is implemented.
|
||||||
|
|
||||||
|
Automation is a key topic in enterprise data systems. When some conditions are met, you want things to happen. There are automations remove from the stock the items that are checked out, to send a message to the stocker's terminals when an item changes price, to make a stock request to a warehouse when a store is about to run out of a particular item, to apply discounts at check-out after changing the price tag on the shelves...
|
||||||
|
|
||||||
|
Automation is at the heart of enterprise systems. Any issue related to a critical automation will impact the business, and there may be hundreds or thousands of those automations. This is the reason why corporations and governments spend large sums of money to build platforms that are as robust as possible. This is why some migrations to a cloud platform take years to complete with costs that sometimes triple the initial budget.
|
||||||
|
|
||||||
|
!!! example
|
||||||
|
|
||||||
|
There are tons of examples of unsuccessful transformations caused by the enormous complexity of business automation. The Birmingham's City Council wanted to [migrate](https://www.datacenterdynamics.com/en/news/uks-birmingham-city-to-spend-465m-fixing-oracle-cloud-issue/) from on-the-premises systems to the cloud. The budget went from £19M at the beginning of the project to £38M two years later, and £100M four years after the start. At the moment of writing this example systems were not fully functional, and there was no estimation for the final delivery date. The officials responded that these kind of delays were *not unusual* for migrations of such magnitude, and they're absolutely right.
|
||||||
|
|
||||||
|
For each automation you should decide the execution time, with many possible choices:
|
||||||
|
|
||||||
|
* Immediately when the condition occurs
|
||||||
|
* Periodically, each hour or every day at midnight
|
||||||
|
* On demand after a required authorization
|
||||||
|
|
||||||
|
# Triggers to act on a given condition
|
||||||
|
|
||||||
|
Let's open the `stock` terminal run a little experiment.
|
28
docs/buyvsbuild.md
Normal file
28
docs/buyvsbuild.md
Normal file
|
@ -0,0 +1,28 @@
|
||||||
|
# Buy vs Build
|
||||||
|
|
||||||
|
## Bulid vs buy an off-the-shelf solution.
|
||||||
|
|
||||||
|
Bulding enterprise data systems is hard and expensive. This is why many companies decide to purchase an ERP like SAP, which is able to handle financials, purchasing, stock, customer relationships, reporting... This seems to be the safest choice by far
|
||||||
|
|
||||||
|
* Popular applications tend to be less buggy, more secure and feature-rich.
|
||||||
|
* Third party appliations by popular vendors tend to add synergies, like easier integrations and a wider range of vendor choices for support and maintenance.
|
||||||
|
* The total cost of ownership tends to be lower by the buyer, even considering that companies like Oracle and Salesforce are immensely profitable.
|
||||||
|
|
||||||
|
!!! info
|
||||||
|
|
||||||
|
Some software vendors are so profitable that many wonder if they're charging excessive margins for their products. Larry Ellison, CEO and founder of Oracle, owns [the sixth largest island in Hawaii](https://en.wikipedia.org/wiki/Lanai), and the 2022 Formula One Championship season winning team was "Oracle Red Bull Racing". The parnership between Oracle and Red Bull is somewhat funny since most are positive that companies selling soda make huge margins selling a bad product mostly thanks to marketing.
|
||||||
|
|
||||||
|
But appearances can be deceiving.
|
||||||
|
|
||||||
|
* If the problem is complex, the solution will be complex as well. It may take multiple years of effort and tens of consultants to fully deploy one of these products.
|
||||||
|
* Proprietary software is always bundled with some additional vendor lock, like supporting a small subset of proprietary storage systems.
|
||||||
|
* Buying software implies vendoring knowledge. If there's a critical issue with one of these components and the vendor is not able to provide support, or it bankrupts and disappears, the engineers within the corporation won't be able to fix the issue no matter how smart they are.
|
||||||
|
|
||||||
|
The most common scenario is to run a mix of applications: mission-critical operations are run by custom, in-house-developed applications maintained by the IT department, while other less critical operations like Marketing use third party tools. I've encountered many corporations that run most of their operations on custom-built software, but they decided to buy a proprietary CRM like Salesforce to support their Marketing and Sales departents. Most projects executed by IT deparments are related to data integrations between two existing tools, or between existing and some new tool the leadership decided to purchase.
|
||||||
|
|
||||||
|
## Bespoke software and the false buy vs build dichotomy.
|
||||||
|
|
||||||
|
There's a third option, which is somewhat in betwen buy and build: hiring a third party with a solid track record to develop a custom application with the goal of getting the best from both worlds. Bespoke software by a solid vendor may provide:
|
||||||
|
|
||||||
|
1. Lower risk, since the vendor has built similar applications for previous clients.
|
||||||
|
2. The application is fully customized, and knowledge stays in-house, since the development can be closely followed by the staff engineers.
|
211
docs/data.md
Normal file
211
docs/data.md
Normal file
|
@ -0,0 +1,211 @@
|
||||||
|
# How data is stored
|
||||||
|
|
||||||
|
## How data is stored physically
|
||||||
|
|
||||||
|
Enterprise data systems are heavily distributed because large corporations are distributed in nature: a chain of grocery stores consists of a set of locations that are relatively independent from each other. Distributed systems, when designed properly, are very robust in nature. If every location can operate independently items can be sold even under a major system failure, like losing the connection between a site and the central database that stores the available stock for each item. But at the same time, disconnecting each site from a central information repository is a challenging data engineering problem.
|
||||||
|
|
||||||
|
!!! info
|
||||||
|
|
||||||
|
A key property for data stored in an enterprise context is consistency, which is very difficult to guarantee when data is spreaded across multiple nodes in a network that can be faulty at times.
|
||||||
|
|
||||||
|
The CAP theorem, sometimes called Brewer's theorem too, is a great tool to provide some theoretical insight about distributed systems design. The CAP acronym stands for Consistency, Availability and Partition tolerance. The theorem proves that there's no distributed storage system that can offer these three guarantees **at the same time**:
|
||||||
|
|
||||||
|
1. Consistency, all nodes in the network see the exact same data at the same time.
|
||||||
|
2. Availablity, all nodes can fetch any piece of information whanever they need it.
|
||||||
|
3. Partition tolerance, you can lose connection to any node without affecting the integrity of the system.
|
||||||
|
|
||||||
|
Relational databases like need to be consistent and available all the time, this is why there aren't distributed versions of PostgreSQL where data is spreaded (sharded) across multiple servers in a network. If there are multiple nodes involved, they are just secondary copies to speed up queries or increase availability.
|
||||||
|
|
||||||
|
If data is spreaded across multiple nodes, and consistency can't be traded off, availability is the only variable engineers can play with. Enterprise data systems run a set of *batch* process that synchronize data across the network when business operation is offline. When these batch processes are running data may be in a temporary inconsistent state, and the database cannot guarantee that adding new data records is a [transaction](https://en.wikipedia.org/wiki/Database_transaction). The safest thing to do in that case is to blocking parts of the database sacrificing availablity.
|
||||||
|
|
||||||
|
This is why the terms *transactional* and *batch* are used so often in enterprise data systems. When a database records information from some actual event as it happens, like someone checking out a can of soup at a counter, that piece of information is usually called a *transaction*, and the database recording the event a *transactional system* because its primary role is to record events. The word *transactional* is also used to the denote that integrity is a key rquirement: we don't want that any hiccup in the database mistakingly checks up the can of soup twice because the first transaction was temporarily lost for any reason. Any activity that may disrupt its operations has to be executed while the system is (at least partially) offline.
|
||||||
|
|
||||||
|
## How data is modelled
|
||||||
|
|
||||||
|
Data models are frequently [normalized](https://en.wikipedia.org/wiki/Database_transaction) to minimize ambiguity. There are probably three different kinds of 500g bags of white bread, but each one will have a different product id, a different [Universal Product Code](https://en.wikipedia.org/wiki/Universal_Product_Code), a different supplier... As it was mentioned in the section about relational databases, relations are as important as data: each item can be related to the corresponding supplier, an order, and batch in particular. Stores and warehouses have to be modelled as well to keep stock always at optimal levels. Disounts have to be modelled too, and they're particularly tricky because at some point the discount logic has to be applied at check-out.
|
||||||
|
|
||||||
|
Data models should be able to track every single item with the least possible ambiguity. If there's an issue with a product batch we should be able to locate with precision every single item and remove it from the shelves, and know exactly how many of those items were purchased. Any source of ambiguity requires manual intervention. For instance, it's possible that a store receives multiple batches with different times of expiry at the same time. In that case stockers have to make sure that the oldest batch is more visible in the shelves, put the newest batch at the bottom of the stack, and record when all items of each batch are sold or returned.
|
||||||
|
|
||||||
|
The digital twin has less than [twenty data models](https://github.com/bcgx-gp-atlasCommunity/data-engineering-fundamentals/blob/main/src/dengfun/retail/models.py), which is an extreme simplification of an actual retailer. This is the entity relational diagram, that may get outdated.
|
||||||
|
|
||||||
|
![relational_diagram.png](img/relational_diagram.png)
|
||||||
|
|
||||||
|
!!! tip
|
||||||
|
|
||||||
|
If you have to handle databases with dozens of normalized data models you should use a proper database management tool like [dbeaver](https://dbeaver.io/) or [datagrip](https://www.jetbrains.com/datagrip/).
|
||||||
|
|
||||||
|
The entity relational diagram above was generated by dbeaver.
|
||||||
|
|
||||||
|
Note how each model is related to a physical entity like an item, a batch, a cart, a location... Each table has a primary key that identifies a single record that is used to build relationships. For instance an item in a cart ready to check out is related to a cart, and a location.
|
||||||
|
|
||||||
|
## Users, authorizations, and permissions
|
||||||
|
|
||||||
|
Probably the hardest feature to implement in enterprise contexts is access control, in other words, making sure that anyone who's supposed to see, add, remove, and delete some information can do it in practice, while blocking everyone else. Some change may require managerial approval, like changing the available stock of a particular item in a warehouse, in that case the system should issue the approval requiest to the correct person.
|
||||||
|
|
||||||
|
There are two main strategies to achieve this:
|
||||||
|
|
||||||
|
1. Access Control Lists (ACL), a set of rules in a database that implement the logic of who's allowed to see, add, remove, and delete what. In this strategy *multiple users haveaccess the same applicaiton* with a different *role*. Each object has a set of rules attached that are applied for every single operation. This is common in financial institutions where all customer-facing employees have access to the same terminal but each operation on each object requires different levels of authority. For instance, any employee may be able to see the balance of a customer, but not to approve the conditions of a mortgage. ACL tends to be so hard to implement that corporations seldom build their own solution, and by one from a popular vendor.
|
||||||
|
|
||||||
|
2. Instead of attaching rules at each data object, one can create a *different application for each role*. This is common in businesses where each worker works in a different location. Cashiers in a grocery store have access only to points of sale, stockers will have a handheld device with stocking information, managers will have access to a web application with privileges to return items, modify stocking information... In this case access control is implemented at a system level. Each one of those appliations will have its own authentication profile, and will be authorized to access a subset of API and other data resources.
|
||||||
|
|
||||||
|
None of these strategies is infallible and universal. The final implementation will probably be a mixture of the two. Workers in stores and warehouse may have role-specific terminals, while members of the HR department may have access to an ERP (Enterprise Resource Planning) that implements ACL underneath. Other common data operations like transfers, migrations, backups and audit logs may add more complexity to the final design. IT departments typically have full access to all data stored within the company, or parts of it, and while they can jump in to fix issues, they can break stuff too.
|
||||||
|
|
||||||
|
!!! example
|
||||||
|
|
||||||
|
Customer communications like emails are data too, and may be required by the regulator to investigate any suspicious behaviour. [JPMorgan had to pay a $4M fine](https://www.reuters.com/legal/jpmorgan-chase-is-fined-by-sec-over-mistaken-deletion-emails-2023-06-22/) to the regulator after a vendor deleted 47M emails from their retail banking branch. A vendor was trying to clean up very old emails from the '70 and the '80, that are no longer required by the regulator, but they ended up deleting emails from 2018 instead. In enterprise data systems some data models and resources have an *audit lock* property that prevents deletion at a system level.
|
||||||
|
|
||||||
|
The digital twin doesn't implement ACL, and each role will have a separate terminal instead.
|
||||||
|
|
||||||
|
!!! warning
|
||||||
|
|
||||||
|
Data access issues may be considered security threats if allow a user to escalate privileges. It's common for large corporations and software vendors to deploy red teams that to find these kind of vulnerabilities. There are also bounty programs intended to motivate independent security researchers to communicate these issues instead of selling them on a "black security market".
|
||||||
|
|
||||||
|
|
||||||
|
## Governance: metadata management.
|
||||||
|
|
||||||
|
Data Governance is the discipline of providing the definitions, policies, procedures, and tools to manage data *as a resource* in the most efficient way possible. It's a very wide topic that involves managing both data and people. The goal is to create sustainable data environments that thrive as an integral part of the organization.
|
||||||
|
|
||||||
|
!!! danger
|
||||||
|
|
||||||
|
Beware of experts on data governance. There are orders of magnitude more experts on data governance on LinkedIn than successful implementations. I've witnessed talks about best practices on data governance from "experts" that were unsuccessful implementing them at their own company. From all experts, the most dangerous are the ones selling you the X or Y application that will get Data Governance sorted out for you.
|
||||||
|
|
||||||
|
The topic of data governance will be split in multiple sections across this case study. The reason for that is that Data Governance has to be implemented across the entire lifetime of the data, from its initial design to the dashboard that shows a handful of dials to the CFO. There's no silver-bullet technology that just implements Data Governance. It's the other way round, governance defines and enforces:
|
||||||
|
|
||||||
|
1. Schemas are defined and documented, including standard patterns to name columns
|
||||||
|
2. Who owns what, and which are the protocols to access, modify and delete each piece of data.
|
||||||
|
3. How data transformations are instrumented, executed, mainained and monitored.
|
||||||
|
4. In case something goes wrong, who can sort out what is going one, and who can fix it.
|
||||||
|
|
||||||
|
Let's talk a bit about point 1, which is related to metadata. Metadata is all the additional information that is relevant to understand the complete context of a piece of information. If a table has a field with the name `volume_unpacked` there should be one and only one definition of volume across all databases, and a single definition of what is an unpacked item. The same database that stores the data can store this additional information too. If the field *volume* in the *item* entity has units of liter. This is how the model `Item` is defined as a SQLAlchemy model:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class Item(Base):
|
||||||
|
__tablename__ = "items"
|
||||||
|
sku: Mapped[int] = mapped_column(primary_key=True)
|
||||||
|
upc: Mapped[int] = mapped_column(BigInteger, nullable=False)
|
||||||
|
provider: Mapped[int] = mapped_column(ForeignKey("providers.id"))
|
||||||
|
name: Mapped[str] = mapped_column(nullable=False)
|
||||||
|
package: Mapped[str] = mapped_column(unique=False)
|
||||||
|
current: Mapped[bool] = mapped_column(
|
||||||
|
comment="True if the item can be still requested to a provider, "
|
||||||
|
"False if it has been discontinued"
|
||||||
|
)
|
||||||
|
volume_unpacked: Mapped[int] = mapped_column(
|
||||||
|
comment="Volume of the item unpacked in cubic decimeters"
|
||||||
|
)
|
||||||
|
volume_packed: Mapped[int] = mapped_column(
|
||||||
|
comment="Volume of each unit item when packaged in cubic decimeters"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
And this is how it's reflected in the ER diagram centered on the `Items` table:
|
||||||
|
|
||||||
|
![comments.png](img/comments.png)
|
||||||
|
|
||||||
|
Metadata management can be implemented as a data governance policy:
|
||||||
|
|
||||||
|
1. All fields that could be ambiguous have to be annotated with a clear definition.
|
||||||
|
2. These schemas can be published in a tool that allows anyone to search those definitions, and where those data are stored.
|
||||||
|
|
||||||
|
Point number 2 is more important than it seems, and its implementation is usually called "Data Discovery". There are tools like [Amundsen](https://www.amundsen.io/) (open source) or [Collibra](https://www.collibra.com/us/en) (proprietary) that implement data catalogs that you can connect to your data sources and extract all metadata they contain, and archive it to create a searchable index similarly to what Internet search engines do. Some organizations implement some simplified metadata management, and only the fields in the data warehouse (more on this later) are annotated. In this case they tend to use tools that are specific for the database technology like [Oracle's data catalog](https://www.oracle.com/big-data/data-catalog/)
|
||||||
|
|
||||||
|
This allows you to make sure that every time the term `sku` is used it actually refers to a Stock Keeping Unit and the storage resource is using it correctly.
|
||||||
|
|
||||||
|
## Bootstrapping the database
|
||||||
|
|
||||||
|
The first step is to create a new database in an existing postgresql database server with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
createdb -h host.docker.internal -U postgres retail
|
||||||
|
```
|
||||||
|
If you're using an ATLAS Core instance you may want to use a different database name.
|
||||||
|
|
||||||
|
The package includes a set of convenience scripts to create the tables that support the digital twin that can be accessed with the `retailtwin` command once the package has been installed. The `init` command will persist the schemas on the database.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
retailtwin init postgresql://postgres:postgres@host.docker.internal/retail
|
||||||
|
```
|
||||||
|
|
||||||
|
And the `bootstrap` command will fill the database with some dummy data
|
||||||
|
|
||||||
|
```bash
|
||||||
|
retailtwin bootstrap postgresql://postgres:postgres@host.docker.internal/retail
|
||||||
|
```
|
||||||
|
|
||||||
|
After running these two commands a stocked chain of grocery stores will be available:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
psql -h host.docker.internal -U postgres -c "select * from customers limit 10" retail
|
||||||
|
```
|
||||||
|
|
||||||
|
!!! success "Output"
|
||||||
|
|
||||||
|
```
|
||||||
|
id | document | info
|
||||||
|
----+----------+-------------------------------------
|
||||||
|
1 | 59502033 | {"name": "Nicholas William James"}
|
||||||
|
2 | 32024229 | {"name": "Edward Jeffrey Roth"}
|
||||||
|
3 | 40812760 | {"name": "Teresa Jason Mcgee"}
|
||||||
|
4 | 52305886 | {"name": "Emily Jennifer Lopez"}
|
||||||
|
5 | 92176879 | {"name": "Joseph Leslie Torres"}
|
||||||
|
6 | 60956977 | {"name": "Brandon Carmen Leonard"}
|
||||||
|
7 | 04707863 | {"name": "Richard Kathleen Torres"}
|
||||||
|
8 | 74587935 | {"name": "Emily Anne Pugh"}
|
||||||
|
9 | 78857405 | {"name": "James Rachel Rodriguez"}
|
||||||
|
10 | 80980264 | {"name": "Paige Kiara Chavez"}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Normalized data, functions and procedures
|
||||||
|
|
||||||
|
If data models are normalized many tables will include many references to other models. These are some of the contents of the model `Itemsonshelf` that contains the items that are available in one location in particular, and their quanity.
|
||||||
|
|
||||||
|
!!! success "Output (truncated)"
|
||||||
|
|
||||||
|
```
|
||||||
|
id |batch|discount|quantity|location|
|
||||||
|
---+-----+--------+--------+--------+
|
||||||
|
1| 1| | 31| 1|
|
||||||
|
2| 2| | 31| 1|
|
||||||
|
3| 3| | 31| 1|
|
||||||
|
4| 4| | 31| 1|
|
||||||
|
5| 5| | 31| 1|
|
||||||
|
6| 6| | 31| 1|
|
||||||
|
7| 7| 4| 31| 1|
|
||||||
|
8| 8| | 31| 1|
|
||||||
|
9| 9| 7| 31| 1|
|
||||||
|
```
|
||||||
|
|
||||||
|
This table only contains foreign keys and quantities are provided on a per-batch basis. Obtainig very simple metrics, like some location's current stock of an item in particular, requires joining multiple tables. This is why databases tend to bundle data and logic. Modern Database Management Systems (DBMS) are programmable and users can define functions and procedures to simplify queries and automate data operations. The query that gets the stock of an item in a given location can be expressed as a function in SQL as:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
{!../src/retailtwin/sql/stock_on_location.sql!}
|
||||||
|
```
|
||||||
|
|
||||||
|
Functions can be called both as values and as tables, since functions may return one or multiple records:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
select * from stock_on_location(1,1)
|
||||||
|
```
|
||||||
|
|
||||||
|
is equivalent to
|
||||||
|
|
||||||
|
```sql
|
||||||
|
select stock_on_location(1,1)
|
||||||
|
```
|
||||||
|
|
||||||
|
and both calls return the same result
|
||||||
|
|
||||||
|
!!! success "Output"
|
||||||
|
|
||||||
|
```
|
||||||
|
stock_on_location|
|
||||||
|
-----------------+
|
||||||
|
31|
|
||||||
|
```
|
||||||
|
|
||||||
|
This package contains a set of functions, procedures, triggers, and other helpers that can be recorded into the database with
|
||||||
|
|
||||||
|
|
||||||
|
```bash
|
||||||
|
retailtwin sync postgresql://postgres:postgres@host.docker.internal/retail
|
||||||
|
```
|
4
docs/dl.md
Normal file
4
docs/dl.md
Normal file
|
@ -0,0 +1,4 @@
|
||||||
|
# The Data Lake
|
||||||
|
|
||||||
|
So far, this case study has only covered how corporations deal with structured data, but data comes in many shapes and forms. An invoice in a PDF file is data too, and
|
||||||
|
|
29
docs/dw.md
Normal file
29
docs/dw.md
Normal file
|
@ -0,0 +1,29 @@
|
||||||
|
# The Data Warehouse
|
||||||
|
|
||||||
|
## What is a data warehouse?
|
||||||
|
|
||||||
|
A Data Warehouse (DW) is a database or a set of databases that:
|
||||||
|
|
||||||
|
1. Implement a resource as close as possible to a *single source of truth* for the entire corporation.
|
||||||
|
2. Provide relevant aggregated metrics and Key Parameter Indicators (KPI).
|
||||||
|
3. Store historical records of relevant metrics to assess the performance of the corporation.
|
||||||
|
|
||||||
|
DW are a hubs of data. On the input side, data is periodically fetched from all transactional systems by a set of batch processes. These processes don't just copy the data from transactional systems verbatim, they will execute a set of transformations and aggregations to make the final outcome easier to work with, and generate the KPI that are relevant to high-level analysis.
|
||||||
|
|
||||||
|
On the output side, DW provide a unique and aggregated vision of the corporation that is used across the board. DW are not critical to keep the operations up and running, but they are key to assess and improve performance across the entire corporation. DW are also leveraged to implement a wealth of use cases to generate more value for customers and stakeholders like:
|
||||||
|
|
||||||
|
* Supply chain management and control.
|
||||||
|
* Campaign performance analytics.
|
||||||
|
* Executive dashboards.
|
||||||
|
* Churn and upsell scoring.
|
||||||
|
* Demand forecasts.
|
||||||
|
|
||||||
|
DW tend to be implemented with analytical databases because data is recorded in batches, and queries are mostly aggregations of sets of historical records. Depending on the size of the corporation and the number of data sources, DW are pretty large and expensive to build and to maintain. Data warehouses may contain thousands of tables with thousands of batch processes fetching and transforming data to its final shape.
|
||||||
|
|
||||||
|
## Extract, Transform, and Load (ETL)
|
||||||
|
|
||||||
|
|
||||||
|
## Data Governance: Lineage
|
||||||
|
|
||||||
|
It's common to use the data warehouse as a scratch space for users doing analytics.
|
||||||
|
|
BIN
docs/img/comments.png
Normal file
BIN
docs/img/comments.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 139 KiB |
BIN
docs/img/injection.png
Normal file
BIN
docs/img/injection.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 223 KiB |
BIN
docs/img/relational_diagram.png
Normal file
BIN
docs/img/relational_diagram.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 165 KiB |
BIN
docs/img/terminal.png
Normal file
BIN
docs/img/terminal.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 74 KiB |
BIN
docs/img/webterminal.png
Normal file
BIN
docs/img/webterminal.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 276 KiB |
30
docs/introduction.md
Normal file
30
docs/introduction.md
Normal file
|
@ -0,0 +1,30 @@
|
||||||
|
# Introduction
|
||||||
|
|
||||||
|
This long chapter intends to answer why all the techonologies and techiniques covered in this text are seldom used in an enterprise environment. There has been a substantial effort to introduce the techniques to implement data management pipelines in the most efficient way. It's humbling to find out that what we see in clients points in the opposite direction:
|
||||||
|
|
||||||
|
1. The use of open-source DBMS tends to be marginal, and Oracle still dominates the market.
|
||||||
|
2. Automation is managed by old-school enterprise management tools like Control-M, or
|
||||||
|
3. No-code solution will be preferred, and Devops and code-first tools sound much like a startup-led fad.
|
||||||
|
4. Tools that are good enough are often abused. If you can do X with SAP, you will do it.
|
||||||
|
5. Business automation is still mostly developed in SQL, PL/SQL, and Java.
|
||||||
|
6. There's a real resistence to change, and some core systems written in COBOL in the nineties are still alive and kicking.
|
||||||
|
7. There are strong barriers between "critical" systems (transactional systems, core automation) and "non-critcal" components (Data warehouse, CRM).
|
||||||
|
8. Things that work are not touched unless strictly necessary, and new features are discussed at length, planned and executed carefully, and thoroughly tested.
|
||||||
|
9. *Benign neglect* is the norm. If a new reality forces a change, sometimes it's preferable to adapt reality with wrappers, translation layers, interfaces...
|
||||||
|
10. Complexity coming from technical debt is seen as a trade off between risks. Complexity is a manageable risk, while reimplementation is a very hard-to-manage risk.
|
||||||
|
|
||||||
|
It's common to see software not as an asset, but as a liability. It's something you have to do to run your company and create profits, and you only remember about its existence when it breaks or when a team of expensive consultants suggests a feature that can't be implemented. Companies that follow this line of thought end up considering software mostly as a source of risk, and it's very hard to put a value tag at a risk. You seldom see teams of risk experts estimating that, unless a major rewrite of a particular component takes place, probability of total meltdown will be around 50% five years down the road.
|
||||||
|
|
||||||
|
In addition, the effect of aging and unfit systems and software is commonly too slow to notice or too fast to react to. Sometimes current systems are capable of keeping the operations alive with no major incidences but implementing new features is too hard. Competitors may not face these limitations and market share will be lost one customer at a time. Maybe these aging automations just melt down when an event never seen before, like COVID, the emergence of generative AI, or a huge snowstorm caused by climate change can't be handled and a relevant portion of the customer base leaves and never comes back.
|
||||||
|
|
||||||
|
!!! example
|
||||||
|
|
||||||
|
Southwest's scheduling meltdown in 2022 was so severe that it has its own [entry on Wikipedia](https://en.wikipedia.org/wiki/2022_Southwest_Airlines_scheduling_crisis). The outdated software running on outdated hardware caused all kinds of disasters, like not being able to track where crews were, and their availability. The consequence were more than fifteen thousand flights cancelled in ten days. Razor-thin operation margins were continously blamed, but Southwest has historically paid [significant dividends](https://www.nasdaq.com/market-activity/stocks/luv/dividend-history) to shareholders, and EBITDA was higher than 4B/year from 2015 to 2019. Southwest announced their intention to allocate $1.3b to upgrade their systems which, considering that the investment will probably be spreaded across a decade, it's not a particularly ambitious plan. Southwest had strong reasons to update their software and their systems but they never considered a priority until it was too late.
|
||||||
|
|
||||||
|
Pragmatism dominates decision making in enterprise IT, and touching things as little as possible in the most simple way tends to be the norm. You've probably heard that *when all you have is a hammer, everything looks like a nail*; but many things work like a nail if you hit them with a hammer hard enough. There's sometimes an enormous distance between the ideal solution and something that *Just Works™*, and pragmatic engineers tend to chose the latter. The only thing you need is a hammer, and knowing what can work as a nail. But if you keep hitting things with a hammer as hard as you can you start cracks that may induce a major structural failure under stress fatigue.
|
||||||
|
|
||||||
|
This is why startups tend to disrupt markets more than well-established corporations. It's easier to put a price tag to features, predict the impact of disruptions, and simulate meltdown scenarios when you're starting with a blank sheet of paper.
|
||||||
|
|
||||||
|
You may love or hate the ideas of N. N. Taleb, but I think it's interesting to bring the concepts of fragility and antifragility into play. You create fragility by constantly taking suboptimal but pragmatic choices, which create scar tissue all across the company. With a cramped body you can't try new things out, you just make it to the next day. Antifragile systems can have major components redesigned and rearchitected because each piece is robust enough to withstand the addititional stress of constant change.
|
||||||
|
|
||||||
|
In the following chapters the implementation of a digital twin of a retail corporation that operates a chain of grocery stores will be described. The implementation is of course limited but comprehensive enough to get some key insights about why corporate IT looks the way it looks. If anyone intends to create antifragile data systems it's important to study how fragile systems come to be on the first place.
|
2
docs/organization.md
Normal file
2
docs/organization.md
Normal file
|
@ -0,0 +1,2 @@
|
||||||
|
# People in the data ecosystem
|
||||||
|
|
114
docs/terminals.md
Normal file
114
docs/terminals.md
Normal file
|
@ -0,0 +1,114 @@
|
||||||
|
# Terminals to access data
|
||||||
|
|
||||||
|
Terminals are dedicated applications or devices to interact with data. This is a very wide definition, a terminal can be an actual device like a point of sale, or a web application that shows the current stock for each location to a store manager.
|
||||||
|
|
||||||
|
## Command-line interface terminals
|
||||||
|
|
||||||
|
This case study includes some simple terminals with a command-line interface (CLI) that are installed when installing the package:
|
||||||
|
|
||||||
|
1. `pos`: a point of sale.
|
||||||
|
2. `tasks`: a stocker terminal to assist operations at the store.
|
||||||
|
2. `stock`: a stock management terminal.
|
||||||
|
|
||||||
|
Let's start a session as store manager with the last of the listed terminals with
|
||||||
|
|
||||||
|
```bash
|
||||||
|
stock postgresql://postgres:postgres@host.docker.internal/retail 1
|
||||||
|
```
|
||||||
|
|
||||||
|
The first argument to the `stock` command is the connection to the database, and the second is the ID of the location. The terminal greets us with the following message and a prompt:
|
||||||
|
|
||||||
|
```
|
||||||
|
Fetching data...
|
||||||
|
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
|
||||||
|
┃ Retail twin stock management CLI ┃
|
||||||
|
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
|
||||||
|
|
||||||
|
|
||||||
|
This is a simple terminal to manage stock. Enter a single-letter command followed by . The available
|
||||||
|
commands are:
|
||||||
|
|
||||||
|
• l: Lists all the current items stocked in any location
|
||||||
|
• s: Enters search mode. Search an item by name
|
||||||
|
• q: Store query mode. Queries the stock of an item by UPC in the current location
|
||||||
|
• w: Warehouse query mode. Queries the stock of an item by UPC in all warehouses. Requires connection to the database
|
||||||
|
• c: Cancel mode. Retires a batch giving a UPC. Requires connection to the database
|
||||||
|
• b: Batch mode. Requests a given quantity from an item to the warehouse. Requires connection to the database
|
||||||
|
• r: Refresh data from the stock database
|
||||||
|
• h: Print this help message
|
||||||
|
• x: Exit this terminal
|
||||||
|
#>
|
||||||
|
```
|
||||||
|
|
||||||
|
This terminal, like any other terminal, provides a set of commands that interact with data in a a limited set of ways. The command `s` allows us to enter a keyword, and the terminal will return a set of related items:
|
||||||
|
|
||||||
|
```
|
||||||
|
#> s
|
||||||
|
s> coffee
|
||||||
|
Items
|
||||||
|
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
|
||||||
|
┃ upc ┃ name ┃ package ┃
|
||||||
|
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
|
||||||
|
│ 566807566244 │ Instant coffee │ 200g │
|
||||||
|
│ 212350582030 │ Coffee Beans │ 1lb bag │
|
||||||
|
│ 415996582616 │ Island Blend Coffee │ 1lb bag │
|
||||||
|
│ 167369617163 │ Ground Coffee │ 1 lb │
|
||||||
|
│ 982157811808 │ Coffee Beans │ 250g │
|
||||||
|
│ 86856869931 │ Ground coffee │ 12 oz bag │
|
||||||
|
│ 520101823089 │ French Roast Coffee │ 250 gram pack │
|
||||||
|
│ 240563892573 │ Fresh coffee beans │ 500g package │
|
||||||
|
│ 389837389865 │ Instant coffee │ 200g jar │
|
||||||
|
│ 940827785911 │ Pumpkin Spice Coffee │ 12 oz Bag │
|
||||||
|
│ 375920191429 │ Premium black coffee beans │ 500 grams │
|
||||||
|
│ 926526200297 │ Dark Roast Coffee │ 500 grams │
|
||||||
|
└──────────────┴────────────────────────────┴───────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
The query mode searches the stock of an item in particular in the current location:
|
||||||
|
|
||||||
|
```
|
||||||
|
#> q
|
||||||
|
q> 566807566244
|
||||||
|
Item 566807566244 on location 1
|
||||||
|
┏━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
|
||||||
|
┃ upc ┃ batch ┃ name ┃ package ┃ received ┃ best_until ┃ quantity ┃
|
||||||
|
┡━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
|
||||||
|
│ 566807566244 │ 20 │ Instant coffee │ 200g │ 2023-08-16 17:23:08.224154 │ 2023-09-15 17:23:08.224154 │ 193 │
|
||||||
|
└──────────────┴───────┴────────────────┴─────────┴────────────────────────────┴────────────────────────────┴──────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## If it's smart it's vulnerable
|
||||||
|
|
||||||
|
It's frequent to assume that CLI terminals are outdated, and more modern web-based user interfaces are the most common. But there many old-school terminals still around. Point of Sales terminals tend to be very basic as well, with displays only capable of showing a handful of characters, and a button for each command. Being this a digital twin, with complete freedom to implement anything we want, building a dumb terminal is important to introduce the following point:
|
||||||
|
|
||||||
|
The most important constraint when designing enterprise data systems is information security, and the dumber the terminal, the more secure it is. PoS tend to be dumb because there's money inside. One key concept in information security is the *attack surface* of a system. A console with no graphical interface and a handful of commands connected to a database is inherently more secure than a web interface that needs a browser, a http connection, a web server, and a database. I can't recommend enough the book [If it's smart it's vulnerable](https://www.ifitssmartitsvulnerable.com/) by the veteran information security researcher Mikko Hypponen. Maybe that $200 cloud-connected PoS with a fancy screen from Alibaba is the door someone exploits to start a ransomware attack, or that simple web terminal that the cheapest bidder implemented is vulnerable to SQL injection.
|
||||||
|
|
||||||
|
![https://xkcd.com/327/](img/injection.png)
|
||||||
|
|
||||||
|
[From XKCD](https://xkcd.com/327/)
|
||||||
|
|
||||||
|
CLI Terminals also run everywhere, and require almost no support from the operative system. Here's the Windows command prompt running the `stock` terminal application.
|
||||||
|
|
||||||
|
![terminal.png](img/terminal.png)
|
||||||
|
|
||||||
|
There's a 99% chance that the future Windows version released in 2033 is still able to run this application. That may not be valid for a web-based application developed with today's technologies. The most popular browser technology in corporate clients ten years ago was still Internet Explorer, and web applications had to implement support for it.
|
||||||
|
|
||||||
|
## API-based web applications
|
||||||
|
|
||||||
|
Some terminals, like PoS, run on specific hardware with a dedicated display and user interface. CLI terminals' display is the operative system's console. Web applications' display is a browser, which is today almost as capable as an operative system. The entire Microsoft Office suite can now run on a browser.
|
||||||
|
|
||||||
|
Web applications require the following components.
|
||||||
|
|
||||||
|
* A database. Data has to be stored somewhere.
|
||||||
|
* A web server that runs the business logic and interfaces data with presentation.
|
||||||
|
* An application, or presentation logic, that runs on the browser.
|
||||||
|
|
||||||
|
The code running on the database and the webserver is fequently called *backend* while the code running on the broser is frequently called *frontend*. It's obvious that implementing a terminal as a web application will require significantly more effort to develop, deploy and to secure that a CLI terminal.
|
||||||
|
|
||||||
|
![webterminal.png](img/webterminal.png)
|
||||||
|
|
||||||
|
The previous image is a web application that implements an analogous set of functionalities as the previos CLI terminal. It can list the items that are available in each location, retire batches, list the available stock in terminals, search products... The implementation took roughly *ten times longer* than the CLI terminal, but it's clearly more powerful, feature-rich and easier to use.
|
||||||
|
|
||||||
|
Web development is a field in constant change, and this makes technological choices harder, and more relevant. New frameworks and libraries are published every year, and the ecosystem is so fragmented that a software engineer will be fluent in a handful of those technologies in a landscape of hundreds of competing technologies.
|
||||||
|
|
||||||
|
The design and implementation of effective user interfaces is as important as hard and time consuming. Don't assume users can be trained to use any user interface that *makes sense*. There should be constant usability tests to gather feedback from users to tune user experience. In the end, the web frontend is the only component of a very large ecosystem that the final user sees. Web application face the risk of not being successful because of a bad user experience. User interfaces are also important to prevent users to enter some wrong input and cause operational issues.
|
62
mkdocs.yml
Normal file
62
mkdocs.yml
Normal file
|
@ -0,0 +1,62 @@
|
||||||
|
site_name: My Docs
|
||||||
|
repo_url: https://github.com/bcgx-gp-atlasCommunity/retailtwin
|
||||||
|
edit_uri: edit/main/docs/
|
||||||
|
|
||||||
|
nav:
|
||||||
|
- "Introduction": "introduction.md"
|
||||||
|
- "People in the data ecosystem": "organization.md"
|
||||||
|
- "How data is stored": "data.md"
|
||||||
|
- "Terminals to access data": "terminals.md"
|
||||||
|
- "How automation is implemented": "automation.md"
|
||||||
|
- "Buy vs Build": "buyvsbuild.md"
|
||||||
|
- "The Data Warehouse": "dw.md"
|
||||||
|
- "The Data Lake": "dl.md"
|
||||||
|
theme:
|
||||||
|
name: "material"
|
||||||
|
palette:
|
||||||
|
# Palette toggle for light mode
|
||||||
|
- media: "(prefers-color-scheme)"
|
||||||
|
primary: teal
|
||||||
|
toggle:
|
||||||
|
icon: material/brightness-auto
|
||||||
|
name: Switch to light mode
|
||||||
|
|
||||||
|
- media: "(prefers-color-scheme: light)"
|
||||||
|
scheme: default
|
||||||
|
primary: teal
|
||||||
|
toggle:
|
||||||
|
icon: material/brightness-7
|
||||||
|
name: Switch to dark mode
|
||||||
|
|
||||||
|
# Palette toggle for dark mode
|
||||||
|
- media: "(prefers-color-scheme: dark)"
|
||||||
|
scheme: slate
|
||||||
|
primary: teal
|
||||||
|
toggle:
|
||||||
|
icon: material/brightness-4
|
||||||
|
name: Switch to light mode
|
||||||
|
|
||||||
|
features:
|
||||||
|
- content.action.edit
|
||||||
|
- content.code.copy
|
||||||
|
icon:
|
||||||
|
edit: material/pencil
|
||||||
|
view: material/eye
|
||||||
|
|
||||||
|
plugins:
|
||||||
|
- search
|
||||||
|
# - with-pdf
|
||||||
|
|
||||||
|
markdown_extensions:
|
||||||
|
- md_in_html
|
||||||
|
- admonition
|
||||||
|
- tables
|
||||||
|
- pymdownx.highlight:
|
||||||
|
anchor_linenums: true
|
||||||
|
line_spans: __span
|
||||||
|
pygments_lang_class: true
|
||||||
|
- pymdownx.inlinehilite
|
||||||
|
- pymdownx.snippets
|
||||||
|
- pymdownx.superfences
|
||||||
|
- markdown_include.include:
|
||||||
|
base_path: docs
|
Loading…
Reference in a new issue