1 Why analytic projects need at least two databases
Guillem Borrell Nogueras edited this page 1 year ago

I started a quite long thread at my work's slack when I posted benchmarks showing how fast DuckDB and Polars are compared to Pandas and PostgreSQL. One of the first comments mentioned that, while the comparison with Pandas made sense, the comparison with PostgreSQL didn't. It's pretty obvious that embedded columnar query engines are faster than OLTP row-based DBMD. This made me reflect on why I thought that the comparison made sense: many times, and it's something I've done myself, projects use a single DBMS because it's way more important to have a single source of truth than maximizing performance, and transactional information takes preference from analytical queries.

And let's be honest, PostgreSQL's performance in OLAP workloads is sometimes good enough.

I'm struggling to find the best possible way to keep two separate databases for OLAP and OLTP respectively. One stores the transactional information, while the second is read-only replica. It's obvious that you can't only have these databases, you need a way to keep them in sync, and suggesting Airflow as part of the stack outright closes the discussion about the two databases for minimalistic projects.

I'm thinking about two alternatives today:

  • PostgreSQL + DuckDB
  • PostgreSQL + Clickhouse

It's shocking how similar these two options are in terms of features. Both Clickhouse and Duckdb have connectors for Postgres and mirror tables can be updated by executing a query with a cronjob or similar.

Maybe that's the way to go, and It's an ide I'm willing to test in a real environment in some of my future projects. Has anyone tried something similar?