guillem/dr

Command-line data file processing in Rust https://git.guillemborrell.es/guillem/dr

Go to file

Guillem Borrell Nogueras 9f7b3605f4 All checks were successful ci/woodpecker/push/woodpecker Pipeline was successful Details Added note on performance		2022-11-21 12:13:37 +01:00
notebooks	Got to read from stdin, write to stdoutgit	2022-11-19 15:36:24 +00:00
python	Added cicd script	2022-11-20 22:00:58 +00:00
queries	Added cicd script	2022-11-20 22:00:58 +00:00
src	Added cicd script	2022-11-20 22:00:58 +00:00
.gitignore	Added cicd script	2022-11-20 22:00:58 +00:00
.woodpecker.yml	Upload package and publish new version	2022-11-21 00:26:22 +01:00
Cargo.toml	Update version	2022-11-21 00:27:32 +01:00
LICENSE	Initial commit	2022-11-13 09:13:25 +01:00
README.md	Added note on performance	2022-11-21 12:13:37 +01:00
wine.csv	Got to read from stdin, write to stdoutgit	2022-11-19 15:36:24 +00:00

README.md

dr.rs

| Download

A toolkit to process data files (csv and parquet) using the command line, inspired by csvkit, with blazing speed, and powered by Rust.

You may wonder why I'm implementing this, since there's already xsv. There are two reasons for that:

This what I'm implementing to learn Rust
The Rust data ecosystem has evolved immensely since xsv was sarted. Now we can add things like SQL commands to filter csv files, or translate results to parquet files.

Example

$ head wine.csv
Wine,Alcohol,Malic.acid,Ash,Acl,Mg,Phenols,Flavanoids,Nonflavanoid.phenols,Proanth,Color.int,Hue,OD,Proline
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450
1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290
1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295
1,14.83,1.64,2.17,14,97,2.8,2.98,.29,1.98,5.2,1.08,2.85,1045

$ cat wine.csv | dr sql "select Wine, avg(Alcohol) from this group by Wine" | dr print
shape: (3, 2)
┌──────┬───────────┐
│ Wine ┆ Alcohol   │
│ ---  ┆ ---       │
│ i64  ┆ f64       │
╞══════╪═══════════╡
│ 3    ┆ 13.15375  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1    ┆ 13.744746 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2    ┆ 12.278732 │
└──────┴───────────┘

Performance

dr is implemented in Rust with the goal of achieving the highest possible performance. Take for instance a simple read, groupby, and aggregate operation with ~30MB of data:

$ time cat data/walmart_train.csv | ./target/release/dr sql "select Dept, avg("Weekly_Sales") from this group by Dept" | ./target/release/dr print
shape: (81, 2)
┌──────┬──────────────┐
│ Dept ┆ Weekly_Sales │
│ ---  ┆ ---          │
│ i64  ┆ f64          │
╞══════╪══════════════╡
│ 30   ┆ 4118.197208  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 16   ┆ 14245.63827  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 56   ┆ 3833.706211  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24   ┆ 6353.604562  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...  ┆ ...          │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 31   ┆ 2339.440287  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 59   ┆ 694.463564   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 27   ┆ 1583.437727  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 77   ┆ 328.9618     │
└──────┴──────────────┘

real    0m0.089s
user    0m0.116s
sys     0m0.036s

Let's compare that with the followint Python script that leverages Pandas to read the data, and compute the aggregation:

#!/usr/bin/env python3

import sys
import pandas as pd

df = pd.read_csv(sys.stdin)
print(df.groupby("Dept", sort=False, as_index=False).Weekly_Sales.mean())

$ time cat data/walmart_train.csv | ./python/group.py 
    Dept  Weekly_Sales
0      1  19213.485088
1      2  43607.020113
2      3  11793.698516
3      4  25974.630238
4      5  21365.583515
..   ...           ...
76    99    415.487065
77    39     11.123750
78    50   2658.897010
79    43      1.193333
80    65  45441.706224

[81 rows x 2 columns]

real    0m0.717s
user    0m0.627s
sys     0m0.282s

Note that there's roughly a 6x speedup. This considering that this operation in particular is heavily optimized in Pandas and most of the run time is spent in parsing and reading from stdin.

Built standing on the shoulders of giants

None of this would be possible without Polars