Added note on performance
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
This commit is contained in:
parent
c16b9cfb4d
commit
9f7b3605f4
69
README.md
69
README.md
|
@ -41,6 +41,75 @@ shape: (3, 2)
|
||||||
|
|
||||||
## Performance
|
## Performance
|
||||||
|
|
||||||
|
`dr` is implemented in Rust with the goal of achieving the highest possible performance. Take for instance a simple read, groupby, and aggregate operation with ~30MB of data:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ time cat data/walmart_train.csv | ./target/release/dr sql "select Dept, avg("Weekly_Sales") from this group by Dept" | ./target/release/dr print
|
||||||
|
shape: (81, 2)
|
||||||
|
┌──────┬──────────────┐
|
||||||
|
│ Dept ┆ Weekly_Sales │
|
||||||
|
│ --- ┆ --- │
|
||||||
|
│ i64 ┆ f64 │
|
||||||
|
╞══════╪══════════════╡
|
||||||
|
│ 30 ┆ 4118.197208 │
|
||||||
|
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
|
||||||
|
│ 16 ┆ 14245.63827 │
|
||||||
|
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
|
||||||
|
│ 56 ┆ 3833.706211 │
|
||||||
|
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
|
||||||
|
│ 24 ┆ 6353.604562 │
|
||||||
|
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
|
||||||
|
│ ... ┆ ... │
|
||||||
|
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
|
||||||
|
│ 31 ┆ 2339.440287 │
|
||||||
|
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
|
||||||
|
│ 59 ┆ 694.463564 │
|
||||||
|
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
|
||||||
|
│ 27 ┆ 1583.437727 │
|
||||||
|
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
|
||||||
|
│ 77 ┆ 328.9618 │
|
||||||
|
└──────┴──────────────┘
|
||||||
|
|
||||||
|
real 0m0.089s
|
||||||
|
user 0m0.116s
|
||||||
|
sys 0m0.036s
|
||||||
|
```
|
||||||
|
|
||||||
|
Let's compare that with the followint Python script that leverages Pandas to read the data, and compute the aggregation:
|
||||||
|
|
||||||
|
```python
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
df = pd.read_csv(sys.stdin)
|
||||||
|
print(df.groupby("Dept", sort=False, as_index=False).Weekly_Sales.mean())
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ time cat data/walmart_train.csv | ./python/group.py
|
||||||
|
Dept Weekly_Sales
|
||||||
|
0 1 19213.485088
|
||||||
|
1 2 43607.020113
|
||||||
|
2 3 11793.698516
|
||||||
|
3 4 25974.630238
|
||||||
|
4 5 21365.583515
|
||||||
|
.. ... ...
|
||||||
|
76 99 415.487065
|
||||||
|
77 39 11.123750
|
||||||
|
78 50 2658.897010
|
||||||
|
79 43 1.193333
|
||||||
|
80 65 45441.706224
|
||||||
|
|
||||||
|
[81 rows x 2 columns]
|
||||||
|
|
||||||
|
real 0m0.717s
|
||||||
|
user 0m0.627s
|
||||||
|
sys 0m0.282s
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that there's roughly a 6x speedup. This considering that this operation in particular is heavily optimized in Pandas and most of the run time is spent in parsing and reading from stdin.
|
||||||
|
|
||||||
|
|
||||||
## Built standing on the shoulders of giants
|
## Built standing on the shoulders of giants
|
||||||
|
|
Loading…
Reference in a new issue