Improved documentation
	
		
			
	
		
	
	
		
	
		
			All checks were successful
		
		
	
	
		
			
				
	
				ci/woodpecker/push/woodpecker Pipeline was successful
				
			
		
		
	
	
				
					
				
			
		
			All checks were successful
		
		
	
	ci/woodpecker/push/woodpecker Pipeline was successful
				
			This commit is contained in:
		
							parent
							
								
									9f7b3605f4
								
							
						
					
					
						commit
						c76150948d
					
				
							
								
								
									
										6
									
								
								.gitignore
									
									
									
									
										vendored
									
									
								
							
							
						
						
									
										6
									
								
								.gitignore
									
									
									
									
										vendored
									
									
								
							|  | @ -19,8 +19,12 @@ Cargo.lock | |||
| # Added by cargo | ||||
| 
 | ||||
| /target | ||||
|   | ||||
| 
 | ||||
| .vscode | ||||
| .ipynb_checkpoints | ||||
| 
 | ||||
| /data | ||||
| 
 | ||||
| /vendor | ||||
| 
 | ||||
| .cargo | ||||
							
								
								
									
										48
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										48
									
								
								README.md
									
									
									
									
									
								
							|  | @ -2,7 +2,7 @@ | |||
| 
 | ||||
| [](https://ci.guillemborrell.es/guillem/dr) | [Download](https://git.guillemborrell.es/guillem/-/packages/generic/dr) | ||||
| 
 | ||||
| A toolkit to process data files (csv and parquet) using the command line, inspired by [csvkit](https://github.com/wireservice/csvkit), with blazing speed, and powered by Rust.  | ||||
| A toolkit to process data files (csv and parquet) using the command line, inspired by [csvkit](https://github.com/wireservice/csvkit), with blazing speed, and powered by Rust. | ||||
| 
 | ||||
| You may wonder why I'm implementing this, since there's already [xsv](https://github.com/BurntSushi/xsv). There are two reasons for that: | ||||
| 
 | ||||
|  | @ -39,6 +39,50 @@ shape: (3, 2) | |||
| └──────┴───────────┘ | ||||
| ``` | ||||
| 
 | ||||
| ## Howto | ||||
| 
 | ||||
| The `dr` command offers a set of subcommands, each one of them with a different functionality. You can get the available subcommands with: | ||||
| 
 | ||||
| ```bash | ||||
| $ dr --help | ||||
| Command-line data file processing in Rust | ||||
| 
 | ||||
| Usage: dr [COMMAND] | ||||
| 
 | ||||
| Commands: | ||||
|   sql    Runs a sql statement on the file | ||||
|   print  Pretty prints the table | ||||
|   rpq    Read parquet file | ||||
|   help   Print this message or the help of the given subcommand(s) | ||||
| 
 | ||||
| Options: | ||||
|   -h, --help     Print help information | ||||
|   -V, --version  Print version information | ||||
| ``` | ||||
| 
 | ||||
| Subcommands can be pipelined unless reading from a file, writing to a file, or pretty prints data. What goes through the pipeline is a plain-text comma separated values with a header. While this may not be the best choice in terms of performance, allows `dr` subcommands to be combined with the usual unix-style command-line tools like `cat`, `head`, `grep`, `awk` and `sed`: | ||||
| 
 | ||||
| ```bash | ||||
| $ cat wine.csv | head -n 5 | dr print | ||||
| shape: (4, 14) | ||||
| ┌──────┬─────────┬────────────┬──────┬─────┬───────────┬──────┬──────┬─────────┐ | ||||
| │ Wine ┆ Alcohol ┆ Malic.acid ┆ Ash  ┆ ... ┆ Color.int ┆ Hue  ┆ OD   ┆ Proline │ | ||||
| │ ---  ┆ ---     ┆ ---        ┆ ---  ┆     ┆ ---       ┆ ---  ┆ ---  ┆ ---     │ | ||||
| │ i64  ┆ f64     ┆ f64        ┆ f64  ┆     ┆ f64       ┆ f64  ┆ f64  ┆ i64     │ | ||||
| ╞══════╪═════════╪════════════╪══════╪═════╪═══════════╪══════╪══════╪═════════╡ | ||||
| │ 1    ┆ 14.23   ┆ 1.71       ┆ 2.43 ┆ ... ┆ 5.64      ┆ 1.04 ┆ 3.92 ┆ 1065    │ | ||||
| ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ | ||||
| │ 1    ┆ 13.2    ┆ 1.78       ┆ 2.14 ┆ ... ┆ 4.38      ┆ 1.05 ┆ 3.4  ┆ 1050    │ | ||||
| ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ | ||||
| │ 1    ┆ 13.16   ┆ 2.36       ┆ 2.67 ┆ ... ┆ 5.68      ┆ 1.03 ┆ 3.17 ┆ 1185    │ | ||||
| ├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤ | ||||
| │ 1    ┆ 14.37   ┆ 1.95       ┆ 2.5  ┆ ... ┆ 7.8       ┆ 0.86 ┆ 3.45 ┆ 1480    │ | ||||
| └──────┴─────────┴────────────┴──────┴─────┴───────────┴──────┴──────┴─────────┘ | ||||
| ``` | ||||
| 
 | ||||
| Note that when `dr` loads csv data also tries to guess the data type of each field. | ||||
| 
 | ||||
| 
 | ||||
| ## Performance | ||||
| 
 | ||||
| `dr` is implemented in Rust with the goal of achieving the highest possible performance. Take for instance a simple read, groupby, and aggregate operation with ~30MB of data: | ||||
|  | @ -88,7 +132,7 @@ print(df.groupby("Dept", sort=False, as_index=False).Weekly_Sales.mean()) | |||
| ``` | ||||
| 
 | ||||
| ```bash | ||||
| $ time cat data/walmart_train.csv | ./python/group.py  | ||||
| $ time cat data/walmart_train.csv | ./python/group.py | ||||
|     Dept  Weekly_Sales | ||||
| 0      1  19213.485088 | ||||
| 1      2  43607.020113 | ||||
|  |  | |||
		Loading…
	
		Reference in a new issue