CLI utilities for text preprocessing and corpus management
$ pip install nltools-cli
A set of composable command-line tools for common text processing tasks: tokenization, cleanup, format conversion, corpus sampling. Designed to work with Unix pipes.
| Command | Description |
|---|---|
| nl-tokenize | Tokenize text by words, sentences, or paragraphs |
| nl-clean | Strip HTML, normalize unicode, remove diacritics |
| nl-freq | Frequency analysis, n-grams, stopword filtering |
| nl-split | Split corpus into train/val/test with ratio control |
| nl-convert | Convert between txt, jsonl, csv, conll formats |
| nl-sample | Stratified sampling from large corpora |
| nl-stats | Text statistics: lengths, char distribution, language detection |
Each tool reads from stdin or a file and writes to stdout.
# tokenize and get word frequencies
$ cat corpus.txt | nl-tokenize --mode word | nl-freq --top 20
the 14523
of 9841
and 8712
to 7654
a 6998
...
# clean up scraped HTML and convert to jsonl
$ nl-clean --strip-tags --normalize raw_pages/ | nl-convert --to jsonl -o clean.jsonl
# split dataset with reproducible seed
$ nl-split --ratio 0.8 0.1 0.1 --seed 42 corpus.jsonl -o splits/
train.jsonl 24,819 lines
val.jsonl 3,102 lines
test.jsonl 3,103 lines
--help for full option reference. See the documentation for detailed guides.
Python 3.10 or later. No external dependencies for core functionality. Optional: icu for advanced unicode normalization, fasttext for language detection in nl-stats.