nltools

CLI utilities for text preprocessing and corpus management

version 0.4.2 python ≥ 3.10 license MIT

Install

$ pip install nltools-cli

Overview

A set of composable command-line tools for common text processing tasks: tokenization, cleanup, format conversion, corpus sampling. Designed to work with Unix pipes.

Tools

CommandDescription
nl-tokenizeTokenize text by words, sentences, or paragraphs
nl-cleanStrip HTML, normalize unicode, remove diacritics
nl-freqFrequency analysis, n-grams, stopword filtering
nl-splitSplit corpus into train/val/test with ratio control
nl-convertConvert between txt, jsonl, csv, conll formats
nl-sampleStratified sampling from large corpora
nl-statsText statistics: lengths, char distribution, language detection

Usage

Each tool reads from stdin or a file and writes to stdout.

# tokenize and get word frequencies
$ cat corpus.txt | nl-tokenize --mode word | nl-freq --top 20

the        14523
of          9841
and         8712
to          7654
a           6998
...
# clean up scraped HTML and convert to jsonl
$ nl-clean --strip-tags --normalize raw_pages/ | nl-convert --to jsonl -o clean.jsonl
# split dataset with reproducible seed
$ nl-split --ratio 0.8 0.1 0.1 --seed 42 corpus.jsonl -o splits/

train.jsonl   24,819 lines
val.jsonl      3,102 lines
test.jsonl     3,103 lines
All tools support --help for full option reference. See the documentation for detailed guides.

Requirements

Python 3.10 or later. No external dependencies for core functionality. Optional: icu for advanced unicode normalization, fasttext for language detection in nl-stats.