nltools

CLI utilities for text preprocessing and corpus management

version 0.4.2 python ≥ 3.10 license MIT

Install

$ pip install nltools-cli

Overview

A set of composable command-line tools for common text processing tasks: tokenization, cleanup, format conversion, corpus sampling. Designed to work with Unix pipes.

Tools

Command	Description
nl-tokenize	Tokenize text by words, sentences, or paragraphs
nl-clean	Strip HTML, normalize unicode, remove diacritics
nl-freq	Frequency analysis, n-grams, stopword filtering
nl-split	Split corpus into train/val/test with ratio control
nl-convert	Convert between txt, jsonl, csv, conll formats
nl-sample	Stratified sampling from large corpora
nl-stats	Text statistics: lengths, char distribution, language detection

Usage

Each tool reads from stdin or a file and writes to stdout.

# tokenize and get word frequencies
$ cat corpus.txt | nl-tokenize --mode word | nl-freq --top 20

the        14523
of          9841
and         8712
to          7654
a           6998
...

# clean up scraped HTML and convert to jsonl
$ nl-clean --strip-tags --normalize raw_pages/ | nl-convert --to jsonl -o clean.jsonl

# split dataset with reproducible seed
$ nl-split --ratio 0.8 0.1 0.1 --seed 42 corpus.jsonl -o splits/

train.jsonl   24,819 lines
val.jsonl      3,102 lines
test.jsonl     3,103 lines

All tools support --help for full option reference. See the documentation for detailed guides.

Requirements

Python 3.10 or later. No external dependencies for core functionality. Optional: icu for advanced unicode normalization, fasttext for language detection in nl-stats.