"Results" section is generated from this script. It reports the task-clock for each command (which can differ from elapsed time). The script provides options regarding sorting and uniqueness. The defaults were used but different options lead to different results. The matrix of possibilities would be very large, so only the defaults are shown below.
Also note the compile time option called DISABLE_FIELD
. It disables the --field
arg, and removes some information associated with each token. This provides a small boost to sorting and uniqueness since there's a smaller memory footprint. --field
is not disabled by default, and inline with above, the benchmarks are run on the default options.
pcre2grep
and choose
have about the same speed.
sed
reads input until it reaches a newline character, and puts the content thus far in a buffer where it is then manipulated. Because of this, sed
performs extremely poorly on input files that contain many small lines (for the no_duplicates
case below, sed
with a newline delimiter (default) is x50 to x100 slower than choose
). To normalize the performance, the -z
option was used with sed
(to change the delimiter to a null char, which never occurs in the input). choose
doesn't use delimiters in this way, and can't come across this type of pathological case. After this normalization, sed
is faster than choose
except in cases where there are few substitutions to apply.
choose
is faster than awk
except in cases where there are few duplicates.
sort
is using naive byte order (via LC_ALL=C
), as this is the fastest. sort
is faster than choose
at sorting. If truncation is leveraged, or if there are many duplicates (when applying uniqueness as well), then choose
is faster than sort
.
Each input file is the same size (50 million bytes), but the type of data is different.
This file represents an average random workload, which includes text from a novel repeated.
The Project Gutenberg eBook of Pride and prejudice, by Jane Austen
This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
...
This file has the line "test" repeated. "test" is the match target used throughout, below. This includes grepping for the word "test", or substituting "test" to "banana".
test
test
test
test
test
...
For filtering by uniqueness, there are two extremes. One is where the entire file consists of the same element repeatedly, which is in test_repeated.txt
. The other is when every element is different. This file counts upwards from 1 for each line:
1
2
3
4
5
...
This file has a field where sorting and uniqueness should be applied. Note that even though the field is numeric, both sort and choose are using a lexicographical comparison in the benchmark.
garbage,1,garbage
garbage,2,garbage
garbage,3,garbage
garbage,4,garbage
garbage,5,garbage
...
choose 0.3.0, ncurses 6.2.20200212, pcre2 10.43
pcre2grep version 10.43-DEV 2023-04-14
sed (GNU sed) 4.7
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
sort (GNU coreutils) 8.30
5.15.90.1-microsoft-standard-WSL2
AMD Ryzen 7 3800X 8-Core Processor
ram: 16331032 kB
(ms) | choose | pcre2grep |
---|---|---|
plain_text | 247.17 | 269.69 |
test_repeated | 1620.34 | 1583.77 |
no_duplicates | 323.59 | 370.16 |
(ms) | choose | sed |
---|---|---|
plain_text | 179.57 | 135.57 |
test_repeated | 2725.50 | 1157.39 |
no_duplicates | 5.10 | 44.00 |
(here is a cherry picked great case for choose compared to sed)
(ms) | choose | sed (with newline delimiter) |
---|---|---|
no_duplicates | 5.13 | 543.93 |
(a special case, where choose cheats by using a literal replacement string)
(ms) | choose (delimiter sub) | sed |
---|---|---|
test_repeated | 1521.93 | 1156.64 |
(ms) | choose | sort |
---|---|---|
plain_text | 1628.88 | 448.06 |
test_repeated | 1850.89 | 1616.13 |
no_duplicates | 3714.94 | 1036.93 |
(a special case that leverages truncation)
(ms) | choose -s --out 5 | sort | head -n 5 |
---|---|---|
no_duplicates | 354.20 | 1059.81 |
(ms) | choose | awk |
---|---|---|
plain_text | 111.95 | 214.41 |
test_repeated | 565.31 | 1147.75 |
no_duplicates | 2340.37 | 1496.42 |
(ms) | choose | sort |
---|---|---|
plain_text | 122.80 | 440.57 |
test_repeated | 558.86 | 1640.79 |
no_duplicates | 5742.11 | 1168.84 |
(ms) | choose | sort |
---|---|---|
csv_field | 2770.27 | 474.02 |