Fix "leading surrogate" in UTF-8 (actually CESU-8) #607

gperciva · 2024-01-04T03:58:08Z

No description provided.

Amusingly, the code which is described as: This is a leading surrogate; some idiot has... has a typo: 0xDC00 should be 0xD800. The comment mentions a "leading surrogate", which is a synonym for a high-surrogate code unit: A 16-bit code unit in the range D800_16 to DBFF_16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. https://unicode.org/glossary/#high_surrogate_code_unit What libarchive is doing here is adjusting for an invalid conversion of UTF16 to UTF8; this adjustment is now known as the Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8), published as the Unicode Technical Report #26 [1]. [1] https://www.unicode.org/reports/tr26/ Essentially, if libarchive detects a surrogate pair (not allowed in UTF-8 [2]), it tries to construct the desired unicode value (as per CESU-8). [2] That unicode value should be encoded with 4 octets, whereas the surrogate pair requires 6 octets. Caveat: - I used the term CESU-8 because later libarchive calls it that -- in fact, they moved this functionality into cesu8_to_unicode() in: 2011-04-20 Add a check for surrogate pairs in UTF-8... libarchive c319a15a798653faae34a1f861ab74fbcf053e11 However, "cesu" does not appear in the tarsnap git repo.

gperciva · 2024-01-04T03:58:53Z

Amusingly, the code which is described as:

This is a leading surrogate; some idiot has...

has a typo: 0xDC00 should be 0xD800.

The comment mentions a "leading surrogate", which is a synonym for a high-surrogate code unit:

A 16-bit code unit in the range D800_16 to DBFF_16, used in UTF-16
as the leading code unit of a surrogate pair. Also known as a
leading surrogate.
https://unicode.org/glossary/#high_surrogate_code_unit

What libarchive is doing here is adjusting for an invalid conversion of UTF16 to UTF8; this adjustment is now known as the Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8), published as the Unicode Technical Report #26 [1].

[1] https://www.unicode.org/reports/tr26/

Essentially, if libarchive detects a surrogate pair (not allowed in UTF-8 [2]), it tries to construct the desired unicode value (as per CESU-8).

[2] That unicode value should be encoded with 4 octets, whereas the surrogate pair requires 6 octets.

Caveat:

I used the term CESU-8 because later libarchive calls it that -- in fact, they moved this functionality into cesu8_to_unicode() in:

2011-04-20 Add a check for surrogate pairs in UTF-8...
libarchive c319a15a798653faae34a1f861ab74fbcf053e11

However, "cesu" does not appear in the tarsnap git repo.

gperciva · 2024-01-04T04:16:24Z

tl;dr

There's a typo in libarchive-2.7 which stopped the code from doing what it was supposed to.

Modern libarchive uses a different method of doing the same thing (so the typo wasn't explicitly "fixed", but it's not broken any more).

Experiments

To investigate this, I created utf8-good.tar and utf8-bad.tar (in the attached zip file, because github doesn't allow us to upload tar files).

tar-utf8-experiments.zip

Both tar files contain a single 0-byte file called filename-😀. (that's U+1F600, a smile emoji)

In the "good" version, the tar file begins:

00000000: 5061 7848 6561 6465 722f 6669 6c65 6e61  PaxHeader/filena
00000010: 6d65 2df0 9f98 8000 0000 0000 0000 0000  me-.............

whereas the bad one encodes the smile emoji as a surrogate pair, and begins:

00000000: 5061 7848 6561 6465 722f 6669 6c65 6e61  PaxHeader/filena
00000010: 6d65 2ded a0bd edb8 8000 0000 0000 0000  me-.............

The diff on those two lines is:

-00000010: 6d65 2df0 9f98 8000 0000 0000 0000 0000  me-.............
+00000010: 6d65 2ded a0bd edb8 8000 0000 0000 0000  me-.............

(The full diff has more: the filename is repeated 3 times, and the checksum and path length changes. But those aren't important.)

tar programs

I tested libarchive-2.7, libarchvie 3.6.0 (the default in freebsd 12.4), and gnu tar 1.35, with tar -tf utf8-good.tar and tar -tf utf8-bad.tar.

Everybody was happy with utf8-good.tar.
Modern libarchive was happy with utf8-bad.tar.
gnu tar and libarchive 2.7 were not happy with utf8-bad.tar.

$ gtar -tf utf8-bad.tar 
filename-\355\240\275\355\270\200

(that's the bytes in octal)

$ ~/src/libarchive-2.7/b/bsdtar -tf utf8-bad.tar 
bsdtar: Pathname in pax header can't be converted to current locale.
filename-\355\240\275\355\270\200
bsdtar: Error exit delayed from previous errors.

When I tried applying this fix to libarchive-2.7, it worked fine:

$ ~/src/libarchive-2.7/b/bsdtar -tf utf8-bad.tar 
filename-😀

(that's the modified libarchive-2.7)

cperciva merged commit 697ebd6 into master Jan 12, 2024
2 checks passed

gperciva deleted the fix-utf8-leading-surrogate branch January 12, 2024 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "leading surrogate" in UTF-8 (actually CESU-8) #607

Fix "leading surrogate" in UTF-8 (actually CESU-8) #607

gperciva commented Jan 4, 2024

gperciva commented Jan 4, 2024 •

edited

Loading

gperciva commented Jan 4, 2024

Fix "leading surrogate" in UTF-8 (actually CESU-8) #607

Fix "leading surrogate" in UTF-8 (actually CESU-8) #607

Conversation

gperciva commented Jan 4, 2024

gperciva commented Jan 4, 2024 • edited Loading

gperciva commented Jan 4, 2024

tl;dr

Experiments

tar programs

gperciva commented Jan 4, 2024 •

edited

Loading