Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "leading surrogate" in UTF-8 (actually CESU-8) #607

Merged
merged 1 commit into from
Jan 12, 2024

Conversation

gperciva
Copy link
Member

@gperciva gperciva commented Jan 4, 2024

No description provided.

Amusingly, the code which is described as:
    This is a leading surrogate; some idiot has...
has a typo: 0xDC00 should be 0xD800.

The comment mentions a "leading surrogate", which is a synonym for a
high-surrogate code unit:
    A 16-bit code unit in the range D800_16 to DBFF_16, used in UTF-16
    as the leading code unit of a surrogate pair.  Also known as a
    leading surrogate.
    https://unicode.org/glossary/#high_surrogate_code_unit

What libarchive is doing here is adjusting for an invalid conversion of
UTF16 to UTF8; this adjustment is now known as the Compatibility
Encoding Scheme for UTF-16: 8-Bit (CESU-8), published as the Unicode
Technical Report #26 [1].

[1] https://www.unicode.org/reports/tr26/

Essentially, if libarchive detects a surrogate pair (not allowed in
UTF-8 [2]), it tries to construct the desired unicode value (as per
CESU-8).

[2] That unicode value should be encoded with 4 octets, whereas the
surrogate pair requires 6 octets.

Caveat:
- I used the term CESU-8 because later libarchive calls it that -- in
  fact, they moved this functionality into cesu8_to_unicode() in:
      2011-04-20 Add a check for surrogate pairs in UTF-8...
      libarchive c319a15a798653faae34a1f861ab74fbcf053e11

  However, "cesu" does not appear in the tarsnap git repo.
@gperciva
Copy link
Member Author

gperciva commented Jan 4, 2024

Amusingly, the code which is described as:

This is a leading surrogate; some idiot has...

has a typo: 0xDC00 should be 0xD800.

The comment mentions a "leading surrogate", which is a synonym for a high-surrogate code unit:

A 16-bit code unit in the range D800_16 to DBFF_16, used in UTF-16
as the leading code unit of a surrogate pair. Also known as a
leading surrogate.
https://unicode.org/glossary/#high_surrogate_code_unit

What libarchive is doing here is adjusting for an invalid conversion of UTF16 to UTF8; this adjustment is now known as the Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8), published as the Unicode Technical Report #26 [1].

[1] https://www.unicode.org/reports/tr26/

Essentially, if libarchive detects a surrogate pair (not allowed in UTF-8 [2]), it tries to construct the desired unicode value (as per CESU-8).

[2] That unicode value should be encoded with 4 octets, whereas the surrogate pair requires 6 octets.

Caveat:

  • I used the term CESU-8 because later libarchive calls it that -- in fact, they moved this functionality into cesu8_to_unicode() in:

    2011-04-20 Add a check for surrogate pairs in UTF-8...
    libarchive c319a15a798653faae34a1f861ab74fbcf053e11

    However, "cesu" does not appear in the tarsnap git repo.

@gperciva
Copy link
Member Author

gperciva commented Jan 4, 2024

tl;dr

There's a typo in libarchive-2.7 which stopped the code from doing what it was supposed to.

Modern libarchive uses a different method of doing the same thing (so the typo wasn't explicitly "fixed", but it's not broken any more).

Experiments

To investigate this, I created utf8-good.tar and utf8-bad.tar (in the attached zip file, because github doesn't allow us to upload tar files).

tar-utf8-experiments.zip

Both tar files contain a single 0-byte file called filename-😀. (that's U+1F600, a smile emoji)

In the "good" version, the tar file begins:

00000000: 5061 7848 6561 6465 722f 6669 6c65 6e61  PaxHeader/filena
00000010: 6d65 2df0 9f98 8000 0000 0000 0000 0000  me-.............

whereas the bad one encodes the smile emoji as a surrogate pair, and begins:

00000000: 5061 7848 6561 6465 722f 6669 6c65 6e61  PaxHeader/filena
00000010: 6d65 2ded a0bd edb8 8000 0000 0000 0000  me-.............

The diff on those two lines is:

-00000010: 6d65 2df0 9f98 8000 0000 0000 0000 0000  me-.............
+00000010: 6d65 2ded a0bd edb8 8000 0000 0000 0000  me-.............

(The full diff has more: the filename is repeated 3 times, and the checksum and path length changes. But those aren't important.)

tar programs

I tested libarchive-2.7, libarchvie 3.6.0 (the default in freebsd 12.4), and gnu tar 1.35, with tar -tf utf8-good.tar and tar -tf utf8-bad.tar.

  • Everybody was happy with utf8-good.tar.
  • Modern libarchive was happy with utf8-bad.tar.
  • gnu tar and libarchive 2.7 were not happy with utf8-bad.tar.
$ gtar -tf utf8-bad.tar 
filename-\355\240\275\355\270\200

(that's the bytes in octal)

$ ~/src/libarchive-2.7/b/bsdtar -tf utf8-bad.tar 
bsdtar: Pathname in pax header can't be converted to current locale.
filename-\355\240\275\355\270\200
bsdtar: Error exit delayed from previous errors.

When I tried applying this fix to libarchive-2.7, it worked fine:

$ ~/src/libarchive-2.7/b/bsdtar -tf utf8-bad.tar 
filename-😀

(that's the modified libarchive-2.7)

@cperciva cperciva merged commit 697ebd6 into master Jan 12, 2024
2 checks passed
@gperciva gperciva deleted the fix-utf8-leading-surrogate branch January 12, 2024 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants