Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harmonise function incorrectly adds latex escaping bibtex fields #74

Open
cpmpercussion opened this issue Sep 9, 2024 · 1 comment
Open
Labels

Comments

@cpmpercussion
Copy link
Contributor

As observed by @stefanofasciani:

all capital letters in title are wrapped in {}, which is something I have not found in 2023.bib and earlier, and replaces all the non-ASCII characters with LaTex code (also something I have not found in previous bib files). Also, the harmonise function messes up the URL, for example {http://nime.org/proceedings/2024/nime2024_11.pdf} becomes {http://nime.org/proceedings/2024/nime2024\_11.pdf} which is deadly for the zenodo upload tool.

This is incorrect behaviour:

  • Special characters in the .bib file should be written in UTF-8 code (not LaTeX symbol represenations).
  • URLs need to have their actual URL not escaped LaTeX representations
  • Titles should have their normal text representation not escaped LaTeX representations

This is because the .bib file is in bibtex format but used to create other text representations of the papers (e.g., NIME individual paper webpages and Zenodo entries). So we need the text in the bibtex fields to be a "plain" UTF-8 representation of the text that could go into an HTML document or an API call, not something tuned to show up correctly in a LaTeX document.

The todo here is:

  • test and update the harmonise function so that it doesn't do the above bad things.

Ultimately we may want to move away from .bib files as a storage system, but they have an advantage of ubiquity within academic publishing and if the processes here break down at some point, the .bib files could easily be used in a different ad hoc system by other future maintainers.

@stefanofasciani
Copy link
Collaborator

it seems that the harmonise function is doing what we are asking to

The current version of the harmoniser function, uses the BibTexParser at line 36 with customization=homogenize_latex_encoding . So the behavior -- with respect to characters encoding -- is correct, while it's weird what happens to the title and url. Apparently BibTexParser has only built in customization as homogenize_latex_encoding or convert_to_unicode. If we use the latter, the strange behaviors disappear, and there are no apparent changes in the .bib file as the text is already unicode.

So we either need to develop a 'custom' customization (possible?), or perhaps see if migrating from BibTexParser 1.4 --> 2.0 is a viable option to get the UTF-8 code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants