Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of Unicode characters in String theory #431

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

daniel-raffler
Copy link
Contributor

@daniel-raffler daniel-raffler commented Jan 16, 2025

Hello,
this MR patches several minor issues remaining after #422:

  • When escaping a String we now also escape \ to make sure the backslash is preserved and doesn't get captured later when unescaping again
  • We now always apply unescape first when creating a new String constant. This is needed even when the solver does not support plain Unicode and we'll have to escape the String again later. This change closes an issue with Z3 and CVC4 where broken escape sequences were not handled correctly.
  • Added two more tests and fixed a bug in another

As mentioned in #412 I think that makeString should not be translating SMTLIB escape sequences at all. However, this is an API breaking change and can still be considered some other time

The test uses Strings.replaceAll and compares it to the result of str.replace_all in SMTLIB. However, the two functions behave different when the "matching" String is empty, and we need a special case for that.
This is needed to protect the backslash from substitution later when getting the results from the model.
…capeUnicodeForSmtlib() as backslashes (= codepoint 5c) are considered special characters by Matcher.appendReplacement()
…constant.

This is needed even for solvers like CVC4+5 or Z3 that expect Unicode characters to be escaped. We first need to unescape the String to resolve any escape sequences from the user, and then apply escape again before sending the String to the solver.
@daniel-raffler daniel-raffler linked an issue Jan 16, 2025 that may be closed by this pull request
Copy link
Member

@kfriedberger kfriedberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am undecided whether our String escaping is useful.

@@ -29,8 +29,8 @@ class CVC4StringFormulaManager extends AbstractStringFormulaManager<Expr, Type,

@Override
protected Expr makeStringImpl(String pValue) {
// The boolean enables escape characters!
return exprManager.mkConst(new CVC4String(escapeUnicodeForSmtlib(pValue), true));
String str = escapeUnicodeForSmtlib(unescapeUnicodeForSmtlib(pValue));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume that escape and unescape are exactly reversed operations. This line is then a no-op and not required.

Copy link
Contributor Author

@daniel-raffler daniel-raffler Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true for a single character (or a single escape sequence), but here we might have "mixed" Strings where some characters are escaped and others aren't. Take "\u{4321}\n" for instance, where the first character is already written as an escape sequence, but the newline following it isn't. By applying unescape we get "䌡\n" and then escape will give us the properly escaped String "\u{4321}\u{a}".

(Just using escape no longer works since we're now also escaping \ to prevent it from capturing an escape sequence when unescaping back from the model. In the example escape("\u{4321}\n") would return "\u{5c}u{4321}\u{a}" which is not what we want as the String now has 9 characters)

Copy link
Member

@kfriedberger kfriedberger Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could handle \u{4321}\n as String with 9 chars like ['\','u','{','4','3','2','1','}','\n'] and simply not handle it as escaped unicode char? what happens if a user would want to use it that way, e.g. like "\u{43" + "21" + "}\n"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if a user would want to use it that way, e.g. like "\u{43" + "21" + "}\n"?

They would have to escape the String themselves and use StringFormulaManager.makeString("\u{5c}u{4321}\n"). However, I agree that this can be quite confusing when assembling several Strings. For instance StringFormulaManager.length(StringFormulaManager.makeString("\u{a")) is 4 and StringFormulaManager.length(StringFormulaManager.makeString("}"))is 1, but StringFormulaManager.length(StringFormulaManager.makeString("\u{a" + "}")) will give you 1 again.

@@ -27,7 +27,8 @@ class CVC5StringFormulaManager extends AbstractStringFormulaManager<Term, Sort,

@Override
protected Term makeStringImpl(String pValue) {
return solver.mkString(escapeUnicodeForSmtlib(pValue), true);
String str = escapeUnicodeForSmtlib(unescapeUnicodeForSmtlib(pValue));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume that escape and unescape are exactly reversed operations. This line is then a no-op and not required.

public void testInputEscape() throws SolverException, InterruptedException {
// Test if SMTLIB Unicode literals are recognized and converted to their Unicode characters.
assertEqual(smgr.length(smgr.makeString("Ξ")), imgr.makeNumber(1));
assertEqual(smgr.length(smgr.makeString("\\u{39E}")), imgr.makeNumber(1));
Copy link
Member

@kfriedberger kfriedberger Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical question: The documentation (https://github.com/sosy-lab/java-smt/blob/master/src/org/sosy_lab/java_smt/api/StringFormulaManager.java#L22-L34) says that escaped strings are recognized. Additionally, the Java-supported Unicode range (0x00000-0x10FFFF) is larger than SMTLIB-supported range (0x00000-0x2FFFF).
Are we sure, we want to implement in this way, e.g, we already use Java-Strings? Why would a user need to use escaping in Java? Wouldn't it be nicer to ignore escaping and see a string like \\u{1234} as 8 individual chars ['\','u','{','1','2','3','4','}']?

Copy link
Contributor Author

@daniel-raffler daniel-raffler Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to just have Java Strings and only use escape/unescape when communicating with the SMT solver. One downside is interoperability as some users might expect SMTLIB escape sequences to work. Maybe there is a way to add escape/unescape to the StringFormulaManager so that users can convert their own Strings when needed?

The other problem is that this is an API breaking change and that we would have to fix quite a few of the tests in StringFormulaManagerTest that expect SMTLIB escape sequences to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no release that JavaSMT officially supports unicode strings. The API change happens mostly on developer stage. Also, changing test-code should not be an issue.

Perhaps we could get some third opinion on this topic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent handling of Unicode characters in String theory
2 participants