Fix handling of Unicode characters in String theory #431

daniel-raffler · 2025-01-16T16:34:27Z

Hello,
this MR patches several minor issues remaining after #422:

When escaping a String we now also escape \ to make sure the backslash is preserved and doesn't get captured later when unescaping again
We now always apply unescape first when creating a new String constant. This is needed even when the solver does not support plain Unicode and we'll have to escape the String again later. This change closes an issue with Z3 and CVC4 where broken escape sequences were not handled correctly.
Added two more tests and fixed a bug in another

As mentioned in #412 I think that makeString should not be translating SMTLIB escape sequences at all. However, this is an API breaking change and can still be considered some other time

The test uses Strings.replaceAll and compares it to the result of str.replace_all in SMTLIB. However, the two functions behave different when the "matching" String is empty, and we need a special case for that.

…he pattern should also be escaped.

This is needed to protect the backslash from substitution later when getting the results from the model.

…capeUnicodeForSmtlib() as backslashes (= codepoint 5c) are considered special characters by Matcher.appendReplacement()

…constant. This is needed even for solvers like CVC4+5 or Z3 that expect Unicode characters to be escaped. We first need to unescape the String to resolve any escape sequences from the user, and then apply escape again before sending the String to the solver.

kfriedberger

I am undecided whether our String escaping is useful.

kfriedberger · 2025-01-16T18:41:40Z

src/org/sosy_lab/java_smt/solvers/cvc4/CVC4StringFormulaManager.java

@@ -29,8 +29,8 @@ class CVC4StringFormulaManager extends AbstractStringFormulaManager<Expr, Type,

  @Override
  protected Expr makeStringImpl(String pValue) {
-    // The boolean enables escape characters!
-    return exprManager.mkConst(new CVC4String(escapeUnicodeForSmtlib(pValue), true));
+    String str = escapeUnicodeForSmtlib(unescapeUnicodeForSmtlib(pValue));


I would assume that escape and unescape are exactly reversed operations. This line is then a no-op and not required.

That's true for a single character (or a single escape sequence), but here we might have "mixed" Strings where some characters are escaped and others aren't. Take "\u{4321}\n" for instance, where the first character is already written as an escape sequence, but the newline following it isn't. By applying unescape we get "䌡\n" and then escape will give us the properly escaped String "\u{4321}\u{a}".

(Just using escape no longer works since we're now also escaping \ to prevent it from capturing an escape sequence when unescaping back from the model. In the example escape("\u{4321}\n") would return "\u{5c}u{4321}\u{a}" which is not what we want as the String now has 9 characters)

we could handle \u{4321}\n as String with 9 chars like ['\','u','{','4','3','2','1','}','\n'] and simply not handle it as escaped unicode char? what happens if a user would want to use it that way, e.g. like "\u{43" + "21" + "}\n"?

what happens if a user would want to use it that way, e.g. like "\u{43" + "21" + "}\n"?

They would have to escape the String themselves and use StringFormulaManager.makeString("\u{5c}u{4321}\n"). However, I agree that this can be quite confusing when assembling several Strings. For instance StringFormulaManager.length(StringFormulaManager.makeString("\u{a")) is 4 and StringFormulaManager.length(StringFormulaManager.makeString("}"))is 1, but StringFormulaManager.length(StringFormulaManager.makeString("\u{a" + "}")) will give you 1 again.

kfriedberger · 2025-01-16T18:41:57Z

src/org/sosy_lab/java_smt/solvers/cvc5/CVC5StringFormulaManager.java

@@ -27,7 +27,8 @@ class CVC5StringFormulaManager extends AbstractStringFormulaManager<Term, Sort,

  @Override
  protected Term makeStringImpl(String pValue) {
-    return solver.mkString(escapeUnicodeForSmtlib(pValue), true);
+    String str = escapeUnicodeForSmtlib(unescapeUnicodeForSmtlib(pValue));


I would assume that escape and unescape are exactly reversed operations. This line is then a no-op and not required.

src/org/sosy_lab/java_smt/test/StringFormulaManagerTest.java

kfriedberger · 2025-01-16T18:53:08Z

src/org/sosy_lab/java_smt/test/StringFormulaManagerTest.java

+  public void testInputEscape() throws SolverException, InterruptedException {
+    // Test if SMTLIB Unicode literals are recognized and converted to their Unicode characters.
+    assertEqual(smgr.length(smgr.makeString("Ξ")), imgr.makeNumber(1));
+    assertEqual(smgr.length(smgr.makeString("\\u{39E}")), imgr.makeNumber(1));


Critical question: The documentation (https://github.com/sosy-lab/java-smt/blob/master/src/org/sosy_lab/java_smt/api/StringFormulaManager.java#L22-L34) says that escaped strings are recognized. Additionally, the Java-supported Unicode range (0x00000-0x10FFFF) is larger than SMTLIB-supported range (0x00000-0x2FFFF).
Are we sure, we want to implement in this way, e.g, we already use Java-Strings? Why would a user need to use escaping in Java? Wouldn't it be nicer to ignore escaping and see a string like \\u{1234} as 8 individual chars ['\','u','{','1','2','3','4','}']?

I'd prefer to just have Java Strings and only use escape/unescape when communicating with the SMT solver. One downside is interoperability as some users might expect SMTLIB escape sequences to work. Maybe there is a way to add escape/unescape to the StringFormulaManager so that users can convert their own Strings when needed?

The other problem is that this is an API breaking change and that we would have to fix quite a few of the tests in StringFormulaManagerTest that expect SMTLIB escape sequences to work.

There was no release that JavaSMT officially supports unicode strings. The API change happens mostly on developer stage. Also, changing test-code should not be an issue.

Perhaps we could get some third opinion on this topic?

daniel-raffler added 6 commits January 16, 2025 16:51

Strings: Added more tests for Unicode escaping

646db34

Strings: Patch a bug in testConstStringReplaceAll

a723ab4

The test uses Strings.replaceAll and compares it to the result of str.replace_all in SMTLIB. However, the two functions behave different when the "matching" String is empty, and we need a special case for that.

Strings: Fix pattern for \u{...} escape sequences. The final '}' in t…

9db5646

…he pattern should also be escaped.

Strings: Escape backslashes when creating String literals.

4f99430

This is needed to protect the backslash from substitution later when getting the results from the model.

Strings: Add a separate case for the escape sequence "\u{5c}" in unes…

16cb3bf

…capeUnicodeForSmtlib() as backslashes (= codepoint 5c) are considered special characters by Matcher.appendReplacement()

daniel-raffler linked an issue Jan 16, 2025 that may be closed by this pull request

Inconsistent handling of Unicode characters in String theory #412

Open

kfriedberger reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of Unicode characters in String theory #431

Fix handling of Unicode characters in String theory #431

daniel-raffler commented Jan 16, 2025 •

edited

Loading

kfriedberger left a comment

kfriedberger Jan 16, 2025

daniel-raffler Jan 16, 2025 •

edited

Loading

kfriedberger Jan 16, 2025 •

edited

Loading

daniel-raffler Jan 16, 2025

kfriedberger Jan 16, 2025

kfriedberger Jan 16, 2025 •

edited

Loading

daniel-raffler Jan 16, 2025 •

edited

Loading

kfriedberger Jan 16, 2025

Fix handling of Unicode characters in String theory #431

Are you sure you want to change the base?

Fix handling of Unicode characters in String theory #431

Conversation

daniel-raffler commented Jan 16, 2025 • edited Loading

kfriedberger left a comment

Choose a reason for hiding this comment

kfriedberger Jan 16, 2025

Choose a reason for hiding this comment

daniel-raffler Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

kfriedberger Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

daniel-raffler Jan 16, 2025

Choose a reason for hiding this comment

kfriedberger Jan 16, 2025

Choose a reason for hiding this comment

kfriedberger Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

daniel-raffler Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

kfriedberger Jan 16, 2025

Choose a reason for hiding this comment

daniel-raffler commented Jan 16, 2025 •

edited

Loading

daniel-raffler Jan 16, 2025 •

edited

Loading

kfriedberger Jan 16, 2025 •

edited

Loading

kfriedberger Jan 16, 2025 •

edited

Loading

daniel-raffler Jan 16, 2025 •

edited

Loading