-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of Unicode characters in String theory #431
base: master
Are you sure you want to change the base?
Fix handling of Unicode characters in String theory #431
Conversation
The test uses Strings.replaceAll and compares it to the result of str.replace_all in SMTLIB. However, the two functions behave different when the "matching" String is empty, and we need a special case for that.
…he pattern should also be escaped.
This is needed to protect the backslash from substitution later when getting the results from the model.
…capeUnicodeForSmtlib() as backslashes (= codepoint 5c) are considered special characters by Matcher.appendReplacement()
…constant. This is needed even for solvers like CVC4+5 or Z3 that expect Unicode characters to be escaped. We first need to unescape the String to resolve any escape sequences from the user, and then apply escape again before sending the String to the solver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am undecided whether our String escaping is useful.
@@ -29,8 +29,8 @@ class CVC4StringFormulaManager extends AbstractStringFormulaManager<Expr, Type, | |||
|
|||
@Override | |||
protected Expr makeStringImpl(String pValue) { | |||
// The boolean enables escape characters! | |||
return exprManager.mkConst(new CVC4String(escapeUnicodeForSmtlib(pValue), true)); | |||
String str = escapeUnicodeForSmtlib(unescapeUnicodeForSmtlib(pValue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume that escape
and unescape
are exactly reversed operations. This line is then a no-op and not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true for a single character (or a single escape sequence), but here we might have "mixed" Strings where some characters are escaped and others aren't. Take "\u{4321}\n"
for instance, where the first character is already written as an escape sequence, but the newline following it isn't. By applying unescape
we get "䌡\n"
and then escape
will give us the properly escaped String "\u{4321}\u{a}"
.
(Just using escape
no longer works since we're now also escaping \
to prevent it from capturing an escape sequence when unescaping back from the model. In the example escape("\u{4321}\n")
would return "\u{5c}u{4321}\u{a}"
which is not what we want as the String now has 9 characters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could handle \u{4321}\n
as String with 9 chars like ['\','u','{','4','3','2','1','}','\n']
and simply not handle it as escaped unicode char? what happens if a user would want to use it that way, e.g. like "\u{43" + "21" + "}\n"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if a user would want to use it that way, e.g. like "\u{43" + "21" + "}\n"?
They would have to escape the String themselves and use StringFormulaManager.makeString("\u{5c}u{4321}\n")
. However, I agree that this can be quite confusing when assembling several Strings. For instance StringFormulaManager.length(StringFormulaManager.makeString("\u{a"))
is 4
and StringFormulaManager.length(StringFormulaManager.makeString("}"))
is 1
, but StringFormulaManager.length(StringFormulaManager.makeString("\u{a" + "}"))
will give you 1
again.
@@ -27,7 +27,8 @@ class CVC5StringFormulaManager extends AbstractStringFormulaManager<Term, Sort, | |||
|
|||
@Override | |||
protected Term makeStringImpl(String pValue) { | |||
return solver.mkString(escapeUnicodeForSmtlib(pValue), true); | |||
String str = escapeUnicodeForSmtlib(unescapeUnicodeForSmtlib(pValue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume that escape
and unescape
are exactly reversed operations. This line is then a no-op and not required.
public void testInputEscape() throws SolverException, InterruptedException { | ||
// Test if SMTLIB Unicode literals are recognized and converted to their Unicode characters. | ||
assertEqual(smgr.length(smgr.makeString("Ξ")), imgr.makeNumber(1)); | ||
assertEqual(smgr.length(smgr.makeString("\\u{39E}")), imgr.makeNumber(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical question: The documentation (https://github.com/sosy-lab/java-smt/blob/master/src/org/sosy_lab/java_smt/api/StringFormulaManager.java#L22-L34) says that escaped strings are recognized. Additionally, the Java-supported Unicode range (0x00000-0x10FFFF) is larger than SMTLIB-supported range (0x00000-0x2FFFF).
Are we sure, we want to implement in this way, e.g, we already use Java-Strings? Why would a user need to use escaping in Java? Wouldn't it be nicer to ignore escaping and see a string like \\u{1234}
as 8 individual chars ['\','u','{','1','2','3','4','}']
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to just have Java Strings and only use escape
/unescape
when communicating with the SMT solver. One downside is interoperability as some users might expect SMTLIB escape sequences to work. Maybe there is a way to add escape
/unescape
to the StringFormulaManager
so that users can convert their own Strings when needed?
The other problem is that this is an API breaking change and that we would have to fix quite a few of the tests in StringFormulaManagerTest
that expect SMTLIB escape sequences to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no release that JavaSMT officially supports unicode strings. The API change happens mostly on developer stage. Also, changing test-code should not be an issue.
Perhaps we could get some third opinion on this topic?
Hello,
this MR patches several minor issues remaining after #422:
\
to make sure the backslash is preserved and doesn't get captured later when unescaping againunescape
first when creating a new String constant. This is needed even when the solver does not support plain Unicode and we'll have to escape the String again later. This change closes an issue with Z3 and CVC4 where broken escape sequences were not handled correctly.As mentioned in #412 I think that
makeString
should not be translating SMTLIB escape sequences at all. However, this is an API breaking change and can still be considered some other time