Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(ASAN report-converter) CodeChecker cmd diff finds new issues even though the reports are the same #3632

Open
jimis opened this issue Mar 18, 2022 · 10 comments

Comments

@jimis
Copy link
Contributor

jimis commented Mar 18, 2022

I compile the exact same code and same test with Address Sanitizer enabled. Both of the days, address sanitizer reports the same issue while running the test. But the memory addresses are different. It seems CodeChecker considers this a significant difference and the issues get a different report_hash.

I append snippets of CodeChecker cmd results -o json for both runs. Notice that the reports are for the same bug, but the report-hash is different

First day:

  {
    "runId": 1,
    "checkerId": "AddressSanitizer",
    "bugHash": "ca8e725f19d848dc2e1f7947f34a5cc6",
    "checkedFile": "tests/auto/widgets/styles/qstylesheetstyle/tst_qstylesheetstyle.cpp",
    "checkerMsg": "heap-use-after-free on address 0x60200000c030 at pc 0x000000544718 bp 0x7ffe1e48a350 sp 0x7ffe1e48a348",
    "reportId": 1,
    "fileId": 26,
    "line": 1495,
    "column": 27,
    "severity": 40,
    "reviewData": {
      "status": 0,
      "comment": null,
      "author": null,
      "date": null
    },
    "detectionStatus": 0,
    "detectedAt": "2022-03-04 16:30:35.008930",
    "fixedAt": null,
    "bugPathLength": 38,
    "details": null,
    "analyzerName": "asan"
  },

Second day:

  {
    "runId": 2,
    "checkerId": "AddressSanitizer",
    "bugHash": "29ad512916070f415c93c6f01afc139f",
    "checkedFile": "tests/auto/widgets/styles/qstylesheetstyle/tst_qstylesheetstyle.cpp",
    "checkerMsg": "heap-use-after-free on address 0x60200000c030 at pc 0x000000544718 bp 0x7ffd8ab14770 sp 0x7ffd8ab14768",
    "reportId": 5,
    "fileId": 26,
    "line": 1495,
    "column": 27,
    "severity": 40,
    "reviewData": {
      "status": 0,
      "comment": null,
      "author": null,
      "date": null
    },
    "detectionStatus": 0,
    "detectedAt": "2022-03-07 12:44:59.403302",
    "fixedAt": null,
    "bugPathLength": 38,
    "details": null,
    "analyzerName": "asan"
  },

As a result, CodeChecker cmd diff reports new issues found every day, without really having any new issue detected.

@jimis jimis changed the title (ASAN report-converter) "CodeChecke cmd diff" finds new issues even though the reports are the same (ASAN report-converter) "CodeChecker cmd diff" finds new issues even though the reports are the same Mar 18, 2022
@jimis jimis changed the title (ASAN report-converter) "CodeChecker cmd diff" finds new issues even though the reports are the same (ASAN report-converter) CodeChecker cmd diff finds new issues even though the reports are the same Mar 18, 2022
@bruntib
Copy link
Contributor

bruntib commented Mar 25, 2022

Hi,

This is a real issue to solve in the future. The problem is that the has which is identifying the report type contains the message emitted by the sanitizer. So yes, unfortunately they will be identified as different reports every time.
Thanks for reporting it!

@jimis
Copy link
Contributor Author

jimis commented Mar 25, 2022

We could execute a regex replace on the message, right before generating the hash. Using sed for example this would be:

sed -E 's/\b0x[0-9a-fA-F]+\b/[address]/g'

Example:

$ echo "heap-use-after-free on address 0x60200000c030 at pc 0x000000544718 bp 0x7ffd8ab14770 sp 0x7ffd8ab14768"  | sed -E 's/\b0x[0-9a-fA-F]+\b/[address]/g'
heap-use-after-free on address [address] at pc [address] bp [address] sp [address]

In CodeChecker python code, where would the best place be to do this replacement? I can submit a patch if it's not too complicated.

@bruntib
Copy link
Contributor

bruntib commented Mar 25, 2022

There is a report-converter tool which parses the output of some analyzer tools and convert them to a CodeChecker report. Sanitizers can be found here: https://github.com/Ericsson/codechecker/tree/master/tools/report-converter/codechecker_report_converter/analyzers/sanitizers

@jimis
Copy link
Contributor Author

jimis commented Mar 25, 2022

There is a report-converter

Isn't the report-hash computation in CodeChecker itself, rather than in report-converter? Such a change might have to be generic.

@bruntib
Copy link
Contributor

bruntib commented Mar 25, 2022

CodeChecker is using this report-converter for converting sanitizer results to .plist format. These are consumed by CodeChecker which computes the report hash based on the analyzer's message among others. The problem is that this checker message is constantly changing in the plists.

What command did you use for generating sanitizer reports?

@jimis
Copy link
Contributor Author

jimis commented Mar 25, 2022

I used report-converter followed by CodeChecker store and finally CodeChecker cmd results -o json.

My point is that there are 2 places where this can be fixed:

  • In report-converter: every sanitizer that might be printing addresses in its messages needs to "sanitize" the messages (remove the addresses) before generating the plist files. Disadvantage is that the original addresses are lost, all that one can see in the message in the CodeChecker web UI is the [address] placeholder.
  • In CodeChecker's generate_report_hash(filename, linenumber, message) function (this function does not really exist, it's a guess of mine about how CodeChecker generates the report_hash from a plist file). This function would only locally edit the message argument to replace the addresses with [address] so that the hash is unaffected by addresses.
    • Advantage: The message with the original addresses is in the plist file, and presumably CodeChecker UI can print that one.
    • Advantage: All CodeChecker reports benefit from that: addresses are automatically removed for all sanitizer messages, clang-tidy messages, analyzer messages etc.
    • Disadvantage: Some checkers might print a hex number that is significant in their message, like "buffer of size 0x20". These kind of numbers will also get mangled.

@bruntib
Copy link
Contributor

bruntib commented Mar 28, 2022

I would vote on the first option, i.e. changing the checker message in the report-converter. I think we shouldn't change the hash generation algorithm only for sanitizers in CodeChecker. Is there some use-case where it would be important for the user to see those addresses? Maybe in the future we could somehow include the original message in a separate "text bubble" or something which is not the part of the message thus doesn't participate in hash generation. We are doing something similar for ClangSA results, having a "bug path". The context insensitive hash generation excludes this bug path.

@jimis
Copy link
Contributor Author

jimis commented Mar 29, 2022

We should also think if there is a way to handle the already stored reports by sanitizers. I have daily reports printing a ton of addresses each, all with different report hash. Even though many reports (even in the same run) refer to the same issue, selecting the "unique reports" filter doesn't help. All these reports would have to be deleted and re-converted and then re-stored?

Is there some use-case where it would be important for the user to see those addresses?

There are messages that tell you "read of size 8 at Address1" and then print a whole stacktrace and then give more information on that specific address by printing "Address1 is located X bytes inside a 32-byte region freed by thread T0 here" and print another stacktrace. Apparently it helps to see that it is talking about the same address.

Maybe in the future we could somehow include the original message in a separate "text bubble"

Yeah I see this need more and more. We don't even need a "text bubble", we need a more clear way to display a lot of text. The message I mentioned above might print 150 lines of stacktraces (of different threads) and helper messages. report-converter only catches the top stacktrace and ignores the rest, but they are vital to debugging the issue. I'll file a separate ticket for that.

@bruntib
Copy link
Contributor

bruntib commented Apr 7, 2022

Some time we were also thinking of the case of same reports, changing their hash. This might happen either because the user chose another hash algorithm at CodeChecker analyze command, or the change of analyzer tool, etc. We thought that we could somehow store the old and new hashes of all reports. This introduced more issues than it solved, so we advise re-analyzing the project.

We had a discussion in the team and we thought that the we could move hash generation to the report converter. It could be a better solution to this current issue.

@jimis
Copy link
Contributor Author

jimis commented Apr 8, 2022

We had a discussion in the team and we thought that the we could move hash generation to the report converter. It could be a better solution to this current issue.

I like the idea, that should be the best indeed! (even if a bit complicated to implement) Thank you for discussing it further.

Regarding the extra messages and stacktraces I have opened #3639 to keep track of it, since parsing, storing and presenting all that info is a separate and complicated issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants