Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bumpy road complexity metrics #714

Merged
merged 35 commits into from
Apr 28, 2024
Merged

Bumpy road complexity metrics #714

merged 35 commits into from
Apr 28, 2024

Conversation

dbukki
Copy link
Collaborator

@dbukki dbukki commented Feb 29, 2024

Fixes #684

Formula

The bumpy road metric of a function is computed as the function's total bumpiness divided by the number of statements considered. The total bumpiness of a function is the sum of the depth of each considered statement, where depth is the level of the statement's indentation. (How many parent scopes does it have?)

bumpy_road(F) = ( SUM {s in S} (depth(s)) ) / count(S)

where S is the set of considered statements in function F.

Note: Functions with count(S)=0 are considered empty. In this case the result of the formula is 1.

Domain

The bumpy road metric only considers

  • control flow (e.g if, for, ...),
  • scope (e.g. { ... }, function body), and
  • top-level statements (e.g. int a;, x += y + 8;, ...) (basically only the "proper" statements terminated by a semicolon)

when traversing the AST. Anything else (e.g. labels, sub-expressions, ...) is not counted towards this metric.

Range

For every function F, let max(D) be the depth of its deepest statement. The bumpy road metric of F falls within the range:

[1, max(D)]

where

  • 1 means completely flat (=good), and
  • max(D) means completely nested/bumpy (=bad).

Changes

In order to integrate this metric into CodeCompass, the total bumpiness and statement count had to be computed during parsing (since we do not store the necessary statement info in the database that the metrics plugin could utilize).

For these values, the CppFunction table now has two extra fields: bumpiness and statementCount.
The metrics parser is then responsible for computing the quotient that makes up the final metric.

In the parser, most of our existing Traverse* functions had to be restructured. As this also impacted existing features (notably: McCabe and destructor usage via statement stacks), some refactoring also had to be done to ensure they still work like before.

Instead of the old template decorator approach, we now use scope objects (StatementScope, TypeScope, EnumScope, FunctionScope, CtxStmtScope and ScopedValue) to perform pre- and post- actions around Base::Traverse* calls with their ctors and dtors. Combined with some further generalizations at the root TraverseDecl and TraverseStmt functions, this not only reduces the number of specialized Traverse* functions (which usually contained duplicate bodies), but also shortens/condenses the code of the parser.

In order to be able to track nested statements, scopes, and therefore the depth of the current statement during traversal, a new _stmtStack member has been introduced to the parser. This member is of a special StatementStack type that is built up from StatementScope objects (one per each statement during traversal) to form the "parent chain" of the currently inspected statement.
Individual statement scopes can then be further configured by the specialized Traverse* functions to describe how that particular statement affects the depth of further statements on the stack. (Note: I did my best to add documentation/comments to the non-trivial parts of each scope type.)
With this new mechanism, the old _mcCabeStack and _statements stacks could be successfully folded into this logic, thus further reducing complexity (and unnecessary duplication of the same statement stack pattern).

Testing

Unit tests for bumpy road have been added into to the test directory. Test cases have been written in a similar style as with the McCabe metric. Further McCabe test cases have also been added to check previously untested cases that I have discovered during my attempts at manual regression testing.

Elimination of duplicate Traverse* bodies via decorator functions.
Minor refactors in the McCabe metric calculation, in anticipation of a similar approach with bumpy road metrics.
Additional test cases for McCabe.
Order of Stmt and Expr traversal functions changed so that they are grouped together more coherently.
More centralized/flexible type-based scope creation spanning all potential cases; not just specialized functions.
Unification of the statement scope stack and the statement stack.
…ame, different ctors look the same; therefore they cannot be distinguished during tests.
Elimination of duplicate Traverse* bodies via decorator functions.
Minor refactors in the McCabe metric calculation, in anticipation of a similar approach with bumpy road metrics.
Additional test cases for McCabe.
Order of Stmt and Expr traversal functions changed so that they are grouped together more coherently.
More centralized/flexible type-based scope creation spanning all potential cases; not just specialized functions.
Unification of the statement scope stack and the statement stack.
…ame, different ctors look the same; therefore they cannot be distinguished during tests.
@dbukki
Copy link
Collaborator Author

dbukki commented Feb 29, 2024

For reference, here is a list of McCabe and bumpy road metrics exported from the Xerces-C project using the current state:
xerces-bumpyroad.log
xerces-mccabe.log

@mcserep mcserep requested review from intjftw and mcserep March 6, 2024 14:46
@mcserep mcserep added Kind: Enhancement 🌟 Plugin: C++ Issues related to the parsing and presentation of C++ projects. Plugin: Metrics Issues related to the code metrics plugin. labels Mar 6, 2024
These types are only intended for scope-like usage (ctor-dtor pair matters), not the relocation of data.
@dbukki
Copy link
Collaborator Author

dbukki commented Mar 7, 2024

I found an interesting phenomenon in the parser:

In C++, we know that records and functions can be either just a declaration or an actual definition. However, our policy towards storing these in the parsed database is very different:

  • For record types, VisitRecordDecl returns early if the visited record declaration is not a definition. This causes the corresponding CppRecord entity to remain initialized with astNodeId==0, which eventually causes the record traversal functions to omit it from _types (and therefore the database):
if (_type->astNodeId)
  _visitor->_types.push_back(_type);

Note: In this PR, this call happens in TypeScope::~TypeScope, but the behavior itself is not new to this PR. In the current master, the same behavior can be found in the TraverseRecordDecl (and similar functions), so this is not a regression.

  • For functions, VisitFunctionDecl will initialize the CppFunction entity regardless of whether the function is a declaration or definition, which causes every occurrence of it to be added to _functions (and therefore the database):
if (_curFun->astNodeId)
  _visitor->_functions.push_back(_curFun);

Note: In this PR, this call happens in FunctionScope::~FunctionScope, but the behavior itself is not new to this PR. In the current master, the same behavior can be found in the TraverseFunctionDecl (and similar functions), so this is not a regression.


Essentially, this means that:

  1. The following code persists only one CppRecord entity in the database (but 3 CppAstNode entities for each declaration):
class A;
class A {};
class A;
  1. The following code persists no CppRecord entities in the database at all (only the 3 CppAstNode entities):
class NoDef0;
class NoDef1;
class NoDef2;
  1. The following code persists 3 different (!) CppFunction entities in the database (along with the 3 CppAstNode entities):
void f();
void f() {}
void f();

Note: The 3 different CppFunction entities are nearly identical in content, with the exception of fields computed from the function's body (e.g. mccabe and bumpiness metrics). The metrics fields of the declaration entities will retain their default values assigned in VisitFunctionDecl; only the definition (which has the body) will contain the actual metrics.


As far as I can see:

  • Scenario 1 above is what we would normally expect: One record entity for the one definition.
  • Scenario 2 is a bit weird, since no actual type information is ever recorded for types that have only ever been declared, but never defined. Nonetheless, I can still image this being consistent with what we want to use our database for. (Since an AST node is still stored even for declarations, navigation via the UI is still possible here.)
  • Scenario 3 definitely smells fishy though. I would expect the same "one definition = one entity" rule to apply here as with record types.

The "multiple function entities" problem is also the root cause of why the McCabe and BumpyRoad tests fail for functions that are declared more than once: For such functions, _db->query_value<model::CppFunction>(...) runs into an assert as the query yields more than one results for the given name, out of which only one is truly meaningful.

I tried resolving this ambiguity by applying the same logic to functions as what we already utilize with records. In my local branch, I added a guard condition to only store the CppFunction entity for the definition.
However, this actually causes an existing test (namely: CppParserTest::FunctionDeclarationOnly) to fail. So this got me confused, and I now see two possibilities:

  1. Either this "multiple function entities" phenomenon is the actual proper/expected behavior, and this test is right. That would mean that there has been reason in the past (and still is) to retain the current behavior and have a CppFunction record for every declaration of a function. But then by design, this also means that the metrics-related fields are also stored for each declaration, even though they have no true meaning there, which raises a database design concern.
  2. Or, both our current policy for function storage and this test are wrong, and we should instead fix the database to only store every function entity once. By extension, this would cause functions without a definition to have no CppFunction entity stored in the database (as is already the case with records in Scenario 2). Thus the meaning of the FunctionDeclarationOnly test would have to change, but metrics (and relevant function info) would only be stored for every function once.

The second option seems like the more rational alternative to me, both from a database design perspective, and from a metrics-query perspective. It would also cut back on the size of the CppFunction table, which apparently contains a lot of "duplicates" right now. But I don't know if this problem deserves an issue ticket of its own or not. I am also uncertain about the regressions this second option would introduce, so I definitely don't want to rush ahead with the development until the situation is clear.

@mcserep Could you advise me on which of the above two options is the correct approach?

@mcserep
Copy link
Collaborator

mcserep commented Mar 14, 2024

@dbukki As we discussed yesterday on the weekly meeting, please continue as follows:

  1. Do not deal with the found multiple function entities issue in this PR. Instead implement a simple solution, which can be used, e.g.:
    • either insert the correct metric value for each CppFunction record; or
    • insert the value only for the definition record.
  2. Handle the multiple CppFunction issue in a separate PR, see Multiple CppFunction entities #720.

@intjftw
Copy link
Collaborator

intjftw commented Apr 24, 2024

If I execute incremental parsing on a project, parsing fails immediately with a segmentation fault. @dbukki can you please check if this is coming from your modifications?

@dbukki
Copy link
Collaborator Author

dbukki commented Apr 26, 2024

If I execute incremental parsing on a project, parsing fails immediately with a segmentation fault. @dbukki can you please check if this is coming from your modifications?

I checked the current master (8e84d84) and the segfault is also present there.
I put some logging into various places, including the constructor of ClangASTVisitor, and no logs are printed before the segfault happens. This leads me to believe that the crash happens at a much earlier step, long before any actual C++ parsing can take place.

In any case, this is a separate problem.
I have opened an issue for it: #735


bool TraverseFunctionDecl(clang::FunctionDecl* fd_)
class TypeScope final
Copy link
Collaborator

@mcserep mcserep Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any special reason these helper classes are marked final?

@mcserep
Copy link
Collaborator

mcserep commented Apr 28, 2024

I checked the current master (8e84d84) and the segfault is also present there. I put some logging into various places, including the constructor of ClangASTVisitor, and no logs are printed before the segfault happens. This leads me to believe that the crash happens at a much earlier step, long before any actual C++ parsing can take place.

In any case, this is a separate problem. I have opened an issue for it: #735

@dbukki I have checked and branch release/gershwin is not affected by this bug, so it is highly likely to be related to the cppmetrics plugin, as that is the only new plugin in the upcoming release with a parser component. Nevertheless, it should be investigated in a separate issue (#735), as it is not directly related to the bumpy road metric.

Copy link
Collaborator

@intjftw intjftw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, LGTM! 🚀

@intjftw intjftw merged commit 3279e4b into Ericsson:master Apr 28, 2024
11 checks passed
@dbukki dbukki deleted the bumpy-road branch May 11, 2024 07:05
@mcserep
Copy link
Collaborator

mcserep commented May 14, 2024

@dbukki The bumpy road metrics was not added to the service.

Thrift interface:

enum CppAstNodeMetricsType
{
ParameterCount = 1,
McCabe = 2,
LackOfCohesion = 3,
LackOfCohesionHS = 4
}

Service implementation:

void CppMetricsServiceHandler::getCppAstNodeMetricsTypeNames(
std::vector<CppAstNodeMetricsTypeName>& _return)
{
CppAstNodeMetricsTypeName typeName;
typeName.type = CppAstNodeMetricsType::ParameterCount;
typeName.name = "Number of function parameters";
_return.push_back(typeName);
typeName.type = CppAstNodeMetricsType::McCabe;
typeName.name = "McCabe metric";
_return.push_back(typeName);
typeName.type = CppAstNodeMetricsType::LackOfCohesion;
typeName.name = "Lack of cohesion of function";
_return.push_back(typeName);
typeName.type = CppAstNodeMetricsType::LackOfCohesionHS;
typeName.name = "Lack of cohesion of function (Henderson-Sellers variant)";
_return.push_back(typeName);
}

Please add this to the codebase in a new fixing PR, so the bumpy road metric become queryable through the web API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kind: Enhancement 🌟 Plugin: C++ Issues related to the parsing and presentation of C++ projects. Plugin: Metrics Issues related to the code metrics plugin.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bumpy Road Complexity at Function Level for C++
3 participants