Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add annotations #654

Closed
wants to merge 2 commits into from
Closed

Add annotations #654

wants to merge 2 commits into from

Conversation

zurk
Copy link
Contributor

@zurk zurk commented Feb 28, 2019

Note that all work in progress. However, I created this PR to discuss the proposed solution, find all possible drawbacks and fix them on the early stage. Please note, that the code is raw in minor details. Let's discuss high-level questions about implemented architecture/interface/API decisions.

Let me remind you the main idea of annotations:
Separate data (code) and all it annotations (like UAST nodes, Virtual Token, y, file path, etc) handling.

This PR contains:

  • Annotation tooling.
    • Will be moved to lookout-sdk-ml when I create ready-to-merge PR. For now, I do not want to split the code because it is harder to review.
    • I cut all possible corners, did not implement a lot of useful functions, did not have good tests, etc.
  • Backward capability layer is added. to_vnode() function.
  • FeatureExtractor._parse_vnodes(),
    FeatureExtractor._parse_file(),
    VirtualNode.from_node()
    functions are rewritten using Annotations to give a feeling of how it will look like. To make everything works I use to_vnode() Backward capability function.

I also checked that I get exactly the same vnodes that we have in current master.

Related PR: #521

cc @src-d/machine-learning and specially @m09 PTAL
Any comments and feedback appreciated.

@zurk zurk requested a review from m09 February 28, 2019 20:06
@zurk zurk force-pushed the annotations branch 4 times, most recently from 42b8603 to a7be057 Compare February 28, 2019 21:52
@zurk
Copy link
Contributor Author

zurk commented Mar 1, 2019

The biggest problem I am aware of is dealing with null-intervals [x, x) aka NOOPs. I write [ and ) because it is semi-intervals like in python slicing.

There is a lot of cases which require different behavior from null-intervals:

  1. Null-intervals should be not covered by the UAST Node annotation next to it. For example, we have var t = 0; and there is a NOOP [9, 9) between 0 and ;. Also, there is an UASTNode annotation for 0 [8, 9). As soon as we do not have UAST annotation for NOOP [9, 9) we should consider these intervals as disjoint ones.

  2. NOOP in the beginning and end of the file. Should be considered as intersecting with file annotations like PATH and LANGUAGE. The same for LINE annotation (annotates the file line number).

  3. There are two identical NOOPs between the files: one from the beginning of the first file and second from the end of the next file. However, it is easy to fix. We should add document separator symbol and do not concatenate files directly to each other. So we have the end NOOP at [len(file1), len(file1)) and the start NOOP at [len(file1)+1, len(file1))+1.

    3.1. That means I should move _check_interval_crossing from AnnotatedData to Annotated class and make it dependent from annotation type.

    3.2. There is also another option to store different files in different AnnotatedData classes, so RawData takes only one document. I feel that this is a bad solution. Once we want to add cross-file annotations (like containing directory) it is better to have all files in one place.

@m09
Copy link
Contributor

m09 commented Mar 5, 2019

Thanks for the first draft @zurk 👍

Some feedback I can give from UIMA experience:

  • it's better to KISS and have a single file per AnnotedData. Then if/when we need cross-files info sharing, we can use external objects.
  • we should store per annotation type and not per offset: the goal is to use as few annotation types as possible in a given piece of code, the architecture of the code should reflect that
  • for the point about UAST not being applicable to noops: we can query related annotations per covering/covered/equal spans when we want / when it's appropriate, but we can also just point to another annotation in an annotation (this way we specify explicitely which annotation is related):
    class TokenAnnotation(Annotation):
        UASTNode: Optional[UASTNodeAnnotation] = None
    then when we iterate TokenAnnotations we can retrieve the UAST node this way.
  • better not to have ValuedAnnotation and specify a meaningful name for each kind of annotation instead of value
  • better not to have names but just use the classes: names make the code less typesafe
  • an annotation containing the complete uast should not be necessary if we have proper UASTNodeAnnotations: only a point to the root UASTNodeAnnotation should be enough (in one of the document wide annotations for example).

@zurk zurk force-pushed the annotations branch 2 times, most recently from 0e55837 to 2239da4 Compare March 7, 2019 19:03
@zurk
Copy link
Contributor Author

zurk commented Mar 7, 2019

First feedback addressed, the code becomes more verbose but I and Hugo agreed that it is better to be more explicit then short.
@m09 hare a day off tomorrow and I think to rewrite couple more functions, add a test, add doc, etc.
So make PR closer to be merged.

@zurk
Copy link
Contributor Author

zurk commented Mar 11, 2019

I was busy with #666 so I do all my plans after a review.

Copy link
Contributor

@m09 m09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements @zurk 👍

From the discussion we had, the last big point left to implement seems to be the UAST to annotations transformation (that will allow to ditch many code smells from current codebase).

@m09
Copy link
Contributor

m09 commented Mar 14, 2019

This starts to look ready 👍. It should be better to have parent attribute in UASTAnnotation and remove the UASTParentAnnotation though, it will make the code much simpler!

@zurk
Copy link
Contributor Author

zurk commented Mar 15, 2019

It should be better to have parent attribute in UASTAnnotation and remove the UASTParentAnnotation though

I agree. However, when I start implementing it I see that it is not so easy. The problem is that we add the UAST and parents annotation in different places. It looks like we can move parents annotation code to uast annotation, but we have a uast_fixers that can change some nodes (which is not happening right now, but it is possible). So only second option left: move UAST annotation code to parents annotation. But it breaks AnnotatedData.from_file function because it requires a UAST and it will be really unobvious if you do not add it inside this function.

I hope once v3 python client will be introduced we will have parent feature included into UAST itself and remove UASTParentAnnotation at all. So let's keep it as is for now. I do not see big drawbacks for such solution.

@zurk zurk force-pushed the annotations branch 2 times, most recently from 14ee11d to ca0638b Compare March 15, 2019 11:39
@zurk
Copy link
Contributor Author

zurk commented Mar 15, 2019

I think we can consider this PR as done and ready to merge, so please review carefully and let's merge if it is fine.
I move out all untested and raw changes to the next PRs.

I am also doing a regression test for this part one more time to be sure that I have exactly the same VNodes in the end (zurk@612887e). Here is the commit to run the test (I run make quality-report and check the logs).
So far no errors (13/20).

@m09
Copy link
Contributor

m09 commented Mar 15, 2019

So only second option left: move UAST annotation code to parents annotation. But it breaks AnnotatedData.from_file function because it requires a UAST and it will be really unobvious if you do not add it inside this function.

This option seems perfectly fine to me. I feel we should do it to avoid complication of the rest of the codebase.

@zurk
Copy link
Contributor Author

zurk commented Mar 15, 2019

Ok, I think I know the answer.
Actually, we should not allow changing Node in the uast_fixers. It is too much.
Because it actually breaks the UAST itself because you did not replace the old node with a new node.
That means that the old node is still available when you traverse UAST (and you miss a new one).

My plan is:

  1. Do not return Node from uast_fixers.
  2. Extract parents calculation from _parse_file to separate function.
  3. Add parents annotation to AnnotatedData.from_file

Logic was not broken, code improved and everyone seems to be happy. :)
WDYT @m09?

@m09
Copy link
Contributor

m09 commented Mar 15, 2019

yes it seems like great solution: then the code will be easy to write everywhere else 👍

zurk added 2 commits March 15, 2019 18:01
Signed-off-by: Konstantin Slavnov <[email protected]>
Signed-off-by: Konstantin Slavnov <[email protected]>
@zurk
Copy link
Contributor Author

zurk commented Mar 15, 2019

Done.

Copy link
Collaborator

@vmarkovtsev vmarkovtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookout/style/format/feature_extractor.py Show resolved Hide resolved
lookout/style/format/feature_extractor.py Show resolved Hide resolved
lookout/style/format/feature_extractor.py Show resolved Hide resolved
lookout/style/format/feature_extractor.py Show resolved Hide resolved
lookout/style/format/feature_extractor.py Show resolved Hide resolved
@zurk
Copy link
Contributor Author

zurk commented Apr 16, 2019

As soon as there is a lot of conflicts I am going to close this one and open new PR.
All feedback from the review will be addressed.

@zurk zurk closed this Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants