Add annotations #654

zurk · 2019-02-28T20:06:08Z

Note that all work in progress. However, I created this PR to discuss the proposed solution, find all possible drawbacks and fix them on the early stage. Please note, that the code is raw in minor details. Let's discuss high-level questions about implemented architecture/interface/API decisions.

Let me remind you the main idea of annotations:
Separate data (code) and all it annotations (like UAST nodes, Virtual Token, y, file path, etc) handling.

This PR contains:

Annotation tooling.
- Will be moved to lookout-sdk-ml when I create ready-to-merge PR. For now, I do not want to split the code because it is harder to review.
- I cut all possible corners, did not implement a lot of useful functions, did not have good tests, etc.
Backward capability layer is added. to_vnode() function.
FeatureExtractor._parse_vnodes(),
FeatureExtractor._parse_file(),
VirtualNode.from_node()
functions are rewritten using Annotations to give a feeling of how it will look like. To make everything works I use to_vnode() Backward capability function.

I also checked that I get exactly the same vnodes that we have in current master.

Related PR: #521

cc @src-d/machine-learning and specially @m09 PTAL
Any comments and feedback appreciated.

zurk · 2019-03-01T10:16:15Z

The biggest problem I am aware of is dealing with null-intervals [x, x) aka NOOPs. I write [ and ) because it is semi-intervals like in python slicing.

There is a lot of cases which require different behavior from null-intervals:

Null-intervals should be not covered by the UAST Node annotation next to it. For example, we have var t = 0; and there is a NOOP [9, 9) between 0 and ;. Also, there is an UASTNode annotation for 0 [8, 9). As soon as we do not have UAST annotation for NOOP [9, 9) we should consider these intervals as disjoint ones.
NOOP in the beginning and end of the file. Should be considered as intersecting with file annotations like PATH and LANGUAGE. The same for LINE annotation (annotates the file line number).
There are two identical NOOPs between the files: one from the beginning of the first file and second from the end of the next file. However, it is easy to fix. We should add document separator symbol and do not concatenate files directly to each other. So we have the end NOOP at [len(file1), len(file1)) and the start NOOP at [len(file1)+1, len(file1))+1.

3.1. That means I should move _check_interval_crossing from AnnotatedData to Annotated class and make it dependent from annotation type.

3.2. There is also another option to store different files in different AnnotatedData classes, so RawData takes only one document. I feel that this is a bad solution. Once we want to add cross-file annotations (like containing directory) it is better to have all files in one place.

lookout/style/format/annotations/annotated_data.py

m09 · 2019-03-05T10:55:00Z

Thanks for the first draft @zurk 👍

Some feedback I can give from UIMA experience:

it's better to KISS and have a single file per AnnotedData. Then if/when we need cross-files info sharing, we can use external objects.
we should store per annotation type and not per offset: the goal is to use as few annotation types as possible in a given piece of code, the architecture of the code should reflect that
for the point about UAST not being applicable to noops: we can query related annotations per covering/covered/equal spans when we want / when it's appropriate, but we can also just point to another annotation in an annotation (this way we specify explicitely which annotation is related):
```
class TokenAnnotation(Annotation):
    UASTNode: Optional[UASTNodeAnnotation] = None
```
then when we iterate TokenAnnotations we can retrieve the UAST node this way.
better not to have ValuedAnnotation and specify a meaningful name for each kind of annotation instead of value
better not to have names but just use the classes: names make the code less typesafe
an annotation containing the complete uast should not be necessary if we have proper UASTNodeAnnotations: only a point to the root UASTNodeAnnotation should be enough (in one of the document wide annotations for example).

zurk · 2019-03-07T19:31:12Z

First feedback addressed, the code becomes more verbose but I and Hugo agreed that it is better to be more explicit then short.
@m09 hare a day off tomorrow and I think to rewrite couple more functions, add a test, add doc, etc.
So make PR closer to be merged.

zurk · 2019-03-11T09:42:59Z

I was busy with #666 so I do all my plans after a review.

m09

Thanks for the improvements @zurk 👍

From the discussion we had, the last big point left to implement seems to be the UAST to annotations transformation (that will allow to ditch many code smells from current codebase).

m09 · 2019-03-14T10:38:23Z

This starts to look ready 👍. It should be better to have parent attribute in UASTAnnotation and remove the UASTParentAnnotation though, it will make the code much simpler!

zurk · 2019-03-15T10:07:30Z

It should be better to have parent attribute in UASTAnnotation and remove the UASTParentAnnotation though

I agree. However, when I start implementing it I see that it is not so easy. The problem is that we add the UAST and parents annotation in different places. It looks like we can move parents annotation code to uast annotation, but we have a uast_fixers that can change some nodes (which is not happening right now, but it is possible). So only second option left: move UAST annotation code to parents annotation. But it breaks AnnotatedData.from_file function because it requires a UAST and it will be really unobvious if you do not add it inside this function.

I hope once v3 python client will be introduced we will have parent feature included into UAST itself and remove UASTParentAnnotation at all. So let's keep it as is for now. I do not see big drawbacks for such solution.

zurk · 2019-03-15T11:49:10Z

I think we can consider this PR as done and ready to merge, so please review carefully and let's merge if it is fine.
I move out all untested and raw changes to the next PRs.

I am also doing a regression test for this part one more time to be sure that I have exactly the same VNodes in the end (zurk@612887e). Here is the commit to run the test (I run make quality-report and check the logs).
So far no errors (13/20).

m09 · 2019-03-15T12:34:23Z

So only second option left: move UAST annotation code to parents annotation. But it breaks AnnotatedData.from_file function because it requires a UAST and it will be really unobvious if you do not add it inside this function.

This option seems perfectly fine to me. I feel we should do it to avoid complication of the rest of the codebase.

zurk · 2019-03-15T13:20:52Z

Ok, I think I know the answer.
Actually, we should not allow changing Node in the uast_fixers. It is too much.
Because it actually breaks the UAST itself because you did not replace the old node with a new node.
That means that the old node is still available when you traverse UAST (and you miss a new one).

My plan is:

Do not return Node from uast_fixers.
Extract parents calculation from _parse_file to separate function.
Add parents annotation to AnnotatedData.from_file

Logic was not broken, code improved and everyone seems to be happy. :)
WDYT @m09?

m09 · 2019-03-15T14:08:45Z

yes it seems like great solution: then the code will be easy to write everywhere else 👍

Signed-off-by: Konstantin Slavnov <[email protected]>

zurk · 2019-03-15T15:23:55Z

Done.

vmarkovtsev

lookout/style/format/annotations/annotated_data.py

lookout/style/format/feature_extractor.py

zurk · 2019-04-16T11:47:41Z

As soon as there is a lot of conflicts I am going to close this one and open new PR.
All feedback from the review will be addressed.

zurk requested a review from m09 February 28, 2019 20:06

zurk force-pushed the annotations branch 4 times, most recently from 42b8603 to a7be057 Compare February 28, 2019 21:52

vmarkovtsev reviewed Mar 5, 2019

View reviewed changes

lookout/style/format/annotations/annotated_data.py Outdated Show resolved Hide resolved

zurk force-pushed the annotations branch 2 times, most recently from 0e55837 to 2239da4 Compare March 7, 2019 19:03

m09 suggested changes Mar 13, 2019

View reviewed changes

zurk force-pushed the annotations branch from 2239da4 to ee7494e Compare March 13, 2019 17:50

zurk force-pushed the annotations branch 2 times, most recently from 14ee11d to ca0638b Compare March 15, 2019 11:39

zurk added 2 commits March 15, 2019 18:01

Add annotations

ad6da8f

Signed-off-by: Konstantin Slavnov <[email protected]>

Separate uast parents calculation

04a55e9

Signed-off-by: Konstantin Slavnov <[email protected]>

zurk force-pushed the annotations branch from ca0638b to 04a55e9 Compare March 15, 2019 15:01

zurk mentioned this pull request Mar 15, 2019

Use Annotations in _parse_vnodes() #708

Closed

vmarkovtsev requested changes Mar 19, 2019

View reviewed changes

zurk closed this Apr 16, 2019

zurk mentioned this pull request Apr 22, 2019

Add annotations to FeatureExtractor #761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add annotations #654

Add annotations #654

zurk commented Feb 28, 2019

zurk commented Mar 1, 2019

m09 commented Mar 5, 2019 •

edited

Loading

zurk commented Mar 7, 2019

zurk commented Mar 11, 2019

m09 left a comment

m09 commented Mar 14, 2019

zurk commented Mar 15, 2019

zurk commented Mar 15, 2019 •

edited

Loading

m09 commented Mar 15, 2019

zurk commented Mar 15, 2019

m09 commented Mar 15, 2019

zurk commented Mar 15, 2019

vmarkovtsev left a comment

zurk commented Apr 16, 2019

Add annotations #654

Add annotations #654

Conversation

zurk commented Feb 28, 2019

zurk commented Mar 1, 2019

m09 commented Mar 5, 2019 • edited Loading

zurk commented Mar 7, 2019

zurk commented Mar 11, 2019

m09 left a comment

Choose a reason for hiding this comment

m09 commented Mar 14, 2019

zurk commented Mar 15, 2019

zurk commented Mar 15, 2019 • edited Loading

m09 commented Mar 15, 2019

zurk commented Mar 15, 2019

m09 commented Mar 15, 2019

zurk commented Mar 15, 2019

vmarkovtsev left a comment

Choose a reason for hiding this comment

zurk commented Apr 16, 2019

m09 commented Mar 5, 2019 •

edited

Loading

zurk commented Mar 15, 2019 •

edited

Loading