Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix for inter-node traffic #339

Closed
wants to merge 1 commit into from
Closed

Conversation

yuntanghsu
Copy link
Contributor

@yuntanghsu yuntanghsu commented Dec 14, 2023

In this PR, we do:

  1. Set ReadyToSend to true for flows don’t need correlation.
  2. Set areCorrelatedFieldsFilled to true for flows don’t belong to inter-node traffic. For flows need to do correlation, its areCorrelatedFieldsFilled will be set to true once the correlation job is finished.
  3. Ignore or overwrite the correlationRequired record if any non-correlationRequired record with the same flowKey comes before or after the correlationRequired record.

Copy link

codecov bot commented Dec 14, 2023

Codecov Report

Merging #339 (f369d83) into main (837050b) will decrease coverage by 0.14%.
The diff coverage is 53.84%.

❗ Current head f369d83 differs from pull request most recent head e7eda4e. Consider uploading reports for the commit e7eda4e to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #339      +/-   ##
==========================================
- Coverage   72.87%   72.73%   -0.14%     
==========================================
  Files          19       19              
  Lines        2853     2861       +8     
==========================================
+ Hits         2079     2081       +2     
- Misses        602      605       +3     
- Partials      172      175       +3     
Flag Coverage Δ
integration-tests 52.89% <38.46%> (-0.02%) ⬇️
unit-tests 71.61% <53.84%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
pkg/intermediate/aggregate.go 71.93% <53.84%> (-0.68%) ⬇️

Copy link
Member

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rebase main?

// In this case, the record from the conntrack table will be considered to do correlation job and the ReadyToSend
// is false until we receive another record from PacketIn from another node. However, the record From PacketIN
// is not correlationRequired, so we won't send this kind of connection due to this conflict.
aggregationRecord.ReadyToSend = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see similar logic later in the function, can we unify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've moved it to the bottom and set ReadyToSend and areCorrelatedFieldsFilled to true if the flow is not correlationRequired.

Comment on lines 369 to 372
// For inter-node traffic with deny np or drop anp, we may receive the record from the conntrack table first.
// In this case, the record from the conntrack table will be considered to do correlation job and the ReadyToSend
// is false until we receive another record from PacketIn from another node. However, the record From PacketIN
// is not correlationRequired, so we won't send this kind of connection due to this conflict.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand what's going on and I'll probably defer to @heanlan and @dreamtalen, but I do not like that the comment is so specific to the Antrea FE implementation, with mentions of conntrack and PacketIn. While I know that this code is pretty specific to Antrea, this seems too specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comment. Thanks.

@yuntanghsu yuntanghsu force-pushed the bug_fix branch 2 times, most recently from 11cda21 to 9d8d3ac Compare December 15, 2023 18:40
@yuntanghsu yuntanghsu marked this pull request as draft December 15, 2023 18:46
Copy link
Contributor

@heanlan heanlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the code changes. Based on my understanding of the PR description, the conntrack records without deny NP metadata should not be sent out by FA, right? However, their correlationRequired are true, and they will be marked as ReadyToSend in your code, which contradicts with the PR purpose?

	// For flows that do not need correlation, ReadyToSend should always be true.
	if !correlationRequired {
		aggregationRecord.ReadyToSend = true
	}

// For intra-node and external traffic, areCorrelatedFieldsFilled is always true.
// For inter-node with allow np/anp action, its areExternalFieldsFilled will be set to true once the correlation job is finished.
if flowType != registry.FlowTypeInterNode {
aggregationRecord.areCorrelatedFieldsFilled = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like areCorrelatedFieldsFilled is only being used in UT? If that's the case, do we still want to keep it? maybe I missed something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be set to true in line#355. And areCorrelatedFieldsFilled will be used in the flowaggregator.go in Antrea repo as well.

@yuntanghsu
Copy link
Contributor Author

yuntanghsu commented Dec 15, 2023

I don't fully understand the code changes. Based on my understanding of the PR description, the conntrack records without deny NP metadata should not be sent out by FA, right? However, their correlationRequired are true, and they will be marked as ReadyToSend in your code, which contradicts with the PR purpose?

No, for deny NP/or drop ANP, we will receive two records. (one is from PacketIn [correlationRequired false] and the other one is from the conntrack table [correlationRequired true])
We should set the ReadyToSend to true whenever we receive the record from PacketIn. For this kind of connection, we don't need to do correlation.

I think the logic should be:

  1. Do we need to do correlation
    A. Yes (inter-node Allow) [ReadyToSend: false, areCorrelatedFieldsFilled: false]
    a. We received another records from another Node -> [ReadyToSend: true, areCorrelatedFieldsFilled: true]

    B. No (Inter-node Deny, External, intra-node)
    a. Intra-Node. (ReadyToSend: true, areCorrelatedFieldsFilled: true)
    b. External. (ReadyToSend: true, areCorrelatedFieldsFilled: true)
    c. Inter-node Deny. Do we receive the record from packetIn first?
    I. Yes. [ReadyToSend: true, areCorrelatedFieldsFilled: false]. Ignore the records from the conntrack table. [ReadyToSend: true, areCorrelatedFieldsFilled: false].
    II. No. [ReadyToSend: false, areCorrelatedFieldsFilled: false]. When we receive the record from PacketIn, we overwrite the record in the conntrack table [ReadyToSend: true, areCorrelatedFieldsFilled: false].

@yuntanghsu yuntanghsu marked this pull request as ready for review December 15, 2023 19:50
@yuntanghsu yuntanghsu force-pushed the bug_fix branch 2 times, most recently from 0ac4515 to 847c6bc Compare December 15, 2023 20:43
heanlan
heanlan previously approved these changes Dec 15, 2023
@yuntanghsu yuntanghsu marked this pull request as draft December 15, 2023 23:10
@yuntanghsu yuntanghsu force-pushed the bug_fix branch 3 times, most recently from 051c676 to f369d83 Compare December 16, 2023 03:25
@yuntanghsu
Copy link
Contributor Author

I think I made a mistake.

For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. For the ingress case, there is no issue as we will receive the records from both nodes and these two records are both correlationRequired.
For the egress case, the record from the conntrack table is correlationRequired and the record from PacketIn is not correlationRequired. If the record from the conntrack table arrive at the FA first, then the record will need to do correlation at FA. The ReadyToSend will be true forever as it keeps waiting to do correlation.

For the egress case, we can overwrite the existing record if it is from the conntrack table. But I think it would be enough to change the order where we append expired records before exporting them from the exporter. And the chance that a record from the conntrack will be exported earlier than a record from PacketIn is very low.

For these reasons, I think we can close this PR as these changes might not be really helpful.
@dreamtalen @heanlan What do you think?

In this commit, we do:

1. Set ReadyToSend to true for flows don’t need correlation.
2. Set areCorrelatedFieldsFilled to true for flows don’t belong to inter-node traffic. For flows need to do correlation, its areCorrelatedFieldsFilled will be set to true once the correlation job is finished.

Signed-off-by: Yun-Tang Hsu <[email protected]>
@yuntanghsu yuntanghsu marked this pull request as ready for review December 18, 2023 18:31
@antoninbas
Copy link
Member

@yuntanghsu I want to confirm my understanding with you, because I feel like I am missing something.
For inter-Node flows with a drop policy, my understanding is as follows:

  • egress drop (at source Node): we will only get a record from the source Node. That record is "generated" by a PacketIn event. This connection should never be committed to conntrack in the Antrea ct_zone (not on the source Node and of course not on the destination Node). If we see the connection in conntrack, this sounds like a bug to me.
  • ingress drop (at destination Node): we will get 2 records, one from each Node. the record from the source Node comes from conntrack polling, the record from the destination Node is generated by a PacketIn event. The connection will never be committed to conntrack on the destination Node. The connection will be committed to conntrack on the source Node, but it will never make it to established state.

Let me know if theses statements don't match your understanding or observations.

@yuntanghsu
Copy link
Contributor Author

yuntanghsu commented Dec 18, 2023

egress drop (at source Node): we will only get a record from the source Node. That record is "generated" by a PacketIn event. This connection should never be committed to conntrack in the Antrea ct_zone (not on the source Node and of course not on the destination Node). If we see the connection in conntrack, this sounds like a bug to me

If the traffic is not a service traffic. That's correct. But I observed we will receive records from PacketIn and the conntrack table for a service traffic. (Observe this behavior for the egress case only)

I1218 21:30:37.187849 1 deny_connections.go:115] "New deny connection added" connection={"ID":0,"Timeout":0,"StartTime":"2023-12-18T21:30:37.187791036Z","StopTime":"2023-12-18T21:30:37.187791036Z","LastExportTime":"2023-12-18T21:30:37.187791036Z","IsActive":true,"IsPresent":false,"ReadyToDelete":false,"Zone":0,"Mark":19,"StatusFlag":0,"Labels":null,"LabelsMask":null,"FlowKey":{"SourceAddress":"10.244.0.7","DestinationAddress":"10.244.1.8","Protocol":6,"SourcePort":57252,"DestinationPort":5201},"OriginalPackets":1,"OriginalBytes":60,"SourcePodNamespace":"testflowaggregator-0e73pgnx","SourcePodName":"perftest-a","DestinationPodNamespace":"","DestinationPodName":"","DestinationServicePortName":"testflowaggregator-0e73pgnx/perftest-e:","OriginalDestinationAddress":"10.96.176.70","OriginalDestinationPort":9999,"IngressNetworkPolicyName":"","IngressNetworkPolicyNamespace":"","IngressNetworkPolicyType":0,"IngressNetworkPolicyRuleName":"","IngressNetworkPolicyRuleAction":0,"EgressNetworkPolicyName":"test-flow-aggregator-anp-egress-drop","EgressNetworkPolicyNamespace":"testflowaggregator-0e73pgnx","EgressNetworkPolicyType":2,"EgressNetworkPolicyRuleName":"test-egress-rule-name","EgressNetworkPolicyRuleAction":2,"PrevPackets":0,"PrevBytes":0,"ReversePackets":0,"ReverseBytes":0,"PrevReversePackets":0,"PrevReverseBytes":0,"TCPState":"","PrevTCPState":"","FlowType":0,"EgressName":"","EgressIP":""}

I1218 21:30:38.106123 1 conntrack_connections.go:284] "New Antrea flow added" connection={"ID":2467408996,"Timeout":119,"StartTime":"2023-12-18T21:30:37.187206191Z","StopTime":"2023-12-18T21:30:38.097239785Z","LastExportTime":"2023-12-18T21:30:37.187206191Z","IsActive":true,"IsPresent":true,"ReadyToDelete":false,"Zone":65520,"Mark":19,"StatusFlag":424,"Labels":null,"LabelsMask":null,"FlowKey":{"SourceAddress":"10.244.0.7","DestinationAddress":"10.244.1.8","Protocol":6,"SourcePort":57252,"DestinationPort":5201},"OriginalPackets":1,"OriginalBytes":60,"SourcePodNamespace":"testflowaggregator-0e73pgnx","SourcePodName":"perftest-a","DestinationPodNamespace":"","DestinationPodName":"","DestinationServicePortName":"testflowaggregator-0e73pgnx/perftest-e:","OriginalDestinationAddress":"10.96.176.70","OriginalDestinationPort":9999,"IngressNetworkPolicyName":"","IngressNetworkPolicyNamespace":"","IngressNetworkPolicyType":0,"IngressNetworkPolicyRuleName":"","IngressNetworkPolicyRuleAction":0,"EgressNetworkPolicyName":"","EgressNetworkPolicyNamespace":"","EgressNetworkPolicyType":0,"EgressNetworkPolicyRuleName":"","EgressNetworkPolicyRuleAction":0,"PrevPackets":0,"PrevBytes":0,"ReversePackets":0,"ReverseBytes":0,"PrevReversePackets":0,"PrevReverseBytes":0,"TCPState":"SYN_SENT","PrevTCPState":"","FlowType":0,"EgressName":"","EgressIP":""}

@antoninbas
Copy link
Member

@yuntanghsu thanks, you are correct. For Service traffic, we will do DNAT (which commits to CT) before policy enforcement in Antrea. Although I wonder if this is the best approach, maybe I will bring it up at the next community meeting.

For ingress drop, I assume my understanding is correct?

With regards to your PR, I don't know what the best approach is. To me it seems that we should avoid exporting 2 separate records from the same Node in the FlowExporter in the first place. It is not right IMO to have the same connection in both connection stores. Maybe @heanlan and @dreamtalen have some inputs here.
I didn't give it too much thought, but would it make sense to omit connections that were never established from the connection store? For the ingress drop case, it would mean that we would only get 1 record (from the destination Node, triggered by PacketIn). It would make things a bit more symmetric in a way, and my assumption is that aggregation doesn't provide much value in this case.

@yuntanghsu
Copy link
Contributor Author

yuntanghsu commented Dec 18, 2023

For ingress drop, I assume my understanding is correct?

Yes, that's correct for non-service and service traffic.

I agree with you. I think the best solution approach is avoiding exporting the record in the conntrack in this case.
However, I found this is kind of difficult to do as we don't know that the record from the conntrack belongs to an allow connection or a deny connection until we see the record from PacketIn.

I think the other acceptable solution is changing the order where we append expired records before exporting them from our exporter. We should prioritize the record in the deny connection store and I think the chance that a record from the conntrack will be exported earlier than a record from PacketIn is very low.
I already made this change in antrea-io/antrea#5770. The test is stable now.
Maybe we can release a new version of the IPFIX collector to unblock this PR antrea-io/antrea#5770?

I didn't give it too much thought, but would it make sense to omit connections that were never established from the connection store?

I think it's better to keep it? This kind of connection is still useful for detecting DoS attacks?

@antoninbas
Copy link
Member

@heanlan could you release vmware/go-ipfix 0.9.0, with Yun-Tang's improvements to the "reference" IPFIX collector

@dreamtalen
Copy link

@heanlan could you release vmware/go-ipfix 0.9.0, with Yun-Tang's improvements to the "reference" IPFIX collector

Anlan is on PTO this week, I can help to do a release

@yuntanghsu
Copy link
Contributor Author

yuntanghsu commented Dec 19, 2023

Salvatore already release vmware/go-ipfix 0.8.2
https://github.com/vmware/go-ipfix/releases/tag/v0.8.2

@antoninbas
Copy link
Member

@dreamtalen can you do the changelog PR?

@yuntanghsu should we close this PR then?

@dreamtalen
Copy link

@dreamtalen can you do the changelog PR?

Yanjun has finished that :)

@yuntanghsu yuntanghsu closed this Dec 19, 2023
@yuntanghsu
Copy link
Contributor Author

Close this PR as we make an alternative fix in antrea-io/antrea#5770.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants