-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve IPFIX collector #338
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #338 +/- ##
==========================================
- Coverage 72.87% 72.73% -0.15%
==========================================
Files 19 19
Lines 2853 2853
==========================================
- Hits 2079 2075 -4
- Misses 602 604 +2
- Partials 172 174 +2
Flags with carried forward coverage won't be shown. Click here to find out more. |
w.Write(jsonData) | ||
} | ||
|
||
func resetRecordHandler(w http.ResponseWriter, r *http.Request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, but you should request the HTTP method to be POST in this handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this method be POST?
Do you think we still need this Handler. After disabling printing the log, the retrieve time is decreased from ~4s to ~80ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoninbas
I think this concern hasn't been solved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I didn't see the question before
It should be POST because it does a mutation the server. That's an HTTP convention. You can lookup which HTTP verbs are most appropriate for different type of APIs.
Do you think we still need this Handler. After disabling printing the log, the retrieve time is decreased from ~4s to ~80ms.
I thought you wrote 0.8s in the PR description, which would be 800ms?
Regardless, let's add the reset method, it's easy to do and if we don't end up using it in Antrea tests, so be it.
It's easier than opening a new PR in this repo later and publishing a new collector image tag...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.8s to 4s will be the range when we use "kubectl logs" to get the logs.
Under the current changes, this time will be approximately 80ms regardless of the number of records inside the IPFIX collector.
In this commit, we do: 1. Changed the order where we append expired records before exporting them from our exporter. For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. If the record from the conntrack table is exported first, then the record will need to do correlation at FA. From the egress case, there is no issue as we will receive the records from both nodes. But the record won't be sent to the collector as it keeps waiting to do correlation if both records come from the same node. Similar approach is added in vmware/go-ipfix#338 as well. 2. Add check to verify if Flow Exporters can successfully resolve the Flow Aggregator Service address before sending traffic. 3. Add check to verify if Flow Aggregator can successfully connect to the ClickHouse before sending traffic. 4. Add labels to External subtest to filter useless logs from the IPFIX collector Pod. 5. Confirm the correct addition of a label to a specific Pod after updating the Pod. 6. Adjust the flow-visibility end-to-end test by disabling the octetDeltaCount check. This modification is necessary because, when the dual-stack cluster is enabled, the time taken to retrieve logs from the IPFIX collector Pod is significantly longer (around 4 seconds). In the e2e test, we regularly checked the logs every 500 milliseconds to ensure that we didn't receive the last record (where octetDeltaCount is 0). However, due to the delay, the PollImmediately() function doesn't execute every 500 milliseconds. Therefore, we have removed the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod. 7. Use new image from go-ipfix PR ( vmware/go-ipfix#338). We improve the IPFIX collector by: a. Disable printing records whenever we receive it. Instead, we store records in a string array. b. Add http listener and handler to receive request to return or reset records. In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records inside it. Signed-off-by: Yun-Tang Hsu <[email protected]>
4d30427
to
071a3eb
Compare
In this commit, we do: 1. Changed the order where we append expired records before exporting them from our exporter. For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. If the record from the conntrack table is exported first, then the record will need to do correlation at FA. From the egress case, there is no issue as we will receive the records from both nodes. But the record won't be sent to the collector as it keeps waiting to do correlation if both records come from the same node. Similar approach is added in vmware/go-ipfix#338 as well. 2. Add check to verify if Flow Exporters can successfully resolve the Flow Aggregator Service address before sending traffic. 3. Add check to verify if Flow Aggregator can successfully connect to the ClickHouse before sending traffic. 4. Add labels to External subtest to filter useless logs from the IPFIX collector Pod. 5. Confirm the correct addition of a label to a specific Pod after updating the Pod. 6. Adjust the flow-visibility end-to-end test by disabling the octetDeltaCount check. This modification is necessary because, when the dual-stack cluster is enabled, the time taken to retrieve logs from the IPFIX collector Pod is significantly longer (around 4 seconds). In the e2e test, we regularly checked the logs every 500 milliseconds to ensure that we didn't receive the last record (where octetDeltaCount is 0). However, due to the delay, the PollImmediately() function doesn't execute every 500 milliseconds. Therefore, we have removed the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod. 7. Use new image from go-ipfix PR ( vmware/go-ipfix#338). We improve the IPFIX collector by: a. Disable printing records whenever we receive it. Instead, we store records in a string array. b. Add http listener and handler to receive request to return or reset records. In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records inside it. Signed-off-by: Yun-Tang Hsu <[email protected]>
071a3eb
to
7ca0084
Compare
7ca0084
to
ff5363a
Compare
In this commit, we do: 1. Changed the order where we append expired records before exporting them from our exporter. For inter-node traffic with egress/ingress np with action drop, we will receive records from PacketIn and the conntrack table. If the record from the conntrack table is exported first, then the record will need to do correlation at FA. From the egress case, there is no issue as we will receive the records from both nodes. But the record won't be sent to the collector as it keeps waiting to do correlation if both records come from the same node. Similar approach is added in vmware/go-ipfix#338 as well. 2. Add check to verify if Flow Exporters can successfully resolve the Flow Aggregator Service address before sending traffic. 3. Add check to verify if Flow Aggregator can successfully connect to the ClickHouse before sending traffic. 4. Add labels to External subtest to filter useless logs from the IPFIX collector Pod. 5. Confirm the correct addition of a label to a specific Pod after updating the Pod. 6. Adjust the flow-visibility end-to-end test by disabling the octetDeltaCount check. This modification is necessary because, when the dual-stack cluster is enabled, the time taken to retrieve logs from the IPFIX collector Pod is significantly longer (around 4 seconds). In the e2e test, we regularly checked the logs every 500 milliseconds to ensure that we didn't receive the last record (where octetDeltaCount is 0). However, due to the delay, the PollImmediately() function doesn't execute every 500 milliseconds. Therefore, we have removed the octetDeltaCount check and, instead, filter out all records with octetDeltaCount=0 when retrieving records from the IPFIX collector Pod. 7. Use new image from go-ipfix PR ( vmware/go-ipfix#338). We improve the IPFIX collector by: a. Disable printing records whenever we receive it. Instead, we store records in a string array. b. Add http listener and handler to receive request to return or reset records. In this way, we can reduce the retrieving log time from ~4s to ~80ms when we have ~1900 records inside it. Signed-off-by: Yun-Tang Hsu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I don't remember how tagging works for the collector image. @heanlan @dreamtalen will we need a new release of go-ipfix? |
Yes, we need a new release |
ff5363a
to
fa19580
Compare
In this commit, we do: 1. We disable printing records whenever we receive it. Instead, we store records in a string array. 2. Add http listener and handler to receive request to return or reset records. Signed-off-by: Yun-Tang Hsu <[email protected]>
fa19580
to
5a11b76
Compare
1. Disable printing records whenever we receive them. Instead, we store records in a string array. 2. Add http listener and handler to receive request to return or reset records. Signed-off-by: Yun-Tang Hsu <[email protected]>
1. Disable printing records whenever we receive them. Instead, we store records in a string array. 2. Add http listener and handler to receive request to return or reset records. Signed-off-by: Yun-Tang Hsu <[email protected]>
In this PR:
The reason we do this change is because we want to reduce the time we use to retrieve logs.
Previously, using
kubectl logs
will take lots of time to get the logs. And the time depends on the number of records inside the IPFIX collector Pod. The time range is ~0.8s to 4s. After above changes, the time is reduced to ~80ms.Functionality has been tested in this PR: https://github.com/antrea-io/antrea/actions/runs/7201229562/job/19617317869?pr=5770