-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add per-process network data to system/process #37121
base: main
Are you sure you want to change the base?
Add per-process network data to system/process #37121
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
for _, key := range keys { | ||
found, _ := gcPidFetch(ctx, int32(key)) | ||
if !found { | ||
keysToDelete = append(keysToDelete, key) | ||
} | ||
} | ||
if len(keysToDelete) > 0 { | ||
track.dataMut.Lock() | ||
for _, key := range keysToDelete { | ||
delete(track.procData, key) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see a possible scenario here when:
- The map is locked and copied
- We identify a key for a PID that does not exist anymore
- Right before the moment when we delete expired keys a new process started with the same re-used PID and it was added to the map overwriting the previous record.
- We delete a record for an existing/valid process.
I think we cannot separate locks here because of that. I'm not sure how often a PID re-use happens though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yah. In my experience, PID reuse does happen on some containerized workflows. I'm a bit reluctant to carry the lock through that entire operation, since the gcPidFetch
operation will take some amount of time, and can potentially bottleneck things. The worst case here is that the data for the "reused" PID is off by a few bytes, as it will just get re-added the next time it reports data. Still not a great solution, though.
Other idea might be to add some kind of updated
tag to the PacketData
object, to check if the updated time is newer than the "check" time.
if track.loopWaiter != nil { | ||
track.loopWaiter <- struct{}{} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary for testing? It looks like we have test code in the production code, I'd rather avoid this if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, maybe we can wrap the ticker.C
somehow, so we only have one channel that we can put in sort of a stop-start mode for testing...
❕ Build Aborted
Expand to view the summary
Build stats
Test stats 🧪
Steps errors
Expand to view the steps failures
|
metricbeat/docs/fields.asciidoc
Outdated
Incoming network counters | ||
|
||
|
||
*`system.process.network.incoming.tcp`*:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be using the best matching field from ECS for new fields.
In this case I would follow https://www.elastic.co/guide/en/ecs/current/ecs-network.html#field-network-direction
if err != nil { | ||
return nil, fmt.Errorf("error creating network tracker: %w", err) | ||
} | ||
err = m.networkMonitoring.Track(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is context.Background correct? Is the only way for this to be cancelled the process exiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I added a Close()
implementation to the metricset, but since the reporter interface doesn't give us any kind of parent context, not sure what use we would get from our own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should there be two ways to stop the Tracker? It seems a little unclear with the API accepting a context.Context
and having a Stop()
And BTW (for another day), adding a Context() context.Context
to MetricSet
sounds like reasonable thing to do.
💔 Build Failed
Expand to view the summary
Build stats
Pipeline error
❕ Flaky test reportNo test was executed to be analysed. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
💔 Tests Failed
Expand to view the summary
Build stats
Test stats 🧪
Test errors
Expand to view the tests failures
|
💔 Tests Failed
Expand to view the summary
Build stats
Test stats 🧪
Test errors
Expand to view the tests failures
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These messages coming out of the Packetbeat watcher are going to be very noisy. We should probably modify that code to be less noisy. Any short-lived process like curl is going to cause this message.
{"log.level":"info","@timestamp":"2023-11-28T16:21:31.921-0500","log.origin":{"file.name":"procs/procs_linux.go","file.line":100},"message":"FindSocketsOfPid: open /proc/11118/fd: no such file or directory","ecs.version":"1.6.0"}
I think you would also get this message for any process that has its own network namespace, which is going to be most containers.
if err != nil { | ||
return nil, fmt.Errorf("error creating network tracker: %w", err) | ||
} | ||
err = m.networkMonitoring.Track(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should there be two ways to stop the Tracker? It seems a little unclear with the API accepting a context.Context
and having a Stop()
And BTW (for another day), adding a Context() context.Context
to MetricSet
sounds like reasonable thing to do.
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//go:build integration_dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this isn't a tag that we ever run on CI. Perhaps this file was not meant to be committed? The 10 minute sleep is another indicator that perhaps this was just for your local testing purposes.
@andrewkroh regarding the messages coming from packetbeat's process watcher---should we block this PR on a change to make that less noisy? |
❕ Build Aborted
Expand to view the summary
Build stats
Test stats 🧪
Test errors
Expand to view the tests failures
|
I think you could take that up in a separate PR. You should consider documenting the limitations of this to prevent surprised users. Like document the visibility limitations relating to short-lived processes and network namespaces (this one I'm assuming so do verify). |
💔 Build Failed
Expand to view the summary
Build stats
Pipeline error
❕ Flaky test reportNo test was executed to be analysed. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
This pull request is now in conflicts. Could you fix it? 🙏
|
💔 Tests Failed
Expand to view the summary
Build stats
Test stats 🧪
Test errors
Expand to view the tests failures
|
❕ Build Aborted
Expand to view the summary
Build stats
🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
❕ Build Aborted
Expand to view the summary
Build stats
🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
❕ Build Aborted
Expand to view the summary
Build stats
Test stats 🧪
🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
💔 Build Failed
Expand to view the summary
Build stats
Pipeline error
❕ Flaky test reportNo test was executed to be analysed. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
💔 Build Failed
Expand to view the summary
Build stats
Test stats 🧪
Steps errors
Expand to view the steps failures
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
This pull request is now in conflicts. Could you fix it? 🙏
|
Proposed commit message
Addresses part of #7461 by adding a series of fields that report network usage in bytes for a process.
This ended up being a tad more complex than I had first thought, as there's two moving pieces.
/proc/net/tcp
andproc/net/udp
to monitor for network connections, then match the connection tuple to an inode, which can then be mapped to a process.The latter part is accomplished by using packetbeat's
ProcessWatcher
component.Getting this to work as-is has required a few tradeoffs:
AF_PACKET
, which is linux-only. I'm not sure if there's a "clever" way to swap between the two, as the libpcap dependency is introduced when the package is imported.ProcessWatcher
does not implement ICMP, meaning we're not tracking byte counts for any ICMP network traffic, just TCP and UDPChecklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
pull down and build, set the
track_network_data
flag. Simply check incoming data for network trafficAlternatively, perform a large network operation, for example, download something with
wget
, and ensure the process is reported with the correct amount of traffic.Relates Metricbeat doesnt give network io and disk io per process #7461
Performance
I ran two tests with this metricset running under agent, then gathered a CPU profile via agent diagnostics. One test ran while downloading a 20GB file (the wikipedia archive at https://dumps.wikimedia.org/enwiki/20231120/), the other ran with just "ambient" network I/O.
In both cases, the
Tracker
thread is responsible for around 50% of agent's total CPU time. The majority of this time is spent in syscalls;This seems to be coming from packetbeat's
ProcessWatcher
:The processWatcher should cache results, so I'm a little baffled why this is using so much CPU time. Will continue investigating.
EDIT: The issue has something to do with caching in the ProcessWatcher. The
findProc
method will attempt to look up a address:port combination, and refresh its own mapping if it can't find it. That refresh is where we're getting hit. However, after some instrumenting, it looks like we're trying to look up the same address:port combination multiple times, which means that the internal map isn't getting updated when it should. Continuing to investigate.EDIT2: issue here: #37266