Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Memory Leak in Micrometer Tracing 1.4 #974

Open
rethab opened this issue Feb 17, 2025 · 2 comments
Open

Potential Memory Leak in Micrometer Tracing 1.4 #974

rethab opened this issue Feb 17, 2025 · 2 comments

Comments

@rethab
Copy link

rethab commented Feb 17, 2025

Hi,

After upgrading to spring boot 3.4, we have noticed a memory leak. I can reproduce it when running performance tests in a production-like environment after about 30 minutes, but I cannot on my local machine. So please apologize in advance for not providing a reproducer. But perhaps you can give me some hints and then I can dig further on my own and provide a reproducer.

The application is based on spring cloud gateway with SB 3.4.2 and spring cloud 2024.0.0. I can make the issue go away by force-downgrading both io.micrometer:micrometer-tracing-bridge-brave and io.micrometer:micrometer-tracing to 1.3.5 (we were on SB 3.3 before). I did this by forcing the specific version in my build and providing custom beans for BraveTracer and BaggagePropagation.FactoryBuilder, because they're not compatible (new constructor added).

I have captured a heap dump (can't share it unfortunately, but happy to provide pieces), which points to io.micrometer.observation.SimpleObservation:

Image

Shortest path:

Image

Incoming references:

Image

Furthermore, when investigating these observations, we found that all the metrics in there were of a specific type. We have many types of metrics and how we collect them, but only the ones that we collect via Micrometer.observation seem to show up there. Importantly, by removing these metrics, the memory leak also goes away.

This is how we collect them:

@Override
public Mono<String> get(String key) {
   return redisTemplate.opsForValue().get(fullKey(key)).transform(addDurationObservation(METHOD_GET));
}
    
private <T> Function<Mono<T>, Mono<T>> addDurationObservation(String method) {
    return mono -> mono.name(DURATION_METRIC_NAME)
            .tag(TAG_SERVICE, serviceId)
            .tag(TAG_METHOD, method)
            .tap(Micrometer.observation(observationRegistry));
}

Also, we get tons of warnings in the logs about:

Warn message example
Observation <{
    name=redis.cache.latency(redis.cache.latency), 
    error=null,
    context=
    name='redis.cache.latency', 
    contextualName='redis.cache.latency', 
    error='null', 
    lowCardinalityKeyValues=[method='get', reactor.status='completed', reactor.type='Mono', serviceId='LoginWebClientBean'], 
    highCardinalityKeyValues=[], 
    map=[
        class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@6db2ad5d', 
        class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=679b469bcb6ed8a9787039b7703fb61b/027e17a438ddf74d}', 
        class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=3342086237708105}'
    ], 
    parentObservation={
        name=spring.security.filterchains(security filterchain after), 
        error=null, 
        context=name='spring.security.filterchains', 
        contextualName='security filterchain after', 
        error='null', 
        lowCardinalityKeyValues=[spring.security.filterchain.position='13', spring.security.filterchain.size='13', spring.security.reached.filter.name='ServerWebExchangeReactorContextWebFilter', spring.security.reached.filter.section='after'], 
        highCardinalityKeyValues=[], 
        map=[
            class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@20ce500d', 
            class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=679b469bcb6ed8a9787039b7703fb61b/799bf843490626c5}', 
            class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=3342086239679203}'
        ], 
        parentObservation={
            name=http.server.requests(http get), 
            error=null, context=name='http.server.requests', 
            contextualName='http get', 
            error='null', 
            lowCardinalityKeyValues=[client.id='frontend', exception='none', method='GET', outcome='CLIENT_ERROR', status='401', uri='UNKNOWN'], 
            highCardinalityKeyValues=[http.url='/user'], 
            map=[
                class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@665e5579', 
                class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=679b469bcb6ed8a9787039b7703fb61b/787039b7703fb61b}', 
                class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=3342086234182238}'
            ],
            parentObservation=null
        }
    }
}> 

to which we're restoring  is not the same as the one set as this scope's parent observation 

<{
    name=redis.cache.latency(redis.cache.latency), 
    error=null, 
    context=name='redis.cache.latency', 
    contextualName='redis.cache.latency', 
    error='null', 
    lowCardinalityKeyValues=[method='get', reactor.status='completed', reactor.type='Mono', serviceId='LoginWebClientBean'], 
    highCardinalityKeyValues=[], 
    map=[
        class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@6433cb08', 
        class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=5d7ed346ab277fea4f3a20599dce3820/0b6e3e4ddc8eed49}', 
        class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=-1.0E-9, duration(nanos)=-1.0, startTimeNanos=3342090767706628}'
    ], 
    parentObservation={
        name=spring.security.filterchains(null), 
        error=null, 
        context=name='spring.security.filterchains', 
        contextualName='null', 
        error='null', 
        lowCardinalityKeyValues=[spring.security.filterchain.position='0', spring.security.filterchain.size='0', spring.security.reached.filter.name='none', spring.security.reached.filter.section='before'], 
        highCardinalityKeyValues=[], 
        map=[
            class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@3f6d09bc', 
            class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=5d7ed346ab277fea4f3a20599dce3820/064d8ada01512255}', 
            class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=0.001866874, duration(nanos)=1866874.0, startTimeNanos=3342090766784193}'
        ], 
        parentObservation={
            name=http.server.requests(null), 
            error=null, 
            context=name='http.server.requests', 
            contextualName='null', 
            error='null', 
            lowCardinalityKeyValues=[client.id='schema', exception='none', method='GET', outcome='SUCCESS', status='200', uri='UNKNOWN'], 
            highCardinalityKeyValues=[http.url='/user/preference'], 
            map=[class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@e7b3687', class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=5d7ed346ab277fea4f3a20599dce3820/853e50085c3180f2}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=0.003921056, duration(nanos)=3921056.0, startTimeNanos=3342090764743947}'], 
            parentObservation=null
        }
    }
}>

Most likely a manually created Observation has a scope opened that was never closed. This may lead to thread polluting and memory leaks.

My suspicion is this PR 😜 #688

If anybody could give me some hints on how to further investigate, that would be much appreciated. Thanks!

@rethab
Copy link
Author

rethab commented Feb 18, 2025

FYI I have added logs to the original message

@marcingrzejszczak
Copy link
Contributor

So I have no idea why this is happening but I understand what this means. Most likely one of the observations is put in scope and then the scope is not closed. With so many moving pieces it's hard to guess why. We have Redis, Micrometer, Tracing, Security and WebFlux that can be the culprit...

Is there an option to e.g. remove spring security from the classpath and to see if the problem is still there? So that we can provide the most probable pair of potential culprits?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants