Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Data race if the provider set few times #307

Open
erka opened this issue Nov 26, 2024 · 12 comments
Open

[BUG] Data race if the provider set few times #307

erka opened this issue Nov 26, 2024 · 12 comments
Assignees
Labels
bug Something isn't working Needs Triage

Comments

@erka
Copy link

erka commented Nov 26, 2024

Observed behavior

I have several tests with a similar setup: the provider is created, then configured, the tests are executed, and openfeature.Shutdown() is called for cleanup. However, when these tests are run with the race detector enabled, they consistently fail.

Expected Behavior

No response

Steps to reproduce

package openfeature_test

import (
	"context"
	"sync"
	"testing"

	of "github.com/open-feature/go-sdk/openfeature"
)

func TestDataRace(t *testing.T) {
	of.SetEvaluationContext(of.NewTargetlessEvaluationContext(map[string]any{}))

	err := of.SetProviderAndWait(newRaceProvider())
	if err != nil {
		t.Fatal(err)
	}

	of.Shutdown()

	err = of.SetProviderAndWait(newRaceProvider())
	if err != nil {
		t.Fatal(err)
	}
}

func newRaceProvider() of.FeatureProvider {
	return &raceProvider{
		state: of.NotReadyState,
	}
}

var (
	_ of.FeatureProvider = (*raceProvider)(nil) // ensure implements FeatureProvider
	_ of.StateHandler    = (*raceProvider)(nil) // ensure implements StateHandler
)

type raceProvider struct {
	state of.State
	mu    sync.RWMutex
}

func (p *raceProvider) Metadata() of.Metadata {
	return of.Metadata{
		Name: "racing",
	}
}

func (p *raceProvider) Status() of.State {
	p.mu.RLock()
	defer p.mu.RUnlock()

	return p.state
}

func (p *raceProvider) Init(evalCtx of.EvaluationContext) error {
	p.mu.Lock()
	defer p.mu.Unlock()
	p.state = of.ReadyState
	return nil
}

func (p *raceProvider) Shutdown() {
	p.mu.Lock()
	defer p.mu.Unlock()
	p.state = of.NotReadyState
}

func (p *raceProvider) BooleanEvaluation(ctx context.Context, flag string, defaultValue bool, evalCtx of.FlattenedContext) of.BoolResolutionDetail {
	return of.BoolResolutionDetail{}
}

func (p *raceProvider) StringEvaluation(ctx context.Context, flag string, defaultValue string, evalCtx of.FlattenedContext) of.StringResolutionDetail {
	return of.StringResolutionDetail{}
}

func (p *raceProvider) FloatEvaluation(ctx context.Context, flag string, defaultValue float64, evalCtx of.FlattenedContext) of.FloatResolutionDetail {
	return of.FloatResolutionDetail{}
}

func (p *raceProvider) IntEvaluation(ctx context.Context, flag string, defaultValue int64, evalCtx of.FlattenedContext) of.IntResolutionDetail {
	return of.IntResolutionDetail{}
}

func (p *raceProvider) ObjectEvaluation(ctx context.Context, flag string, defaultValue interface{}, evalCtx of.FlattenedContext) of.InterfaceResolutionDetail {
	return of.InterfaceResolutionDetail{}
}

func (p *raceProvider) Hooks() []of.Hook {
	return []of.Hook{}
}

Create a file with test and run go test -race ./...

@erka erka added bug Something isn't working Needs Triage labels Nov 26, 2024
@beeme1mr
Copy link
Member

Hey @erka, could you please try the test provider and see if that resolves your issue?

@erka
Copy link
Author

erka commented Nov 26, 2024

@beeme1mr I am trying to implement the provider and write some e2e tests for it. Using another provider doesn't help much. The issue is related to the providers which implement StateHandler interface.

@beeme1mr
Copy link
Member

In the next release, stage management moves to the provider. This is a non-breaking change but may address this problem. FYI @toddbaert

@toddbaert
Copy link
Member

There have been substantial changes to the SDK recently like @beeme1mr said, which are yet unreleased.

The changes are significant enough that there's a chance this bug could be resolved. I would have liked to see the released already but there's a few more issues we have to iron out with it, namely the one you pointed out here: #296 (review)

@erka
Copy link
Author

erka commented Dec 10, 2024

Just keep this alive, the release 1.14.0 doesn't fix the issue. go-sdk uses reflect.DeepEqual() to compare provider(s) internally and that doesn't play very well if provider has sync.Mutex probably.

@toddbaert
Copy link
Member

Just keep this alive, the release 1.14.0 doesn't fix the issue. go-sdk uses reflect.DeepEqual() to compare provider(s) internally and that doesn't play very well if provider has sync.Mutex probably.

Thanks for the update @erka ! If you have some ideas for a fix, I'm open to that. Otherwise, we have some pressing issues internally so I probably won't be able to dig into this until next week.

🙏

@blkt
Copy link

blkt commented Dec 13, 2024

Hi @toddbaert, we're having the same issue with our project.
After looking into this, the problem seems to be that this invocation of Shutdown happens while the subsequent SetProviderAndWait is running.
https://github.com/open-feature/go-sdk/blob/main/openfeature/openfeature_api.go#L254-L256

I think that just running v.Shutdown() should solve the problem.

cc @rdimitrov

@warber
Copy link
Contributor

warber commented Dec 14, 2024

I think that just running v.Shutdown() should solve the problem.

@blkt i gave it a quick shot and it seems to fix the problem. At least all tests pass, even with the race detector enabled. Cool.

go-sdk uses reflect.DeepEqual() to compare provider(s) internally and that doesn't play very well if provider has sync.Mutex probably.

@erka yea using reflect.DeepEqual() to compare provider implementations is a problem...
Here are just some ideas:

  • Require the feature flag implementation to include a UniqueName() or Id() method that can be used to internally identify providers. Also not an ideal solution.
  • Serialize provider implementations into a format like JSON, which naturally excludes unexported fields and requires omitting problematic field types like sync.Mutex. However, this approach is also far from ideal.
  • Avoid directly using the FeatureProvider implementation within the SDK. Instead, create a providerReference wrapper during registration, assigning it a self-generated identifier. Subsequently, the rest of the code should work with providerReference values, enabling easy comparison using their IDs. This approach seems like the most promising. The effort strongly depends on how easy it is to change the code to not use FeatureProvider implementations directly.

@erka
Copy link
Author

erka commented Dec 16, 2024

@warber We could add a kind field to providerReference and set it to reflect.TypeOf(FeatureProvider).Kind(). If the kind is ptr, we can perform a direct comparison of FeatureProvider. For struct, fallback to reflect.DeepEqual.

@toddbaert
Copy link
Member

@warber We could add a kind field to providerReference and set it to reflect.TypeOf(FeatureProvider).Kind(). If the kind is ptr, we can perform a direct comparison of FeatureProvider. For struct, fallback to reflect.DeepEqual.

I'll give this a shot this week unless I see a PR from somebody else first.

@toddbaert
Copy link
Member

toddbaert commented Dec 24, 2024

My findings so far:

  • In addition to just doing v.Shutdown() as @blkt suggested here, another simple (but inadequate, IMO) solution is to RLock the api mutex during shutdown:
	go func(forShutdown StateHandler) {
		api.mu.RLock()
		defer api.mu.RUnlock()
		forShutdown.Shutdown()
	}(v)
  • This is probably a bit better than just running shutdown synchronously, but it still means that we won't be able to register providers until the old provider is shut down, which is bad, we want the shutdown not to block anything - it's for side effects such as flushing telemetry, etc. This solution also hides the real issue, IMO...
  • The root problem of this entire issue, as far as I can tell, is that we are using DeepEqual, which is a problem specifically because it reads unexported fields (in this case the state of @erka 's provider), bypassing the mutex in that provider and causing the race (this is confirmed by the trace in the DATA RACE warning). I think this is a more general issue and we should not do it, or we risk a lot of this sort of issue.
  • I think the only real solution is to find a safe way to compare providers without using DeepEqual at all, so that provider's unexported fields are never read (perhaps the way @warber suggested in their last point here)

@toddbaert
Copy link
Member

toddbaert commented Dec 31, 2024

  • I think the only real solution is to find a safe way to compare providers without using DeepEqual at all, so that provider's unexported fields are never read (perhaps the way @warber suggested in their last point here)

I'm going to give this a shot. Somebody stop me if you think I'm barking up the wrong tree.

@toddbaert toddbaert self-assigned this Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Needs Triage
Projects
None yet
Development

No branches or pull requests

5 participants