Skip to content

Commit

Permalink
Add a structured error for non-determinism failures (cadence-workflow…
Browse files Browse the repository at this point in the history
…#1269)

This is being added for a few reasons:

1. it's useful information, and an unstructured error has no way to safely communicate it
2. previously these were un-typed errors, so this custom type can be added without breaking backwards compatibility
3. we want to use it to expose more information about shadow-test failures, which are currently far more opaque than they need to be

Shadow tests are opaque for a lot of reasons that need to be improved, but enhancing the error is a pretty straightforward first step.  And it seems useful in general.

This intentionally does *not* change the `.Error()` text, largely just because I didn't see any reason to do so.  The error strings are reasonably well-known and there doesn't seem to be any major benefit that can be gained by improving them... currently.
That said, these strings *must not* be considered stable in general, and we should consider changing them in the future.
  • Loading branch information
Groxx authored Aug 31, 2023
1 parent c0f5bb0 commit 9fdcd4f
Show file tree
Hide file tree
Showing 6 changed files with 112 additions and 15 deletions.
3 changes: 3 additions & 0 deletions error.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ type (

// CanceledError returned when operation was canceled.
CanceledError = internal.CanceledError

// NonDeterministicError is returned when a workflow's replay was non-deterministic, and it could not be resumed safely.
NonDeterministicError = internal.NonDeterministicError
)

// ErrNoData is returned when trying to extract strong typed data while there is no data available.
Expand Down
3 changes: 1 addition & 2 deletions internal/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,14 @@ import (
"fmt"
"time"

"go.uber.org/cadence/internal/common/isolationgroup"

"github.com/opentracing/opentracing-go"
"github.com/uber-go/tally"
"go.uber.org/zap"

"go.uber.org/cadence/.gen/go/cadence/workflowserviceclient"
s "go.uber.org/cadence/.gen/go/shared"
"go.uber.org/cadence/internal/common/auth"
"go.uber.org/cadence/internal/common/isolationgroup"
"go.uber.org/cadence/internal/common/metrics"
)

Expand Down
6 changes: 2 additions & 4 deletions internal/common/isolationgroup/service_wrapper_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,12 @@ import (
"testing"
"time"

"github.com/stretchr/testify/assert"

"go.uber.org/cadence/.gen/go/cadence/workflowserviceclient"

"github.com/golang/mock/gomock"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/suite"
"github.com/uber/tchannel-go/thrift"

"go.uber.org/cadence/.gen/go/cadence/workflowserviceclient"
"go.uber.org/cadence/.gen/go/cadence/workflowservicetest"
"go.uber.org/cadence/.gen/go/shared"
)
Expand Down
100 changes: 100 additions & 0 deletions internal/error.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ import (
"strings"

"go.uber.org/cadence/.gen/go/shared"
"go.uber.org/cadence/internal/common/util"
)

/*
Expand Down Expand Up @@ -125,6 +126,51 @@ type (
stackTrace string
}

// NonDeterministicError contains some structured data related to a non-deterministic
// replay failure, and is primarily intended for allowing richer error reporting.
//
// WorkflowType, WorkflowID, RunID, TaskList, and DomainName will likely be long-term stable
// and included in some form in future library versions, but the rest of these fields may
// change at any time, or be removed in a future major version change.
NonDeterministicError struct {

// Reason is a relatively free-form description of what kind of non-determinism
// was detected.
//
// You are *strongly* encouraged to not rely on these strings for behavior, only
// explanation, for a few reasons. More will likely appear in the future, they may
// change, and there is little that can be safely decided on in an automated way.
//
// Currently, values roughly match the historical error strings, and are:
// - "missing replay decision" (The error will contain HistoryEventText, as there
// is at least one history event that has no matching replayed decision)
// - "extra replay decision" (The error will contain DecisionText, as there is
// at least one decision from replay that has no matching history event)
// - "mismatch" (Both HistoryEventText and DecisionText will exist, as there
// are issues with both. This was previously shown as "history event is ...,
// replay decision is ..." error text.)
Reason string

WorkflowType string
WorkflowID string
RunID string
TaskList string
DomainName string

// intentionally avoiding "history event" and "decision" names
// because we *do* have types for them, but they are in thrift and should
// not be exposed directly.
// we should consider doing that eventually though, or providing a
// simplified object for richer failure information.

// HistoryEventText contains a String() representation of a history
// event (i.e. previously recorded) that is related to the problem.
HistoryEventText string
// DecisionText contains a String() representation of a replay decision
// event (i.e. created during replay) that is related to the problem.
DecisionText string
}

// ContinueAsNewError contains information about how to continue the workflow as new.
ContinueAsNewError struct {
wfn interface{}
Expand Down Expand Up @@ -419,3 +465,57 @@ func (b ErrorDetailsValues) Get(valuePtr ...interface{}) error {
}
return nil
}

// NewNonDeterminsticError constructs a new *NonDeterministicError.
//
// - reason should be a documented NonDeterminsticError.Reason value
// - info is always required. only a portion of it is used, but it is a convenient
// and currently always-available object.
// - history and decision may each be present or nil at any time
func NewNonDeterminsticError(reason string, info *WorkflowInfo, history *shared.HistoryEvent, decision *shared.Decision) error {
var historyText string
if history != nil {
historyText = util.HistoryEventToString(history)
}
var decisionText string
if decision != nil {
decisionText = util.DecisionToString(decision)
}
return &NonDeterministicError{
Reason: reason,

WorkflowType: info.WorkflowType.Name,
WorkflowID: info.WorkflowExecution.ID,
RunID: info.WorkflowExecution.RunID,
TaskList: info.TaskListName,
DomainName: info.Domain,

HistoryEventText: historyText,
DecisionText: decisionText,
}
}

func (e *NonDeterministicError) Error() string {
switch e.Reason {
case "missing replay decision":
// historical text
return "nondeterministic workflow: " +
"missing replay decision for " + e.HistoryEventText
case "extra replay decision":
// historical text
return "nondeterministic workflow: " +
"extra replay decision for " + e.DecisionText
case "mismatch":
// historical text
return "nondeterministic workflow: " +
"history event is " + e.HistoryEventText + ", " +
"replay decision is " + e.DecisionText
default:
// should not occur in practice, but it's basically fine if it does.
// ideally this should crash in internal builds / tests, to prevent mismatched values.
return fmt.Sprintf(
"unknown reason %q, history event is: %s, replay decision is: %s",
e.Reason, e.HistoryEventText, e.DecisionText,
)
}
}
4 changes: 2 additions & 2 deletions internal/internal_task_handlers.go
Original file line number Diff line number Diff line change
Expand Up @@ -932,7 +932,7 @@ ProcessEvents:
var nonDeterministicErr error
if !skipReplayCheck && !w.isWorkflowCompleted || isReplayTest {
// check if decisions from reply matches to the history events
if err := matchReplayWithHistory(replayDecisions, respondEvents); err != nil {
if err := matchReplayWithHistory(w.workflowInfo, replayDecisions, respondEvents); err != nil {
nonDeterministicErr = err
}
}
Expand All @@ -947,7 +947,7 @@ ProcessEvents:
nonDeterministicErr = panicErr
} else {
// Since we know there is an error, we do the replay check to give more context in the log
replayErr := matchReplayWithHistory(replayDecisions, respondEvents)
replayErr := matchReplayWithHistory(w.workflowInfo, replayDecisions, respondEvents)
w.wth.logger.Error("Ignored workflow panic error",
zap.String(tagWorkflowType, task.WorkflowType.GetName()),
zap.String(tagWorkflowID, task.WorkflowExecution.GetWorkflowId()),
Expand Down
11 changes: 4 additions & 7 deletions internal/workflow_replayer_utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,13 @@ package internal

import (
"bytes"
"fmt"
"reflect"
"strings"

s "go.uber.org/cadence/.gen/go/shared"
"go.uber.org/cadence/internal/common/util"
)

func matchReplayWithHistory(replayDecisions []*s.Decision, historyEvents []*s.HistoryEvent) error {
func matchReplayWithHistory(info *WorkflowInfo, replayDecisions []*s.Decision, historyEvents []*s.HistoryEvent) error {
di := 0
hi := 0
hSize := len(historyEvents)
Expand Down Expand Up @@ -60,16 +58,15 @@ matchLoop:
}

if d == nil {
return fmt.Errorf("nondeterministic workflow: missing replay decision for %s", util.HistoryEventToString(e))
return NewNonDeterminsticError("missing replay decision", info, e, nil)
}

if e == nil {
return fmt.Errorf("nondeterministic workflow: extra replay decision for %s", util.DecisionToString(d))
return NewNonDeterminsticError("extra replay decision", info, nil, d)
}

if !isDecisionMatchEvent(d, e, false) {
return fmt.Errorf("nondeterministic workflow: history event is %s, replay decision is %s",
util.HistoryEventToString(e), util.DecisionToString(d))
return NewNonDeterminsticError("mismatch", info, e, d)
}

di++
Expand Down

0 comments on commit 9fdcd4f

Please sign in to comment.