-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(evpn-bridge): fix system behaviour for pending objects #391
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -31,8 +31,9 @@ | |||
objectType string | ||||
resourceVersion string | ||||
subIndex int | ||||
retryTimer time.Duration | ||||
subs []*eventbus.Subscriber | ||||
// systemTimer is used only when we want to retry a Task due to unavailability of the Subscriber or not receiving a TaskStatus | ||||
systemTimer time.Duration | ||||
subs []*eventbus.Subscriber | ||||
} | ||||
|
||||
// TaskStatus holds info related to the status that has been received | ||||
|
@@ -60,6 +61,7 @@ | |||
objectType: objectType, | ||||
resourceVersion: resourceVersion, | ||||
subIndex: 0, | ||||
systemTimer: 1 * time.Second, | ||||
subs: subs, | ||||
} | ||||
} | ||||
|
@@ -94,13 +96,18 @@ | |||
// StatusUpdated creates a task status and sends it for handling | ||||
func (t *TaskManager) StatusUpdated(name, objectType, resourceVersion, notificationID string, dropTask bool, component *common.Component) { | ||||
taskStatus := newTaskStatus(name, objectType, resourceVersion, notificationID, dropTask, component) | ||||
|
||||
// Do we need to make this call happen in a goroutine in order to not make the | ||||
// StatusUpdated function stuck in case that nobody reads what is written on the channel ? | ||||
// Is there any case where this can happen | ||||
// (nobody reads what is written on the channel and the StatusUpdated gets stuck) ? | ||||
t.taskStatusChan <- taskStatus | ||||
log.Printf("StatusUpdated(): New Task Status has been created and sent to channel: %+v\n", taskStatus) | ||||
log.Printf("StatusUpdated(): New Task Status has been created: %+v\n", taskStatus) | ||||
|
||||
// We need to have a default case here so the call is not stuck if we send to channel but there is nobody reading from the channel. | ||||
// e.g. a subscriber got stuck and doesn't reply. The task will be requeued after the timer gets expired. In the meanwhile | ||||
// the subscriber replies and a taskStatus is sent to channel but the call gets stuck there as the previous task has not been requeued yet | ||||
// as the timer has not expired and the queue is empty (We assume that there is only one task in the queue). | ||||
select { | ||||
case t.taskStatusChan <- taskStatus: | ||||
log.Printf("StatusUpdated(): Task Status has been sent to channel: %+v\n", taskStatus) | ||||
default: | ||||
log.Printf("StatusUpdated(): Task Status has not been sent to channel. Channel not available: %+v\n", taskStatus) | ||||
} | ||||
} | ||||
|
||||
// processTasks processes the task | ||||
|
@@ -123,7 +130,18 @@ | |||
// (e.g. Maybe you have a timeout on the subscribers and you got the notification after the timeout have passed) | ||||
NotificationID: uuid.NewString(), | ||||
} | ||||
eventbus.EBus.Publish(objectData, sub) | ||||
if err := eventbus.EBus.Publish(objectData, sub); err != nil { | ||||
log.Printf("processTasks(): Notification not sent to subscriber %+v with data %+v. Subscriber is busy. The Task %+v will be requeued.\n", sub, objectData, task) | ||||
// We keep this subIndex in order to know from which subscriber to start iterating after the requeue of the Task | ||||
// so we do start again from the subscriber that returned an error or was unavailable for any reason. | ||||
task.subIndex += i | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you are at subscriber with index There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No this will not happen as the next time we will take a sublist of the subscribers based on the subIndex. Then we will iterate on that sublist and that means that the i will start from 0 again. check here: https://github.com/opiproject/opi-evpn-bridge/blob/main/pkg/infradb/taskmanager/taskmanager.go#L114 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds a bit error-prone There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be honest I like the solution as it is today. It looks cleaner to me. It is tested and it works so nice so I would not like to change it if that is ok with you. I can put a comment if you want so people can understand what this index calculation is all about. I do not like so much to be honest to keep sublists all the time for the remaining subscribers to me that is a bit more error prone. Maybe we can revisit this issue in the future as this bug fix here is not related to the subIndex. The SubIndex was allready there from the beggining There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For me it is not. I already misinterpreted what is happening here. We could create an issue and discuss it there. I don't know if we can add a Subscriber in the middle and if we need to regard that a new one and other details
Please do. I am also wondering if it could be a single place like a function to "re-queue task"
Please create an issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||
task.systemTimer *= 2 | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The flow you described is clear. I want to know if we ever stop increasing it. Will it make sense to wait for hours? days? years? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently we never stop increasing it. But this is a known issue that we have just not addressed do far. We are planning to address this in the future. I can open an issue to track it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pls create an issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||
log.Printf("processTasks(): The Task will be requeued after %+v\n", task.systemTimer) | ||||
time.AfterFunc(task.systemTimer, func() { | ||||
t.taskQueue.Enqueue(task) | ||||
}) | ||||
break loopTwo | ||||
} | ||||
log.Printf("processTasks(): Notification has been sent to subscriber %+v with data %+v\n", sub, objectData) | ||||
|
||||
loopThree: | ||||
|
@@ -143,11 +161,17 @@ | |||
log.Printf("processTasks(): received notification id %+v doesn't equal the sent notification id %+v\n", taskStatus.notificationID, objectData.NotificationID) | ||||
|
||||
// We need a timeout in case that the subscriber doesn't update the status at all for whatever reason. | ||||
// If that occurs then we just take a note which subscriber need to revisit and we requeue the task without any timer | ||||
// If that occurs then we just requeue the task with a timer | ||||
case <-time.After(30 * time.Second): | ||||
log.Printf("processTasks(): No task status has been received in the channel from subscriber %+v. The task %+v will be requeued immediately Task Status %+v\n", sub, task, taskStatus) | ||||
log.Printf("processTasks(): No task status has been received in the channel from subscriber %+v. The task %+v will be requeued. Task Status %+v\n", sub, task, taskStatus) | ||||
// We keep this subIndex in order to know from which subscriber to start iterating after the requeue of the Task | ||||
// so we do start again from the subscriber that returned an error or was unavailable for any reason. | ||||
task.subIndex += i | ||||
go t.taskQueue.Enqueue(task) | ||||
task.systemTimer *= 2 | ||||
log.Printf("processTasks(): The Task will be requeued after %+v\n", task.systemTimer) | ||||
time.AfterFunc(task.systemTimer, func() { | ||||
t.taskQueue.Enqueue(task) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Issue is needed
Sounds like a sequential flow. Do we need channels then? Pls consider. It would make all the things much simpler There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think channels and notifications make the whole system scale better. The implementation that we have so far it just uses one task queue and one process that drains the queue. Maybe in the future if we hit any limitation we could use more processes to drain the queue of the tasks and I think this will scale easier when we are using channels and notifications. So I would prefer keeping it this way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
But we didn't hit, and we don't know if we really need it in the future, but already complicating the design. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I didn't get your question correct at the beggining. I apologize for this The channels are not so much related to the sequential flow as to keep the core system as much as was possible agnostic to the plugins that we need to send and receive notifications. The plugins are responsible to configuring the underline system of their responsibility (e.g. FRRModule for FRR, LinuxModule for Linux etc...). The core system doesn't need to know what are those plugins or what they exactly do but the only thing that they need to know is towards which subscribers they need to send notifications. The channels help in this agnostic notion as well as to the communication between the different go routines as each subscriber runs as a different go routine. Also the sequential flow it is there because the different plugins have some sort of dependency into eachother. That dependency is expressed by the sequential flow and that is why we want the first plugin to succeed before we move to the second one. Because for instance if the Linux Vendor Module wants to attach an interface into a bridge which bridge has been created before by General Linux Module if the sequential flow is not there the Linux Vendor module call will fail as it has a dependency to the General Linux module operation first to create the bridge. This is a significant design choice which has been presented to the TSC before the implementation and we have decided that was resonable to move forward with it . Also we think that is a good design and works well so far and I do not really agree that the channels make the system complicated as they serve the architecture well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The same level of agnosticism can be achieved with a function calling. The channels do not add more agnosticism here.
They can be run in different routines, even if we call them by a function (you can use mutex to sync or send to channel or execute right in the same thread context)
We send a msg, and go to sleep waiting for a plugin to complete its job transforming our flow into sequential
What if we do not register General Linux Module at all so it won't receive notifications at all?
It might be. But from the chunk I see, it looks like you need a sequential flow and channels add complexity. Apparently this PR is not the place to make such decisions, but pls consider |
||||
}) | ||||
break loopThree | ||||
} | ||||
} | ||||
|
@@ -159,19 +183,27 @@ | |||
break loopTwo | ||||
} | ||||
|
||||
// We re-initialize the systemTimer every time that we get a taskStatus. That means that the subscriber is available and has responded | ||||
task.systemTimer = 1 * time.Second | ||||
|
||||
switch taskStatus.component.CompStatus { | ||||
case common.ComponentStatusSuccess: | ||||
log.Printf("processTasks(): Subscriber %+v has processed the task %+v successfully\n", sub, task) | ||||
continue loopTwo | ||||
default: | ||||
case common.ComponentStatusError: | ||||
log.Printf("processTasks(): Subscriber %+v has not processed the task %+v successfully\n", sub, task) | ||||
log.Printf("processTasks(): The Task will be requeued after %+v\n", taskStatus.component.Timer) | ||||
// We keep this subIndex in order to know from which subscriber to start iterating after the requeue of the Task | ||||
// so we do start again from the subscriber that returned an error or was unavailable for any reason. | ||||
task.subIndex += i | ||||
task.retryTimer = taskStatus.component.Timer | ||||
log.Printf("processTasks(): The Task will be requeued after %+v\n", task.retryTimer) | ||||
time.AfterFunc(task.retryTimer, func() { | ||||
time.AfterFunc(taskStatus.component.Timer, func() { | ||||
t.taskQueue.Enqueue(task) | ||||
}) | ||||
break loopTwo | ||||
default: | ||||
artek-koltun marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
log.Printf("processTasks(): Subscriber %+v has not provided designated status for the task %+v\n", sub, task) | ||||
log.Printf("processTasks(): The task %+v will be dropped\n", task) | ||||
break loopTwo | ||||
} | ||||
} | ||||
} | ||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... if write into a channel, why do we use read lock here? With the read lock many threads can publish at the same time. We apparently want to use
Lock
hereThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. The question though is if we actually need a Lock.
In a case where two goroutines try to write on the same channel I think golang is smart enough to block the second goroutine until the first one has finished writing. In that case we do not need any Lock. is that correct ?
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we need the lock only to lock on the channel, then we can remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think the Lock is not needed here we can remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done