Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: debugging PTAG and ABS messages #312

Closed
wants to merge 15 commits into from
Closed

draft: debugging PTAG and ABS messages #312

wants to merge 15 commits into from

Conversation

byeonggiljun
Copy link
Collaborator

This PR aims to debug the PTAG and ABS messages. Currently, in this PR (reactor-c/enclaves3), PTAG isn't sent when a federate is not in a zero-delay cycle (ZDC). However, the semantics of PTAG should be correct even for federates that aren't in a ZDC and there is a problem with PTAG now. So I made this PR to preserve the state that does not skip PTAGs and debug them.

@byeonggiljun byeonggiljun added bug Something isn't working federated labels Nov 24, 2023
@byeonggiljun byeonggiljun marked this pull request as draft November 24, 2023 08:33
@byeonggiljun byeonggiljun changed the title draft: debugging PRAG and ABS messages draft: debugging PTAG and ABS messages Nov 24, 2023
@byeonggiljun
Copy link
Collaborator Author

byeonggiljun commented Dec 1, 2023

@edwardalee @hokeun @lhstrh I tried to resolve the problem on AfterDelays.lf and encountered additional difficulties. This is the summary of the efforts and I will explain this in the next meeting.

The reason why a federate sends a wrong LTC (described in this issue): Federates only consider zero-delay actions when advancing MLAA (Max Level Allowed to Advance).

So I let federates look up every network action in this commit (reactor-c/pull/312/commits/9307777). However, this commit released other masked problems.

  • Problem on AfterDelays.lf

    There is a non-deterministic error in AfterDelays.lf. Let's compare two cases. I attached images below. We only need to see NET (100 msec, 0) from sw2 (The second from the right one). In the success trace, NET (100 msec, 0) is sent and in the failure trace, it is not sent. This is caused by 'the timing of PTAG'.

    if (lf_tag_compare(_fed.last_TAG, tag) >= 0) {
    LF_PRINT_DEBUG("Granted tag " PRINTF_TAG " because TAG or PTAG has been received.",
    _fed.last_TAG.time - start_time, _fed.last_TAG.microstep);
    return _fed.last_TAG;
    }

    In the failed trace, a federate received PTAG(82 msec) before it completed the tag 81 msec. So it tried to send NET(82 msec) but it didn't send it because last_TAG = 82 msec >= tag = 82 msec. However, NET(82 msec) must be sent because it's stalled by MLAA and still waiting for TAG(82 msec).

    Successfully executed trace
    image

    Deadlock occurred trace
    image

  • Solution of the problem on AfterDelays.lf
    Thus, I changed the code above like this.

    if (lf_tag_compare(_fed.last_TAG, tag) > 0
    || (!_fed.is_last_TAG_provisional && lf_tag_compare(_fed.last_TAG, tag) == 0)
    || (_fed.is_last_TAG_provisional && lf_tag_compare(env->current_tag, _fed.last_TAG) < 0)) {
    LF_PRINT_DEBUG("Granted tag " PRINTF_TAG " because TAG or PTAG has been received.",
    _fed.last_TAG.time - start_time, _fed.last_TAG.microstep);
    return _fed.last_TAG;
    }

    This code will not send NET when

    1. TAG >= intended NET
    2. PTAG > intended NET
    3. PTAG > current tag.

    You can see that a federate sends NET(62 msec) although it received PTAG(62 msec) before in the trace below. And that NET(62 msec) allowed the RTI to send TAGs. So the deadlock didn't happen. I guess this is not an optimum solution yet it works. Because duplicated NETs are being sent. I'm trying to devise an elegant solution.

    The trace after fixing the problem
    image

  • Problem on ChainWithDelay.lf
    A problem occurred after I made this commit (reactor-c/pull/312/commits/9307777) to consider every action in a federate.
    image
    Let's assume that we're at the tag 33 msec. The RTI sent PTAG(33 msec) to everyone. So, p: PhysicalPlant sends T_MSG with the tag 33 msec. However, c:Controller cannot execute reaction_2. Because it has reaction_1. We know that there is no message with the tag 33 msec for reaction_1 as pl:Planner's NET is 33 msec and there is a delay in the connection. But c:Controller has no sense about it and is waiting for ABS, T_MSG, or TAG at 33 msec.
    Please note that when we only consider zero-delay actions when advancing MLAA, this isn't a problem because we didn't wait for reaction_1.

  • Solution of the problem on FeedbackDelay.lf

    I'm still trying to take the shape of the solution.

    The bottom line is that c:Controller cannot know the status of reaction_1. The RTI knows that information but it doesn't have a way to inform it. pl:Planner also knows that information and sending ABS(33 msec) will be the simplest solution. However, pl:Planner cannot recognize that c:Controller has an event at 33 msec and it is waiting for its input for 33 msec. Maybe NDT can help to solve this problem?

Base automatically changed from enclaves3 to main December 2, 2023 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working federated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant