Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Jetty 11 adapter #447

Closed
wants to merge 11 commits into from

Conversation

alexandergunnarson
Copy link
Contributor

No description provided.

ptaoussanis and others added 11 commits September 5, 2023 11:22
Requires further review.

This is a minor but potentially BREAKING change.
Zero urgency, consider holding till next breaking set.

Before this commit:

  Client-side `[:chsk/ws-ping]` events are wrapped to
  `[:chsk/recv [:chsk/ws-ping]]` when `wrap-recv-evs?` is true.

After this commit:

  Client-side `[:chsk/ws-ping` events are NOT wrapped, even
  when `wrap-recv-evs?` is true.

MOTIVATION:

  While `:chsk/ping` is indeed pushed from server, `:chsk/recv` is
  in practice reserved for user-level events. Keeping it that way
  may be better (more convenient for users).
@ptaoussanis
Copy link
Member

Thanks so much for this Alex! I've got my hands a bit full atm with some other open source tasks, but will definitely get this reviewed+merged the next time I'm on batched Sente work.

Will be really nice to have this in 👍

@ptaoussanis ptaoussanis self-assigned this May 9, 2024
@alexandergunnarson
Copy link
Contributor Author

No worries! Thanks :)

@awb99
Copy link

awb99 commented Oct 8, 2024

@ptaoussanis any idea when this will be merged? Thanks!

@ptaoussanis
Copy link
Member

@awb99 Are you blocking on this? Would you need a full release, or just a merge?

@awb99
Copy link

awb99 commented Oct 8, 2024

I am stuck to a very old sente version that still has jetty support. I would like to update to a more uptodate version of sente that again supports jetty. I have two reasons: 1) old jetty needs websocket handlers to be marked in the config as such (which means my routing table needs to be done effectively twice, or at least my config becomes more convoluted). 2) old sente versions sometimes to not receive disconnected client notification, so I believe I am sometimes sending push event's to clients that are already gone. I could just use a git commit ID to refer to sente via deps.edn if this is what you refer to. I really would like to upgrade ... but I prefer to do it once it is clear the new code works. Thanks!

@ptaoussanis
Copy link
Member

Thanks, and just to clarify - would you need a full release, or just a review and merge of this PR?

@awb99
Copy link

awb99 commented Oct 8, 2024

If you just review and merge, then I can use it in deps.edn projects fine. And if it is helpful to you I could try it in the next 2-4 weeks on some of my apps and report here in case there are issues before you do a full release.

@ptaoussanis
Copy link
Member

ptaoussanis commented Oct 8, 2024

@awb99

And if it is helpful to you I could try it in the next 2-4 weeks on some of my apps and report here in case there are issues before you do a full release.

That'd be great, thank you! 🙏 I've just pushed v1.20.0-SNAPSHOT without fully reviewing or testing this PR except for adding missing dependencies, etc.

Please let me know if/when you've had an opportunity to test.

The reference example includes tools for testing both WebSocket and AJAX features.

@awb99
Copy link

awb99 commented Oct 12, 2024

Unfortunately I cannot add snapshot dependencies; most of my libraries are on clojars, and I cannot depend
on snapshot dependencies. I was looking for a commit somewhere, a branch or something that I could use?
But I cound not find any commits here. What do I overlook? Thanks

ptaoussanis added a commit that referenced this pull request Oct 12, 2024
@ptaoussanis
Copy link
Member

I've just added the relevant (temporary, experimental) commits to master 👍

@ptaoussanis
Copy link
Member

@awb99 Have you had any opportunity to test this adapter? I'm aiming to cut a new stable release shortly, and it'd be nice to have this in if possible.

ptaoussanis pushed a commit that referenced this pull request Oct 28, 2024
…narson)

Not officially tested, waiting on confirmation from users that this works as intended
@ptaoussanis
Copy link
Member

Actually, I'll just go ahead and merge this as v1.20 RC1. Hopefully someone can try the adapter out and confirm that it works as expected before v1.20 final - will keep the PR open till then.

@p-himik
Copy link
Contributor

p-himik commented Nov 11, 2024

AFAICT, the code doesn't handle timeouts properly. I.e. ajax-read! will return :ajax-resp-timeout by default, but then it'll be treated as a string which will result in a much less transparent ClassCastException.

@ptaoussanis
Copy link
Member

@p-himik Hi Eugene, please note that you'll need to look here for the code that was actually committed. I changed a few things while reviewing, including removing :ajax-resp-timeout-val.

Thanks for taking a look, it definitely helps to get extra eyes on this 👍

@p-himik
Copy link
Contributor

p-himik commented Nov 11, 2024

@ptaoussanis Right, but that only substitutes CCE with NPE. Perhaps a custom exception makes sense, or java.util.concurrent.TimeoutException or some other existing one.

On a separate topic - surprisingly, I can't quite get it to work with Jetty 12, even with the aforementioned change.
This line keeps getting triggered and at the moment I don't have enough background knowledge to understand why exactly it happens.

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 11, 2024

Unfortunately can't offer any advice since I've little familiarity with this myself.

Have just marked the issue - hopefully someone else (Alex / Stefan?) might be able to make a suggestion! Otherwise I'll try investigate next time I'm on batched Sente work.

Edit to add: re: CCE/NPE - would be happy to see a PR for the Jetty adapter (and presumably also Undertow?) for a clearer error, assuming that's the best we can do. Would also be nice to update the docstrings to make it clear what actually happens on timeout (exception is thrown, etc.).

@alexandergunnarson
Copy link
Contributor Author

alexandergunnarson commented Nov 11, 2024

@p-himik

AFAICT, the code doesn't handle timeouts properly. I.e. ajax-read! will return :ajax-resp-timeout by default, but then it'll be treated as a string which will result in a much less transparent ClassCastException.

Ah, so this is why there are some random Keyword to String cast complaints in our logs :D The stacktrace is suppressed by Sente IIRC, by an errorf.

@p-himik

On a separate topic - surprisingly, I can't quite get it to work with Jetty 12 ...

Yeah I'm not sure about interaction with Jetty 12. We're running ring/ring-jetty-adapter {:mvn/version "1.13.0"} in production, and that uses Jetty 11. Would need more testing/debugging to get to the bottom of what's happening to you.

@stefanroex
Copy link

@p-himik I’m only running with WebSockets, no Ajax fallback, on Jetty 12. I suspect the ClassCastException and WritePendingException are specific to the Ajax fallback. Unfortunately, that means I’m not much help here. But if you think this pops up with WebSockets, let me know. I’m happy to dig in and see if it shows up on our end!

@p-himik
Copy link
Contributor

p-himik commented Nov 11, 2024

I think I figured it out, PEBKAC is to blame for at least 80% of it:

  • One of my middlewares was throwing an exception specifically for WS responses
  • In my current logging setup, exceptions logged with clojure.tools.logging get printed into stdout, but exceptions logged with Timbre get printed into stderr. It seems that it sometimes result in the exceptions being reported out of order, so I was blaming the exception from Jetty even though it was my middleware
  • The exception returns a plain HTTP 500 response with the exception as the body, but Chrome DevTools Network panel for some reason doesn't print it at all and just reports "Status: Finished" with no payload or messages (had to use ngrep to actually see what's going on)
  • Sente doesn't know about the issue on the server side (after all, ajax-get-or-ws-handshake-fn was called and returned a proper result) and proceeds as usual. No clue what actually goes on in the internals of Jetty here in such a degenerate case, but in the end it results in that obscure WritePendingException

I'm not sure what I can improve here to make such errors more tractable. Perhaps close the Sente connection on the server side when there's an error downstream of Sente? @ptaoussanis But what would be a proper way to do it, given (= ev [:chsk/close]) ; Currently undocumented?

As a second-degree consequence, a different potential issue has manifested itself: after receiving HTTP 500 from the WS request, Sente on the client side keeps spamming the server with plain GET requests with no upgrade headers.

And due to a potential error in the adapter implementation, Sente on the server side keeps responding with HTTP 200 with an empty body, which after a delay (most likely, the one specified by ajax-resp-timeout-ms) results in a deluge of the aforementioned ClassCastException from Keyword to String.

@ptaoussanis
Copy link
Member

@p-himik Thanks for the update! [:chsk/close] will continue to be supported and should be documented, just haven't done so yet. Unfortunately can't comment offhand - would need to take a closer look, and might not have a chance for a while.

Is there any urgency from your side to get feedback from me on this? If so, can try make some time this weekend.

@p-himik
Copy link
Contributor

p-himik commented Nov 11, 2024

@ptaoussanis No urgency at all, and thanks for confirming that I can safely use [:chsk/close] - I just found that I've actually used it before in a different project.

@p-himik
Copy link
Contributor

p-himik commented Nov 12, 2024

@stefanroex Do you not experience timeouts? After fixing the initial issue, I'm getting lines like this in the logs, approximately once every 30 seconds:

[taoensso.sente:784] - [ws/on-error] Server sch error for u_2/c_5d8640/n_374957: org.eclipse.jetty.websocket.api.exceptions.WebSocketTimeoutException: Connection Idle Timeout

It's triggered by org.eclipse.jetty.websocket.core.WebSocketConnection#onIdleExpired, with the default timeout specified by org.eclipse.jetty.websocket.core.WebSocketConstants#DEFAULT_IDLE_TIMEOUT, which is exactly 30 seconds.

Even though Sente has a built-in ping-pong mechanism that by default is set to 25 seconds for the server side, the actual ping messages seem to have a longer delay - 28 seconds on the picture, that one time it took less than 30 seconds so I didn't get a timeout exception with reconnection:
image

@p-himik
Copy link
Contributor

p-himik commented Nov 12, 2024

@ptaoussanis Not urgent - maybe I misunderstand something but to me it seems that the documentation of the ws-kalive-ms option is inaccurate.

The current description in make-channel-socket-server!: "Ping to keep a WebSocket conn alive if no activity w/in given msecs." So it makes it sound as if when there's no activity within the specified time, a ping is immediately sent.

However, here you describe it differently:
"Check every ws-kalive-ms if there's been connection activity. Noop if yes. Otherwise attempt to send server->client ping." So, if at T0 there's a check, at T0+1ms there's a message, at T0+ws-kalive-ms another check - that check would result in noop, so only the check at T0+ws-kalive-ms*2 would result in a ping, effectively making the pings happen half as frequently as ws-kalive-ms specifies.

I haven't analyzed Sente's code enough to confirm that the above scenario is exactly what's going on, but at least it corresponds to what I see - after setting ws-kalive-ms to 10 s, I get pings every 20 s:
image

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 12, 2024

@p-himik Hi Eugene, thanks for checking on this-

It's late here and I'm about to sign off for the night - so I might also be missing something quite obvious ^^ - but it seems the two descriptions are compatible?

The actual implementation is (IIRC) as you linked here. There's a fixed-interval check every kalive msecs. If there was no activity in the previous window, then a ping will be fired.

So the kalive setting effectively sets the maximum time allowed between activity.

Let's say that kalive is 10s and there's NO other activity. I'd then expect a ping at t=10s, t=20s, t=30s, ...

So, if at T0 there's a check, at T0+1ms there's a message, at T0+ws-kalive-ms another check - that check would result in noop, so only the check at T0+ws-kalive-ms*2 would result in a ping, effectively making the pings happen half as frequently as ws-kalive-ms specifies.

So there should be opportunities to ping at T0+ws-kalive-ms, T0+ws-kalive-ms*2, T0+ws-kalive-ms*3, ...

Could it be that you're observing the result of pong's coming back and therefore registering as activity?

I.e.:

Period1 (t0->t1): no activity    => PING at t1
Period2 (t1->t2): pong received  => noop at t2
Period3 (t2->t3): no activity    => PING at t3
Period4 (t3->t4): pong received  => noop at t4

etc.

This'll show as a ping every second interval if there's a successful pong in between each.

Does that make sense? Could well be I'm confused, or could be there's some issue with the implementation - will check back in the morning.

I'll note that there does seem to be a potential ambiguity in "Ping to keep a WebSocket conn alive if no activity w/in given msecs" since it's not stated whether it's talking about a rolling or fixed window/period. I'll adjust to make that clearer.

@p-himik
Copy link
Contributor

p-himik commented Nov 12, 2024

Yeah, same, about to log off.

I think the behavior that I've observed and your description are perfectly compatible. It's just that the docs are vague enough to allow for different interpretations.

Your description operates with separate discrete time windows, and I was initially thinking in the context of a sliding window of sorts. Because that's how I've seen timeouts generally being implemented - actions reset a timer, and when the timer reaches 0, a timeout exception is triggered. So the longest possible period of inactivity is exactly the setting of the timer.
But with how ws-kalive-ms operates on non-intersecting periods, that's no longer the case, and the longest possible period of inactivity becomes ws-kalive-ms * 2 - dT where dT is the shortest time between a ping and a corresponding pong.

Alas, I can't quite think of an alternative description of the argument that's unambiguous and doesn't take a whole paragraph.

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 13, 2024

Your description operates with separate discrete time windows, and I was initially thinking in the context of a sliding window of sorts. Because that's how I've seen timeouts generally being implemented - actions reset a timer, and when the timer reaches 0, a timeout exception is triggered. So the longest possible period of inactivity is exactly the setting of the timer.

Makes sense 👍

But with how ws-kalive-ms operates on non-intersecting periods, that's no longer the case, and the longest possible period of inactivity becomes ws-kalive-ms * 2 - dT where dT is the shortest time between a ping and a corresponding pong.

Unless I'm missing something, I'm not sure this part is right though. I'd assert that the longest period of inactivity remains ws-kalive-ms even with the fixed (rather than sliding) window.

I'm intentionally considering a pong as activity on the connection - do you not? The purpose of ws-kalive-ms was to ensure that connections don't idle longer than the specified amount - and I do believe that that holds true?

A pong will still keep the connection open.

So if we have a 10s ws-kalive-ms, I assert (believe) that the connection won't idle for longer than 10 seconds.

In your example the pong is received very shortly after the ping - that's common, but not universal - especially on sketchier mobile connections, etc.

A sliding window would indeed trigger pings more frequently, but it's not immediately obvious to me that that would necessarily be preferable.

To clarify:

  • I'm definitely happy to detail specifics in the docstring!
  • I'm also happy to consider a different mechanism for the keep-alive if there's a reason to

Just want to make sure that I understand what the motivations are for pursuing this. Is there a specific reason you're concerned about the ping interval? (Context was given here)

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 13, 2024

Okay after re-reading my post, I finally think I understand the point that you've been making - sorry about the slow uptake ^^

The current behaviour is indeed unintentional, my assertion here was wrong:

I'd assert that the longest period of inactivity remains ws-kalive-ms even with the fixed (rather than sliding) window.

Your original example is accurate:

But with how ws-kalive-ms operates on non-intersecting periods, that's no longer the case, and the longest possible period of inactivity becomes ws-kalive-ms * 2 - dT where dT is the shortest time between a ping and a corresponding pong.

Would be happy to see a PR that switched to a sliding window (and documented that), otherwise I'll do that myself next time I'm on batched Sente work.

Still want to double check on context though - is there a specific reason you're concerned about the ping interval? (Context was given here, apologies for missing that)

@ptaoussanis
Copy link
Member

Created a separate issue for the keep-alive bug here 👍

Apologies for noise on this, just juggling too many things atm.

@stefanroex
Copy link

Thanks for clarifying how ws-kalive-ms works. We've run into timeouts as well and weren't entirely sure about its behavior. To be safe, we've reduced the value to ping every 10 seconds, keeping it well below Jetty's default 30-second timeout. We did add a ticket to our backlog to get a better grasp of its functionality.

We initially thought ws-kalive-ms acted like a sliding window: it would trigger activity exactly after ws-kalive-ms had passed since the last message. That behavior would make it easier to fine-tune Sente to match server timeouts.

Turns out I was seeing the exact same thing as @p-himik but hadn't started debugging it yet. Thanks @p-himik and @ptaoussanis for clarifying!

@awb99
Copy link

awb99 commented Nov 25, 2024

I got it to work with the official ring version that uses jetty 11.
The last working release of jetty adapter used the jetty adapter from sunng87.
Am I correct that we now use official ring version and no longer sunng87 version?
I ask because it seems that ring has now updated at least to jetty 11, but
jetty 12 seems to be only in sunng87.
Thanks a lot!

@ptaoussanis
Copy link
Member

Hi everyone,

Just to update on this - the present issue has gotten overly long, and seems to involve at least a few different topics.

To prevent this from blocking the release of Sente v1.20, I'm going to try and pull these topics apart.

Summary below:

  • [PR in general] - From what I understand the present PR seems more or less useful to at least some folks, so I'll retain it for the v1.20 release.
  • [Behaviour on Ajax read timeouts] - The current behaviour seems to be confusing and tough to identify in logs. To hopefully help improve this, I'll throw a clear Exception on these read timeouts. PR or suggestions very welcome if someone has a better idea!
  • [Keep-alive behaviour] - This isn't specific to the current PR or adapter, there's now a dedicated issue for this here to be addressed later (e.g. v1.21)
  • [Jetty 12 support] - I'll check now if this'd be simple to add immediately, otherwise it could wait for a separate issue after v1.20.

Hope that covers everything / seems reasonable?

For anything I may have missed, please feel free to create an individual issue!
Thanks everyone 🙏

@ptaoussanis ptaoussanis added this to the v1.20 milestone Dec 31, 2024
@ptaoussanis
Copy link
Member

@p-himik Hi Eugene! If/when convenient - would you please create an issue to clarify your experience here:

a different potential issue has manifested itself: after receiving HTTP 500 from the WS request, Sente on the client side keeps spamming the server with plain GET requests with no upgrade headers.

I'll try look into this separately (along with #455) for v1.21. Thanks! 🙏

ptaoussanis added a commit that referenced this pull request Dec 31, 2024
…timeouts

There may well be a better way of handling Ajax timeouts, but I'm unfortunately
not familiar with the involved servers.

PR or suggestions very welcome!
@p-himik
Copy link
Contributor

p-himik commented Dec 31, 2024

Done: #460.

@ptaoussanis
Copy link
Member

Preparing v1.20 release now, closing.

ptaoussanis added a commit that referenced this pull request Dec 31, 2024
…timeouts

There may well be a better way of handling Ajax timeouts, but I'm unfortunately
not familiar with the involved servers.

PR or suggestions very welcome!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants