Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workspace: Reload hangs for several minutes before restarting Zed on Linux (blocking lsof call) #22666

Open
1 task done
JaagupAverin opened this issue Jan 4, 2025 · 4 comments
Labels
bug [core label] installer / updater Feedback for installation and update process linux upstream

Comments

@JaagupAverin
Copy link
Contributor

Check for existing issues

  • Completed

Describe the bug / provide steps to reproduce it

When triggering the workspace reload, it takes several minutes for Zed to restart. After a bit of debugging the culprit line appears to be at zed/crates/gpui/src/platform/linux/platform.rs:192:

while lsof -nP -iTCP -a -p {pid} 2>/dev/null; do
    sleep 0.1
done

From htop I can verify that the lsof process hangs for several minutes before the process is restarted:
image
Running the exact same command from another terminal finishes with 1 immediately as expected (only the background process is hung).

Workarounds I've found that fix the issue:

  1. Adding -bw to the lsof arguments to skip usage of kernel calls altogether;
  2. Adding -O to the lsof arguments to disable certain kernal call optimizations/subprocessing;
    However since I don't fundamentally understand what's the root cause of the issue I won't be attemping a proper fix with a PR. Wiser minds please consider this :)

Zed Version and System Specs

Zed: v0.168.0 (Zed Preview)
OS: Linux Wayland ubuntu 24.10
Memory: 30 GiB
Architecture: x86_64
GPU: AMD Radeon Graphics (RADV GFX1103_R1) || radv || Mesa 24.2.3-1ubuntu1

If applicable, add screenshots or screencasts of the incorrect state / behavior

No response

If applicable, attach your Zed.log file to this issue.

Zed.log

@JaagupAverin JaagupAverin added admin read Pending admin review bug [core label] triage Maintainer needs to classify the issue labels Jan 4, 2025
@notpeter
Copy link
Member

notpeter commented Jan 6, 2025

@JaagupAverin This was reported previously here:

You've correctly identified this as the code in question:

log::info!("Restarting process, using app path: {:?}", app_path);
// Script to wait for the current process to exit and then restart the app.
// We also wait for possibly open TCP sockets by the process to be closed,
// since on Linux it's not guaranteed that a process' resources have been
// cleaned up when `kill -0` returns.
let script = format!(
r#"
while kill -0 {pid} 2>/dev/null; do
sleep 0.1
done
while lsof -nP -iTCP -a -p {pid} 2>/dev/null; do
sleep 0.1
done
{app_path}
"#,
pid = app_pid,

CC: @afossa @majutsushi You are no longer alone in experiencing this issue

Thanks for the suggestions of -bw and -O.
Man page for lsof 4.95.0 from Ubuntu24 copied here for reference:

    -O       directs lsof to bypass the strategy it  uses  to  avoid  being
            blocked by some kernel operations - i.e., doing them in forked
            child  processes.   See  the  BLOCKS AND TIMEOUTS and AVOIDING
            KERNEL BLOCKS sections for more information on  kernel  opera‐
            tions that may block lsof.

            While use of this option will reduce lsof startup overhead, it
            may also cause lsof to hang when the kernel doesn't respond to
            a function.  Use this option cautiously.

   -b       causes  lsof  to  avoid  kernel  functions  that might block -
            lstat(2), readlink(2), and stat(2).

            See the BLOCKS AND TIMEOUTS and AVOIDING  KERNEL  BLOCKS  sec‐
            tions for information on using this option.

   +|-w     Enables (+) or disables (-) the suppression  of  warning  mes‐
            sages.

            The  lsof builder may choose to have warning messages disabled
            or enabled by default.  The default warning message  state  is
            indicated  in  the  output of the -h or -?  option.  Disabling
            warning messages when they are already  disabled  or  enabling
            them when already enabled is acceptable.

BLOCKS AND TIMEOUTS
   Lsof  can  be blocked by some kernel functions that it uses - lstat(2),
   readlink(2), and stat(2).  These functions are stalled in  the  kernel,
   for  example,  when the hosts where mounted NFS file systems reside be‐
   come inaccessible.

   Lsof attempts to break these blocks with timers  and  child  processes,
   but  the  techniques are not wholly reliable.  When lsof does manage to
   break a block, it will report the break with  an  error  message.   The
   messages may be suppressed with the -t and -w options.

   The  default  timeout value may be displayed with the -h or -?  option,
   and it may be changed with the -S [t] option.  The minimum for t is two
   seconds, but you should avoid small values, since slow  system  respon‐
   siveness  can  cause  short timeouts to expire unexpectedly and perhaps
   stop lsof before it can produce any output.

   When lsof has to break a block during its access of mounted file system
   information, it normally  continues,  although  with  less  information
   available to display about open files.

   Lsof  can  also be directed to avoid the protection of timers and child
   processes when using the kernel functions that might block by  specify‐
   ing  the  -O  option.  While this will allow lsof to start up with less
   overhead, it exposes lsof completely  to  the  kernel  situations  that
   might block it.  Use this option cautiously.

AVOIDING KERNEL BLOCKS
   You  can use the -b option to tell lsof to avoid using kernel functions
   that would block.  Some cautions apply.

   First, using this option usually requires that your system  supply  al‐
   ternate  device  numbers in place of the device numbers that lsof would
   normally obtain with the lstat(2) and stat(2)  kernel  functions.   See
   the  ALTERNATE DEVICE NUMBERS section for more information on alternate
   device numbers.

   Second, you can't specify names for lsof to locate unless they're  file
   system  names.  This is because lsof needs to know the device and inode
   numbers of files listed with names in the lsof options, and the -b  op‐
   tion  prevents lsof from obtaining them.  Moreover, since lsof only has
   device numbers for the file systems that have alternates,  its  ability
   to  locate files on file systems depends completely on the availability
   and accuracy of the alternates.  If no alternates are available, or  if
   they're incorrect, lsof won't be able to locate files on the named file
   systems.

   Third,  if  the names of your file system directories that lsof obtains
   from your system's mount table are symbolic links, lsof won't  be  able
   to  resolve  the  links.   This is because the -b option causes lsof to
   avoid the kernel readlink(2)  function  it  uses  to  resolve  symbolic
   links.

   Finally, using the -b option causes lsof to issue warning messages when
   it  needs  to use the kernel functions that the -b option directs it to
   avoid.  You can suppress these messages by specifying  the  -w  option,
   but  if  you do, you won't see the alternate device numbers reported in
   the warning messages.

@notpeter notpeter added installer / updater Feedback for installation and update process linux and removed triage Maintainer needs to classify the issue admin read Pending admin review labels Jan 6, 2025
@JaagupAverin
Copy link
Contributor Author

After removing all the layers from the issue this actually appears like a general issue with lsof.
Would be interesting to hear if @majutsushi still has this issue too and its not some invdividual edge case.
lsof-org/lsof#328

@majutsushi
Copy link

I haven't actually been able to reproduce this recently. I think it started working properly after a reboot, but I can't be entirely certain of the timing with this. Of course this could also just be a complete coincidence.

@mrnugget
Copy link
Member

mrnugget commented Jan 7, 2025

Hey, just dropping some context here: the original reason for lsof was that Zed keeps a TCP socket open to check that only a single instance is running. We ran into a race condition where restarting would fail because the old process didn't clean up its TCP socket yet. See #11488:

When running with these release channels, Zed tries to ensure that there's only one instance of Zed running.

It does that by listening on a TCP socket to which other instances can connect on start. If the other instance receives a message, it knows that another Zed instance is running and exits.

On Linux, though, we ran into a race condition:

kill -0, which checks whether a process is still running, returns an error, signalling that the old Zed process has exited
BUT: the process was still listening on the TCP port.
It seems like that on Linux, process resources aren't guaranteed to be cleaned up as soon as signal handling stops working for a process.

The fix is to wait until the process is no longer listening on any TCP sockets.

If someone has a better way to run lsof (or a better way to check in general!), that'd be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug [core label] installer / updater Feedback for installation and update process linux upstream
Projects
None yet
Development

No branches or pull requests

4 participants