Skip to content
This repository has been archived by the owner on Dec 23, 2024. It is now read-only.

ARE issue: init_fs_encoding: failed to get the Python codec #40

Open
rbeucher opened this issue Aug 8, 2023 · 11 comments
Open

ARE issue: init_fs_encoding: failed to get the Python codec #40

rbeucher opened this issue Aug 8, 2023 · 11 comments

Comments

@rbeucher
Copy link

rbeucher commented Aug 8, 2023

Hi @dsroberts

We have been encountering that issue in the last few weeks:

Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized

Here is the full log output:
Looks like a pb with the file system. Any idea?

u=rwx,g=rwx,o=rx
Initialising new SSH agent...
succeeded
Script starting...
Waiting for Jupyter Notebook server to open port 52454...
TIMING - Starting wait at: Tue Aug  8 15:45:31 AEST 2023
TIMING - Starting main script at: Tue Aug  8 15:45:31 AEST 2023
WARNING: Symbolic version 'conda/access-med' already defined

Loading conda/access-med-0.3
  Loading requirement: singularity
/g/data/xp65/public/apps/med_conda_scripts/access-med-0.3.d/bin/jupyter
/g/data/xp65/public/apps/med_conda_scripts/access-med-0.3.d/bin/jupyter
TIMING - Starting jupyter at: Tue Aug  8 15:45:31 AEST 2023
+ ssh-agent -- jupyter lab --config=/home/563/rb5533/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/ncigadi/output/eeb148c4-3438-490c-9b31-fd5d179684cc/config.py --FileContentsManager.root_dir=/home/563/rb5533/.jupyter-root
Python path configuration:
  PYTHONHOME = (not set)
  PYTHONPATH = ':/home/563/rb5533/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/ncigadi/output/eeb148c4-3438-490c-9b31-fd5d179684cc/lib/python3'
  program name = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/bin/python'
  isolated = 0
  environment = 1
  user site = 1
  import site = 1
  sys._base_executable = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/bin/python'
  sys.base_prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.base_exec_prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.platlibdir = 'lib'
  sys.executable = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/bin/python'
  sys.prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.exec_prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.path = [
    '',
    '/home/563/rb5533/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/ncigadi/output/eeb148c4-3438-490c-9b31-fd5d179684cc/lib/python3',
    '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/lib/python310.zip',
    '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/lib/python3.10',
    '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/lib/python3.10/lib-dynload',
  ]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 879, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1016, in get_code
  File "<frozen importlib._bootstrap_external>", line 1074, in get_data
OSError: [Errno 5] Input/output error
+ die 'ERROR: Jupyter Lab exited with non-zero code (1)'
+ MSG='ERROR: Jupyter Lab exited with non-zero code (1)'
+ '[' '!' -z 'ERROR: Jupyter Lab exited with non-zero code (1)' ']'
+ echo '<div class="error-msg">ERROR: Jupyter Lab exited with non-zero code (1)</div>'
+ kill -9 1269893
-/bin/bash: line 1: 1269893 Killed                  /local/spool/pbs/mom_priv/jobs/92187277.gadi-pbs.SC
++ date
+ echo 'TIMING - Finished jupyter at: Tue Aug  8 15:45:35 AEST 2023'
TIMING - Finished jupyter at: Tue Aug  8 15:45:35 AEST 2023

======================================================================================
                  Resource Usage on 2023-08-08 15:45:49:
   Job Id:             92187277.gadi-pbs
   Project:            iq82
   Exit Status:        137 (Linux Signal 9 SIGKILL Kill, unblockable)
   Service Units:      0.00
   NCPUs Requested:    2                      NCPUs Used: 2               
                                           CPU Time Used: 00:00:01        
   Memory Requested:   9.0GB                 Memory Used: 105.45MB        
   Walltime requested: 05:00:00            Walltime Used: 00:00:07        
   JobFS requested:    100.0MB                JobFS used: 0B              
======================================================================================

Thanks!

@rbeucher rbeucher changed the title ARE issue: ARE issue: init_fs_encoding: failed to get the Python codec Aug 8, 2023
@dsroberts
Copy link
Member

Known issue I'm afraid. The root cause is this:

OSError: [Errno 5] Input/output error

There is something broken about how the linux kernel is handling the squashfs loopback device. As far as I can tell, its some kind of unhandled cache miss. I've been trying on and off to figure out a workaround for it, but nothing has helped yet. The shared loop device setting in singularity reduced the occurrence of this error, but it still shows up from time to time, as you've seen. I used to have a semi-consistent reproducer, but when NCI changed the shared loopback setting reduced the OSError rate to about 1/1000 python launches. Insofar as I can tell, it only happens during python imports. I've managed to get into a jupyter session that was failing to import some module for this reason once. I did manage to clear it, but I'm not sure exactly what I did that fixed it. That was a few months ago and I've not managed to get an interactive session with this problem since, in spite of using the squashfs envs pretty much every day. How often do you see this?

@rbeucher
Copy link
Author

rbeucher commented Aug 9, 2023

Thanks.

I only use the environment via PBS and have never encountered that pb.
It's a different story with ARE. @dougiesquire and @headmetal have had regular issues.

@dougiesquire and @headmetal any comments?

@headmetal
Copy link

headmetal commented Aug 9, 2023

Over the last month or so, I'd say I'm getting this error (or variants of this error that contain the OSError: [Errno 5] Input/output error) around 1/5 ARE Jupyter sessions using the same conda environment as in @rbeucher's post above.

Admittedly i haven't been running these sessions that much over the last week or so, so haven't had any recently - but will track it over the next few weeks and report back.

@rbeucher
Copy link
Author

rbeucher commented Aug 9, 2023

Andy mentioned the issue too.

@dsroberts
Copy link
Member

Interesting. That's way more prevalent than in the hh5 squashfs envs. Can I join xp65 and do some tests with your environment?

@rbeucher
Copy link
Author

rbeucher commented Aug 9, 2023

Please 🙏 😁

@dougiesquire
Copy link

I've had this happen multiple times today so it should be fairly easy to reproduce. Even if it doesn't occur on start-up, the error will sometimes occur partway through a JupyterLab session when a cell is run.

@dsroberts
Copy link
Member

@dougiesquire I've noticed the prevalence of this issue is dependent on the workflow. I thnk (but am not certain) the more modules get imported, especially in parallel e.g. during dask cluster startup, the more likely it is to show up. Would I be able to have a copy of whatever you were running yesterday when this kept happening? Doesn't need to be a minimal reproducer, in fact, the more complicated the better. Can you also let me know which version of conda/access-med and what size ARE instance you were using?

@dougiesquire
Copy link

Sure - I was getting it fairly often when running this notebook: https://github.com/ACCESS-NRI/NRI-Workshop2023-MED/blob/main/Intake_tutorial_p1.ipynb

I was using conda/are with an X-Large instance on normalbw.

@dsroberts
Copy link
Member

OK, I've made some progress on this. I managed to set up a stress test that would fail with this error about 80% of the time. dmesg kept complaining about the fragment cache, so I figured lets disable fragmentation in the squashfs and also simplify the internal structure of the squashfs as much as possible and see what happens. That dropped the failure rate to about 20%, but the failures became SIGBUS errors where data was not able to be read. I don't know if this is better or not, but I've updated the arguments to mksquashfs in a174719 and I'll see what effect that has over time. I'm wondering if there are mount options I can change that may have an effect as well. Perhaps I can switch from an overlay to a bind and that might help? I'm just flipping switches at this point, I really don't have any clue on the underlying mechanism that's causing this.

@headmetal
Copy link

Thanks Dale - 20% sounds a lot better than 80% already

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants