ARE issue: init_fs_encoding: failed to get the Python codec #40

rbeucher · 2023-08-08T12:17:13Z

We have been encountering that issue in the last few weeks:

Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized

Here is the full log output:
Looks like a pb with the file system. Any idea?

u=rwx,g=rwx,o=rx
Initialising new SSH agent...
succeeded
Script starting...
Waiting for Jupyter Notebook server to open port 52454...
TIMING - Starting wait at: Tue Aug  8 15:45:31 AEST 2023
TIMING - Starting main script at: Tue Aug  8 15:45:31 AEST 2023
WARNING: Symbolic version 'conda/access-med' already defined

Loading conda/access-med-0.3
  Loading requirement: singularity
/g/data/xp65/public/apps/med_conda_scripts/access-med-0.3.d/bin/jupyter
/g/data/xp65/public/apps/med_conda_scripts/access-med-0.3.d/bin/jupyter
TIMING - Starting jupyter at: Tue Aug  8 15:45:31 AEST 2023
+ ssh-agent -- jupyter lab --config=/home/563/rb5533/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/ncigadi/output/eeb148c4-3438-490c-9b31-fd5d179684cc/config.py --FileContentsManager.root_dir=/home/563/rb5533/.jupyter-root
Python path configuration:
  PYTHONHOME = (not set)
  PYTHONPATH = ':/home/563/rb5533/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/ncigadi/output/eeb148c4-3438-490c-9b31-fd5d179684cc/lib/python3'
  program name = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/bin/python'
  isolated = 0
  environment = 1
  user site = 1
  import site = 1
  sys._base_executable = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/bin/python'
  sys.base_prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.base_exec_prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.platlibdir = 'lib'
  sys.executable = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/bin/python'
  sys.prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.exec_prefix = '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3'
  sys.path = [
    '',
    '/home/563/rb5533/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/ncigadi/output/eeb148c4-3438-490c-9b31-fd5d179684cc/lib/python3',
    '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/lib/python310.zip',
    '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/lib/python3.10',
    '/g/data/xp65/public/apps/med_conda/envs/access-med-0.3/lib/python3.10/lib-dynload',
  ]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 879, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1016, in get_code
  File "<frozen importlib._bootstrap_external>", line 1074, in get_data
OSError: [Errno 5] Input/output error
+ die 'ERROR: Jupyter Lab exited with non-zero code (1)'
+ MSG='ERROR: Jupyter Lab exited with non-zero code (1)'
+ '[' '!' -z 'ERROR: Jupyter Lab exited with non-zero code (1)' ']'
+ echo '<div class="error-msg">ERROR: Jupyter Lab exited with non-zero code (1)</div>'
+ kill -9 1269893
-/bin/bash: line 1: 1269893 Killed                  /local/spool/pbs/mom_priv/jobs/92187277.gadi-pbs.SC
++ date
+ echo 'TIMING - Finished jupyter at: Tue Aug  8 15:45:35 AEST 2023'
TIMING - Finished jupyter at: Tue Aug  8 15:45:35 AEST 2023

======================================================================================
                  Resource Usage on 2023-08-08 15:45:49:
   Job Id:             92187277.gadi-pbs
   Project:            iq82
   Exit Status:        137 (Linux Signal 9 SIGKILL Kill, unblockable)
   Service Units:      0.00
   NCPUs Requested:    2                      NCPUs Used: 2               
                                           CPU Time Used: 00:00:01        
   Memory Requested:   9.0GB                 Memory Used: 105.45MB        
   Walltime requested: 05:00:00            Walltime Used: 00:00:07        
   JobFS requested:    100.0MB                JobFS used: 0B              
======================================================================================

Thanks!

The text was updated successfully, but these errors were encountered:

dsroberts · 2023-08-09T00:16:16Z

Known issue I'm afraid. The root cause is this:

OSError: [Errno 5] Input/output error

There is something broken about how the linux kernel is handling the squashfs loopback device. As far as I can tell, its some kind of unhandled cache miss. I've been trying on and off to figure out a workaround for it, but nothing has helped yet. The shared loop device setting in singularity reduced the occurrence of this error, but it still shows up from time to time, as you've seen. I used to have a semi-consistent reproducer, but when NCI changed the shared loopback setting reduced the OSError rate to about 1/1000 python launches. Insofar as I can tell, it only happens during python imports. I've managed to get into a jupyter session that was failing to import some module for this reason once. I did manage to clear it, but I'm not sure exactly what I did that fixed it. That was a few months ago and I've not managed to get an interactive session with this problem since, in spite of using the squashfs envs pretty much every day. How often do you see this?

rbeucher · 2023-08-09T01:13:24Z

Thanks.

I only use the environment via PBS and have never encountered that pb.
It's a different story with ARE. @dougiesquire and @headmetal have had regular issues.

@dougiesquire and @headmetal any comments?

headmetal · 2023-08-09T02:24:47Z

Over the last month or so, I'd say I'm getting this error (or variants of this error that contain the OSError: [Errno 5] Input/output error) around 1/5 ARE Jupyter sessions using the same conda environment as in @rbeucher's post above.

Admittedly i haven't been running these sessions that much over the last week or so, so haven't had any recently - but will track it over the next few weeks and report back.

rbeucher · 2023-08-09T03:03:46Z

Andy mentioned the issue too.

dsroberts · 2023-08-09T03:07:15Z

Interesting. That's way more prevalent than in the hh5 squashfs envs. Can I join xp65 and do some tests with your environment?

rbeucher · 2023-08-09T04:01:06Z

Please 🙏 😁

dougiesquire · 2023-08-09T06:41:20Z

I've had this happen multiple times today so it should be fairly easy to reproduce. Even if it doesn't occur on start-up, the error will sometimes occur partway through a JupyterLab session when a cell is run.

dsroberts · 2023-08-10T01:19:41Z

@dougiesquire I've noticed the prevalence of this issue is dependent on the workflow. I thnk (but am not certain) the more modules get imported, especially in parallel e.g. during dask cluster startup, the more likely it is to show up. Would I be able to have a copy of whatever you were running yesterday when this kept happening? Doesn't need to be a minimal reproducer, in fact, the more complicated the better. Can you also let me know which version of conda/access-med and what size ARE instance you were using?

dougiesquire · 2023-08-10T01:26:36Z

Sure - I was getting it fairly often when running this notebook: https://github.com/ACCESS-NRI/NRI-Workshop2023-MED/blob/main/Intake_tutorial_p1.ipynb

I was using conda/are with an X-Large instance on normalbw.

dsroberts · 2023-08-16T00:42:38Z

OK, I've made some progress on this. I managed to set up a stress test that would fail with this error about 80% of the time. dmesg kept complaining about the fragment cache, so I figured lets disable fragmentation in the squashfs and also simplify the internal structure of the squashfs as much as possible and see what happens. That dropped the failure rate to about 20%, but the failures became SIGBUS errors where data was not able to be read. I don't know if this is better or not, but I've updated the arguments to mksquashfs in a174719 and I'll see what effect that has over time. I'm wondering if there are mount options I can change that may have an effect as well. Perhaps I can switch from an overlay to a bind and that might help? I'm just flipping switches at this point, I really don't have any clue on the underlying mechanism that's causing this.

headmetal · 2023-08-16T00:51:39Z

Thanks Dale - 20% sounds a lot better than 80% already

rbeucher changed the title ~~ARE issue:~~ ARE issue: init_fs_encoding: failed to get the Python codec Aug 8, 2023

dsroberts mentioned this issue Aug 14, 2023

Missing packages for full UM file support in iris ACCESS-NRI/ACCESS-Analysis-Conda#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARE issue: init_fs_encoding: failed to get the Python codec #40

ARE issue: init_fs_encoding: failed to get the Python codec #40

rbeucher commented Aug 8, 2023

dsroberts commented Aug 9, 2023

rbeucher commented Aug 9, 2023

headmetal commented Aug 9, 2023 •

edited

Loading

rbeucher commented Aug 9, 2023

dsroberts commented Aug 9, 2023

rbeucher commented Aug 9, 2023

dougiesquire commented Aug 9, 2023

dsroberts commented Aug 10, 2023

dougiesquire commented Aug 10, 2023

dsroberts commented Aug 16, 2023

headmetal commented Aug 16, 2023

ARE issue: init_fs_encoding: failed to get the Python codec #40

ARE issue: init_fs_encoding: failed to get the Python codec #40

Comments

rbeucher commented Aug 8, 2023

dsroberts commented Aug 9, 2023

rbeucher commented Aug 9, 2023

headmetal commented Aug 9, 2023 • edited Loading

rbeucher commented Aug 9, 2023

dsroberts commented Aug 9, 2023

rbeucher commented Aug 9, 2023

dougiesquire commented Aug 9, 2023

dsroberts commented Aug 10, 2023

dougiesquire commented Aug 10, 2023

dsroberts commented Aug 16, 2023

headmetal commented Aug 16, 2023

headmetal commented Aug 9, 2023 •

edited

Loading