Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace EnhancedFileSystem with fsspec #128

Merged
merged 3 commits into from
Nov 20, 2024
Merged

Replace EnhancedFileSystem with fsspec #128

merged 3 commits into from
Nov 20, 2024

Conversation

npennequin
Copy link
Collaborator

@npennequin npennequin commented Nov 12, 2024

The current implementation of EnhancedFileSystem is based on the legacy pyarrow filesystem interface that was removed in pyarrow 16.0.0 (apache/arrow#39825).

We can entirely replace EnhancedFileSystem with fsspec. For HDFS fsspec relies on the new pyarrow filesystem interface.

Resolves #87

@npennequin npennequin force-pushed the fsspec branch 4 times, most recently from d142367 to af290d6 Compare November 13, 2024 17:01
@npennequin npennequin changed the title Replace EnhancedFileSystem/EnhancedHdfsFile with fsspec Replace EnhancedFileSystem with fsspec Nov 14, 2024
PyArrow has stopped supporting it since 13.0.0
The mocking setup for packaging.pack_spec_in_pex caused upload_spec to
copy the pex file with the same source and destination.
This doesn't fail with EnhancedFileSystem, but it raises an exception
with fsspec filesystems
The current implementation of EnhancedFileSystem is based on the legacy
pyarrow filesystem interface that was removed in pyarrow 16.0.0
(apache/arrow#39825).

We can entirely replace EnhancedFileSystem with fsspec. For HDFS fsspec
relies on the new pyarrow filesystem interface.

Behavior change note: for put, fsspec doesn't preserve file permissions

Resolves #87
@@ -137,6 +137,7 @@ def test_put():

remote_file = f"{temp_dir}/copied_script.sh"
fs.put(file, remote_file)
fs.chmod(remote_file, 0o755)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will probably lead to issues when users will want to retrieve a pex from hdfs and execute it, as this was handled by the preserve_acls in the EnhancedFileSystem.

Since pyarrow doesn't provide a chmod any more, I don't have a simple way to ensure that original permissions are preserved.

Another way to solve the issue would be to to the chmod locally when the pex is retrieved from hdfs by cluster_pack, but depending on the usage, this solution would be different for each use case.

for example in the spark_example.py:

#this creates an executable pex, and uploads it on hdfs (previously preserving permissions)
 package_path, _ = cluster_pack.upload_env()
 
#this adds the hdfs path of the pex in "spark.yarn.dist.files" so it is downloaded by spark executors, and
#sets PYSPARK_PYTHON to the local pex file retrieved from hdfs in the executors.
spark_config_builder.add_packaged_environment(ssb, package_path)

It seems that if we have lost the executable permission when going from local to hdfs, it won't be executable on executor's side

@jcuquemelle jcuquemelle marked this pull request as ready for review November 20, 2024 16:59
@npennequin npennequin merged commit fe938c1 into master Nov 20, 2024
5 checks passed
npennequin added a commit that referenced this pull request Nov 26, 2024
Using `fsspec.url_to_fs` resulted in using an incorrect host value
for instantiating HDFS filesystems.

For example `fsspec.url_to_fs("viewfs://root/user/someone")` would
call `fsspec.filesystem("viewfs", host="root")`, which could cause
errors. Instead in this case we need the host to be `viewfs://root`, so
we restore most of the code of that function from before #128.

However since we're using fsspec we can generalize to all its supported
file system implementations.
npennequin added a commit that referenced this pull request Nov 27, 2024
Using `fsspec.url_to_fs` resulted in using an incorrect host value
for instantiating HDFS filesystems.

For example `fsspec.url_to_fs("viewfs://root/user/someone")` would
call `fsspec.filesystem("viewfs", host="root")`, which could cause
errors. Instead in this case we need the host to be `viewfs://root`, so
we restore most of the code of that function from before #128.

However since we're using fsspec we can generalize to all its supported
file system implementations.
npennequin added a commit that referenced this pull request Nov 27, 2024
Using `fsspec.url_to_fs` resulted in using an incorrect host value
for instantiating HDFS filesystems.

For example `fsspec.url_to_fs("viewfs://root/user/someone")` would
call `fsspec.filesystem("viewfs", host="root")`, which could cause
errors. Instead in this case we need the host to be `viewfs://root`, so
we restore most of the code of that function from before #128.

However since we're using fsspec we can generalize to all its supported
file system implementations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use fsspec instead of custom generic filesystem
2 participants