-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace EnhancedFileSystem with fsspec #128
Conversation
d142367
to
af290d6
Compare
PyArrow has stopped supporting it since 13.0.0
The mocking setup for packaging.pack_spec_in_pex caused upload_spec to copy the pex file with the same source and destination. This doesn't fail with EnhancedFileSystem, but it raises an exception with fsspec filesystems
The current implementation of EnhancedFileSystem is based on the legacy pyarrow filesystem interface that was removed in pyarrow 16.0.0 (apache/arrow#39825). We can entirely replace EnhancedFileSystem with fsspec. For HDFS fsspec relies on the new pyarrow filesystem interface. Behavior change note: for put, fsspec doesn't preserve file permissions Resolves #87
@@ -137,6 +137,7 @@ def test_put(): | |||
|
|||
remote_file = f"{temp_dir}/copied_script.sh" | |||
fs.put(file, remote_file) | |||
fs.chmod(remote_file, 0o755) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will probably lead to issues when users will want to retrieve a pex from hdfs and execute it, as this was handled by the preserve_acls in the EnhancedFileSystem
.
Since pyarrow doesn't provide a chmod any more, I don't have a simple way to ensure that original permissions are preserved.
Another way to solve the issue would be to to the chmod locally when the pex is retrieved from hdfs by cluster_pack, but depending on the usage, this solution would be different for each use case.
for example in the spark_example.py:
#this creates an executable pex, and uploads it on hdfs (previously preserving permissions)
package_path, _ = cluster_pack.upload_env()
#this adds the hdfs path of the pex in "spark.yarn.dist.files" so it is downloaded by spark executors, and
#sets PYSPARK_PYTHON to the local pex file retrieved from hdfs in the executors.
spark_config_builder.add_packaged_environment(ssb, package_path)
It seems that if we have lost the executable permission when going from local to hdfs, it won't be executable on executor's side
Using `fsspec.url_to_fs` resulted in using an incorrect host value for instantiating HDFS filesystems. For example `fsspec.url_to_fs("viewfs://root/user/someone")` would call `fsspec.filesystem("viewfs", host="root")`, which could cause errors. Instead in this case we need the host to be `viewfs://root`, so we restore most of the code of that function from before #128. However since we're using fsspec we can generalize to all its supported file system implementations.
Using `fsspec.url_to_fs` resulted in using an incorrect host value for instantiating HDFS filesystems. For example `fsspec.url_to_fs("viewfs://root/user/someone")` would call `fsspec.filesystem("viewfs", host="root")`, which could cause errors. Instead in this case we need the host to be `viewfs://root`, so we restore most of the code of that function from before #128. However since we're using fsspec we can generalize to all its supported file system implementations.
Using `fsspec.url_to_fs` resulted in using an incorrect host value for instantiating HDFS filesystems. For example `fsspec.url_to_fs("viewfs://root/user/someone")` would call `fsspec.filesystem("viewfs", host="root")`, which could cause errors. Instead in this case we need the host to be `viewfs://root`, so we restore most of the code of that function from before #128. However since we're using fsspec we can generalize to all its supported file system implementations.
The current implementation of
EnhancedFileSystem
is based on the legacy pyarrow filesystem interface that was removed in pyarrow 16.0.0 (apache/arrow#39825).We can entirely replace
EnhancedFileSystem
with fsspec. For HDFS fsspec relies on the new pyarrow filesystem interface.Resolves #87