Replies: 9 comments 3 replies
-
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
Can you please explain what executir and what log volume configuration you have? I believe it might have something to do with the the volume you are using to store the logs. This looks very much like the volume does not allow to concurrently write and read files from. I think it would be great if you could check that and see what type of volume you have there. Or @dstandish -> do you think that might be related to the logging change you implemented? Maybe that rings a bell too? |
Beta Was this translation helpful? Give feedback.
-
Could this issue be related to #45529? The error log traceback also references |
Beta Was this translation helpful? Give feedback.
-
Hi @potiuk,
As you can see in the stacktrace in the middle of my issue, the executor is base_log_folder = /opt/airflow/logs
remote_logging = True
remote_log_conn_id = s3_airflow_logs
delete_local_logs = False
google_key_path =
remote_base_log_folder = s3://the-bucket here is how it's run by the k8s cluster: # ...
spec:
containers:
- volumeMounts:
- mountPath: /opt/airflow/logs
name: logs
# ...
volumes
- emptyDir:
sizeLimit: 10Gi
name: logs
I don't think my problem is purely a configuration issue. If I downgrade my airflow instance to version
|
Beta Was this translation helpful? Give feedback.
-
Hi @jason810496
Sorry, I was not clear in my previous message, the problem only occurs when a job and its tasks of interest are to be run, which assumes that its state is |
Beta Was this translation helpful? Give feedback.
-
Thanks. that saves a bit of searching through a stack trace. In the future might be better to specify it explicitly rather than leave a chnce that somoene will find it. It allows for people who look at it and try to help to quickly assess whether they can help or whether the case "rings a bell" without actually spending time and looking at such details. It simply optimizes for time of those who try to help you to solve your problem
I was more thinking - what are properties of the volume you have. Something that you can look at your K8S way of handling volumes of the specific kind you use. The error indicates, that somewhere during receiving logs you get "resource unavailable" error. After looking at this - it seems that somewhere the k8s reads logs from remote pod and something does not let it read it. And I think in this case it's something in your K8S configuration., There is a similar issue kubernetes/kubernetes#105928 which indicates that somewhere logs are filing space - for example containerd version used has a problem. And yes - I think the way how logs are read has changed between versions of k8s provider - you can take a look at the changelog - so maybe you had uncovered a configuration or another issue in your K8S. Maybe you can try to see your k8s logs correlating with the events and see if you have some other errors in other components of K8S that indicate what is a root cause. Unfortunately k8s has 100s of moving parts and sometimes you need to dig deeper to find out the root causes (for example often problems - very strange) might occur when your DNS does not have enough resources to respond on time, and the only way to see what's going on is to generally look at what happens in your K8S and see potential issues that are correlated with the event. But I am mostly guessing here - I wanted to help and direct the discussion but I have no deep knowledge on this particular part. |
Beta Was this translation helpful? Give feedback.
-
Hi @potiuk, Thanks for your reply.
No pb, as the subject is very complex, I didn't know how to present the key elements.
I've actually researched this problem quite a bit. Firstly, if I roll back to Airflow 2.4.3, the problem disappears. Another thing is that I've patched the Airflow code with
As you can see, all lines of code are slightly offset. |
Beta Was this translation helpful? Give feedback.
-
Yeah, I think I knw what it is. it looks like your processes have too low ulimit set. The error usually happens when there are not enough sockets. This is an "os" level error and often caused by So you have to look at the configuration of your Kubernetes and Kernel to see how you can increase the numbers. Converting it into a discussion in case more is needeed. |
Beta Was this translation helpful? Give feedback.
-
Example SO issue where similar behaviour is explained is https://stackoverflow.com/questions/70599628/ubuntu-16-gives-fork-retry-resource-temporarily-unavailable-ubuntu-20-doesn |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.10.3
What happened?
Since our migration from Airflow 2.4.3 to 2.9.3 and then to 2.10.3, we have noticed that it has become impossible to access logs via the web UI or the Rest API for a running Task instance.
We run our Airflow instance within the in-house k8s infrastructure, using S3 as our remote logging end.
When the Task instance completes its run, the remote log is visible through the web UI. In v2.4.3 for the same params we never encountered similar issues. Here are our logging config section:
When we try to access the logs for the running task, we see the following text with no content:
Same result for already finalized task attempts:
When we try to get the logs via the REST API (
/api/v1/dags/MY-DAG1/dagRuns/manual__DATE/taskInstances/MY-TASK/logs/8?full_content=false
) after long waiting, we get a time-out exception and following page:What you think should happen instead?
If we check the webserver logs we notice the following exceptions:
What we notice is that the
s3_task_handler
does its part of the job correctly, for a running task it gets the s3 content and if there is no content it clearly saysNo logs found on s3 for ti=<TaskInstance: ...
The problem starts when we try to get stdout for the running k8s pod, as shown above it ends withBlockingIOError - Resource temporarily unavailable
. It all fails infile_task_handler
within_read
method:It looks like this problem has been around for several minor releases.
How to reproduce
You need to deploy an instance of airflow within a k8s cluster with remote logs activated, it should be enough. For solving another issue related to the remote logging, we set up following env vars(not sure if it's relevant):
Operating System
Debian GNU/Linux trixie/sid
Versions of Apache Airflow Providers
Deployment
Official Apache Airflow Helm Chart
Deployment details
Kube version:
v1.30.4
Helm:
version.BuildInfo{Version:"v3.15.2", GitCommit:"1a500d5625419a524fdae4b33de351cc4f58ec35", GitTreeState:"clean", GoVersion:"go1.22.4"}
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions