Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling and retry behaviour improved #42

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

dmichal
Copy link

@dmichal dmichal commented Jul 15, 2020

Several changes in error handling:

  • Support for infinite retries added - better fits logstash philosophy of not dropping events in case of errors in order to provide data integrity,
  • Properly use 'retry_times' in case of failovers - previously 'retry_times' parameter was ignored in case of failovers thus leading to infinite retries,
  • Reducing unnecessary failovers due to conn errors - solves Unnecessary failovers for HDFS namenode #36
  • Optional limit for retry interval added - useful in case of infinite retries as retry interval increases with each attempt potentially resulting in extremely high values,
  • Docs improved - better description of retry behaviour and retry_interval increments,
  • Properly handle errors during file creation - previously errors encountered during file creation were not handled at all; now they are treated just like any other write error

dmichal added 7 commits July 9, 2020 14:56
Previously, 'retry_times' parameter was ignored after failovers causing unconditional and possibly infinite retries.
Previously, the plugin performed failovers in case of datanode connection errors. This behaviour has been changed by checking if host and port for which connection error occurred match the namenode host and port.
@dmichal
Copy link
Author

dmichal commented Jul 15, 2020

One more change I'm wondering about is to add sleep before retry in case of failovers, just as it is in case of other errors. This may prove beneficial when both namenodes are down or inaccessible - in current implementation retries are performed immediately after error thus using resources, producing a lot of unnecessary log messages and quickly using up retry limit. On the other hand, in case of one namenode working properly sleeping for a while after failover won't have significant impact on performance since it is going to be performed only once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant