Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sr3 status fixes #1350 and #1352. #1355

Merged
merged 7 commits into from
Dec 20, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/Explanation/CommandLineGuide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -544,6 +544,8 @@ will be:

* cpuS: process is expensive in CPU usage (runStateThreshold_cpuSlow)
* disa: disabled, configured not to run.
* disc: disconnected, cannot reach the message broker.
* down: cannot connect or exchange data with remote data source or sink.
* hung: processes appear hung, not writing anything to logs.
* idle: all processes running, but no data or message transfers for too long (runStateThreshold_idle)
* lag: all processes running, but messages being processed are too old ( runStateThreshold_lag )
Expand Down
7 changes: 7 additions & 0 deletions docs/source/Reference/sr3_options.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1766,6 +1766,13 @@ Examples that could contribute to this:

It defaults to inactive, but may be set to identify transient issues.

runStateThreshold_disconnected <interval> (default: 80)
------------------------------------------------

If a flow is connected less than the *runStateThreshold_disconnected* percent of the time,
sr3 status should show the state as *disconnected* (when the broker connection is the issue),
or *down* when it is the data connections that are broken.

runStateThreshold_hung <interval> (default: 450)
------------------------------------------------

Expand Down
4 changes: 4 additions & 0 deletions docs/source/fr/Explication/GuideLigneDeCommande.rst
Original file line number Diff line number Diff line change
Expand Up @@ -544,6 +544,10 @@ La deuxième rangée donne des détails sur les en têtes de chacune des catégo
Les configurations sont répertoriées sur la gauche. Pour chaque configuration, l’état.
sera :

* cpuS : le processus est coûteux en termes d'utilisation du processeur (runStateThreshold_cpuSlow)
* disa : disabled (désactivé), configuré pour ne pas s'exécuter.
* disc : déconnecté, impossible d'atteindre le courtier de messages. (runStateThreshold_disconnected.)
* down : impossible de se connecter ou d'échanger des données avec une source ou un récepteur de données distant.
* hung : les processus semblent bloqués et n'écrivent rien dans les journaux.
* idle : tous les processus en cours d'exécution, mais ne transfert pas depuis trop longtemps (runStateThreshold_idle.)
* lag : tous les processus en cours d'exécution, mais les messages en cours de traitement sont trop anciens ( runStateThreshold_lag )
Expand Down
6 changes: 6 additions & 0 deletions docs/source/fr/Reference/sr3_options.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1746,6 +1746,12 @@ Exemples qui pourraient y contribuer :

Par défaut, il est inactif, mais peut être défini pour identifier des problèmes temporaires.

runStateThreshold_disconnected <intervalle> (par défaut : 80)
------------------------------------------------

Si un flux est connecté moins de fois que le pourcentage de temps *runStateThreshold_disconnected*,
le statut sr3 affichera l'état comme *déconnecté* (lorsque la connexion au broker est le problème),
ou *down* lorsque ce sont les connexions de données qui sont interrompues.

runStateThreshold_hung <intervalle> (défaut: 450s)
--------------------------------------------------
Expand Down
2 changes: 2 additions & 0 deletions sarracenia/config/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ def __repr__(self) -> str:
'realpathPost': False,
'recursive' : True,
'runStateThreshold_reject': 80,
'runStateThreshold_disconnected': 80,
'report': False,
'retryEmptyBeforeExit': False,
'retry_refilter': False,
Expand All @@ -135,6 +136,7 @@ def __repr__(self) -> str:
count_options = [
'batch', 'count', 'exchangeSplit', 'instances', 'logRotateCount', 'no',
'post_exchangeSplit', 'prefetch', 'messageCountMax', 'runStateThreshold_cpuSlow',
'runStateThreshold_disconnected',
'runStateThreshold_reject', 'runStateThreshold_retry', 'runStateThreshold_slow',
]

Expand Down
28 changes: 22 additions & 6 deletions sarracenia/sr.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,14 @@

logger = logging.getLogger(__name__)

empty_metrics={ "byteRate":0, "cpuTime":0, "rejectCount":0, "last_housekeeping":0, "messagesQueued": 0,
"lagMean": 0, "latestTransfer": 0, "rejectPercent":0, "transferRxByteRate":0, "transferTxByteRate": 0,
empty_metrics={ "byteRate":0, "cpuTime":0, "connected": True, "rejectCount":0, "last_housekeeping":0, "messagesQueued": 0,
"lagMean": 0, "latestTransfer": 0, "rejectPercent":0, "transferConnected": True, "transferRxByteRate":0,
"transferTxByteRate": 0,
"rxByteCount":0, "rxGoodCount":0, "rxBadCount":0, "txByteCount":0, "txGoodCount":0, "txBadCount":0,
"lagMax":0, "lagTotal":0, "lagMessageCount":0, "disconnectTime":0, "transferConnectTime":0,
"transferRxLast": 0, "transferTxLast": 0, "rxLast":0, "txLast":0,
"transferRxBytes":0, "transferRxFiles":0, "transferTxBytes": 0, "transferTxFiles": 0,
"msgs_in_post_retry": 0, "msgs_in_download_retry":0, "brokerQueuedMessageCount": 0,
"msgs_in_post_retry": 0, "msgs_in_download_retry":0, "msgs_in_post_retry":0, "brokerQueuedMessageCount": 0,
MagikEh marked this conversation as resolved.
Show resolved Hide resolved
petersilva marked this conversation as resolved.
Show resolved Hide resolved
'time_base': 0, 'byteTotal': 0, 'byteRate': 0, 'msgRate': 0, 'msgRateCpu': 0, 'retry': 0,
'messageLast': 0, 'transferLast': 0, 'connectPercent': 0, 'byteConnectPercent': 0
}
Expand Down Expand Up @@ -942,10 +943,13 @@ def _resolve(self):
#print( f"k={k}" )
if k in metrics:
newval = self.states[c][cfg]['instance_metrics'][i][j][k]
#print( f"k={k}, newval={newval}" )
#print( f"k={k}, type={type(newval)} newval={newval}" )
if k in [ "lagMax" ]:
if newval > metrics[k]:
metrics[k] = newval
elif k in [ "connected", "transferConnected" ]:
if not newval:
metrics[k] = False
elif k in [ "last_housekeeping" ]:
if metrics[k] == 0 or newval < metrics[k] :
metrics[k] = newval
Expand All @@ -970,7 +974,6 @@ def _resolve(self):

if 'transferConnectTime' in metrics:
metrics['transferConnectTime'] = metrics['transferConnectTime'] / len(self.states[c][cfg]['instance_metrics'])

if 'disconnectTime' in metrics:
metrics['disconnectTime'] = metrics['disconnectTime'] / len(self.states[c][cfg]['instance_metrics'])

Expand Down Expand Up @@ -1086,6 +1089,17 @@ def _resolve(self):
self.states[c][cfg]['hung_instances'].append(i)

flow_status = 'unknown' if self.configs[c][cfg]['status'] != 'disabled' else 'disabled'
if hasattr(self.configs[c][cfg]['options'],'download') and self.configs[c][cfg]['options'].download and \
(self.states[c][cfg]['metrics']['retry']+self.states[c][cfg]['metrics']['messagesQueued'] > 0 ) :
if not self.states[c][cfg]['metrics']['transferConnected']:
flow_status='down'
elif (self.states[c][cfg]['metrics']['connectPercent']< self.configs[c][cfg]['options'].runStateThreshold_disconnected/100):
flow_status='disconnected'
elif not self.states[c][cfg]['metrics']['connected']:
flow_status='disconnected'
elif (self.states[c][cfg]['metrics']['byteConnectPercent']>0) and (self.states[c][cfg]['metrics']['byteConnectPercent']< self.configs[c][cfg]['options'].runStateThreshold_disconnected/100):
flow_status='down'

if hung_instances > 0 and (observed_instances > 0):
flow_status = 'hung'
elif observed_instances < int(self.configs[c][cfg]['instances']):
Expand Down Expand Up @@ -1123,6 +1137,8 @@ def _resolve(self):
flow_status = 'reject'
elif self.configs[c][cfg]['options'].attempts == 0:
flow_status='standby'
elif flow_status in [ 'down', 'disconnected' ]:
pass
elif hasattr(self.configs[c][cfg]['options'],'post_broker') and self.configs[c][cfg]['options'].post_broker \
and (now-self.states[c][cfg]['metrics']['txLast']) > self.configs[c][cfg]['options'].runStateThreshold_idle:
flow_status = 'idle'
Expand Down Expand Up @@ -1312,7 +1328,7 @@ def __init__(self, opt, config_fnmatches=None):
'sender', 'shovel', 'subscribe', 'watch', 'winnow'
]
# active means >= 1 process exists on the node.
self.status_active = ['cpuSlow', 'hung', 'idle', 'lagging', 'partial', 'reject', 'retry', 'running', 'slow', 'standby', 'waitVip' ]
self.status_active = ['cpuSlow', 'disconnected', 'down', 'hung', 'idle', 'lagging', 'partial', 'reject', 'retry', 'running', 'slow', 'standby', 'waitVip' ]
self.status_values = self.status_active + [ 'disabled', 'include', 'missing', 'stopped', 'unknown' ]

self.bin_dir = os.path.dirname(os.path.realpath(__file__))
Expand Down
Loading