feat: health status implementation #1406

kohlisid · 2023-12-06T10:11:31Z

Health status definitions

It is divided into two parts:

Pipeline Resource Health: It is based on the health of each vertex in the pipeline
Data Criticality: It is based on the data movement of the pipeline

Resource Health can be "healthy (0) | unhealthy (1) | paused (3) | unknown (4)".

Resource Health purely means it is up and running.
Resource health will be the max(health) based of each vertex's health

Resource health checks if all the pods are in running state for the pipeline, and also for paused, unknown pipeline etc

Data Criticality on the other end shows whether the pipeline is working as expected.
It represents the pending messages, lags, etc.
Data Criticality can be "ok (0) | warning (1) | critical (2)".
A backlogged pipeline can be healthy even though it has an increasing back-pressure.

For data criticality the timeline data is populated for the pipeline, and if the average usage lies above the thresholds then the required state is assigned. For critical states we have the option do a lookback and only assign it to be critical if we see a predefined number of critical state in a lookback period window.

go.mod

whynowy

partial review

pkg/daemon/server/service/healthStatus.go

pkg/daemon/server/daemon_server.go

pkg/daemon/server/service/healthStatus.go

vigith

awesome job on the well-commented code :)

pkg/daemon/server/service/healthStatus.go

pkg/shared/ewma/interface.go

pkg/shared/ewma/simple_ewma.go

pkg/shared/health-status-code/code_map.go

server/apis/v1/health.go

pkg/daemon/server/service/healthStatus.go

pkg/shared/ewma/interface.go

whynowy · 2023-12-13T19:32:28Z

pkg/daemon/server/service/healthStatus.go

+func (hc *HealthChecker) StartHealthCheck(ctx context.Context) {
+	// Goroutine to listen for ticks
+	// At every tick, check and update the health status of the pipeline.
+	go func(ctx context.Context) {


Shouldn't the function StartHealthCheck be blocked? It's using a goroutine to start it in the daemon server.

Updated. Please check if it seems fine now

server/apis/v1/health.go

yhl25

I see a lot of methods and types exposed in healthStatus.go, avoid exposing types and methods if they are not used in other packages

pkg/daemon/server/service/healthStatus.go

Signed-off-by: Sidhant Kohli <[email protected]>

whynowy · 2023-12-19T17:35:35Z

@kohlisid - resolve the conflicts?

Signed-off-by: Sidhant Kohli <[email protected]>

kohlisid marked this pull request as ready for review December 11, 2023 19:29

kohlisid requested review from whynowy and vigith as code owners December 11, 2023 19:29

kohlisid requested a review from a team December 11, 2023 19:29

vigith reviewed Dec 11, 2023

View reviewed changes

go.mod Outdated Show resolved Hide resolved

whynowy reviewed Dec 11, 2023

View reviewed changes

pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved

pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved

pkg/daemon/server/service/healthStatus.go Show resolved Hide resolved

whynowy reviewed Dec 11, 2023

View reviewed changes

pkg/daemon/server/daemon_server.go Outdated Show resolved Hide resolved

kohlisid requested review from whynowy and vigith December 13, 2023 00:50

vigith reviewed Dec 13, 2023

View reviewed changes

pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved

vigith reviewed Dec 13, 2023

View reviewed changes

whynowy reviewed Dec 13, 2023

View reviewed changes

yhl25 reviewed Dec 14, 2023

View reviewed changes

pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved

kohlisid requested review from yhl25, whynowy and vigith December 15, 2023 19:23

vigith approved these changes Dec 16, 2023

View reviewed changes

health impl

475ea96

Signed-off-by: Sidhant Kohli <[email protected]>

kohlisid added 11 commits December 19, 2023 09:37

add cache to status check

49cc212

Signed-off-by: Sidhant Kohli <[email protected]>

return message and status

d899680

Signed-off-by: Sidhant Kohli <[email protected]>

Add status code mapping

af1e468

Signed-off-by: Sidhant Kohli <[email protected]>

rename to resource health

2f6299b

Signed-off-by: Sidhant Kohli <[email protected]>

add unit tests

c5cee89

Signed-off-by: Sidhant Kohli <[email protected]>

clean up

adc0692

Signed-off-by: Sidhant Kohli <[email protected]>

cleanup

58cfecc

Signed-off-by: Sidhant Kohli <[email protected]>

comments

4e66a64

Signed-off-by: Sidhant Kohli <[email protected]>

refactor

bf1293a

Signed-off-by: Sidhant Kohli <[email protected]>

refactor

b48852c

Signed-off-by: Sidhant Kohli <[email protected]>

refactor

f3a52e3

Signed-off-by: Sidhant Kohli <[email protected]>

kohlisid force-pushed the health branch from 947502c to f3a52e3 Compare December 19, 2023 17:45

Merge branch 'main' into health

fb4d5a5

whynowy approved these changes Dec 19, 2023

View reviewed changes

whynowy merged commit bca1b3b into numaproj:main Dec 19, 2023
19 checks passed

kohlisid deleted the health branch June 25, 2024 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: health status implementation #1406

feat: health status implementation #1406

kohlisid commented Dec 6, 2023 •

edited

Loading

whynowy left a comment

vigith left a comment

whynowy Dec 13, 2023

kohlisid Dec 15, 2023

yhl25 left a comment

whynowy commented Dec 19, 2023

feat: health status implementation #1406

feat: health status implementation #1406

Conversation

kohlisid commented Dec 6, 2023 • edited Loading

whynowy left a comment

Choose a reason for hiding this comment

vigith left a comment

Choose a reason for hiding this comment

whynowy Dec 13, 2023

Choose a reason for hiding this comment

kohlisid Dec 15, 2023

Choose a reason for hiding this comment

yhl25 left a comment

Choose a reason for hiding this comment

whynowy commented Dec 19, 2023

kohlisid commented Dec 6, 2023 •

edited

Loading