Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stability] Survive vips sigabrt #75

Merged
merged 4 commits into from
Jan 28, 2025
Merged

[stability] Survive vips sigabrt #75

merged 4 commits into from
Jan 28, 2025

Conversation

blefaudeux
Copy link
Contributor

@blefaudeux blefaudeux commented Jan 24, 2025

For some payloads vips will sigabrt on a function call, without an obvious way to catch this beforehand (initial loadbuffer doesn't error out, image characteristics have nothing specific). Happening very often on a given cluster, guess is that network could be less reliable and broken packets would not be caught somehow.

Adding a relatively ugly but seemingly reliable defer(recover) pattern which will skip this payload for now, testing overnight. Not sure how to unit test this

Note that this PR enables img.Close()-ing the vips image when done, which lowers the RAM use quite dramatically (not waiting for the go GC)

@@ -0,0 +1,90 @@
package main
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a binary to stress test the lib in the vectorDB path

@@ -29,11 +29,9 @@ func main() {
sourceConfig.Rank = 0
sourceConfig.WorldSize = 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a binary to stress test the lib in the filesystem path

@@ -161,12 +167,8 @@ type DatagoClient struct {

// GetClient is a constructor for the DatagoClient, given a JSON configuration string
func GetClient(config DatagoConfig) *DatagoClient {
// Make sure that the GC is run more often than usual
// VIPS will allocate a lot of memory and we want to make sure that it's released as soon as possible
os.Setenv("GOGC", "10") // Default is 100, we're running it when heap is 10% larger than the last GC
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was ugly, not required when we properly release the buffers (and it doesn't crash on us)

// Initialize the vips library
err := os.Setenv("VIPS_DISC_THRESHOLD", "5g")
err := os.Setenv("VIPS_DISC_THRESHOLD", "10g")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other terms: never go do disk. Could be changed depending on where this runs, 2TB servers for Photoroom so ram is fine

}

err = img.AutoRotate()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not required anymore, can be done at load time, a little cleaner

pkg/serdes.go Outdated
// Decode the image payload using vips
img, err := vips.NewImageFromBuffer(buffer)
// Decode the image payload using vips, using bulletproof settings
importParams := vips.NewImportParams()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit and visible load settings, didn't make anything more stable but still a good thing probably

// If the image is 4 channels, we need to drop the alpha channel
if img.Bands() == 4 {
err = img.Flatten(&vips.Color{R: 255, G: 255, B: 255}) // Flatten with white background
if err != nil {
fmt.Println("Error flattening image:", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code style, reporting these messages directly in the error, nicer

return nil, -1., fmt.Errorf("error converting to sRGB: %w", err)
}
} else {
// // FIXME: maybe that we could recover these still. By default throws an error, sRGB and PNG not supported
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaner conversion, vips would complain that we were doing something illegal for these images so skipping it now

// Optionally crop and resize the image on the fly. Save the aspect ratio in the process for future use
originalWidth, originalHeight := img.Width(), img.Height()

if transform != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the crux of the stability fix when direct-to-cluster, the subsequent vips call can trigger a sigabrt and I couldn't find any way to catch this case beforehand.
Flipping this around, this can happen but we recover from it. Guess is that some network transmission was broken and we didn't catch it as we should have

if err != nil {
return nil, -1., err
}
imgBytes, _, err = img.ExportPng(&vips.PngExportParams{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as in the loading case, be explicit above the params used here instead of using ones which would depend on the library defaults

// Define bit depth de facto, not exposed in the vips interface
bitDepth = len(imgBytes) / (width * height * channels) * 8 // 8 bits per byte
}

defer img.Close() // release vips buffers when done
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this helps a lot in high throuhgput situations, lowers the RAM use and just generally makes sense, we're done with this vips intermediate at this point

client_config.SourceConfig = source_config
client = datago.GetClient(client_config)

client_config = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous implementation was buggy, in that the update to the config object in the python space wouldn't be reflected in go. This json/dict approach is readable and not broken

@blefaudeux blefaudeux force-pushed the ben/img_close branch 3 times, most recently from 012c695 to e8b826a Compare January 24, 2025 22:50
@blefaudeux blefaudeux marked this pull request as draft January 25, 2025 07:31
@blefaudeux
Copy link
Contributor Author

can still crash eventually

@blefaudeux blefaudeux force-pushed the ben/img_close branch 2 times, most recently from be2d5e6 to 678303c Compare January 25, 2025 15:32
- now reproed the issue, looks like it's around the autorotate part
- fixing this hopefully
@blefaudeux
Copy link
Contributor Author

can still be made to hard crash, when trying long enough.

@blefaudeux blefaudeux closed this Jan 26, 2025
@blefaudeux
Copy link
Contributor Author

reviving this for testing

@blefaudeux blefaudeux reopened this Jan 28, 2025
@blefaudeux blefaudeux marked this pull request as ready for review January 28, 2025 19:34
@blefaudeux blefaudeux merged commit a46e601 into main Jan 28, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant