Best practices for large data set backfill #24232

mitchpaulus · 2023-05-16T14:30:29Z

mitchpaulus
May 16, 2023

I have a large dataset that I am trying to backfill - it is 5-minute interval data, taken for approximately 2,500 sensors, over 6 years. So this is approximately 288 pts/day * 365 days/yr * 6 yrs * 2500 = 1.5 billion records.

I have tried to upload this through the CLI, but have run into issues in which the ./influxd process crashes with no error to stderr after approximately uploading 100 of these files (~60 million records, not rate limited).

Looking through the documentation, I could not find examples or a list of best practices for one-time backfilling of large amounts of data.

I found recommendations for optimizing writes like

Batch in groups of 5,000 items of line protocol
gzip compression

Neither of which helped. I don't have a retention policy since we analyze this dataset in various different ways and would like the entire data set to be available.

So general questions that I have that I think would be useful to have answered in a single place would be:

Are there temporary settings that can be adjusted just for backfilling?
Is there a maximum rate limit that should be adhered to?
Are there VM minimum specs (RAM/CPU/etc.) that need to be met?
Is this use case not well suited to InfluxDb and we should look elsewhere?
Can someone provide example code using the CLI that would address this?

sanderson · 2023-05-16T14:40:49Z

sanderson
May 16, 2023

@mitchpaulus What version of InfluxDB are you using?

3 replies

mitchpaulus May 16, 2023
Author

@sanderson InfluxDB v2.7.1 (git: 407fa62) build_date: 2023-04-28T13:24:27Z

What I believe to be the latest open source download.

sanderson May 16, 2023

Thanks. I'm no authority on historical backfilling, but I know it can be tricky. I think what's happening here is a bottleneck on shard creation. The shard creation service typically "pre-creates" shards before they're actually needed, but it only does this for future time ranges. When backfilling, the shards can't be pre-created and have to be created on the fly, which causes a reduction in the write throughput. From the docs:

The InfluxDB shard precreation service pre-creates shards with future start and end times for each shard group based on the shard group duration.

The precreator service does not pre-create shards for past time ranges. When backfilling historical data, InfluxDB creates shards for past time ranges as needed, resulting in temporarily lower write throughput.

I'm guessing the behavior is also related to shard compaction. By default, shards are compacted after 4 hours of no activity (write or read). With the rate of your ingest, you likely have a lot of "hot," uncompacted shards. It could also be that the compaction service is running and trying to compact all the shards at once and is consuming all of the available resources. You my try modifying some of the shard- and compaction-related settings as you perform your backfill.

mitchpaulus May 18, 2023
Author

Thanks @sanderson - I appreciate the validation that there can be subtle issues with large data backfills. I hope someone with additional expertise finds this and can add some more actionable tactics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for large data set backfill #24232

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Best practices for large data set backfill #24232

mitchpaulus May 16, 2023

Replies: 1 comment · 3 replies

sanderson May 16, 2023

mitchpaulus May 16, 2023 Author

sanderson May 16, 2023

mitchpaulus May 18, 2023 Author

mitchpaulus
May 16, 2023

Replies: 1 comment 3 replies

sanderson
May 16, 2023

mitchpaulus May 16, 2023
Author

mitchpaulus May 18, 2023
Author