File inventory finishes without updating FileInventoryRequest #1331

hectorcorrea · 2025-02-28T19:32:28Z

There has been a few instances in which a FileInventoryRequest job "finishes" but does not update the record in the database with the file that it produced.

Here are the details on a job that behaved like this today.

Notice that the FileInventoryRequest record in the database was never updated with a file in request_details nor with a completion_time

#<FileInventoryRequest:0x00007fafb49761a0
  id: 10,
  user_id: 152,
  project_id: 8,
  job_id: "b20cbbc2-b910-454d-a53e-cd7a93de7b48",
  completion_time: nil,
  state: "pending",
  type: "FileInventoryRequest",
  request_details: {"project_title"=>"Princeton Prosody Archive"},
  created_at: Fri, 28 Feb 2025 17:52:15.985014000 UTC +00:00,
  updated_at: Fri, 28 Feb 2025 17:52:15.985014000 UTC +00:00>]

The file that was produced was last updated at 18:56 and there is nothing in the log to indicate that there was an error during the process:

-rw-rw-r-- 1 nobody nogroup 280257895 Feb 28 18:56 b20cbbc2-b910-454d-a53e-cd7a93de7b48.csv

Yet, the record in the database was never updated to point to the file and there is no entry in the log that the process got to that point, i.e. the code should have logged Export file generated but it never did.

There are no jobs in any of the Sidekiq queues either so the job did finish (or died) but again, nothing in the logs to track it.

There were a few CheckMK alerts around this time (1:55 PM EST) regarding tigerdata-prod2 (which is the server where this job ran). All these alerts were short-lived and seem to have recovered...but I wonder if they messed up the job.

(see also PR #1330 and issue #1274)

The text was updated successfully, but these errors were encountered:

hectorcorrea · 2025-02-28T20:25:15Z

Re-running the File Inventory Job again at 2:37 PM ran fine for a while but then we saw some memory alerts in CheckMK for the server where the job was running (tigerdata-prod2) around 3:20 PM. As of 3:25 PM the job is still running.

hectorcorrea · 2025-02-28T20:27:03Z

Memory according to top:

hectorcorrea · 2025-02-28T20:32:15Z

It does look like the kernel killed our job:

~$ sudo dmesg -T | egrep -i 'killed process'
[Fri Feb 28 17:05:31 2025] Out of memory: Killed process 971 (bundle) total-vm:8878208kB, anon-rss:6443516kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:12936kB oom_score_adj:0
[Fri Feb 28 18:58:11 2025] Out of memory: Killed process 2098426 (bundle) total-vm:8072908kB, anon-rss:6623936kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:13256kB oom_score_adj:0

$ date
Fri Feb 28 20:33:07 UTC 2025

Notice that file for the job (b20cbbc2-b910-454d-a53e-cd7a93de7b48.csv) was last updated 18:56

280257895 Feb 28 18:56 b20cbbc2-b910-454d-a53e-cd7a93de7b48.csv

hectorcorrea · 2025-02-28T20:42:52Z

The Linux kernel is indeed killing our job:

sudo dmesg -T | egrep -i 'killed process'
[Fri Feb 28 17:05:31 2025] Out of memory: Killed process 971 (bundle) total-vm:8878208kB, anon-rss:6443516kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:12936kB oom_score_adj:0
[Fri Feb 28 18:58:11 2025] Out of memory: Killed process 2098426 (bundle) total-vm:8072908kB, anon-rss:6623936kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:13256kB oom_score_adj:0
[Fri Feb 28 20:41:29 2025] Out of memory: Killed process 2149691 (bundle) total-vm:7982372kB, anon-rss:6587632kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:13204kB oom_score_adj:0

Notice the process id 2149691 matches with the ID of our job in the screenshot from top above.

hectorcorrea · 2025-02-28T20:49:45Z

I am surprised that we are running out of memory here.

We are probably keeping the file inventory in memory before saving it to a file and for 2 million records that does not seem to be a good idea. We should look into saving the file as we go so that we only keep a one or a few batches in memory.

hectorcorrea · 2025-03-03T15:19:58Z

@kayiwa has bumped up the memory to 32GB in tigerdata-prod1 and tigerdata-prod2. We'll run the jobs again and cross out fingers :D

hectorcorrea · 2025-03-03T16:42:05Z

The job finished with the new memory in the servers! It used almost 25% of the 32 GB so I can see how it was crashing before when the server had only 8 GB.

hectorcorrea assigned hectorcorrea and kayiwa Mar 3, 2025

hectorcorrea mentioned this issue Mar 3, 2025

Why do file inventory jobs take so long for some users #1274

Open

hectorcorrea closed this as completed Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File inventory finishes without updating FileInventoryRequest #1331

File inventory finishes without updating FileInventoryRequest #1331

hectorcorrea commented Feb 28, 2025 •

edited

Loading

hectorcorrea commented Feb 28, 2025

hectorcorrea commented Feb 28, 2025

hectorcorrea commented Feb 28, 2025 •

edited

Loading

hectorcorrea commented Feb 28, 2025

hectorcorrea commented Feb 28, 2025

hectorcorrea commented Mar 3, 2025

hectorcorrea commented Mar 3, 2025

File inventory finishes without updating FileInventoryRequest #1331

File inventory finishes without updating FileInventoryRequest #1331

Comments

hectorcorrea commented Feb 28, 2025 • edited Loading

hectorcorrea commented Feb 28, 2025

hectorcorrea commented Feb 28, 2025

hectorcorrea commented Feb 28, 2025 • edited Loading

hectorcorrea commented Feb 28, 2025

hectorcorrea commented Feb 28, 2025

hectorcorrea commented Mar 3, 2025

hectorcorrea commented Mar 3, 2025

hectorcorrea commented Feb 28, 2025 •

edited

Loading

hectorcorrea commented Feb 28, 2025 •

edited

Loading