Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File inventory finishes without updating FileInventoryRequest #1331

Closed
hectorcorrea opened this issue Feb 28, 2025 · 7 comments
Closed

File inventory finishes without updating FileInventoryRequest #1331

hectorcorrea opened this issue Feb 28, 2025 · 7 comments
Assignees

Comments

@hectorcorrea
Copy link
Member

hectorcorrea commented Feb 28, 2025

There has been a few instances in which a FileInventoryRequest job "finishes" but does not update the record in the database with the file that it produced.

Here are the details on a job that behaved like this today.

Notice that the FileInventoryRequest record in the database was never updated with a file in request_details nor with a completion_time

#<FileInventoryRequest:0x00007fafb49761a0
  id: 10,
  user_id: 152,
  project_id: 8,
  job_id: "b20cbbc2-b910-454d-a53e-cd7a93de7b48",
  completion_time: nil,
  state: "pending",
  type: "FileInventoryRequest",
  request_details: {"project_title"=>"Princeton Prosody Archive"},
  created_at: Fri, 28 Feb 2025 17:52:15.985014000 UTC +00:00,
  updated_at: Fri, 28 Feb 2025 17:52:15.985014000 UTC +00:00>]

The file that was produced was last updated at 18:56 and there is nothing in the log to indicate that there was an error during the process:

-rw-rw-r-- 1 nobody nogroup 280257895 Feb 28 18:56 b20cbbc2-b910-454d-a53e-cd7a93de7b48.csv

Yet, the record in the database was never updated to point to the file and there is no entry in the log that the process got to that point, i.e. the code should have logged Export file generated but it never did.

There are no jobs in any of the Sidekiq queues either so the job did finish (or died) but again, nothing in the logs to track it.

There were a few CheckMK alerts around this time (1:55 PM EST) regarding tigerdata-prod2 (which is the server where this job ran). All these alerts were short-lived and seem to have recovered...but I wonder if they messed up the job.

(see also PR #1330 and issue #1274)

@hectorcorrea
Copy link
Member Author

Re-running the File Inventory Job again at 2:37 PM ran fine for a while but then we saw some memory alerts in CheckMK for the server where the job was running (tigerdata-prod2) around 3:20 PM. As of 3:25 PM the job is still running.

@hectorcorrea
Copy link
Member Author

Memory according to top:

Image

@hectorcorrea
Copy link
Member Author

hectorcorrea commented Feb 28, 2025

It does look like the kernel killed our job:

~$ sudo dmesg -T | egrep -i 'killed process'
[Fri Feb 28 17:05:31 2025] Out of memory: Killed process 971 (bundle) total-vm:8878208kB, anon-rss:6443516kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:12936kB oom_score_adj:0
[Fri Feb 28 18:58:11 2025] Out of memory: Killed process 2098426 (bundle) total-vm:8072908kB, anon-rss:6623936kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:13256kB oom_score_adj:0

$ date
Fri Feb 28 20:33:07 UTC 2025

Notice that file for the job (b20cbbc2-b910-454d-a53e-cd7a93de7b48.csv) was last updated 18:56

280257895 Feb 28 18:56 b20cbbc2-b910-454d-a53e-cd7a93de7b48.csv

@hectorcorrea
Copy link
Member Author

The Linux kernel is indeed killing our job:

sudo dmesg -T | egrep -i 'killed process'
[Fri Feb 28 17:05:31 2025] Out of memory: Killed process 971 (bundle) total-vm:8878208kB, anon-rss:6443516kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:12936kB oom_score_adj:0
[Fri Feb 28 18:58:11 2025] Out of memory: Killed process 2098426 (bundle) total-vm:8072908kB, anon-rss:6623936kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:13256kB oom_score_adj:0
[Fri Feb 28 20:41:29 2025] Out of memory: Killed process 2149691 (bundle) total-vm:7982372kB, anon-rss:6587632kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:13204kB oom_score_adj:0

Notice the process id 2149691 matches with the ID of our job in the screenshot from top above.

@hectorcorrea
Copy link
Member Author

I am surprised that we are running out of memory here.

We are probably keeping the file inventory in memory before saving it to a file and for 2 million records that does not seem to be a good idea. We should look into saving the file as we go so that we only keep a one or a few batches in memory.

@hectorcorrea
Copy link
Member Author

@kayiwa has bumped up the memory to 32GB in tigerdata-prod1 and tigerdata-prod2. We'll run the jobs again and cross out fingers :D

@hectorcorrea
Copy link
Member Author

The job finished with the new memory in the servers! It used almost 25% of the 32 GB so I can see how it was crashing before when the server had only 8 GB.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants