Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do file inventory jobs take so long for some users #1274

Open
hectorcorrea opened this issue Jan 31, 2025 · 18 comments
Open

Why do file inventory jobs take so long for some users #1274

hectorcorrea opened this issue Jan 31, 2025 · 18 comments
Assignees

Comments

@hectorcorrea
Copy link
Member

hectorcorrea commented Jan 31, 2025

While fixing issue #1263 we noticed that the file inventory job takes a very long time when Matt (user_id 186) runs it in QA. The file inventory for project_id 44 (which has 189,000 files) takes 65 minutes when Matt runs it, but as shown below it only takes 5 minutes when Hector user_id 192 runs it.

Below is what the user_request table shows in QA.

#<FileInventoryRequest:0x00007ff447b7da20
  id: 16,
  user_id: 192,
  project_id: 44,
  job_id: "fa74522e-c720-4ed5-b30b-741bdeb3b804",
  completion_time: Fri, 31 Jan 2025 17:15:35.378825000 UTC +00:00,
  state: "completed",
  type: "FileInventoryRequest",
  request_details: {"file_size"=>22090419, "output_file"=>"/mnt/nfs/tigerdata/fa74522e-c720-4ed5-b30b-741bdeb3b804.csv", "project_title"=>"Test 06-12-24"},
  created_at: Fri, 31 Jan 2025 17:10:12.237105000 UTC +00:00,
  updated_at: Fri, 31 Jan 2025 17:15:35.382571000 UTC +00:00>,
 #<FileInventoryRequest:0x00007ff447b7d660
  id: 17,
  user_id: 186,
  project_id: 44,
  job_id: "fdb79db4-e6b6-41a3-842a-b63afa619163",
  completion_time: Fri, 31 Jan 2025 20:05:34.709517000 UTC +00:00,
  state: "completed",
  type: "FileInventoryRequest",
  request_details: {"file_size"=>22090419, "output_file"=>"/mnt/nfs/tigerdata/fdb79db4-e6b6-41a3-842a-b63afa619163.csv", "project_title"=>"Test 06-12-24"},
  created_at: Fri, 31 Jan 2025 18:59:57.304534000 UTC +00:00,
  updated_at: Fri, 31 Jan 2025 20:05:34.719367000 UTC +00:00>]

We think this could be related to the fact that Hector is an admin on MediaFlux QA but Matt isn't.

@hectorcorrea hectorcorrea self-assigned this Feb 13, 2025
@hectorcorrea
Copy link
Member Author

hectorcorrea commented Feb 18, 2025

In the QA MediaFlux instance my permissions are very different from Matt's permissions:

> user.authority.describe :domain princeton :user hc8719
    :user -domain "princeton" -user "hc8719"
        :role -type "domain" "princeton"
        :role -type "role" "user"
        :role -type "role" "standard-user"
        :role -type "role" "desktop-experimental"
        :role -type "role" "pu-lib:developer"
        :role -type "role" "system-administrator"
> user.authority.describe :domain princeton :user mjc12
    :user -domain "princeton" -user "mjc12"
        :role -type "domain" "princeton"
        :role -type "role" "user"
        :role -type "role" "standard-user"
        :role -type "role" "desktop-experimental"
        :role -type "role" "pu-hpc-group:cses"
        :role -type "role" "asset-accumulator"

You can also get permissions via the user.describe command but you must pass :permissions true. This command gives even more details:

> user.describe :domain princeton :user hc8719 :permissions true :show-grant true

@matthewjchandler
Copy link

We now have a user complaint about this:

RT #56623: Issue with manifest

Last Friday, we had a use case for the manifest (checking the IDs of files we received from an external source and put in TigerData against our internal list of IDs to ensure we received all the files we expected). I requested the manifest download Friday morning, and when I checked back this morning the status still says “Preparing Now.” I am wondering whether this is a bug, or whether it is really taking that long to prepare the manifest? It has been three days.

@hectorcorrea
Copy link
Member Author

@matthewjchandler do you think you can provide some details about the issue, for example the user that had the issue and/or the project that they are trying to download?

I tried to view the ticket in RT (guessing this is the right link https://cses.princeton.edu/tickets/Ticket/Display.html?id=56623) but I get "No permission to view ticket"

@matthewjchandler
Copy link

@hectorcorrea I don't have the project ID, but the user is mnaydan, who I believe is a data manager on the project in question.

@hectorcorrea
Copy link
Member Author

hectorcorrea commented Feb 25, 2025

The job and project in question in Production:

> User.where(uid:"mnaydan").first.id
77

> FileInventoryRequest.where(user_id:77)
[#<FileInventoryRequest:0x00007f2d73695350
  id: 4,
  user_id: 77,
  project_id: 8,
  job_id: "e2976b40-869b-4404-adfc-56bec9c9616d",
  completion_time: Wed, 05 Feb 2025 19:07:25.331081000 UTC +00:00,
  state: "completed",
  type: "FileInventoryRequest",
  request_details:
   {"file_size"=>241201911, "output_file"=>"/mnt/nfs/tigerdata/e2976b40-869b-4404-adfc-56bec9c9616d.csv", "project_title"=>"Princeton Prosody Archive"},
  created_at: Wed, 05 Feb 2025 17:00:10.825831000 UTC +00:00,
  updated_at: Wed, 05 Feb 2025 19:07:25.340914000 UTC +00:00>,
 #<FileInventoryRequest:0x00007f2d800af780
  id: 7,
  user_id: 77,
  project_id: 8,
  job_id: "14f48af7-ffd6-4234-8885-decb5b62ac1e",
  completion_time: nil,
  state: "pending",
  type: "FileInventoryRequest",
  request_details: {"project_title"=>"Princeton Prosody Archive"},
  created_at: Fri, 21 Feb 2025 15:47:31.733365000 UTC +00:00,
  updated_at: Fri, 21 Feb 2025 15:47:31.733365000 UTC +00:00>]

Project 8 is the "Princeton Prosody Archive" with MediaFlux id 555360115

@hectorcorrea
Copy link
Member Author

I re-ran the job in production at 9:46 AM and as of 10:13 AM it has exported 1 million files and it's still going.

$ grep Inventory log/production.log
I, [2025-02-25T14:46:00.071209 #971]  INFO -- : [ActiveJob] [FileInventoryJob] [e431e6d7-81a2-46b2-a2a3-55311ef35858] Performing FileInventoryJob (Job ID: e431e6d7-81a2-46b2-a2a3-55311ef35858) from Sidekiq(default) enqueued at 2025-02-25T14:46:00Z with arguments: {:user_id=>51, :project_id=>8, :mediaflux_session=>"..."}
I, [2025-02-25T14:46:00.389606 #971]  INFO -- : [ActiveJob] [FileInventoryJob] [e431e6d7-81a2-46b2-a2a3-55311ef35858] [dd.env=production dd.service=tigerdata dd.version=1.0.0 dd.trace_id=276613953404838195 dd.span_id=3321320978877536597 ddsource=ruby] Exporting file list to /mnt/nfs/tigerdata/e431e6d7-81a2-46b2-a2a3-55311ef35858.csv for project 8

$ wc -l /mnt/nfs/tigerdata/e431e6d7-81a2-46b2-a2a3-55311ef35858.csv
1070001 /mnt/nfs/tigerdata/e431e6d7-81a2-46b2-a2a3-55311ef35858.csv

@hectorcorrea
Copy link
Member Author

My file inventory finished in about an hour which is not great but it's better than not finishing. It was strange that the file that it produced only had 2,115,517 files but the TigerData app shows 2,596,619.

To make things more interesting, the MediaFlux desktop app show also 2,115,517 files. I guess the way we are getting the count that we show in the app is picking something else. Not sure what that is.

@hectorcorrea
Copy link
Member Author

Aterm also shows 2115517 when we query for a count (the :where clause used is the one that the MediaFlux desktop suggests in the little "info" icon):

> asset.count :where asset in static collection or subcollection of 555360115
    :total "2115517"

@hectorcorrea
Copy link
Member Author

File count issue now tracked here: #1327

@hectorcorrea
Copy link
Member Author

With the logging information introduced in PR #1328 it's obvious the slowness is in the iterator.

Fetching each page of data takes less than one second when running under my user (hc8719):

I, [2025-02-26T01:25:37.282053 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: STARTED, 0.00 s
I, [2025-02-26T01:25:37.937056 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: FETCHED page 1 from iterator, 0.64 s
I, [2025-02-26T01:25:39.625114 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: FETCHED page 2 from iterator, 0.83 s
I, [2025-02-26T01:25:41.163220 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: FETCHED page 3 from iterator, 0.71 s
I, [2025-02-26T01:25:42.789327 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: FETCHED page 4 from iterator, 0.80 s
I, [2025-02-26T01:25:44.283935 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: FETCHED page 5 from iterator, 0.72 s
I, [2025-02-26T01:25:46.116625 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: FETCHED page 6 from iterator, 1.00 s
I, [2025-02-26T01:25:47.638003 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: FETCHED page 7 from iterator, 0.75 s
I, [2025-02-26T01:25:49.139052 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: FETCHED page 8 from iterator, 0.74 s
I, [2025-02-26T01:25:50.745654 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: FETCHED page 9 from iterator, 0.80 s
I, [2025-02-26T01:25:52.338491 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: FETCHED page 10 from iterator, 0.73 s
...
I, [2025-02-26T01:30:34.252466 #189072]  INFO -- : ... file_list_to_file CMCTBoqn 10.80021/gwqd-2d50: ENDED, 296.97 s

Whereas for Matt's user fetching each page of data takes almost 10 seconds:

I, [2025-02-26T13:01:16.745530 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: STARTED, 0.00 s
I, [2025-02-26T13:01:25.063007 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: FETCHED page 1 from iterator, 8.31 s
I, [2025-02-26T13:01:35.729090 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: FETCHED page 2 from iterator, 9.82 s
I, [2025-02-26T13:01:46.201057 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: FETCHED page 3 from iterator, 9.73 s
I, [2025-02-26T13:01:56.827202 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: FETCHED page 4 from iterator, 9.86 s
I, [2025-02-26T13:02:07.402020 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: FETCHED page 5 from iterator, 9.81 s
I, [2025-02-26T13:02:18.148202 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: FETCHED page 6 from iterator, 9.89 s
I, [2025-02-26T13:02:28.694064 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: FETCHED page 7 from iterator, 9.82 s
I, [2025-02-26T13:02:39.290140 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: FETCHED page 8 from iterator, 9.84 s
I, [2025-02-26T13:02:50.036018 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: FETCHED page 9 from iterator, 9.94 s
I, [2025-02-26T13:03:00.296505 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: FETCHED page 10 from iterator, 9.50 s
...
I, [2025-02-26T13:37:02.678011 #189072]  INFO -- : ... file_list_to_file 4EFyInvo 10.80021/gwqd-2d50: ENDED, 2145.93 s

@hectorcorrea
Copy link
Member Author

Updating my user hc8719 to have the same permissions as Matt's user makes the issue happen for me to. Now each page of data takes 15 seconds to be fetch (instead of less than one second).

It was taking 10 seconds for Matt rather than 15 seconds but the experiment shows that the user permissions are having a huge impact on the iterator performance.

@hectorcorrea
Copy link
Member Author

hectorcorrea commented Feb 26, 2025

Unfortunately, running the asset.query, asset.query.iterate directly on Mediaflux via aTerm does not have the same issues. The following script executes within 1 or 2 seconds for my user with the same permissions as Matt (with the premise that the permissions was the culprit the execution time should have been 20-30 seconds since we are iterating twice on this script.)

# define the query parameters
set collection 1083
set user [xvalue user/name [user.self.get]]


# create the iterator
set iterator [xvalue iterator [asset.query :where asset in static collection or subcollection of $collection :action get-values :xpath name -ename name :xpath path -ename path :xpath content/@total-size -ename total-size :xpath @collection -ename collection :xpath mtime -ename mtime :as iterator]]


# iterate twice over the iterator (and gather execution time)
set startTime [clock seconds]
asset.query.iterate :id $iterator
asset.query.iterate :id $iterator
set endTime [clock seconds]
set elapsed [expr {$endTime - $startTime}]

# Log the execution time
puts "BEGIN User $user querying collection $collection"
puts "Iterator started at $startTime"
puts "Iterator ended at at $endTime"
puts "Iterator took $elapsed seconds"
puts "END User $user querying collection $collection"


# Destroy the iterator
asset.query.iterator.destroy :id $iterator

return $elapsed

The log file shows:

:line -nb "498" "BEGIN User Hector Correa querying collection 1083"
:line -nb "499" "Iterator started at 1740598635"
:line -nb "500" "Iterator ended at at 1740598637"
:line -nb "501" "Iterator took 2 seconds"
:line -nb "502" "END User Hector Correa querying collection 1083"

@hectorcorrea
Copy link
Member Author

Running the File Inventory from the application with the reduced permissions still takes 10-15 per iteration so this it is not a case where the server might have been too busy at one time but not the other.

Could the issue be related to how we make the requests from Ruby or how we manage the authentication/session in those requests?

@hectorcorrea
Copy link
Member Author

Re-adding the admin role to my account gets the the job to run fast:

> actor.describe :type user :name princeton:hc8719
    :actor -id "46" -type "user" -name "princeton:hc8719"
        :role -id "35" -type "domain" "princeton"
        :role -id "4" -type "role" "user"
        :role -id "6" -type "role" "standard-user"
        :role -id "37" -type "role" "desktop-experimental"
        :role -id "1" -type "role" "system-administrator"

I removed the pu-oit-group roles from the display to keep things tidy.

@hectorcorrea
Copy link
Member Author

hectorcorrea commented Feb 27, 2025

With Matt as an administrator he gets the same performance as my user does, each batch of records it's fetched in less than a second.

I, [2025-02-27T15:40:52.764190 file_list_to_file hlgcR4BS 10.80021/gwqd-2d50: STARTED, 0.00 s
I, [2025-02-27T15:40:53.333033 file_list_to_file hlgcR4BS 10.80021/gwqd-2d50: FETCHED page 1 from iterator, 0.56 s
I, [2025-02-27T15:40:54.708107 file_list_to_file hlgcR4BS 10.80021/gwqd-2d50: FETCHED page 2 from iterator, 0.66 s
I, [2025-02-27T15:40:56.045156 file_list_to_file hlgcR4BS 10.80021/gwqd-2d50: FETCHED page 3 from iterator, 0.70 s
I, [2025-02-27T15:40:57.270689 file_list_to_file hlgcR4BS 10.80021/gwqd-2d50: FETCHED page 4 from iterator, 0.59 s

His current permissions (plus the pu-oit-group roles) are:

> actor.describe :type user :name princeton:mjc12
    :actor -id "165" -type "user" -name "princeton:mjc12"
        :role -id "35" -type "domain" "princeton"
        :role -id "4" -type "role" "user"
        :role -id "239" -type "role" "pu-hpc-group:cses"
        :role -id "37" -type "role" "desktop-experimental"
        :role -id "6" -type "role" "standard-user"
        :role -id "12" -type "role" "asset-accumulator"
        :role -id "1" -type "role" "system-administrator"
> 

@hectorcorrea
Copy link
Member Author

Issue about FileInventoryJob crashing: #1331

@hectorcorrea
Copy link
Member Author

hectorcorrea commented Mar 3, 2025

Deployed version v0.52.3 to production which includes better logging. Re-run the File Inventory for the Prosody Archive under my account.

When querying MediaFlux directly my account fetches a smaller number of files (2,115,517) than Chuck's (2,116,016) or the original user but close enough to benchmark as a non-admin.

As of 5 PM the File Inventory job has exported half the file inventory and should end around 6:30 PM. Fetching pages is slowing down, it used to be 0.5 seconds per page and now it's hovering around the 1 second per page but hopefully does not slow down much more. Memory utilization is about 6% which is good.

Update: The job finished in 62 minutes and my account is not an administrator in production.

@hectorcorrea
Copy link
Member Author

hectorcorrea commented Mar 5, 2025

Deployed a new version of the code to Staging and QA that lets us run a FileInventory for a specific user without using CAS authentication and noticed that the same problem persist. Good news the problem is not CAS, bad news I still don't know what the issue is.

Here is what I did:

  • Ran the FileInventory for project_id 44 in QA both times with my account hc8719 and bypassing CAS authentication (instead I issued the credentials directly on the request).
  • On one run my account had the system:administrator role granted in MediaFlux and on the second role it didn't.

When running with system:administrator fetching each batch of files takes 0.45 seconds but when my account is not administrator it takes 9 seconds per batch.

My user with system-administrator permission

> actor.describe :type user :name princeton:hc8719
 :actor -id "46" -type "user" -name "princeton:hc8719"
        :role -id "35" -type "domain" "princeton"
        :role -id "4" -type "role" "user"
        :role -id "6" -type "role" "standard-user"
        :role -id "37" -type "role" "desktop-experimental"
        :role -id "1" -type "role" "system-administrator"

When running the inventory fetching each page takes 0.46 seconds

$ bundle exec rake file_inventory:run[44,hc8719]
Scheduling file inventory using credentials for: princeton:hc8719, session: owdouZhs...

I, [2025-03-05T20:11:34.315841 #2994093]  INFO -- ... file_list_to_file owdouZhs 10.80021/gwqd-2d50: FETCHED page 12 from iterator, 0.46 s
I, [2025-03-05T20:11:35.558012 #2994093]  INFO -- ... file_list_to_file owdouZhs 10.80021/gwqd-2d50: FETCHED page 13 from iterator, 0.48 s
...
I, [2025-03-05T20:15:29.137617 #2994093]  INFO -- ... file_list_to_file owdouZhs 10.80021/gwqd-2d50: ENDED, 249.14 s

Setting my permissions to match those of Matt

> actor.revoke :type user :name princeton:hc8719 :role -type role system-administrator
> actor.revoke :type user :name princeton:hc8719 :role -type role pu-lib:developer
> actor.grant :type user :name princeton:hc8719 :role -type role pu-hpc-group:cses
> actor.grant :type user :name princeton:hc8719 :role -type role asset-accumulator
> actor.describe :type user :name princeton:hc8719

:actor -id "46" -type "user" -name "princeton:hc8719"
        :role -id "35" -type "domain" "princeton"
        :role -id "4" -type "role" "user"
        :role -id "6" -type "role" "standard-user"
        :role -id "37" -type "role" "desktop-experimental"
        :role -id "239" -type "role" "pu-hpc-group:cses"
        :role -id "12" -type "role" "asset-accumulator"

Re-running inventory with the new permissions and now it takes 9 seconds per page!

$ bundle exec rake file_inventory:run[44,hc8719]
Scheduling file inventory using credentials for: princeton:hc8719, session: C7RRWHwl...

I, [2025-03-05T20:18:31.788155 #2994093]  INFO -- ...  file_list_to_file C7RRWHwl 10.80021/gwqd-2d50: FETCHED page 4 from iterator, 8.59 s
I, [2025-03-05T20:18:41.096222 #2994093]  INFO -- ... file_list_to_file C7RRWHwl 10.80021/gwqd-2d50: FETCHED page 5 from iterator, 8.54 s
...
I, [2025-03-05T20:49:14.461361 #2994093]  INFO -- : ... file_list_to_file C7RRWHwl 10.80021/gwqd-2d50: ENDED, 1877.91 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants