From 33026a845ddbb64372301d118110378aea9d5c25 Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Wed, 10 May 2023 12:30:21 -0500 Subject: [PATCH 01/12] Create faq.md Create a document for questions and answers that came up during train production runs --- docs/faq.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/faq.md diff --git a/docs/faq.md b/docs/faq.md new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/docs/faq.md @@ -0,0 +1 @@ + From 83289d8edb5fa9863af7134f34e9ba4d37bc02be Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Wed, 10 May 2023 12:32:08 -0500 Subject: [PATCH 02/12] Create campaignSetupInstructions.md --- docs/campaignSetupInstructions.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 docs/campaignSetupInstructions.md diff --git a/docs/campaignSetupInstructions.md b/docs/campaignSetupInstructions.md new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/docs/campaignSetupInstructions.md @@ -0,0 +1 @@ + From 46396a4c988c839ac962cc4f422f9d76061dce3f Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Wed, 10 May 2023 12:32:59 -0500 Subject: [PATCH 03/12] Rename campaignSetupInstructions.md to campaign_setup_instructions.md --- ...ampaignSetupInstructions.md => campaign_setup_instructions.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/{campaignSetupInstructions.md => campaign_setup_instructions.md} (100%) diff --git a/docs/campaignSetupInstructions.md b/docs/campaign_setup_instructions.md similarity index 100% rename from docs/campaignSetupInstructions.md rename to docs/campaign_setup_instructions.md From ab7d088cff471e2b2de70de744f2f9a5095334ad Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Wed, 10 May 2023 12:36:22 -0500 Subject: [PATCH 04/12] Update campaign_setup_instructions.md --- docs/campaign_setup_instructions.md | 32 +++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/docs/campaign_setup_instructions.md b/docs/campaign_setup_instructions.md index 8b13789..c3764b6 100644 --- a/docs/campaign_setup_instructions.md +++ b/docs/campaign_setup_instructions.md @@ -1 +1,33 @@ +# Steps to setup a campaign on the Open Science Grid + +We have be focusing on OSG to run simulation campaigns, through login05.osgconnect.net (single point of access, no lab account needed, intended for accessibility for foreign users, and because at the time we started there was no official VOMS for EIC which was required for lab submit nodes). Login can be requested through the format OSG open pool account request process. + +OSG typically has many times more free nodes than the combined JLab and BNL allocation to EIC. + +OSG does impose conditions on jobs, in particular short jobs (2 hours) and ideally self-contained jobs that don't talk to the generic internet (xrootd is ok). + +OSG uses Htcondor: + +Htcondor takes care of the S3 transfer of simulation products which greatly facilitates the job management compared to slurm. +Htcondor puts failed jobs in a hold state which greatly facilitates triaging failures and simply resubmitting if it was a transient (as is typical). +JLab and BNL access nodes to OSG are a bit problematic: JLab now has an access node that can also support the EIC VO (instead of us submitting jobs pretending to be GlueX). BNL had their setup messed up and jobs only got farmed to a single site at UCSD, which is likely not been fixed yet. BNL is also a messed up system of many individual interactive nodes and you have to know which one to go to to do any specific thing. + +Job scripts, whether htcondor or slurm, are all starting jobs inside the eic-shell container on /cvmfs/singularity.opensciencegrid.org/, the same image that users see in eic-shell (though a pinned stable version, typically). Here are the job submission scripts: + +https://github.com/eic/job_submission_condor: scripts/submit_csv.sh +https://github.com/eic/job_submission_slurm: scripts/submit.sh +Job submitters on OSG will want to git clone the first repo. + +Here are the job scripts that run inside the container (installed inside the container, no need to clone): + +https://github.com/eic/simulation_campaign_single: scripts/run.sh (see CI for examples) +https://github.com/eic/simulation_campaign_hepmc3: scripts/run.sh (see CI for examples) +These are different for historical reason but they do mostly the exact same thing. These job scripts assume input data is accessible and just leave output data where they produce it (no attempt to upload). That allows them to be used by slurm and condor alike. Modifications to these scripts are likely only needed when the actual underlying calling syntax of the reconstruction needs changes. + +Because we target 2 hours per job and because that varies for different data sets, we run benchmarks on all data sets that we simulate: + +https://github.com/eic/simulation_campaign_datasets, but that's a mirror of https://eicweb.phy.anl.gov/EIC/campaigns/datasets for CI reasons (takes a few 100 core hours to benchmark all the datasets, can't fit in github CI). +The data sets produce a simple csv file with info about that data sets: running time per event, number of events, etc. Then submit_csv.sh (for condor) takes that and submits it for a specific target job duration, ensuring disk space and memory request are appropriate. + +When submitting 10k to 100k jobs, dealing with failing jobs has two options: don't care about failures, or look a them with a semi-automated approach. scripts/hold_release_and_review.sh looks at stdout, stderr, and condor log, greps for patterns, does automatic resubmit. It's useful to keep an eye on now failures so we can document and fix them. Most common error is failure to write to S3 at the end of a job, which just needs a resubmit (but does mean we ran the job for nothing). From aaa7b62b79f48cb042f79340b96bd2d5ccf5e182 Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Wed, 24 May 2023 12:01:06 -0500 Subject: [PATCH 05/12] Update faq.md --- docs/faq.md | 49 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) diff --git a/docs/faq.md b/docs/faq.md index 8b13789..1b62d32 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -1 +1,50 @@ +### Frequently Asked Questions ### + +## Common Errors ## + +# Example 1 + +* Scenario + +I downloaded a full simulation output file from S3. When I run eicrecon, I see the following error. +``` +[EcalBarrelImagingRecHits] [warning] Failed to load ID decoder for EcalBarrelImagingHits +[WARN] Parameter 'BEMC:ecalbarrelimagingrawhits:timeResolution' with value '0' loses equality with itself after stringificatin +[FATAL] Segfault detected! Printing backtraces and exiting. + +Thread model: pthreads +139636488283840: + `- JSignalHandler::handle_sigsegv(int, siginfo_t*, void*) (0x7effbbb2e296) + `- /lib/x86_64-linux-gnu/libc.so.6 (0x7effbb61bf90) + `- ImagingPixelReco::execute() (0x7effb7887511) + `- CalorimeterHit_factory_EcalBarrelImagingRecHits::Process(std::shared_ptr const&) (0x7effb78879c9) + `- eicrecon::JFactoryPodioT::Create(std::shared_ptr const&) (0x7effb7908c7e) + `- std::vector > JEvent::Get(std::__cxx11::basic_string, std::allocator > const&) const (0x7effb790d28e) + `- ProtoCluster_factory_EcalBarrelImagingProtoClusters::Process(std::shared_ptr const&) (0x7effb78677fa) + `- eicrecon::JFactoryPodioT::Create(std::shared_ptr const&) (0x7effb79098be) + `- std::vector > JEvent::Get(std::__cxx11::basic_string, std::allocator > const&) const (0x7effb790c45e) + `- Cluster_factory_EcalBarrelImagingClusters::Process(std::shared_ptr const&) (0x7effb7871701) + `- eicrecon::JFactoryPodioT::Create(std::shared_ptr const&) (0x7effb790a4fe) + `- JEvent::GetCollectionBase(std::__cxx11::basic_string, std::allocator >) const (0x7effb7ca85f6) + `- JEventProcessorPODIO::Process(std::shared_ptr const&) (0x7effb7165a87) + `- JEventProcessor::DoMap(std::shared_ptr const&) (0x7effbbab47bd) + `- JEventProcessorArrow::execute(JArrowMetrics&, unsigned long) (0x7effbba9c6c5) + `- JWorker::loop() (0x7effbbaa74d7) + `- /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7effbb8b54a3) + `- /lib/x86_64-linux-gnu/libc.so.6 (0x7effbb668fd4) + `- __clone (0x7effbb6e8820) +``` + +* Explanation + +Eicrecon is trying to access a collection that doesn't exist for the detector config with which the original simulation was run with. + +* Solution + +Make sure the correct tagged detector geometry environment was sourced and DETECTOR_CONFIG variable was defined. + +``` +source /opt/detector/epic-23.05.2/setup.sh +DETECTOR_CONFIG=epic_brycecanyon eicrecon -Ppodio:output_file=.eicrecon.tree.edm4eic.root -Pjana:warmup_timeout=0 -Pjana:timeout=0 -Pplugins=janadot .edm4hep.root +``` From f441957b2c363d9c7944d2cd67f593af0fb30eb6 Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Wed, 24 May 2023 12:54:49 -0500 Subject: [PATCH 06/12] Update campaign_setup_instructions.md --- docs/campaign_setup_instructions.md | 30 ++++++++++++++++++++++------- 1 file changed, 23 insertions(+), 7 deletions(-) diff --git a/docs/campaign_setup_instructions.md b/docs/campaign_setup_instructions.md index c3764b6..5b0160e 100644 --- a/docs/campaign_setup_instructions.md +++ b/docs/campaign_setup_instructions.md @@ -1,17 +1,33 @@ -# Steps to setup a campaign on the Open Science Grid +# Instructions for Running a Simulation Campaign on the Open Science Grid +## 1 Getting an account -We have be focusing on OSG to run simulation campaigns, through login05.osgconnect.net (single point of access, no lab account needed, intended for accessibility for foreign users, and because at the time we started there was no official VOMS for EIC which was required for lab submit nodes). Login can be requested through the format OSG open pool account request process. +### 1.1 OSGconnect +Follow the instructions at +https://portal.osg-htc.org/documentation/overview/account_setup/connect-access/ +to get an osgconnect account and add yourself to the osg.EIC project. + +Once your account is setup, you can login to login05.osgconnect.net. This method is still currently in use for historical reasons (single point of access, no lab account needed, intended for accessibility for foreign users). However, once we setup the new project under EIC virtual orgnanization, we will stop using this. + +### 1.2 EIC Virtual Organization +JLab has now has an access node that can support the EIC Virtual Organization. + +BNL had their setup messed up and jobs only got farmed to a single site at UCSD, which is likely not been fixed yet. BNL is also a messed up system of many individual interactive nodes and you have to know which one to go to to do any specific thing. -OSG typically has many times more free nodes than the combined JLab and BNL allocation to EIC. -OSG does impose conditions on jobs, in particular short jobs (2 hours) and ideally self-contained jobs that don't talk to the generic internet (xrootd is ok). +## 2 Features of OSG -OSG uses Htcondor: +### 2.1 Node Availability +OSG typically has many times more free nodes than the combined JLab and BNL allocation to EIC. + +### 2.2 Restrictions +OSG does impose conditions on jobs, in particular short jobs (2 hours) and ideally self-contained jobs that don't talk to the generic internet (xrootd is ok). -Htcondor takes care of the S3 transfer of simulation products which greatly facilitates the job management compared to slurm. +### 2.3 HTCondor +OSG uses Htcondor for job submissions. Htcondor takes care of the S3 transfer of simulation products which greatly facilitates the job management compared to slurm. Htcondor puts failed jobs in a hold state which greatly facilitates triaging failures and simply resubmitting if it was a transient (as is typical). -JLab and BNL access nodes to OSG are a bit problematic: JLab now has an access node that can also support the EIC VO (instead of us submitting jobs pretending to be GlueX). BNL had their setup messed up and jobs only got farmed to a single site at UCSD, which is likely not been fixed yet. BNL is also a messed up system of many individual interactive nodes and you have to know which one to go to to do any specific thing. + + Job scripts, whether htcondor or slurm, are all starting jobs inside the eic-shell container on /cvmfs/singularity.opensciencegrid.org/, the same image that users see in eic-shell (though a pinned stable version, typically). Here are the job submission scripts: From d390ccb3508a12268b4df9d661366b9f9a107883 Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Thu, 25 May 2023 00:13:44 -0500 Subject: [PATCH 07/12] Update campaign_setup_instructions.md --- docs/campaign_setup_instructions.md | 32 ++++++++++++++++------------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/docs/campaign_setup_instructions.md b/docs/campaign_setup_instructions.md index 5b0160e..379e22c 100644 --- a/docs/campaign_setup_instructions.md +++ b/docs/campaign_setup_instructions.md @@ -1,34 +1,38 @@ # Instructions for Running a Simulation Campaign on the Open Science Grid -## 1 Getting an account +## 1 OSGConnect -### 1.1 OSGconnect +### 1.1 Getting an account Follow the instructions at https://portal.osg-htc.org/documentation/overview/account_setup/connect-access/ to get an osgconnect account and add yourself to the osg.EIC project. -Once your account is setup, you can login to login05.osgconnect.net. This method is still currently in use for historical reasons (single point of access, no lab account needed, intended for accessibility for foreign users). However, once we setup the new project under EIC virtual orgnanization, we will stop using this. +Once your account is setup, you can login to login05.osgconnect.net. This method is still currently in use for historical reasons (single point of access, no lab account needed, intended for accessibility for foreign users). -### 1.2 EIC Virtual Organization -JLab has now has an access node that can support the EIC Virtual Organization. - -BNL had their setup messed up and jobs only got farmed to a single site at UCSD, which is likely not been fixed yet. BNL is also a messed up system of many individual interactive nodes and you have to know which one to go to to do any specific thing. - - -## 2 Features of OSG - -### 2.1 Node Availability +### 1.2 Node Availability OSG typically has many times more free nodes than the combined JLab and BNL allocation to EIC. -### 2.2 Restrictions +### 1.3 Restrictions OSG does impose conditions on jobs, in particular short jobs (2 hours) and ideally self-contained jobs that don't talk to the generic internet (xrootd is ok). -### 2.3 HTCondor +### 1.4 HTCondor OSG uses Htcondor for job submissions. Htcondor takes care of the S3 transfer of simulation products which greatly facilitates the job management compared to slurm. Htcondor puts failed jobs in a hold state which greatly facilitates triaging failures and simply resubmitting if it was a transient (as is typical). + + +## EIC Virtual Organization + +JLab has now has an access node that can support the EIC Virtual Organization. + +BNL had their setup messed up and jobs only got farmed to a single site at UCSD, which is likely not been fixed yet. BNL is also a messed up system of many individual interactive nodes and you have to know which one to go to to do any specific thing. + + +## 2 Features of OSG + + Job scripts, whether htcondor or slurm, are all starting jobs inside the eic-shell container on /cvmfs/singularity.opensciencegrid.org/, the same image that users see in eic-shell (though a pinned stable version, typically). Here are the job submission scripts: https://github.com/eic/job_submission_condor: scripts/submit_csv.sh From 3f5701720e303102217df8eefed43c87ba1b94c4 Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Thu, 25 May 2023 07:49:59 -0500 Subject: [PATCH 08/12] Update campaign_setup_instructions.md --- docs/campaign_setup_instructions.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/campaign_setup_instructions.md b/docs/campaign_setup_instructions.md index 379e22c..1052ddc 100644 --- a/docs/campaign_setup_instructions.md +++ b/docs/campaign_setup_instructions.md @@ -19,11 +19,11 @@ OSG does impose conditions on jobs, in particular short jobs (2 hours) and ideal OSG uses Htcondor for job submissions. Htcondor takes care of the S3 transfer of simulation products which greatly facilitates the job management compared to slurm. Htcondor puts failed jobs in a hold state which greatly facilitates triaging failures and simply resubmitting if it was a transient (as is typical). +### 1.5 Setting up campaign +To set up the running environment (cloning [this repository](https://github.com/eic/job_submission_condor) and setting up the secret access key), ask the production WG for help in the sim-prod mattermost channel. - - -## EIC Virtual Organization +## 2 EIC Virtual Organization JLab has now has an access node that can support the EIC Virtual Organization. From 5a211f5f2b520670ff0744bbbb4315ebed0ea86a Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Thu, 25 May 2023 08:01:46 -0500 Subject: [PATCH 09/12] Update campaign_setup_instructions.md --- docs/campaign_setup_instructions.md | 32 ++++++++++++----------------- 1 file changed, 13 insertions(+), 19 deletions(-) diff --git a/docs/campaign_setup_instructions.md b/docs/campaign_setup_instructions.md index 1052ddc..4dc2c7a 100644 --- a/docs/campaign_setup_instructions.md +++ b/docs/campaign_setup_instructions.md @@ -22,32 +22,26 @@ Htcondor puts failed jobs in a hold state which greatly facilitates triaging fai ### 1.5 Setting up campaign To set up the running environment (cloning [this repository](https://github.com/eic/job_submission_condor) and setting up the secret access key), ask the production WG for help in the sim-prod mattermost channel. - -## 2 EIC Virtual Organization - -JLab has now has an access node that can support the EIC Virtual Organization. - -BNL had their setup messed up and jobs only got farmed to a single site at UCSD, which is likely not been fixed yet. BNL is also a messed up system of many individual interactive nodes and you have to know which one to go to to do any specific thing. - - -## 2 Features of OSG - - -Job scripts, whether htcondor or slurm, are all starting jobs inside the eic-shell container on /cvmfs/singularity.opensciencegrid.org/, the same image that users see in eic-shell (though a pinned stable version, typically). Here are the job submission scripts: - +### 1.6 Running a campaign +The relevant submission script is: https://github.com/eic/job_submission_condor: scripts/submit_csv.sh -https://github.com/eic/job_submission_slurm: scripts/submit.sh -Job submitters on OSG will want to git clone the first repo. -Here are the job scripts that run inside the container (installed inside the container, no need to clone): +It starts jobs inside the eic-shell container on /cvmfs/singularity.opensciencegrid.org/, the same image that users see in eic-shell (though a pinned stable version, typically). +Here are the job scripts that run inside the container (installed inside the container, no need to clone): https://github.com/eic/simulation_campaign_single: scripts/run.sh (see CI for examples) https://github.com/eic/simulation_campaign_hepmc3: scripts/run.sh (see CI for examples) These are different for historical reason but they do mostly the exact same thing. These job scripts assume input data is accessible and just leave output data where they produce it (no attempt to upload). That allows them to be used by slurm and condor alike. Modifications to these scripts are likely only needed when the actual underlying calling syntax of the reconstruction needs changes. Because we target 2 hours per job and because that varies for different data sets, we run benchmarks on all data sets that we simulate: +https://github.com/eic/simulation_campaign_datasets, but that's a mirror of https://eicweb.phy.anl.gov/EIC/campaigns/datasets for CI reasons (takes a few 100 core hours to benchmark all the datasets, can't fit in github CI). The data sets produce a simple csv file with info about that data sets: running time per event, number of events, etc. Then submit_csv.sh (for condor) takes that and submits it for a specific target job duration, ensuring disk space and memory request are appropriate. -https://github.com/eic/simulation_campaign_datasets, but that's a mirror of https://eicweb.phy.anl.gov/EIC/campaigns/datasets for CI reasons (takes a few 100 core hours to benchmark all the datasets, can't fit in github CI). -The data sets produce a simple csv file with info about that data sets: running time per event, number of events, etc. Then submit_csv.sh (for condor) takes that and submits it for a specific target job duration, ensuring disk space and memory request are appropriate. - +### 1.7 Monitoring Failures When submitting 10k to 100k jobs, dealing with failing jobs has two options: don't care about failures, or look a them with a semi-automated approach. scripts/hold_release_and_review.sh looks at stdout, stderr, and condor log, greps for patterns, does automatic resubmit. It's useful to keep an eye on now failures so we can document and fix them. Most common error is failure to write to S3 at the end of a job, which just needs a resubmit (but does mean we ran the job for nothing). + + +## 2 EIC Virtual Organization + +JLab has now has an access node that can support the EIC Virtual Organization. + +BNL had their setup messed up and jobs only got farmed to a single site at UCSD, which is likely not been fixed yet. BNL is also a messed up system of many individual interactive nodes and you have to know which one to go to to do any specific thing. From 0c32229bb6c63a20d581869fcf001a31d5a44362 Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Thu, 25 May 2023 08:02:19 -0500 Subject: [PATCH 10/12] Update campaign_setup_instructions.md --- docs/campaign_setup_instructions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/campaign_setup_instructions.md b/docs/campaign_setup_instructions.md index 4dc2c7a..b75537d 100644 --- a/docs/campaign_setup_instructions.md +++ b/docs/campaign_setup_instructions.md @@ -42,6 +42,6 @@ When submitting 10k to 100k jobs, dealing with failing jobs has two options: don ## 2 EIC Virtual Organization -JLab has now has an access node that can support the EIC Virtual Organization. +JLab now has an access node that can support the EIC Virtual Organization. BNL had their setup messed up and jobs only got farmed to a single site at UCSD, which is likely not been fixed yet. BNL is also a messed up system of many individual interactive nodes and you have to know which one to go to to do any specific thing. From 49da2a779a7ceec91a4dadb51ee569c0e3168a71 Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Sun, 28 May 2023 18:07:47 -0500 Subject: [PATCH 11/12] Update campaign_setup_instructions.md --- docs/campaign_setup_instructions.md | 34 +++++++++-------------------- 1 file changed, 10 insertions(+), 24 deletions(-) diff --git a/docs/campaign_setup_instructions.md b/docs/campaign_setup_instructions.md index b75537d..c1992f9 100644 --- a/docs/campaign_setup_instructions.md +++ b/docs/campaign_setup_instructions.md @@ -1,28 +1,21 @@ # Instructions for Running a Simulation Campaign on the Open Science Grid -## 1 OSGConnect -### 1.1 Getting an account -Follow the instructions at -https://portal.osg-htc.org/documentation/overview/account_setup/connect-access/ -to get an osgconnect account and add yourself to the osg.EIC project. +## 1 Getting an account +Create an account at https://www.ci-connect.net/. +Once logged in, request for membership of https://www.ci-connect.net/groups/root.collab and https://www.ci-connect.net/groups/root.collab.ePIC. +Once your account is approved, you can setup a private and public key pair after clicking on Edit Profile. Store the public key on the ci-connect account and private key in your machine. Then you can ssh to login.collab.ci-connect.net. -Once your account is setup, you can login to login05.osgconnect.net. This method is still currently in use for historical reasons (single point of access, no lab account needed, intended for accessibility for foreign users). - -### 1.2 Node Availability -OSG typically has many times more free nodes than the combined JLab and BNL allocation to EIC. - -### 1.3 Restrictions +### 2 Restrictions OSG does impose conditions on jobs, in particular short jobs (2 hours) and ideally self-contained jobs that don't talk to the generic internet (xrootd is ok). -### 1.4 HTCondor -OSG uses Htcondor for job submissions. Htcondor takes care of the S3 transfer of simulation products which greatly facilitates the job management compared to slurm. -Htcondor puts failed jobs in a hold state which greatly facilitates triaging failures and simply resubmitting if it was a transient (as is typical). +### 3 HTCondor +OSG uses Htcondor for job submissions. Htcondor takes care of the S3 transfer of simulation products which greatly facilitates the job management compared to slurm. Htcondor puts failed jobs in a hold state which greatly facilitates triaging failures and simply resubmitting if it was a transient (as is typical). -### 1.5 Setting up campaign +### 4 Setting up campaign To set up the running environment (cloning [this repository](https://github.com/eic/job_submission_condor) and setting up the secret access key), ask the production WG for help in the sim-prod mattermost channel. -### 1.6 Running a campaign +### 5 Running a campaign The relevant submission script is: https://github.com/eic/job_submission_condor: scripts/submit_csv.sh @@ -36,12 +29,5 @@ These are different for historical reason but they do mostly the exact same thin Because we target 2 hours per job and because that varies for different data sets, we run benchmarks on all data sets that we simulate: https://github.com/eic/simulation_campaign_datasets, but that's a mirror of https://eicweb.phy.anl.gov/EIC/campaigns/datasets for CI reasons (takes a few 100 core hours to benchmark all the datasets, can't fit in github CI). The data sets produce a simple csv file with info about that data sets: running time per event, number of events, etc. Then submit_csv.sh (for condor) takes that and submits it for a specific target job duration, ensuring disk space and memory request are appropriate. -### 1.7 Monitoring Failures +### 6 Monitoring Failures When submitting 10k to 100k jobs, dealing with failing jobs has two options: don't care about failures, or look a them with a semi-automated approach. scripts/hold_release_and_review.sh looks at stdout, stderr, and condor log, greps for patterns, does automatic resubmit. It's useful to keep an eye on now failures so we can document and fix them. Most common error is failure to write to S3 at the end of a job, which just needs a resubmit (but does mean we ran the job for nothing). - - -## 2 EIC Virtual Organization - -JLab now has an access node that can support the EIC Virtual Organization. - -BNL had their setup messed up and jobs only got farmed to a single site at UCSD, which is likely not been fixed yet. BNL is also a messed up system of many individual interactive nodes and you have to know which one to go to to do any specific thing. From 59bb098ff4c0082b26df62f86f94f6211171677b Mon Sep 17 00:00:00 2001 From: Sakib Rahman Date: Sun, 28 May 2023 18:09:36 -0500 Subject: [PATCH 12/12] Update faq.md --- docs/faq.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/faq.md b/docs/faq.md index 1b62d32..85790bc 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -1,9 +1,9 @@ -### Frequently Asked Questions ### +# Frequently Asked Questions ### ## Common Errors ## -# Example 1 +### Example 1 * Scenario