Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3 feature update documentation #5

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
33 changes: 33 additions & 0 deletions docs/campaign_setup_instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Instructions for Running a Simulation Campaign on the Open Science Grid


## 1 Getting an account
Create an account at https://www.ci-connect.net/.
Once logged in, request for membership of https://www.ci-connect.net/groups/root.collab and https://www.ci-connect.net/groups/root.collab.ePIC.
Once your account is approved, you can setup a private and public key pair after clicking on Edit Profile. Store the public key on the ci-connect account and private key in your machine. Then you can ssh to login.collab.ci-connect.net.

### 2 Restrictions
OSG does impose conditions on jobs, in particular short jobs (2 hours) and ideally self-contained jobs that don't talk to the generic internet (xrootd is ok).

### 3 HTCondor
OSG uses Htcondor for job submissions. Htcondor takes care of the S3 transfer of simulation products which greatly facilitates the job management compared to slurm. Htcondor puts failed jobs in a hold state which greatly facilitates triaging failures and simply resubmitting if it was a transient (as is typical).

### 4 Setting up campaign
To set up the running environment (cloning [this repository](https://github.com/eic/job_submission_condor) and setting up the secret access key), ask the production WG for help in the sim-prod mattermost channel.

### 5 Running a campaign
The relevant submission script is:
https://github.com/eic/job_submission_condor: scripts/submit_csv.sh

It starts jobs inside the eic-shell container on /cvmfs/singularity.opensciencegrid.org/, the same image that users see in eic-shell (though a pinned stable version, typically).

Here are the job scripts that run inside the container (installed inside the container, no need to clone):
https://github.com/eic/simulation_campaign_single: scripts/run.sh (see CI for examples)
https://github.com/eic/simulation_campaign_hepmc3: scripts/run.sh (see CI for examples)
These are different for historical reason but they do mostly the exact same thing. These job scripts assume input data is accessible and just leave output data where they produce it (no attempt to upload). That allows them to be used by slurm and condor alike. Modifications to these scripts are likely only needed when the actual underlying calling syntax of the reconstruction needs changes.

Because we target 2 hours per job and because that varies for different data sets, we run benchmarks on all data sets that we simulate:
https://github.com/eic/simulation_campaign_datasets, but that's a mirror of https://eicweb.phy.anl.gov/EIC/campaigns/datasets for CI reasons (takes a few 100 core hours to benchmark all the datasets, can't fit in github CI). The data sets produce a simple csv file with info about that data sets: running time per event, number of events, etc. Then submit_csv.sh (for condor) takes that and submits it for a specific target job duration, ensuring disk space and memory request are appropriate.

### 6 Monitoring Failures
When submitting 10k to 100k jobs, dealing with failing jobs has two options: don't care about failures, or look a them with a semi-automated approach. scripts/hold_release_and_review.sh looks at stdout, stderr, and condor log, greps for patterns, does automatic resubmit. It's useful to keep an eye on now failures so we can document and fix them. Most common error is failure to write to S3 at the end of a job, which just needs a resubmit (but does mean we ran the job for nothing).
50 changes: 50 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Frequently Asked Questions ###


## Common Errors ##

### Example 1

* Scenario

I downloaded a full simulation output file from S3. When I run eicrecon, I see the following error.
```
[EcalBarrelImagingRecHits] [warning] Failed to load ID decoder for EcalBarrelImagingHits
[WARN] Parameter 'BEMC:ecalbarrelimagingrawhits:timeResolution' with value '0' loses equality with itself after stringificatin
[FATAL] Segfault detected! Printing backtraces and exiting.

Thread model: pthreads
139636488283840:
`- JSignalHandler::handle_sigsegv(int, siginfo_t*, void*) (0x7effbbb2e296)
`- /lib/x86_64-linux-gnu/libc.so.6 (0x7effbb61bf90)
`- ImagingPixelReco::execute() (0x7effb7887511)
`- CalorimeterHit_factory_EcalBarrelImagingRecHits::Process(std::shared_ptr<JEvent const> const&) (0x7effb78879c9)
`- eicrecon::JFactoryPodioT<edm4eic::CalorimeterHit>::Create(std::shared_ptr<JEvent const> const&) (0x7effb7908c7e)
`- std::vector<edm4eic::CalorimeterHit const*, std::allocator<edm4eic::CalorimeterHit const*> > JEvent::Get<edm4eic::CalorimeterHit>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const (0x7effb790d28e)
`- ProtoCluster_factory_EcalBarrelImagingProtoClusters::Process(std::shared_ptr<JEvent const> const&) (0x7effb78677fa)
`- eicrecon::JFactoryPodioT<edm4eic::ProtoCluster>::Create(std::shared_ptr<JEvent const> const&) (0x7effb79098be)
`- std::vector<edm4eic::ProtoCluster const*, std::allocator<edm4eic::ProtoCluster const*> > JEvent::Get<edm4eic::ProtoCluster>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const (0x7effb790c45e)
`- Cluster_factory_EcalBarrelImagingClusters::Process(std::shared_ptr<JEvent const> const&) (0x7effb7871701)
`- eicrecon::JFactoryPodioT<edm4eic::Cluster>::Create(std::shared_ptr<JEvent const> const&) (0x7effb790a4fe)
`- JEvent::GetCollectionBase(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) const (0x7effb7ca85f6)
`- JEventProcessorPODIO::Process(std::shared_ptr<JEvent const> const&) (0x7effb7165a87)
`- JEventProcessor::DoMap(std::shared_ptr<JEvent const> const&) (0x7effbbab47bd)
`- JEventProcessorArrow::execute(JArrowMetrics&, unsigned long) (0x7effbba9c6c5)
`- JWorker::loop() (0x7effbbaa74d7)
`- /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7effbb8b54a3)
`- /lib/x86_64-linux-gnu/libc.so.6 (0x7effbb668fd4)
`- __clone (0x7effbb6e8820)
```

* Explanation

Eicrecon is trying to access a collection that doesn't exist for the detector config with which the original simulation was run with.

* Solution

Make sure the correct tagged detector geometry environment was sourced and DETECTOR_CONFIG variable was defined.

```
source /opt/detector/epic-23.05.2/setup.sh
DETECTOR_CONFIG=epic_brycecanyon eicrecon -Ppodio:output_file=<prefix>.eicrecon.tree.edm4eic.root -Pjana:warmup_timeout=0 -Pjana:timeout=0 -Pplugins=janadot <prefix>.edm4hep.root
```