Skip to content

PALoizeau/cbm-integration

 
 

Repository files navigation

CBM Integration

This repository contains a topology and configuration files needed for running CBM FairMQ devices with ODC/DDS.

Steps to run on the Virgo cluster with Slurm

We take lxbk0600 node here for convenience, but you can take any other or a random lxlogin node. You will need to know the node name where odc-grpc-server runs to connect to it with odc-grpc-client.

We use the container /cvmfs/cbm.gsi.de/debian10/containers/debian10_master_v18.8.0_nov22p1_online_flesnet.sif, where CBM, ODC and DDS are installed under /opt/.

Adjust the host names and paths for your environment

Environment setup (for both ODC server and client)

  1. Launch the CBM+ODC+DDS container on Virgo (both for ODC server and client)
SINGULARITY_CONTAINER=/cvmfs/cbm.gsi.de/debian10/containers/debian10_master_v18.8.0_nov22p1_online_flesnet.sif ssh -o SendEnv=SINGULARITY_CONTAINER lxbk0600
  1. initialize CBM, ODC, DDS environment
source /opt/cbmroot/master_v18.8.0_nov22p1_online_flesnet/bin/CbmRootConfig.sh -a
  1. Disable http_proxy for gRPC
export no_grpc_proxy=lxbk0600
  1. set HOME to a path on Lustre, to be used for DDS/ODC session files and logs (can be configured separately)
export HOME=/lustre/rz/orybalch/
  1. Set SDE_HOME to the folder where the clone of this repo was placed

  2. Change SDE_HOME value also in env.sh

ODC gRPC server

odc-grpc-server --sync --host "*:6667" --rp "slurm:/opt/fairsoft/nov22p1_nodds/bin/odc-rp-epn-slurm --zones online:5:/lustre/rz/orybalch/cbm-integration/slurm-main.cfg:" --timeout 120

ODC gRPC client

You can run client interactively, or have it execute a set of given commands sequentially.

odc-grpc-client --host lxbk0600:6667

For an interactive run, the following client commands start a session, launch the topology and bring it from Idle to Running and back and terminate the session.

.run -p slurm -r "{\"zone\":\"online\",\"n\":1}" --topo /lustre/rz/orybalch/cbm-integration/topology.xml
.config
.start
.stop
.reset
.term
.down

To check the state of the devices at any point (when they already exist):

.state --detailed

Steps to run the "full" test case with data flowing from the mFLES cluster at close to real bandwidth

  1. Same Environment setting as for chain described above
  2. Set SDE_HOME to the folder where the clone of this repo was placed and go to it
export SDE_HOME=/lustre/cbm/users/ploizeau/sde2023
cd $SDE_HOME
  1. Change SDE_HOME value in env.sh
  2. Start the ODC gRPC server with 8 possible device slots (!Remember to check with ps aux | grep 6668 if the port is already used on this lxbk node and increase it if needed)
odc-grpc-server --sync --host "*:6668" --rp "slurm:/opt/fairsoft/nov22p1_online/bin/odc-rp-epn-slurm --zones online:8:$SDE_HOME/cbm-integration/slurm-main.cfg:" --timeout 120
  1. Start the ODC client
odc-grpc-client --host lxbk0600:6668
  1. In the client, start the topology and move it to the READY state
.run -p slurm -r "{\"zone\":\"online\",\"n\":1}" --topo /lustre/cbm/users/ploizeau/sde2023/cbm-integration/topology_mfles.xml
.config
  1. In another console connected to virgo, check with squeue on which node (lxbk<XXXX>) the topology was started
  2. Connect a web browser to lxbk<XXXX>:8080 (from within the GSI trusted network, e.g. lxg or lxi node, you may need to use Chromium instead of Firefox depending on the versions)
  3. Connect to the mFLES cluster (cbmfleslogin01, you will need an account there) and start the replay (it does not matter if the first timeslices are not caught)
spm-run /home/loizeau/scripts/replayers_node_2022_Ni_Au.spm
  1. In the ODC client, move the topology to the RUNNING state
.start
  1. Wait until it succeeds, then check that all devices are still alive after receiving the first TS, then try to refresh the web-browser page (it may take 10-20s to receive the first batch of plots)
  2. Click in the folder explorer in the left tab on canvases and thein either cEvBSummary or cSampSummary_ to see some of the plots, then tick Monitoring to have live updates
  3. When done, first stop the topology in ODC client (got from RUNNING to READY state)
.stop
  1. When state transition returned (not before otherwise the Sampler will hang!), stop the replayer on mFLES with CTRL + C
  2. Then tear down the topology in ODC client with the following sequence (the output and histogram root files should then be cleanly written to disk)
.reset
.term
.down

Steps to run the "full" test case with data replayed on Virgo at close to real bandwidth

Same as for mFLES replay, except that

  1. the spm-run command is replaced with a command executed on virgo:
ssh -Y virgo-debian10.hpc.gsi.de
cd <cbm-integration-folder>
sbatch flesnet_replay_2022_ni.sbatch 2391 5557
  1. This command should be used to start the virgo job before running the topology, so that the timeslice source node can be changed in the xml file (check the node with squeue -u $USER)

Logs

ODC log is written to $HOME/.ODC/log/odc_<date>.log.

The logs for individual devices are stored under $HOME/.DDS/sessions/<session_id>/wrk/<submission_id>/<job_id_and_node>.

For anything larger than a handful of devices the location of logs and session files for DDS agents and user tasks should be moved to the local node (for example to /tmp/). This can be configured in $HOME/.DDS/DDS.cfg under the key agent.work_dir.

Known Issues

  1. Sometimes agent submission gets stuck on worker package creation. Investigated in FairRootGroup/DDS#468.

  2. RepReqTsSampler does not return out of Running state on Stop transition. Possibly effect of missing input, but should still be fixed.
    => Confirmed to be due to missing input: the sampler is asking for the next timeslice through a flesnet library call and this call never returns if input not present in network mode.
    This leads to hangup if a state transition is attempted at this stage (e.g. stopping before the source starts emitting or stopping after the source stopped emitting)
    =>To be discussed with the flesnet team.

  3. Environment variable expansion does not work when pointing to the topology in the ODC .run command, even when it is defined in both client and server sides

  4. The stop sequence is not controlled, sometimes leading to errors because clients try to contact servers which are already in the READY state.

  5. DigiEventSink crashes on termination. Cause is not clear as the output file is properly closed just before, it seems there is on push/publish too much after the sockets are closed,
    => To be investigated on the Cbmroot side

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%