This repository contains a topology and configuration files needed for running CBM FairMQ devices with ODC/DDS.
We take lxbk0600
node here for convenience, but you can take any other or a random lxlogin
node. You will need to know the node name where odc-grpc-server
runs to connect to it with odc-grpc-client
.
We use the container /cvmfs/cbm.gsi.de/debian10/containers/debian10_master_v18.8.0_nov22p1_online_flesnet.sif
, where CBM, ODC and DDS are installed under /opt/.
Adjust the host names and paths for your environment
- Launch the CBM+ODC+DDS container on Virgo (both for ODC server and client)
SINGULARITY_CONTAINER=/cvmfs/cbm.gsi.de/debian10/containers/debian10_master_v18.8.0_nov22p1_online_flesnet.sif ssh -o SendEnv=SINGULARITY_CONTAINER lxbk0600
- initialize CBM, ODC, DDS environment
source /opt/cbmroot/master_v18.8.0_nov22p1_online_flesnet/bin/CbmRootConfig.sh -a
- Disable http_proxy for gRPC
export no_grpc_proxy=lxbk0600
- set HOME to a path on Lustre, to be used for DDS/ODC session files and logs (can be configured separately)
export HOME=/lustre/rz/orybalch/
-
Set SDE_HOME to the folder where the clone of this repo was placed
-
Change SDE_HOME value also in
env.sh
odc-grpc-server --sync --host "*:6667" --rp "slurm:/opt/fairsoft/nov22p1_nodds/bin/odc-rp-epn-slurm --zones online:5:/lustre/rz/orybalch/cbm-integration/slurm-main.cfg:" --timeout 120
You can run client interactively, or have it execute a set of given commands sequentially.
odc-grpc-client --host lxbk0600:6667
For an interactive run, the following client commands start a session, launch the topology and bring it from Idle to Running and back and terminate the session.
.run -p slurm -r "{\"zone\":\"online\",\"n\":1}" --topo /lustre/rz/orybalch/cbm-integration/topology.xml
.config
.start
.stop
.reset
.term
.down
To check the state of the devices at any point (when they already exist):
.state --detailed
Steps to run the "full" test case with data flowing from the mFLES cluster at close to real bandwidth
- Same Environment setting as for chain described above
- Set SDE_HOME to the folder where the clone of this repo was placed and go to it
export SDE_HOME=/lustre/cbm/users/ploizeau/sde2023
cd $SDE_HOME
- Change SDE_HOME value in
env.sh
- Start the ODC gRPC server with 8 possible device slots (!Remember to check with
ps aux | grep 6668
if the port is already used on this lxbk node and increase it if needed)
odc-grpc-server --sync --host "*:6668" --rp "slurm:/opt/fairsoft/nov22p1_online/bin/odc-rp-epn-slurm --zones online:8:$SDE_HOME/cbm-integration/slurm-main.cfg:" --timeout 120
- Start the ODC client
odc-grpc-client --host lxbk0600:6668
- In the client, start the topology and move it to the
READY
state
.run -p slurm -r "{\"zone\":\"online\",\"n\":1}" --topo /lustre/cbm/users/ploizeau/sde2023/cbm-integration/topology_mfles.xml
.config
- In another console connected to virgo, check with squeue on which node (
lxbk<XXXX>
) the topology was started - Connect a web browser to
lxbk<XXXX>:8080
(from within the GSI trusted network, e.g. lxg or lxi node, you may need to useChromium
instead ofFirefox
depending on the versions) - Connect to the mFLES cluster (
cbmfleslogin01
, you will need an account there) and start the replay (it does not matter if the first timeslices are not caught)
spm-run /home/loizeau/scripts/replayers_node_2022_Ni_Au.spm
- In the ODC client, move the topology to the
RUNNING
state
.start
- Wait until it succeeds, then check that all devices are still alive after receiving the first TS, then try to refresh the web-browser page (it may take 10-20s to receive the first batch of plots)
- Click in the folder explorer in the left tab on
canvases
and thein eithercEvBSummary
orcSampSummary_
to see some of the plots, then tickMonitoring
to have live updates - When done, first stop the topology in ODC client (got from
RUNNING
toREADY
state)
.stop
- When state transition returned (not before otherwise the Sampler will hang!), stop the replayer on
mFLES
withCTRL + C
- Then tear down the topology in ODC client with the following sequence (the output and histogram root files should then be cleanly written to disk)
.reset
.term
.down
Same as for mFLES replay, except that
- the spm-run command is replaced with a command executed on virgo:
ssh -Y virgo-debian10.hpc.gsi.de
cd <cbm-integration-folder>
sbatch flesnet_replay_2022_ni.sbatch 2391 5557
- This command should be used to start the virgo job before running the topology, so that the timeslice source node can be changed in the xml file (check the node with
squeue -u $USER
)
ODC log is written to $HOME/.ODC/log/odc_<date>.log
.
The logs for individual devices are stored under $HOME/.DDS/sessions/<session_id>/wrk/<submission_id>/<job_id_and_node>
.
For anything larger than a handful of devices the location of logs and session files for DDS agents and user tasks should be moved to the local node (for example to /tmp/
). This can be configured in $HOME/.DDS/DDS.cfg
under the key agent.work_dir
.
-
Sometimes agent submission gets stuck on worker package creation. Investigated in FairRootGroup/DDS#468.
-
RepReqTsSampler does not return out of Running state on Stop transition. Possibly effect of missing input, but should still be fixed.
=> Confirmed to be due to missing input: the sampler is asking for the next timeslice through a flesnet library call and this call never returns if input not present in network mode.
This leads to hangup if a state transition is attempted at this stage (e.g. stopping before the source starts emitting or stopping after the source stopped emitting)
=>To be discussed with theflesnet
team. -
Environment variable expansion does not work when pointing to the topology in the ODC
.run
command, even when it is defined in both client and server sides -
The stop sequence is not controlled, sometimes leading to errors because clients try to contact servers which are already in the
READY
state. -
DigiEventSink crashes on termination. Cause is not clear as the output file is properly closed just before, it seems there is on push/publish too much after the sockets are closed,
=> To be investigated on the Cbmroot side