-
Notifications
You must be signed in to change notification settings - Fork 100
Debugging CUBE Services without Containers
CUBE is a complex software infrastructure, comprising the main core backend, as well as a constellation of ancillary services. Usually, if things break, the first place to start debugging is to make sure the ancillary services are all OK. This page documents a workflow that can be useful in debugging these CUBE services. Essentially, instructions on starting each core service directly (i.e. not in containers) is provided.
Due to the distributed and containerized nature of CUBE, debugging the ancillary services, pfcon
, pfioh
, and pman
can be particularly difficult. The most effective method of debugging is to run the services non-containerized, but preserving the pattern of volume mounts using symbolic links in the host filesystem, followed by a typical CUBE directive.
Create the following paths on the host FS -- in any directory. I typically do this in the ChRIS_ultron_backend
dir, but for debugging the services outside of ChRIS, any directory (like ~/tmp
) will be fine:
mkdir -p ./FS/remote # Simulate remote FS
chmod 777 ./FS/remote
# To debug "local" services, uncomment the following
# (Note that at first this is probably not necessary)
# mkdir -p ./FS/local # Simulate local FS
# chmod 777 ./FS/local
sudo mkdir /hostFS
sudo ln -s $(pwd)/FS/remote /hostFS/storeBase # Simulates the FS within the pman/pfioh
# containers
# If you are trying to debug *all* services, including 'local' ones,
# uncomment the below
#sudo ln -s $(pwd)/FS/local /hostFS/pfconFS # Simulates the FS within the pfcon/cube
# containers
sudo mkdir -p /usr/users # Explicitly maps the CUBE internal path
sudo chmod 777 /usr/users
The storeBase
will be used by pfioh
and pman
and simulates the remote FS. The pfconFS
and /usr/users
simulates the local and CUBE DB trees.
You must export an environment variable, HOST_IP to the terminal running pfurl
and ``pfcon```. On Linux this is:
export HOST_IP=$(ip route | grep -v docker | awk '{if(NF==11) print $9}')
A source of some frustration can be the pfurl
module which is used internally by pfcon
. If debugging pfurl
code while debugging the interaction of the services, it is a good idea to link the pfurl.py
module in the virtualenv to the file being edited. In my setup, this is effected by:
cd ~/src/python-env/chris_env/lib/python3.5/site-packages/pfurl
mv pfurl.py pfurl.orig.py
ln -s ~/src/pfurl/pfurl/pfurl.py .
Alternatively, if you make changes to pfurl
and propagate these all the way through the git repo, docker hub, and up to PiPy, then you can install this laters pfurl
in the virtualenv for pfcon
to use:
pip3 install pfurl==X.Y.Z
where X.Y.Z
is the version of pfurl
to install.
The simplest way to debug is to open four terminals. Each terminal will be used to run a specific service. Typically from left to right the terminals should be anchored in the following source repositories:
Terminal 1 | Terminal 2 | Terminal 3 | Terminal 4 |
---|---|---|---|
pfurl | pfcon | pfioh | pman |
In each terminal, cd
to the bin
dir of each, and run the respective service from terminal
./pfcon --forever --httpResponse
./pfioh --forever --httpResponse --createDirsAsNeeded --storeBase /hostFS/storeBase
rm -fr /tmp/pman ; ./pman --rawmode 1 --http --port 5010 --listeners 12
It is important that all the services need to stop and be restarted each time a change is made. In particular, the pman
service needs its database cleared if the same job is being submitted repeatedly during debugging.
If you don't rm -fr /tmp/pman before each call to pman
you WILL get unpredictable results and behaviour!
In debugging one often will repeat the same pfurl
command and associated JSON over and over again. Within this JSON is a directive specifying the service name to the swarm manager. Sometimes, the service remains in the swarm scheduler and if the same pfurl
is resent with the same service name, pman
will throw an internal exception that is hard to notice.
The solution is to check on the swarm service using
dsl
and if it exists, to remove the service
dss <service>
where dsl
(docker service list) is a shell alias
alias dsl="docker service ls "
and dss
(docker service stop) another alias
alias dss="docker service rm "
Assuming satisfied preconditions, let's say hello
to pfcon
. It will in turn ask each of pfioh
and pman
hello
and return the response.
./pfurl --verb POST --raw --http ${HOST_IP}:5005/api/v1/cmd --httpResponseBodyParse --jsonwrapper 'payload' --msg \
'{ "action": "hello",
"meta": {
"askAbout": "sysinfo",
"echoBack": "Hi there!",
"service": "host"
}
}'
To find the status on a job, say job 89
as in the examples below, use
./pfurl --verb POST --raw --http ${HOST_IP}:5005/api/v1/cmd --httpResponseBodyParse --jsonwrapper 'payload' --msg \
'{ "action": "hello",
"meta": {
"askAbout": "sysinfo",
"echoBack": "Hi there!",
"service": "host"
}
}'
In this call, be sure that the HOST_IP
env variable is set correctly.
./pfurl --verb POST --raw --http ${HOST_IP}:5005/api/v1/cmd --httpResponseBodyParse --jsonwrapper 'payload' --msg '
{
"action": "status",
"meta": {
"key": "jid",
"value": "89"
},
"service": "host"
'
In this call, be sure that the HOST_IP
env variable is set correctly.
pfurl --verb POST --raw --http ${HOST_IP}:5005/api/v1/cmd --httpResponseBodyParse --jsonwrapper 'payload' --msg '
{ "action": "coordinate",
"threadAction": true,
"meta-store": {
"meta": "meta-compute",
"key": "jid"
},
"meta-data": {
"remote": {
"key": "%meta-store"
},
"localSource": {
"path": "/neuro/users/rudolphpienaar/Pictures"
},
"localTarget": {
"path": "/home/rudolph/tmp/Pictures",
"createDir": true
},
"specialHandling": {
"op": "plugin",
"cleanup": true
},
"transport": {
"mechanism": "compress",
"compress": {
"encoding": "none",
"archive": "zip",
"unpack": true,
"cleanup": true
}
},
"service": "host"
},
"meta-compute": {
"cmd": "$execshell $selfpath/$selfexec --prefix test- --sleepLength 0 /share/incoming /share/outgoing",
"auid": "rudolphpienaar",
"jid": "89",
"threaded": true,
"container": {
"target": {
"image": "fnndsc/pl-simpledsapp",
"cmdParse": true
},
"manager": {
"image": "fnndsc/swarm",
"app": "swarm.py",
"env": {
"meta-store": "key",
"serviceType": "docker",
"shareDir": "%shareDir",
"serviceName": "89"
}
}
},
"service": "host"
}
}
'
Running all the services containerized can result in a lag in debugging, mostly because log files sometimes need to be fully flushed. At times, it is better to run the services non-containerized.
In such instances, add a breakpoint using pudb.set_trace()
typically in charm.py
. Then start the CUBE dev environment in a containerized fashion:
sudo rm -fr /hostFS/storeBase/* ; sudo rm -fr /usr/users/* ; *make*
When execution stops at the breakpoint, kill all the ancillary containers
dkrm pfcon
dkrm pfioh
dkrm pman
where dkrm
is actually a function I have in .bashrc
dkrm ()
{
NAME=$1;
ID=$(dkl | grep $NAME | awk '{print $1}');
docker stop $ID && docker rm -vf $ID
}
and then restart these services directly as per instructions above.