HTCondor batch system
Computing jobs that run on individual nodes with up to 32 CPU cores per node can be submitted to the CERN batch service (lxbatch). The jobs are submitted and managed using the HTCcondor platform.
All users having access to the CERN linux service and the AFS filesystem (which can be self-enabled at https://resources.web.cern.ch) can submit jobs to HTCondor, but have by default rather low priority.
ABP users working on computationally intensive tasks can be granted higher priority, by being added to one of the following e-groups (based on their section):
|Section name||e-group name|
The section leaders and the ABP-CP members have admin rights to add users to the e-group of their section. All these e-groups are mapped to a single computing group called “group_u_BE.ABP.NORMAL”.
Detailed documentation managed by the IT department can be found here:
A quick start guide can be found here:
GUIs for monitoring the clusters can be found here:
Graphics Processing Units are available in the system. To use GPUs please follow the instructions available here:
An example submit file is:
executable = job_name/job_name.sh arguments = $(ClusterId) $(ProcId) output = job_name/htcondor.out error = job_name/htcondor.err log = job_name/htcondor.log transfer_input_files = job_name requirements = regexp("V100", Target.CUDADeviceName) request_GPUs = 1 request_CPUs = 1 +MaxRunTime = 86400 queue
requirements = regexp("V100", Target.CUDADeviceName) will find nodes that only have a V100 GPU. Nodes with T4 GPUs also exist.
Nodes with large memory
Some nodes are equipped with a larger number of cores and memory, namely (31 Oct 2019): - a few nodes with 24 physical cores and 1Tb of memory - 6 nodes with 48 cores (hyperthreaded) and 512Gb of memory
These can be used via HTCondor by adding the appropriate lines to the submit file, e.g.:
RequestCpus = 24 +BigMemJob = True
Example on how HTCondor is used to manage PyECLOUD, PyHEADTAIL and PLACET simulations can be found here:
The shares of the different groups can be monitored on https://haggis.cern.ch/ (available only inside the CERN network). Search for "be" to see our shares.
Scheduler Not Replying
From time to time it happens that the scheduler does not reply. In general, it is a temporary problem; if this is not the case, open IT ticket. At the same time, you may try changing the scheduler you are assigned by default. This can be accomplished by one of two ways:
setting the two environment variables:
setenv _condor_SCHEDD_HOST bigbird02.cern.ch setenv _condor_CREDD_HOST bigbird02.cern.ch
export _condor_SCHEDD_HOST="bigbird02.cern.ch" export _condor_CREDD_HOST="bigbird02.cern.ch"
In the output of a simple call to
condor_qyou can find the scheduler name. If you don't set these variables, the reported scheduler name is the one assigned to you by default; otherwise, you should find the one that you have set via the previous variables.
Please keep in mind that these statements, if typed on terminal, will apply only to that session. For instance, in case you log out or the
lxplussession expires, you have to re-set those two variables if you want them also in the new session. So, please remember the scheduler that you have requested, otherwise you won't be able to retrieve the results form
-nameparameter. You can use another than your defined schedular by addressing it directly in your commands, e.g.
condor_q -name bigbird15.cern.ch
This should work for any of the
condor_submit, etc.). The scheduler is then used only for this command.
Jobs Being Taken Very Slowly
It may happen that you see your jobs queueing for too long. This might be simply due to overload of the batch system (please check the batch GUI); more rarely, it can be also a problem with priorities. Indeed, it may happen that your jobs are assigned (by mistake) an accounting group with very low priority.
Hence, you can check if your jobs are assigned the wrong accounting group via (an example output is shown):
$ condor_q owner $LOGNAME -long | grep '^AccountingGroup' | sort | uniq -c 9 AccountingGroup = "group_u_ATLAS.u_zp.nkarast" 1496 AccountingGroup = "group_u_BE.UNIX.u_pz.nkarast"
You can force the use of the high priority accounting group modifying your
.sub script as:
+AccountingGroup = "group_u_BE.ABP.NORMAL"
-spool option can be used at
condor_submit level, e.g.
condor_submit -spool htcondor.sub
In this case, all the output files (
transfer_output_files) and the
output files are not generated once the jobs finishes, but only when requested by the user, after the job is over. Retrieval can be done via the following command:
condor_transfer_data $LOGNAME -const 'JobStatus == 4'
In the above example, the files from all completed job in each cluster will be retrieved.
In this case, the jobs may not automatically disappear from
condor_q. Job removal takes place after 10 days the job has finished.
Submitting Jobs to HTCondor from a local Machine
The recommended way of using HTCondor is to submit jobs by logging to lxplus.cern.ch.
It is also possible to configure your own computer to manage HTCondor jobs, as described in this guide.