= PSNC DRMAA for Torque/PBS Pro = == Introduction == DRMAA for Torque/PBS Pro is implementation of Open Grid Forum [http://drmaa.org/ DRMAA] (Distributed Resource Management Application API) specification for submission and control jobs to PBS systems: [http://www.clusterresources.com/products/torque-resource-manager.php Torque] and [http://www.pbsworks.com/Product.aspx?id=1 PBS Professional]. Using DRMAA, grid applications builders, portal developers and ISVs can use the same high-level API to link their software with different cluster/resource management systems. This software also enables the integration of [http://apps.man.poznan.pl/trac/qcg-computing QCG-Computing] with the underlying Torque/PBS Pro system for remote multi-user job submission and control over Web Services. == Download == DRMAA for Torque/PBS Pro is distributed as a source package which can be downloaded directly from [download:1 here] or via the [http://apps.man.poznan.pl/trac/pbs-drmaa/downloads Downloads] section. == SVN access == {{{ $ svn co https://apps.man.poznan.pl/svn/pbs-drmaa/ }}} Please note the `./autogen.sh` and `./autoclean.sh` scripts which calls the autotools command chain in appropriate order. **note:** You need some [#dev_tools developer tools] to compile the svn version. Also the trunk version may not always compile. == Installation == To compile the library just go to main source directory and type: {{{ $ ./configure [--prefix=/installation/directory] && make }}} If you had installed PBS in a non standard directory pass it in `--with-pbs` configure parameter. There are no unusual requirements for basic usage of library: ANSI C compiler and standard make should suffice (if linking against PBS Professional you will need also the OpenSSL library). If you have taken sources directly from SVN repository you would need additional [#dev_tools developer tools]. For further information regarding GNU build system see the INSTALL file. For [http://www.clusterresources.com/products/torque-resource-manager.php Torque] it is advised to configure queues so jobs are leaved after the completion. To achieve this simply type the following command for all queues which are intended to use with PBS DRMAA: {{{ # qmgr -c "set queue QUEUE_NAME keep_completed = 60" }}} or simply set is as the global server parameter: {{{ # qmgr -c "set server keep_completed = 60" }}} The value of the `keep_completed` parameter denotes a number of seconds jobs will have to wait in the queue after the completion (and should be greater then `pool_delay` value in PBS DRMAA configuration). It enables the DRMAA library to retrieve the information about finished jobs. Alternatively you can configure the DRMAA library to use Torque server daemon logs as information source for terminated jobs (consult the next section for details). == Configuration == During DRMAA session initialization (`drmaa_init`) library tries to read its configuration parameters from locations: * `PREFIX/etc/pbs_drmaa.conf`, * `~/.pbs_drmaa.conf` * and from file given in `PBS_DRMAA_CONF` environment variable (if set to non-empty string). If multiple configuration sources are present then all configurations are merged with values from user-defined files taking precedence (in the following order: `PBS_DRMAA_CONF`, `~/.pbs_drmaa.conf`, `PREFIX/etc/pbs_drmaa.conf`). Currently recognized configuration parameters are: pool_delay:: Amount of time (in seconds) between successive checks of unfinished job(s). Type: integer, Default: 5 wait_thread:: Value 1 enables single "wait thread" for updating jobs status. With `pbs_home` set enables wait_thread which reads PBS log files (instead of polling PBS daemons). Type: integer, Default: 0 pbs_home:: Path to Torque/PBS Pro spool directory that contains server logs (e.g.: /var/spool/pbs). Type: string, Default: not set job_categories:: Dictionary of job categories. It's keys are job categories names mapped to `native specification`_ strings. Attributes set by job category can be overridden by corresponding DRMAA attributes or native specification. Special category name ``default`` is used when ``drmaa_job_category`` job attribute was not set. cache_job_state:: According to the DRMAA specification every `drmaa_job_ps()` call should query DRM system for job state. With this option one may optimize communication with DRM. If set to positive integer `drmaa_job_ps()` returns remembered job state without communicating with DRM for `cache_job_state` seconds since last update. By default library conforms to specification (no caching will be performed). Type: integer, default: 0 === Different modes of operation === ||=wait_thread =||=pbs_home =||=mode =||= keep_completed needed =||= comments =|| || 0 || not set || polling || yes || default configuration || || 1 || not set || polling || yes || more effective than above || || 1 || set || triggered || no || read access to server logs needed || === Configuration file syntax === Configuration file is in form a dictionary. Dictionary is set of zero or more key-value pairs. Key is a string while value could be a string, an integer or another dictionary. {{{ configuration: dictionary | dictionary_body dictionary: '{' dictionary_body '}' dictionary_body: (string ':' value ',')* value: integer | string | dictionary string: unquoted-string | single-quoted-string | double-quoted-string unquoted-string: [^ \t\n\r:,0-9][^ \t\n\r:,]* single-quoted-string: '[^']*' double-quoted-string: "[^"]*" integer: [0-9]+ }}} === Configuration file example === {{{ # pbs_drmaa.conf - Sample pbs_drmaa configuration file. wait_thread: 0, #pool_delay: 5, job_categories: { #default: "-k n", # delete output files from execution hosts longterm: "-p -100 -l nice=5", amd64: "-l arch=amd64", python: "-l software=python", java: "-l software=java,vmem=500mb -v PATH=/opt/sun-jdk-1.6:/usr/bin:/bin", #test: "-u test -q testing", }, }}} == Native specification == DRMAA interface allows to pass DRM dependant job submission options. Those options may be specified by settings `drmaa_native_specification`. `drmaa_native_specification` accepts space delimited `qsub`. `qsub` options which does not set job attributes (`-b`, `-z`, `-C`) as well as meant for submission of interactive jobs (`-I`, `-X`) or to specify directories (`-d`, `-D`) are *not* supported. Also instead of `-W` option following long options are accepted within native specification: `--depend`, `--group-list`, `--stagein` and `--stageout`. For detailed description of each option see PBS documentation. Attributes set in native specification overrides corresponding DRMAA job attributes. Native specification strings with corresponding DRMAA attributes. ||= DRMAA attribute =||= PBS attribute =||= PBS resource native specification =|| ||||||= Attributes which get overridden =|| || drmaa_job_name || Job_Name || `-N` job name || || drmaa_output_path || Output_Path || `-o` output path || || drmaa_error_path || Error_Path || `-e` error path || || drmaa_join_files || Join_Path || `-j` join options || || drmaa_block_email || Mail_Points || `-m` mail options || || drmaa_start_time || Execution_Time || `-a` start time || || drmaa_js_state || Hold_Types || `-h` || || .. || Account_Name || `-A` account string || || .. || Checkpoint || `-c` interval || || .. || Keep_Files || `-k` keep || || .. || Priority || `-p` priority || || .. || destination || `-q` queue || || .. || Rerunable || `-r` y/n || || .. || Shell_Path_List || `-S` path list || || .. || User_List || `-u` user list || || .. || group_list || `--group_list=`\groups || || drmaa_v_env || Variable_List || `-v` variable list || || .. || Variable_List || `-V` || || drmaa_v_email || Mail_Users || `-M` user list || || drmaa_duration_hlimit || Resource_List cput || `-l cput=`\limit || || drmaa_wct_hlimit || Resource_List walltime || `-l walltime=`\limit || || .. || Resource_List || `-l` resources || == Release notes == === Changes in 1.0.5 release === * make drmaa tolerant to torque restarts * now one can use '-lmem' in native specification attribute === Changes in 1.0.4 release === * fix "mtime" date parsing ('triggered' mode) * fix "submit_args" attribute bug (PBS Professional only) === Changes in 1.0.3 release === * new implementation of the "wait thread" which reads PBS log files (increased scalability) * support for native specification attribute * memleak fixes * testsuite passed on PBS Pro 10 * exit codes 126-127 cause the `drmaa_wifaborted()` to return true * other bug fixes === Changes in 1.0.2 release === * automatic reconnect on PBS connection errors * static linkage with DRMAA utilities * other bug fixes === Changes in 1.0.1 release === * number of attributes implemented: - `drmaa_start_time` - `drmaa_duration_hlimit` - `drmaa_wct_hlimit` - `drmaa_native_specification` - `drmaa_job_category` * configuration file(s) * separate wait thread * lot of bug fixes * more robust code * separated DRMAA utilities library * Python driven test-suite === Known bugs and limitations === Library covers nearly all DRMAA 1.0 specification_ with exceptions listed below. It passes the `official DRMAA test-suite`_ except of tests which require job termination status. All mandatory and some optional job attributes (namely: transfer files, wall clock time hard limit, job run duration hard limit) are implemented. Known limitations imposed by PBS API: * With `PBS Pro`_ (and OpenPBS_) retrieving of job termination status is impossible. For this DRM finished jobs are marked as done with 0 return code unless job was terminated through library when they are treated as aborted and killed after receiving SIGTERM. * Library accepts job identifiers only of those jobs which were submitted under current session (specification says it should also accept job identifiers from previous sessions and even of jobs submitted in former execution of DRMAA enabled application). This could only be partially fixed as job state needs to be kept by library in order to cope with PBS shortcomings. * Job termination (when job is running) is realized by PBS by sending SIGTERM and/or SIGKILL therefore retrieving those signals cannot be distinguished from abort using `drmaa_control(DRMAA_CONTROL_TERMINATE)`. Then job termination state is marked as "aborted" and "signaled" whatever is the state. * `drmaa_wcoredump()` always returns `false`. * Waiting functions (`drmaa_wait()` and `drmaa_synchronize()`) must pool DRM to find out whether job finished. [=#dev_tools] === Developer tools === Although not needed for library user the following tools may be required if you intend to develop PSNC DRMAA for Torque/PBS Pro: * GNU autotools ** autoconf (tested with version 2.67) ** automake (tested with version 1.11) ** libtool (tested with version 2.2.8) ** m4 (tested with version 1.4.14) * [http://www.gnu.org/software/bison/ Bison] parser generator, * [http://www.complang.org/ragel/ RAGEL] State Machine Compiler, * [http://www.gnu.org/software/gperf/ gperf] gperf - a perfect hash function generator. === Links === [=#drmaa] DRMAA: http://www.drmaa.org/ \\ [=#open_grid_forum] Open Grid Forum: http://www.gridforum.org/ \\ [=#specification] DRMAA 1.0 specification: http://www.ogf.org/documents/GFD.133.pdf \\ [=#testsuite] Official DRMAA test-suite: http://drmaa.org/testsuite.php \\ [=#smoa_comp] Smoa Computing: http://apps.man.poznan.pl/trac/smoa-comp \\ [=#bison] Bison: http://www.gnu.org/software/bison/ \\