Version 2 (modified by mmamonski, 11 years ago) (diff)

--

Configuration file example


::

# pbs_drmaa.conf - Sample pbs_drmaa configuration file.

wait_thread: 0,

#pool_delay: 5,

job_categories: {

#default: "-k n", # delete output files from execution hosts longterm: "-p -100 -l nice=5", amd64: "-l arch=amd64", python: "-l software=python", java: "-l software=java,vmem=500mb -v PATH=/opt/sun-jdk-1.6:/usr/bin:/bin", #test: "-u test -q testing",

},

Native specification ====================

DRMAA interface allows to pass DRM dependant job submission options. Those options may be specified by settings drmaa_native_specification. drmaa_native_specification accepts space delimited qsub. qsub options which does not set job attributes (-b, -z, -C) as well as meant for submission of interactive jobs (-I, -X) or to specify directories (-d, -D) are *not* supported. Also instead of -W option following long options are accepted within native specification: --depend, --group-list, --stagein and --stageout. For detailed description of each option see PBS documentation.

Attributes set in native specification overrides corresponding DRMAA job attributes.

.. table::

Native specification strings with corresponding DRMAA attributes.

===================== =============== ============ ==================== DRMAA attribute PBS attribute PBS resource native specification ===================== =============== ============ ====================

Attributes which get overridden

----------------------------------------------------------------------- drmaa_job_name Job_Name -N job name drmaa_output_path Output_Path -o output path drmaa_error_path Error_Path -e error path drmaa_join_files Join_Path -j join options drmaa_block_email Mail_Points -m mail options drmaa_start_time Execution_Time -a start time drmaa_js_state Hold_Types -h .. Account_Name -A account string .. Checkpoint -c interval .. Keep_Files -k keep .. Priority -p priority .. destination -q queue .. Rerunable -r y/n .. Shell_Path_List -S path list .. User_List -u user list .. group_list --group_list=\groups drmaa_v_env Variable_List -v variable list .. Variable_List -V drmaa_v_email Mail_Users -M user list drmaa_duration_hlimit Resource_List cput -l cput=\limit drmaa_wct_hlimit Resource_List walltime -l walltime=\limit .. Resource_List -l resources .. depend --depend=\dependency .. stagein --stagein=\stagein .. stageout --stageout=\stageout ===================== =============== ============ ====================

Release notes =============

Changes in 1.0.5 release


  • make drmaa tolerant to torque restarts
  • now one can use '-lmem' in native specification attribute

Changes in 1.0.4 release


  • fix "mtime" date parsing ('triggered' mode)
  • fix "submit_args" attribute bug (PBS Professional only)

Changes in 1.0.3 release


  • new implementation of the "wait thread" which reads PBS log files (increased scalability)
  • support for native specification attribute
  • memleak fixes
  • testsuite passed on PBS Pro 10
  • exit codes 126-127 cause the drmaa_wifaborted() to return true
  • other bug fixes

Changes in 1.0.2 release


  • automatic reconnect on PBS connection errors
  • static linkage with DRMAA utilities
  • other bug fixes

Changes in 1.0.1 release


  • number of attributes implemented:
  • drmaa_start_time
  • drmaa_duration_hlimit
  • drmaa_wct_hlimit
  • drmaa_native_specification
  • drmaa_job_category
  • configuration file(s)
  • separate wait thread
  • lot of bug fixes
  • more robust code
  • separated DRMAA utilities library
  • Python driven test-suite

Known bugs and limitations


Library covers nearly all DRMAA 1.0 specification_ with exceptions listed below. It passes the official DRMAA test-suite_ except of tests which require job termination status. All mandatory and some optional job attributes (namely: transfer files, wall clock time hard limit, job run duration hard limit) are implemented.

Known limitations imposed by PBS API:

  • With PBS Pro_ (and OpenPBS_) retrieving of job termination status is impossible. For this DRM finished jobs are marked as done with 0 return code unless job was terminated through library when they are treated as aborted and killed after receiving SIGTERM.
  • Library accepts job identifiers only of those jobs which were submitted under current session (specification says it should also accept job identifiers from previous sessions and even of jobs submitted in former execution of DRMAA enabled application). This could only be partially fixed as job state needs to be kept by library in order to cope with PBS shortcomings.
  • Job termination (when job is running) is realized by PBS by sending SIGTERM and/or SIGKILL therefore retrieving those signals cannot be distinguished from abort using drmaa_control(DRMAA_CONTROL_TERMINATE). Then job termination state is marked as "aborted" and "signaled" whatever is the state.
  • drmaa_wcoredump() always returns false.
  • Waiting functions (drmaa_wait() and drmaa_synchronize()) must pool DRM to find out whether job finished.

Test-suite


The DRMAA for Torque/PBS Pro library was successfully tested with PBS Pro_ 10 (10.0.0.82981) and Torque_ 2.5.1 on Linux OS. Following table presents results of tests from Official DRMAA test-suite_ (originally developed for Sun Grid Engine).

.. table::

Mode - Polling

=============================================== =========== ============

Test name PBS Pro 10 Torque 2.5.1

=============================================== =========== ============ test_mt_exit_during_submit passed passed test_mt_exit_during_submit_or_wait passed passed test_mt_submit_before_init_wait passed passed test_mt_submit_mt_wait passed passed test_mt_submit_wait passed passed test_st_attribute_change passed passed test_st_bulk_singlesubmit_wait_individual passed passed test_st_bulk_submit_in_hold_session_delete passed passed test_st_bulk_submit_in_hold_session_release passed passed test_st_bulk_submit_in_hold_single_delete passed passed test_st_bulk_submit_in_hold_single_release passed passed test_st_bulk_submit_wait passed passed test_st_contact passed passed test_st_drm_system passed passed test_st_drmaa_impl passed passed test_st_empty_session_control passed passed test_st_empty_session_synchronize_dispose passed passed test_st_empty_session_synchronize_nodispose passed passed test_st_empty_session_wait passed passed test_st_error_file_failure FAILED [1]_ passed test_st_exit_status FAILED [1]_ passed test_st_input_file_failure FAILED [1]_ passed test_st_mult_exit passed passed test_st_mult_init passed passed test_st_output_file_failure FAILED [1]_ passed test_st_submit_in_hold_delete passed passed test_st_submit_in_hold_release passed passed test_st_submit_kill_sig FAILED [1]_ passed test_st_submit_polling_synchronize_timeout passed passed test_st_submit_polling_synchronize_zerotimeout passed passed test_st_submit_polling_wait_timeout passed passed test_st_submit_polling_wait_zerotimeout passed passed test_st_submit_suspend_resume_wait passed passed test_st_submit_wait passed passed test_st_submitmixture_sync_all_dispose passed passed test_st_submitmixture_sync_all_nodispose passed passed test_st_submitmixture_sync_allids_dispose passed passed test_st_submitmixture_sync_allids_nodispose passed passed test_st_supported_attr passed passed test_st_supported_vattr passed passed test_st_usage_check passed passed test_st_version passed passed =============================================== =========== ============

.. [1] Due to unavailability of job termination status.

.. table::

Mode - Triggered

=============================================== =========== ============

Test name PBS Pro 10 Torque 2.5.1

=============================================== =========== ============ test_mt_exit_during_submit passed passed test_mt_exit_during_submit_or_wait passed passed test_mt_submit_before_init_wait passed passed test_mt_submit_mt_wait passed passed test_mt_submit_wait passed passed test_st_attribute_change passed passed test_st_bulk_singlesubmit_wait_individual passed passed test_st_bulk_submit_in_hold_session_delete passed passed test_st_bulk_submit_in_hold_session_release passed passed test_st_bulk_submit_in_hold_single_delete passed passed test_st_bulk_submit_in_hold_single_release passed passed test_st_bulk_submit_wait passed passed test_st_contact passed passed test_st_drm_system passed passed test_st_drmaa_impl passed passed test_st_empty_session_control passed passed test_st_empty_session_synchronize_dispose passed passed test_st_empty_session_synchronize_nodispose passed passed test_st_empty_session_wait passed passed test_st_error_file_failure passed passed test_st_exit_status passed passed test_st_input_file_failure passed passed test_st_mult_exit passed passed test_st_mult_init passed passed test_st_output_file_failure passed passed test_st_submit_in_hold_delete passed passed test_st_submit_in_hold_release passed passed test_st_submit_kill_sig passed passed test_st_submit_polling_synchronize_timeout passed passed test_st_submit_polling_synchronize_zerotimeout passed passed test_st_submit_polling_wait_timeout passed passed test_st_submit_polling_wait_zerotimeout passed passed test_st_submit_suspend_resume_wait passed passed test_st_submit_wait passed passed test_st_submitmixture_sync_all_dispose passed passed test_st_submitmixture_sync_all_nodispose passed passed test_st_submitmixture_sync_allids_dispose passed passed test_st_submitmixture_sync_allids_nodispose passed passed test_st_supported_attr passed passed test_st_supported_vattr passed passed test_st_usage_check passed passed test_st_version passed passed =============================================== =========== ============