Version 8 (modified by pkopta, 6 years ago) (diff)

--

Introduction

QCG-Computing service is an open source service acting as a computing provider exposing on demand access to computing resources and jobs over the HPC Basic Profile compliant Web Services interface. In addition the QCG-Computing offers remote interface for Advance Reservations management.

Within QosCosGrid? the QCG-Notification service is widely used for brokering various types of notification messages related to the state of a job (e.g. including predefined status of a job or snippet from the the job's output file).

This document describes installation of the QCG services: QCG-Computing and QCG-Notification service. These services should be deployed on the same machine (or virtual machine) that:

  • has at least 1GB of memory (recommended value: 2 GB)
  • has 10 GB of free disk space (most of the space will be used by the log files)
  • has any modern CPU (if you plan to use virtual machine you should dedicated to it one or two cores from the host machine)
  • is running under:
    • Centos 7 (in most cases the provided RPMs should work with any operating system based on Redhat Enterpise Linux 7)

Prerequisites

We assume that you have the local resource manager/scheduler already installed. This would be typically a submit machine for the scheduling system.

Since version 2.4 the QCG-Computing services discovers installed application using the  Environment Modules package. For this reason you should install modules on the QCG-Computing host and mount directories that contain all module files used at your cluster and make sure that user qcg-comp can see all modules.

The QosCosGrid? services do not require from you to install any QCG component on the worker nodes, however application wrapper scripts need the following software to be available on worker nodes:

  • bash,
  • rsync,
  • zip/unzip,
  • dos2unix,
  • python.

Which are usually available out of the box on most of the HPC systems.

Shared file system

Deployment of the QCG-Computing requires usually two shared file systems in the cluster:

  • "Users' directories" - shared between the QCG host and all worker nodes. Used for storing jobs sandbox directories. It can be either HOME or scratch file system. You can read more about this here.
  • "Applications scripts" - shared between all worker nodes. Used for storing Applications Scripts

Firewall configuration

In order to expose the QosCosGrid services externally you need to open the following incoming ports in the firewall:

  • 19000 (TCP) - QCG-Computing
  • 19001 (TCP) - QCG-Notification
  • 2811 (TCP) - GridFTP server
  • 20000-25000 (TCP) - GridFTP port-range (if you want to use different port-range adjust the GLOBUS_TCP_PORT_RANGE variable in the /etc/xinetd.d/gsiftp file)

You may also want to allow SSH access from white-listed machines (for administration purpose only).

The following outgoing trafic should be allowed in general:

  • NTP, DNS, HTTP, HTTPS services
  • gridftp (TCP ports: 2811 and port-ranges: 20000-25000)

QCG-Computing

Preparation of the environment

Database

  • Install database backend (PostgresSQL). On !CentOS Linux it can be done with:
    yum install postgresql postgresql-server
    
  • UnixODBC and the PostgresSQL odbc driver:
    yum install unixODBC postgresql-odbc
    

CA and host certificates

At first install all need trusted CA certificates ( instruction). Moreover we assume that the X.509 host certificate (signed by your local  Certificate Authority) and key is already installed in the following locations:

  • /etc/grid-security/qcg-compcert.pem
  • /etc/grid-security/qcg-compkey.pem

In case where QCG-Computing is run from unprivileged account, these files must be owned by the same account. Because during the installation, the qcg-comp account is created we suggest to use this account as owner of certificate and key files.

Other

Most of the grid services and security infrastructures are sensitive to time skews. Thus we recommended to install a Network Time Protocol daemon or use any other solution that provides accurate clock synchronization. Also disable automatic packages update as it may hurt running system.

Installation

At first you need to install unstable repository:

rpm -Uvh http://www.qoscosgrid.org/qcg-packages/centos7/x86_64/qcg-repo-unstable-1.0.0-1.centos7.noarch.rpm

Now the packages can be installed:

yum install qcg-comp qcg-comp-client qcg-comp-logrotate

Database initialization

  • setup QCG-Computing database using provided script:
    /usr/share/qcg-comp/tools/qcg-comp-install.sh
    Welcome to qcg-comp installation script!
     
    This script will guide you through process of configuring proper environment
    for running the QCG-Computing service. You have to answer few questions regarding
    parameters of your database. If you are not sure just press Enter and use the
    default values.
      
    Use local PostgreSQL server? (y/n) [y]: y
    Database [qcg-comp]: 
    User [qcg-comp]: 
    Password [RAND-PASSWD]: MojeTajneHaslo
    Create database? (y/n) [y]: y
    Create user? (y/n) [y]: y
      
    Checking for system user qcg-comp...OK
    Checking whether PostgreSQL server is installed...OK
    Checking whether PostgreSQL server is running...OK
      
    Performing installation
    * Creating user qcg-comp...OK
    * Creating database qcg-comp...OK
    * Creating database schema...OK
    * Checking for ODBC data source qcg-comp...
    * Installing ODBC data source...OK
        
    Remember to add appropriate entry to /var/lib/pgsql/data/pg_hba.conf (as the first rule!) to allow user qcg-comp to
    access database qcg-comp. For instance:
      
    host    qcg-comp       qcg-comp       127.0.0.1/32    md5
      
    and reload Postgres server.
    

Add a new rule to the pg_hba.conf as requested:

vim /var/lib/pgsql/data/pg_hba.conf 
systemctl reload postgresql

Authorization modules

For testing purpose or if your user community is small enough to maintain it manually you can use plain grid mapfile which provides static mapping between user's certificate Distinguish Name and a local account:

#for test purpose only add mapping for your account
echo '"MyCertDN" myaccount' >> /etc/grid-security/grid-mapfile

For the single account submit configuration, all DNS's should be mapped onto the same, qcg-comp account.

The special entry for the QCG-Broker also must be put to the grid mapfile:

echo '"/C=PL/O=GRID/O=PSNC/CN=qcg-broker/broker.compat.qcg.psnc.pl"  qcg-comp' >> /etc/grid-security/grid-mapfile

This is a DN that will be used by the QCG-Broker service to periodically obtain report about current available resources & accounts.

Getting the DRMAA library

The QCG-Computing service use DRMAA compilant interface for batch job submission. Thus you need to install library appropriate for your system. The latest version of the SLURM DRMAA library can be downloaded from the [ Git repository.

Prerequisites

The following package should be installed to build SLURM DRMAA library:

yum install autoconf automake libtool m4 bison gperf ragel hiredis-devel

Build & install

git clone https://git.man.poznan.pl/stash/scm/qcg/slurm-drmaa.git
cd slurm-drmaa
./configure --prefix=/opt/qcg/dependencies --sysconfdir=/opt/qcg/dependencies/etc CFLAGS=-fstack-protector-all
make clean all
sudo make install

Configuration

The example configuration file is created in the destination directory

sudo cp /opt/qcg/dependencies/etc/slurm_drmaa.conf.example /opt/qcg/dependencies/etc/slurm_drmaa.conf 

The default settings should be appropriate for most installations.

Slurm notifications

The Slurm DRMAA library traces status of jobs submitted to scheduling system by polling Slurm about current status of a job. To minimize the number of queries, the qcg-comp-slurm-redis-notifier package has been developed. It contains a script that tracks the Slurm controller logs and pushes notification to the local Redis database about jobs that changed it state. The Slurm DRMAA library register for Redis notification, and waits until they come. To use this mechanism, the following packages must be installed:

yum install redis python-redis
systemctl enable redis
systemctl start redis
yum install qcg-slurm-redis-notifier

The path to the Slurm controller logs should be configured in /etc/qcg/qcg-comp/qcg-slurm-redis-notifier.json file. Now, the service is ready to start:

systemctl start qcg-slurm-redis-notifier

The log file of the service is stored in /var/log/qcg/qcg-comp/qcg-slurm-redis-notifier.log.

Service configuration

Edit the preinstalled service configuration file (/etc/qcg/qcg-comp/qcg-compd.xml):

<?xml version="1.0" encoding="UTF-8"?>
<sm:QCGCore xmlns:sm="http://schemas.qoscosgrid.org/core/2011/04/config" xmlns="http://schemas.qoscosgrid.org/comp/2011/04/config" xmlns:smc="http://schemas.qoscosgrid.org/comp/2011/04/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <Configuration>
    <sm:ModuleManager>
      <sm:Directory>/usr/lib64/qcg-core/modules/</sm:Directory>
      <sm:Directory>/usr/lib64/qcg-comp/modules/</sm:Directory>
    </sm:ModuleManager>
    <sm:Service xsi:type="qcg-compd" description="QCG Computing">
      <sm:WorkingDirectory>/var/log/qcg/qcg-comp/</sm:WorkingDirectory>
      <sm:Logger>
        <sm:Filename>/var/log/qcg/qcg-comp/qcg-compd.log</sm:Filename>
        <sm:Level>INFO</sm:Level>
      </sm:Logger>
      <sm:Transport>
        <sm:Module xsi:type="sm:ecm_gsoap.service">
          <sm:Host>frontend.example.com</sm:Host>
          <sm:Port>19000</sm:Port>
          <sm:KeepAlive>false</sm:KeepAlive>
          <sm:Authentication>
            <sm:Module xsi:type="sm:atc_transport_gsi.service">
              <sm:X509CertFile>/etc/grid-security/hostcert.pem</sm:X509CertFile>
              <sm:X509KeyFile>/etc/grid-security/hostkey.pem</sm:X509KeyFile>
            </sm:Module>
          </sm:Authentication>
          <sm:Authorization>
            <sm:Module xsi:type="sm:atz_mapfile">
              <sm:Mapfile>/etc/grid-security/grid-mapfile</sm:Mapfile>
            </sm:Module>
          </sm:Authorization>
        </sm:Module>
        <sm:Module xsi:type="smc:qcg-comp-service"/>
      </sm:Transport>
      <sm:Module xsi:type="slurm_jsdl_filter"/>
      <!--<sm:Module xsi:type="atz_ardl_filter"/>-->
      <sm:Module xsi:type="sm:general_python" path="/usr/lib64/qcg-comp/modules/python/monitoring.py"/>
      <sm:Module xsi:type="sm:general_python" path="/usr/lib64/qcg-comp/modules/python/modules_info.py"/>
      <!--<sm:Module xsi:type="sm:general_python" path="/usr/lib64/qcg-comp/modules/python/plgrid_info.py"/>-->
      <!--<sm:Module xsi:type="sm:general_python" path="/usr/lib64/qcg-comp/modules/python/node_types.py"/>-->
      <sm:Module xsi:type="submission_drmaa" path="/opt/qcg/dependencies/lib/libdrmaa.so"/>
      <!--sm:Module xsi:type="reservation_python" path="/usr/lib64/qcg-comp/modules/python/reservation_maui.py"/-->
      <sm:Module xsi:type="notification_wsn">
        <sm:Module xsi:type="sm:ecm_gsoap.client">
          <sm:ServiceURL>http://frontend.example.com:19001/</sm:ServiceURL>
          <sm:Authentication>
            <sm:Module xsi:type="sm:atc_transport_http.client"/>
          </sm:Authentication>
          <sm:Module xsi:type="sm:ntf_client"/>
        </sm:Module>
      </sm:Module>
      <sm:Module xsi:type="application_mapper">
        <ApplicationMapFile>/etc/qcg/qcg-comp/application_mapfile</ApplicationMapFile>
      </sm:Module>
      <Database>
        <DSN>qcg-comp</DSN>
        <User>qcg-comp</User>
        <Password>qcg-comp</Password>
      </Database>
      <UnprivilegedUser>qcg-comp</UnprivilegedUser>

      <!--<SetuidEnabled>false</SetuidEnabled>-->

      <!--UseScratch>true</UseScratch> uncomment this if scratch is the only file system shared between the worker nodes and this machine -->
  
      <FactoryAttributes>
        <CommonName>hpc.example.com</CommonName>
        <LongDescription>QCG enabled cluster</LongDescription>
      </FactoryAttributes>
    </sm:Service>
  </Configuration>
</sm:QCGCore>

Common

In most cases it should be enough to change only following elements:

Transport/Module/Host
the hostname of the machine where the service is deployed. You can put here 0.0.0.0 if you want to listen on all interfaces.
Transport/Module/Authentication/Module/X509CertFile and Transport/Module/Authentication/Module/X509KeyFile
Path to the certificate and key files (for single submit account these files must be owned by the qcg-comp user).
Module[type="smc:notification_wsn"]/Module/ServiceURL
the localhost URL of the QCG-Notification service (In most cases this is the same address as the QCG-Computing service)
Module[type="submission_drmaa"]/@path
path to the DRMAA library (the libdrmaa.so).
Module[type="general_python"]/usr/lib64/qcg-comp/modules/python/monitoring.py
path to the monitoring module which gathers general information about currenlty available modules
Module[type="general_python"]/usr/lib64/qcg-comp/modules/python/modules_info.py
path to the plugin which gathers information about currenlty available environment modules
Module[type="general_python"]/usr/lib64/qcg-comp/modules/python/plgrid_info.py
path to the module which gathers extended information about system, such as: available and default grants, users scratch directory; developed for PL-Grid infrastructure (integrated with qcg-gridmapfilegenerator) but can also be used in other infrastructures to report for example user scratch directories - in this scenario the setup-plgrid-plugin.sh script should be used to generate necessary files
Module[type="general_python"]/usr/lib64/qcg-comp/modules/python/node_types.py
path to the module reporting available node types (developed for Compat infrastructure); requires the configuration file /etc/qcg/qcg-comp/qcg-compd/nodes.conf
Module[type="reservation_python"]/@path
path to the reservation module. Change this if you are using different scheduler than Maui (e.g. use reservation_moab.py for Moab, reservation_pbs.py for PBS Pro)
Database/Password
the qcg-comp database password generated in the earlier step by the qcg-comp-install.sh script
SetuidEnabled
set this to false if QCG-Computing service should be run on unprivileged account and all jobs should be submitted to the scheduling system from this single account; if set to false, the startup service script (/usr/lib/systemd/system/qcg-compd.service) should be modified to launch service from qcg-comp account instead of root account
UseScratch
set this to true if you set QCG_SCRATCH_DIR_ROOT in sysconfig so any job will be started from scratch directory (instead of default home directory)
FactoryAttributes/CommonName
a common name of the cluster (e.g. reef.man.poznan.pl). You can use any name that is unique among all systems (e.g. cluster name + domain name of your institution)
FactoryAttributes/LongDescription
a human readable description of the cluster

Module plgrid_info

To report users scratch directory, this module should be uncommented. For non-PL-Grid sites, the script /usr/share/qcg-comp/tools/setup-plgrid-plugin.sh should be executed to create necessary files

Module node_types

Every QCG-Computing service communicating with Compat QCG-Broker instance should have enabled this module. The configuration file (/etc/qcg/qcg-comp/nodes.conf) contains mapping between node type names and slurm node’s features. In Compat the node type is a class of nodes that have similar configuration (the similar performance).

SetuidEnabled? element

This element should be set to false if QCG-Computing should be run on unprivileged account. All users' jobs, will be submitted to the scheduling system from the same unprivileged account the service is run on. Enabling this element requires modification of service startup script to set a different than root startup account.

UseScratch?

This element should be set to true if jobs shall start in other than home directory of the user. When SetuidEnabled is set to false, this also means that all jobs will be started from the subdirectory of the qcg-comp user's home directory (/var/log/qcg/qcg-comp by default). The QCG_SCRATCH_DIR_ROOT environment variable should be set in /etc/sysconfig/qcg-compd file and point to the root directory of user's scratch directories. For example, if QCG_SCRATCH_DIR_ROOT=/var/scratch, and SetuidEnabled set to false (with default qcg-comp as a unprivileged QCG-Computing account), all jobs will be started in /var/scratch/qcg-comp directory.

Creating applications' script space

A common case for the QCG-Computing service is that an application is accessed using abstract app name rather than specifying absolute executable path. The application name/version to executbale path mappings are stored in the file /etc/qcg/qcg-comp/application_mapfile:

cat /etc/qcg/qcg-comp/application_mapfile
# ApplicationName ApplicationVersion Executable

bash * /opt/exp_soft/qcg/qcg-app-scripts/apps/bash.app

It is also common to provide here wrapper scripts rather than target executables. The wrapper script can handle such aspects of the application lifetime like: environment initialization, copying files from/to scratch storage and application monitoring. It is recommended to create separate directory for those wrapper scripts (e.g. the application partition) for an applications. This directory must be readable by all users and from every worker node (the application partition usually fullfils those requirements). Please read more on Application Scripts. You must provide at least mapping for the 'bash' application.

Starting the service

As root type:

/etc/init.d/qcg-compd start

The service logs can be found in:

/var/log/qcg/qcg-comp/qcg-compd.log

Note: In current version, whenever you restart the PosgreSQL server you need also restart the QCG-Computing and QCG-Notification service:

/etc/init.d/qcg-compd restart
/etc/init.d/qcg-ntfd restart

Stopping the service

The service can be stopped using the following command:

/etc/init.d/qcg-compd stop

Verifying the installation

  • Edit the QCG-Computing client configuration file (/etc/qcg/qcg-comp/qcg-comp.xml):
    • set the Host and Port to reflects the changes in the service configuration file (qcg-compd.xml).
      <?xml version="1.0" encoding="UTF-8"?>
      <sm:QCGCore
             xmlns:sm="http://schemas.qoscosgrid.org/core/2011/04/config"
             xmlns="http://schemas.qoscosgrid.org/comp/2011/04/config"
             xmlns:smc="http://schemas.qoscosgrid.org/comp/2011/04/config"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        
             <Configuration>
                     <sm:ModuleManager>
                             <sm:Directory>/usr/lib64/qcg-core/modules/</sm:Directory>
                             <sm:Directory>/usr/lib64/qcg-comp/modules/</sm:Directory>
                     </sm:ModuleManager>
       
                     <sm:Client xsi:type="qcg-comp" description="QCG-Computing client">
                             <sm:Transport>
                                     <sm:Module xsi:type="sm:ecm_gsoap.client">
                                             <sm:ServiceURL>httpg://frontend.example.com:19000/</sm:ServiceURL>
                                             <sm:Authentication>
                                                     <sm:Module xsi:type="sm:atc_transport_gsi.client"/>
                                             </sm:Authentication>
                                             <sm:Module xsi:type="smc:qcg-comp-client"/>
                                     </sm:Module>
                             </sm:Transport>
                     </sm:Client>
             </Configuration>
      </sm:QCGCore>
      
  • Initialize your credentials:
    grid-proxy-init -rfc
    Your identity: /C=PL/O=GRID/O=PSNC/CN=Mariusz Mamonski
    Enter GRID pass phrase for this identity:
    Creating proxy .................................................................. Done
    Your proxy is valid until: Wed Apr  6 05:01:02 2012
    
  • Query the QCG-Computing service:
    qcg-comp -G | xmllint --format - # the xmllint is used only to present the result in more pleasant way
      
    <bes-factory:FactoryResourceAttributesDocument xmlns:bes-factory="http://schemas.ggf.org/bes/2006/08/bes-factory">
        <bes-factory:IsAcceptingNewActivities>true</bes-factory:IsAcceptingNewActivities>
        <bes-factory:CommonName>IT cluster</bes-factory:CommonName>
        <bes-factory:LongDescription>IT department cluster for public   use</bes-factory:LongDescription>
        <bes-factory:TotalNumberOfActivities>0</bes-factory:TotalNumberOfActivities>
        <bes-factory:TotalNumberOfContainedResources>1</bes-factory:TotalNumberOfContainedResources>
        <bes-factory:ContainedResource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="bes-factory:BasicResourceAttributesDocumentType">
            <bes-factory:ResourceName>worker.example.com</bes-factory:ResourceName>
            <bes-factory:CPUArchitecture>
                <jsdl:CPUArchitectureName xmlns:jsdl="http://schemas.ggf.org/jsdl/2005/11/jsdl">x86_32</jsdl:CPUArchitectureName>
            </bes-factory:CPUArchitecture>
            <bes-factory:CPUCount>4</bes-factory:CPUCount><bes-factory:PhysicalMemory>1073741824</bes-factory:PhysicalMemory>
        </bes-factory:ContainedResource>
        <bes-factory:NamingProfile>http://schemas.ggf.org/bes/2006/08/bes/naming/BasicWSAddressing</bes-factory:NamingProfile> 
        <bes-factory:BESExtension>http://schemas.ogf.org/hpcp/2007/01/bp/BasicFilter</bes-  factory:BESExtension>
        <bes-factory:BESExtension>http://schemas.qoscosgrid.org/comp/2011/04</bes-factory:BESExtension>
        <bes-factory:LocalResourceManagerType>http://example.com/SunGridEngine</bes-factory:LocalResourceManagerType>
        <smcf:NotificationProviderURL xmlns:smcf="http://schemas.qoscosgrid.org/comp/2011/04/factory">http://localhost:2211/</smcf:NotificationProviderURL>
    </bes-factory:FactoryResourceAttributesDocument>
    
  • Submit a sample job:
    qcg-comp -c -J /usr/share/qcg-comp/doc/examples/jsdl/sleep.xml
    Activity Id: ccb6b04a-887b-4027-633f-412375559d73
    
  • Query it status:
    qcg-comp -s -a ccb6b04a-887b-4027-633f-412375559d73
    status = Executing
    qcg-comp -s -a ccb6b04a-887b-4027-633f-412375559d73
    status = Executing
    qcg-comp -s -a ccb6b04a-887b-4027-633f-412375559d73
    status = Finished
    exit status = 0
    
  • Submit a job which produces some output:
    $ qcg-comp -c -J /usr/share/qcg-comp/doc/examples/jsdl/date.xml 
    Activity Id: 591effa9-143d-4cae-9dd9-02e40f760448
    $ qcg-comp -s -a 591effa9-143d-4cae-9dd9-02e40f760448
    status = Queued
    $ qcg-comp -s -a 591effa9-143d-4cae-9dd9-02e40f760448
    status = Finished (exit status = 0)
    $ qcg-comp -o -J /usr/share/qcg-comp/doc/examples/jsdl/date.xml
    File /tmp/date.staged.out staged out.
    All files staged out.
    $ cat /tmp/date.staged.out 
    Mon Jul 29 02:23:33 HST 2013
    

If possible perform a "reboot test", i.e. reboot the machine and check if all services are operational without manual intervention. You can also run extended UMD Verification Procedure.

Maintenance

The historic usage information is stored in two relations of the QCG-Computing database: jobs_acc and reservations_acc. You can always archive old usage data to a file and delete it from the database using the psql client:

psql -h localhost qcg-comp qcg-comp 
Password for user qcg-comp: 
Welcome to psql 8.1.23, the PostgreSQL interactive terminal.
  
Type:  \copyright for distribution terms
     \h for help with SQL commands
     \? for help with psql commands
     \g or terminate with semicolon to execute query
     \q to quit

qcg-comp=> \o jobs.acc
qcg-comp=> SELECT * FROM jobs_acc where end_time < date '2010-01-10';
qcg-comp=> \o reservations.acc
qcg-comp=> SELECT * FROM reservations_acc where end_time < date '2010-01-10';
qcg-comp=> \o
qcg-comp=> DELETE FROM jobs_acc where end_time < date '2010-01-10';
qcg-comp=> DELETE FROM reservation_acc where end_time < date '2010-01-10';

you should also install logrotate configuration for QCG-Computing:

yum install  qcg-comp-logrotate

Important: On any update/restart of the PostgreSQL database you must restart also the qcg-compd and qcg-ntfd services.

/etc/init.d/qcg-compd restart
/etc/init.d/qcg-ntfd restart

On scheduled downtimes we recommend to disable submission in the service configuration file:

...
   <AcceptingNewActivities>false</AcceptingNewActivities>
<FactoryAttributes>

GOCDB

Please remember to register the QCG-Computing and QCG-Notification services in the GOCDB using the QCG.Computing and QCG.Notification services types respectively.