Version 41 (modified by bartek, 6 years ago) (diff) |
---|
Introduction
The QCG-Computing service is an open source service acting as a computing provider and offering on-demand access to computing resources and jobs over the HPC Basic Profile compliant Web Services interface. In addition QCG-Computing offers a remote interface for Advance Reservations management.
Within QosCosGrid the QCG-Notification service is widely used for brokering various types of notification messages related to the state of a job (e.g. including predefined status of a job or snippet from the the job's output file).
This document describes installation of the both QCG services: QCG-Computing and QCG-Notification. These services should be deployed on the same machine (or virtual machine) that:
- has at least 1GB of memory (recommended value: 2 GB)
- has 10 GB of free disk space (most of the space will be used for the log files)
- has any modern CPU (if you plan to use virtual machine you should dedicate to it one or two cores from the host machine)
- is running under:
- Centos 7 (in most cases the provided RPMs should work with any operating system based on Redhat Enterpise Linux 7)
Prerequisites
We assume that you have a local resource manager/scheduler already installed. The QCG services are typically installed on a submit machine for the scheduling system.
Since version 2.4 the QCG-Computing services discover installed applications using the Environment Modules package. For this reason you should install modules on the QCG-Computing host and mount directories that contain all module files used at your cluster as well as make sure that a user qcg-comp can see all these modules.
The QCG services do not require from you to install any QCG component on the worker nodes, however the provided application wrapper scripts, that are typically used by QCG, need the following software to be available on worker nodes:
- bash,
- rsync,
- zip/unzip,
- dos2unix,
- python.
These packages are usually available out of the box on most of the HPC systems.
Both services, QCG-Notification and QCG-Computing, require access to a database over ODBC. In most cases, the database is located on the same system as services. Currently the PostgreSQL database and UnixODBC are supported. To install them on CentOS Linux invoke:
yum install postgresql postgresql-server yum install unixODBC postgresql-odbc
Shared file system
Deployment of QCG-Computing requires usually two shared file systems in the cluster:
- "Users' directories" - shared between the QCG host and all worker nodes. Used for storing jobs' sandbox directories. It can be either HOME or scratch file system. You can read more about this here.
- "Applications scripts" - shared between all worker nodes. Used for storing Applications Scripts
Firewall configuration
In order to expose the QosCosGrid services externally you need to open the following incoming ports in the firewall:
- 19000 (TCP) - QCG-Computing
- 19001 (TCP) - QCG-Notification
- 2811 (TCP) - GridFTP server
- 20000-25000 (TCP) - GridFTP port-range (if you want to use different port-range adjust the GLOBUS_TCP_PORT_RANGE variable in the /etc/xinetd.d/gsiftp file)
You may also want to allow SSH access from white-listed machines (for administration purpose only).
The following outgoing trafic should be allowed in general:
- NTP, DNS, HTTP, HTTPS services
- gridftp (TCP ports: 2811 and port-ranges: 20000-25000)
QCG-Notification
Installation
QCG-Notification may be installed using Yum Package Manager from RPMs. The procedure is as follows:
- At first you need to install the QCG repository:
rpm -Uvh http://www.qoscosgrid.org/qcg-packages/centos7/x86_64/qcg-repo-unstable-1.0.0-1.centos7.noarch.rpm
- install QCG-Notification using YUM Package Manager:
yum install qcg-ntf qcg-ntf-logrotate
Configuration
The first step is to configure QCG-Notification database using provided script:
/usr/share/qcg-ntf/tools/qcg-ntf-install.sh Welcome to qcg-ntf installation script! This script will guide you through process of configuring proper environment for running the QCG-Notification service. You have to answer few questions regarding parameters of your database. If you are not sure just press Enter and use the default values. Use local PostgreSQL server? (y/n) [y]: y Database [qcg-ntf]: User [qcg-ntf]: Password [qcg-ntf]: MojeTajneHaslo Create database? (y/n) [y]: y Create user? (y/n) [y]: y Checking for system user qcg_ntf...OK Checking whether PostgreSQL server is installed...OK Checking whether PostgreSQL server is running...OK Performing installation * Creating user qcg-ntf...OK * Creating database qcg-ntf...OK * Creating database schema...OK * Checking for ODBC data source qcg-ntf... * Installing ODBC data source...OK The newly established database settings must be reflected in the Database section of the QCG-Notification configuration file (by default /etc/qcg/qcg-ntf/qcg-ntfd.xml) Remember to add appropriate entry to /var/lib/pgsql/data/pg_hba.conf (as the first rule!) to allow user qcg-ntf to access database qcg-ntf. For instance: host qcg-ntf qcg-ntf 127.0.0.1/32 md5 and reload Postgres server.
Add a new rule to the pg_hba.conf as requested and reload Postgres:
vim /var/lib/pgsql/data/pg_hba.conf systemctl reload postgresql
Now minor updates should be be also applied to the QCG-Notification main configuration file located in: /etc/qcg/qcg-ntf/qcg-ntfd.xml. You will propably need to change the Host parameter (in most cases it must be an external address, also do not use 0.0.0.0 wildcard address) as well as the Password parameter for the database connection. A part of the configuration file with marked key parameters is presented below:
<sm:QCGCore <Configuration> .... <sm:Module xsi:type="sm:ecm_gsoap.service"> <sm:Host>host.example.com</sm:Host> <sm:Port>19001</sm:Port> <sm:UseWsa>true</sm:UseWsa> </sm:Module> .... <smn:Database> <smn:DatabaseEnabled>true</smn:DatabaseEnabled> <smn:DSN>qcg-ntf</smn:DSN> <smn:User>qcg-ntf</smn:User> <smn:Password>qcg-ntf</smn:Password> <smn:CleanAtStart>false</smn:CleanAtStart> </smn:Database> .... </Configuration> </sm:QCGCore>
Running the service
The QCG-Notification startup script is available in standard systemd paths:
systemctl start qcg-ntfd
The service logs can be found in:
/var/log/qcg/qcg-ntf/qcg-ntfd.log
It could be then stopped with the following command:
systemctl stop qcg-ntfd
Note: qcg-ntfd will be started with the qcg_ntf user permissions.
Log management
You may also wish to install logrotate configuration for QCG-Notification:
yum install qcg-ntf-logrotate
QCG-Computing
Preparation of the environment
CA and host certificates
At first install all need trusted CA certificates ( instruction). Moreover we assume that the X.509 host certificate (signed by your local Certificate Authority) and key is already installed in the following locations:
- /etc/grid-security/qcg-compcert.pem
- /etc/grid-security/qcg-compkey.pem
In case where QCG-Computing is run from unprivileged account, these files must be owned by the same account. Because during the installation, the qcg-comp account is created we suggest to use this account as owner of certificate and key files.
Other
Most of the grid services and security infrastructures are sensitive to time skews. Thus we recommend to install a Network Time Protocol daemon or use any other solution that provides accurate clock synchronization. Also disable automatic packages update as it may hurt running system.
Installation
If it is not yet installed, install the QCG repository:
rpm -Uvh http://www.qoscosgrid.org/qcg-packages/centos7/x86_64/qcg-repo-unstable-1.0.0-1.centos7.noarch.rpm
Install the qcg-comp packages:
yum install qcg-comp qcg-comp-client qcg-comp-logrotate
Database initialization
Setup the QCG-Computing database using the provided script:
/usr/share/qcg-comp/tools/qcg-comp-install.sh Welcome to qcg-comp installation script! This script will guide you through process of configuring proper environment for running the QCG-Computing service. You have to answer few questions regarding parameters of your database. If you are not sure just press Enter and use the default values. Use local PostgreSQL server? (y/n) [y]: y Database [qcg-comp]: User [qcg-comp]: Password [RAND-PASSWD]: MojeTajneHaslo Create database? (y/n) [y]: y Create user? (y/n) [y]: y Checking for system user qcg-comp...OK Checking whether PostgreSQL server is installed...OK Checking whether PostgreSQL server is running...OK Performing installation * Creating user qcg-comp...OK * Creating database qcg-comp...OK * Creating database schema...OK * Checking for ODBC data source qcg-comp... * Installing ODBC data source...OK Remember to add appropriate entry to /var/lib/pgsql/data/pg_hba.conf (as the first rule!) to allow user qcg-comp to access database qcg-comp. For instance: host qcg-comp qcg-comp 127.0.0.1/32 md5 and reload Postgres server.
Add a new rule to the pg_hba.conf as requested:
vim /var/lib/pgsql/data/pg_hba.conf systemctl reload postgresql
Authorization modules
For testing purposes or if your user community is small enough to maintain it manually you can use a plain grid mapfile which provides static mapping between user's certificate Distinguish Name and a local account:
#for test purpose only add mapping for your account echo '"MyCertDN" myaccount' >> /etc/grid-security/grid-mapfile
For the single account submit configuration, all DNS's should be mapped onto the same, qcg-comp account.
Additionally the special entry for QCG-Broker must be put to the grid mapfile:
echo '"/C=PL/O=GRID/O=PSNC/CN=qcg-broker/broker.compat.qcg.psnc.pl" qcg-comp' >> /etc/grid-security/grid-mapfile
This is a DN that will be used by the QCG-Broker service to periodically obtain a report about currently available resources & accounts.
Getting the DRMAA library
The QCG-Computing service use DRMAA compilant interface for the batch job submission. Thus you need to install a library appropriate for your system. The latest version of the SLURM DRMAA library can be downloaded from the Git repository.
Prerequisites
The following package should be installed to build SLURM DRMAA library:
yum install autoconf automake libtool m4 bison gperf ragel hiredis-devel
Build & install
git clone https://git.man.poznan.pl/stash/scm/qcg/slurm-drmaa.git cd slurm-drmaa ./configure --prefix=/opt/qcg/dependencies --sysconfdir=/opt/qcg/dependencies/etc CFLAGS=-fstack-protector-all make clean all sudo make install
Configuration
The example configuration file is created in the destination directory
sudo cp /opt/qcg/dependencies/etc/slurm_drmaa.conf.example /opt/qcg/dependencies/etc/slurm_drmaa.conf
The default settings should be appropriate for most installations.
Slurm notifications
The Slurm DRMAA library traces status of jobs submitted to a scheduling system by polling Slurm about a current status of a job. To minimize the number of queries, the qcg-comp-slurm-redis-notifier package has been developed. It contains a script that tracks the Slurm controller logs and pushes notifications to the local Redis database about jobs that changed it state. The Slurm DRMAA library registers for the Redis notifications and waits until they come. To use this mechanism, the following packages must be installed:
yum install redis python-redis systemctl enable redis systemctl start redis yum install qcg-slurm-redis-notifier
The path to the Slurm controller logs should be configured in the /etc/qcg/qcg-comp/qcg-slurm-redis-notifier.json file. Now, the notifier service is ready to start:
systemctl start qcg-slurm-redis-notifier
The log file of the notifier service is stored in /var/log/qcg/qcg-comp/qcg-slurm-redis-notifier.log.
Service configuration
Edit the preinstalled service configuration file (/etc/qcg/qcg-comp/qcg-compd.xml):
<?xml version="1.0" encoding="UTF-8"?> <sm:QCGCore xmlns:sm="http://schemas.qoscosgrid.org/core/2011/04/config" xmlns="http://schemas.qoscosgrid.org/comp/2011/04/config" xmlns:smc="http://schemas.qoscosgrid.org/comp/2011/04/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Configuration> <sm:ModuleManager> <sm:Directory>/usr/lib64/qcg-core/modules/</sm:Directory> <sm:Directory>/usr/lib64/qcg-comp/modules/</sm:Directory> </sm:ModuleManager> <sm:Service xsi:type="qcg-compd" description="QCG Computing"> <sm:WorkingDirectory>/var/log/qcg/qcg-comp/</sm:WorkingDirectory> <sm:Logger> <sm:Filename>/var/log/qcg/qcg-comp/qcg-compd.log</sm:Filename> <sm:Level>INFO</sm:Level> </sm:Logger> <sm:Transport> <sm:Module xsi:type="sm:ecm_gsoap.service"> <sm:Host>frontend.example.com</sm:Host> <sm:Port>19000</sm:Port> <sm:KeepAlive>false</sm:KeepAlive> <sm:Authentication> <sm:Module xsi:type="sm:atc_transport_gsi.service"> <sm:X509CertFile>/etc/grid-security/hostcert.pem</sm:X509CertFile> <sm:X509KeyFile>/etc/grid-security/hostkey.pem</sm:X509KeyFile> </sm:Module> </sm:Authentication> <sm:Authorization> <sm:Module xsi:type="sm:atz_mapfile"> <sm:Mapfile>/etc/grid-security/grid-mapfile</sm:Mapfile> </sm:Module> </sm:Authorization> </sm:Module> <sm:Module xsi:type="smc:qcg-comp-service"/> </sm:Transport> <sm:Module xsi:type="slurm_jsdl_filter"/> <!--<sm:Module xsi:type="atz_ardl_filter"/>--> <sm:Module xsi:type="sm:general_python" path="/usr/lib64/qcg-comp/modules/python/monitoring.py"/> <sm:Module xsi:type="sm:general_python" path="/usr/lib64/qcg-comp/modules/python/modules_info.py"/> <!--<sm:Module xsi:type="sm:general_python" path="/usr/lib64/qcg-comp/modules/python/plgrid_info.py"/>--> <!--<sm:Module xsi:type="sm:general_python" path="/usr/lib64/qcg-comp/modules/python/node_types.py"/>--> <sm:Module xsi:type="submission_drmaa" path="/opt/qcg/dependencies/lib/libdrmaa.so"/> <!--sm:Module xsi:type="reservation_python" path="/usr/lib64/qcg-comp/modules/python/reservation_maui.py"/--> <sm:Module xsi:type="notification_wsn"> <sm:Module xsi:type="sm:ecm_gsoap.client"> <sm:ServiceURL>http://frontend.example.com:19001/</sm:ServiceURL> <sm:Authentication> <sm:Module xsi:type="sm:atc_transport_http.client"/> </sm:Authentication> <sm:Module xsi:type="sm:ntf_client"/> </sm:Module> </sm:Module> <sm:Module xsi:type="application_mapper"> <ApplicationMapFile>/etc/qcg/qcg-comp/application_mapfile</ApplicationMapFile> </sm:Module> <Database> <DSN>qcg-comp</DSN> <User>qcg-comp</User> <Password>qcg-comp</Password> </Database> <UnprivilegedUser>qcg-comp</UnprivilegedUser> <!--<SetuidEnabled>false</SetuidEnabled>--> <!--UseScratch>true</UseScratch> uncomment this if scratch is the only file system shared between the worker nodes and this machine --> <FactoryAttributes> <CommonName>hpc.example.com</CommonName> <LongDescription>QCG enabled cluster</LongDescription> </FactoryAttributes> </sm:Service> </Configuration> </sm:QCGCore>
In most cases it should be enough to change only the following elements:
- Transport/Module/Host
- the hostname of the machine where the service is deployed. You can put here 0.0.0.0 if you want to listen on all interfaces.
- Transport/Module/Authentication/Module/X509CertFile and Transport/Module/Authentication/Module/X509KeyFile
- Path to the certificate and key files (for single submit account these files must be owned by the qcg-comp user).
- Module[type="smc:notification_wsn"]/Module/ServiceURL
- the localhost URL of the QCG-Notification service (In most cases this is the same address as the QCG-Computing service)
- Module[type="submission_drmaa"]/@path
- path to the DRMAA library (the libdrmaa.so).
- Module[type="general_python"]/usr/lib64/qcg-comp/modules/python/monitoring.py
- path to the monitoring module which gathers general information about currenlty available modules
- Module[type="general_python"]/usr/lib64/qcg-comp/modules/python/modules_info.py
- path to the plugin which gathers information about currenlty available environment modules
- Module[type="general_python"]/usr/lib64/qcg-comp/modules/python/plgrid_info.py
- path to the module which gathers extended information about system, such as: available and default grants, users scratch directory; developed for PL-Grid infrastructure (integrated with qcg-gridmapfilegenerator) but can also be used in other infrastructures to report for example user scratch directories. To report users scratch directory, this module should be uncommented. For non-PL-Grid sites, the script /usr/share/qcg-comp/tools/setup-plgrid-plugin.sh should be executed to create necessary files.
- Module[type="general_python"]/usr/lib64/qcg-comp/modules/python/node_types.py
- path to the module reporting available node types (developed for Compat infrastructure); Every QCG-Computing service communicating with Compat QCG-Broker instance should have enabled this module. The configuration file (/etc/qcg/qcg-comp/nodes.conf) contains mapping between node type names and slurm node’s features. In Compat the node type is a class of nodes that have similar configuration (the similar performance).
- Module[type="reservation_python"]/@path
- path to the reservation module. Change this if you are using different scheduler than Maui (e.g. use reservation_moab.py for Moab, reservation_pbs.py for PBS Pro)
- Database/Password
- the qcg-comp database password generated in the earlier step by the qcg-comp-install.sh script
- SetuidEnabled
- set this to false if QCG-Computing service should be run on unprivileged account and all jobs should be submitted to the scheduling system from this single account; if set to false, the startup service script (/usr/lib/systemd/system/qcg-compd.service) should be modified to launch service from a different than root startup account.
- UseScratch
- This element should be set to true if jobs shall start in other than home directory of the user. When SetuidEnabled is set to false, this also means that all jobs will be started from the subdirectory of the qcg-comp user's home directory (/var/log/qcg/qcg-comp by default). The QCG_SCRATCH_DIR_ROOT environment variable should be set in /etc/sysconfig/qcg-compd file and point to the root directory of user's scratch directories. For example, if QCG_SCRATCH_DIR_ROOT=/var/scratch, and SetuidEnabled set to false (with default qcg-comp as a unprivileged QCG-Computing account), all jobs will be started in /var/scratch/qcg-comp directory.
- FactoryAttributes/CommonName
- a common name of the cluster (e.g. reef.man.poznan.pl). You can use any name that is unique among all systems (e.g. cluster name + domain name of your institution)
- FactoryAttributes/LongDescription
- a human readable description of the cluster
Creating applications' script space
A common case for the QCG-Computing service is that an application is accessed using an abstract app name rather than specifying absolute executable path. The application name/version to executbale path mappings are stored in the file /etc/qcg/qcg-comp/application_mapfile:
cat /etc/qcg/qcg-comp/application_mapfile # ApplicationName ApplicationVersion Executable bash * /opt/exp_soft/qcg/qcg-app-scripts/apps/bash.app
It is also common to provide here wrapper scripts rather than target executables. The wrapper script can handle such aspects of the application lifetime like: environment initialization, copying files from/to scratch storage and application monitoring. It is recommended to create separate directory for those wrapper scripts (e.g. the application partition) for an applications. This directory must be readable by all users and from every worker node (the application partition usually fullfils those requirements). You must provide at least mapping for the 'bash' application.
To install the basic set of application scripts (including 'bash' application):
yum install qcg-appscripts
Edit the configuration file /etc/qcg/qcg-comp/app-scripts/config and set cluster_shared_path with path to the created directory accessible from all worker nodes. To deploy scripts to the shared path, execute:
qcg-appscripts-deploy
The last step is to edit /etc/qcg/qcg-comp/application_mapfile file and set a proper path to deployed *.app files.
Please read more on Application Scripts.
Starting the service
As root type:
systemctl start qcg-compd
The service logs can be found in:
/var/log/qcg/qcg-comp/qcg-compd.log
Note: In current version, whenever you restart the PosgreSQL server, you need also restart the QCG-Computing and QCG-Notification services:
systemctl restart qcg-compd systemctl restart qcg-ntfd
Stopping the service
The service can be stopped using the following command:
systemctl stop qcg-compd
Verifying the installation
- Edit the QCG-Computing client configuration file (/etc/qcg/qcg-comp/qcg-comp.xml):
- set the Host and Port to reflects the changes in the service configuration file (qcg-compd.xml).
<?xml version="1.0" encoding="UTF-8"?> <sm:QCGCore xmlns:sm="http://schemas.qoscosgrid.org/core/2011/04/config" xmlns="http://schemas.qoscosgrid.org/comp/2011/04/config" xmlns:smc="http://schemas.qoscosgrid.org/comp/2011/04/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Configuration> <sm:ModuleManager> <sm:Directory>/usr/lib64/qcg-core/modules/</sm:Directory> <sm:Directory>/usr/lib64/qcg-comp/modules/</sm:Directory> </sm:ModuleManager> <sm:Client xsi:type="qcg-comp" description="QCG-Computing client"> <sm:Transport> <sm:Module xsi:type="sm:ecm_gsoap.client"> <sm:ServiceURL>httpg://frontend.example.com:19000/</sm:ServiceURL> <sm:Authentication> <sm:Module xsi:type="sm:atc_transport_gsi.client"/> </sm:Authentication> <sm:Module xsi:type="smc:qcg-comp-client"/> </sm:Module> </sm:Transport> </sm:Client> </Configuration> </sm:QCGCore>
- set the Host and Port to reflects the changes in the service configuration file (qcg-compd.xml).
- Initialize your credentials:
grid-proxy-init -rfc Your identity: /C=PL/O=GRID/O=PSNC/CN=Mariusz Mamonski Enter GRID pass phrase for this identity: Creating proxy .................................................................. Done Your proxy is valid until: Wed Apr 6 05:01:02 2012
- Query the QCG-Computing service:
qcg-comp -G | xmllint --format - # the xmllint is used only to present the result in more pleasant way <bes-factory:FactoryResourceAttributesDocument xmlns:bes-factory="http://schemas.ggf.org/bes/2006/08/bes-factory"> <bes-factory:IsAcceptingNewActivities>true</bes-factory:IsAcceptingNewActivities> <bes-factory:CommonName>IT cluster</bes-factory:CommonName> <bes-factory:LongDescription>IT department cluster for public use</bes-factory:LongDescription> <bes-factory:TotalNumberOfActivities>0</bes-factory:TotalNumberOfActivities> <bes-factory:TotalNumberOfContainedResources>1</bes-factory:TotalNumberOfContainedResources> <bes-factory:ContainedResource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="bes-factory:BasicResourceAttributesDocumentType"> <bes-factory:ResourceName>worker.example.com</bes-factory:ResourceName> <bes-factory:CPUArchitecture> <jsdl:CPUArchitectureName xmlns:jsdl="http://schemas.ggf.org/jsdl/2005/11/jsdl">x86_32</jsdl:CPUArchitectureName> </bes-factory:CPUArchitecture> <bes-factory:CPUCount>4</bes-factory:CPUCount><bes-factory:PhysicalMemory>1073741824</bes-factory:PhysicalMemory> </bes-factory:ContainedResource> <bes-factory:NamingProfile>http://schemas.ggf.org/bes/2006/08/bes/naming/BasicWSAddressing</bes-factory:NamingProfile> <bes-factory:BESExtension>http://schemas.ogf.org/hpcp/2007/01/bp/BasicFilter</bes- factory:BESExtension> <bes-factory:BESExtension>http://schemas.qoscosgrid.org/comp/2011/04</bes-factory:BESExtension> <bes-factory:LocalResourceManagerType>http://example.com/SunGridEngine</bes-factory:LocalResourceManagerType> <smcf:NotificationProviderURL xmlns:smcf="http://schemas.qoscosgrid.org/comp/2011/04/factory">http://localhost:2211/</smcf:NotificationProviderURL> </bes-factory:FactoryResourceAttributesDocument>
- Submit a sample job:
qcg-comp -c -J /usr/share/qcg-comp/doc/examples/jsdl/sleep.xml Activity Id: ccb6b04a-887b-4027-633f-412375559d73
- Query it status:
qcg-comp -s -a ccb6b04a-887b-4027-633f-412375559d73 status = Executing qcg-comp -s -a ccb6b04a-887b-4027-633f-412375559d73 status = Executing qcg-comp -s -a ccb6b04a-887b-4027-633f-412375559d73 status = Finished exit status = 0
- Submit a job which produces some output:
$ qcg-comp -c -J /usr/share/qcg-comp/doc/examples/jsdl/date.xml Activity Id: 591effa9-143d-4cae-9dd9-02e40f760448 $ qcg-comp -s -a 591effa9-143d-4cae-9dd9-02e40f760448 status = Queued $ qcg-comp -s -a 591effa9-143d-4cae-9dd9-02e40f760448 status = Finished (exit status = 0) $ qcg-comp -o -J /usr/share/qcg-comp/doc/examples/jsdl/date.xml File /tmp/date.staged.out staged out. All files staged out. $ cat /tmp/date.staged.out Mon Jul 29 02:23:33 HST 2013