Version 1 (modified by pkopta, 6 years ago) (diff) |
---|
- Introduction
- Prerequisites
- Firewall configuration
- Preparation of the environment
- Installation
- Integration with QCG-Broker
- Database initializaiton
- Authorization modules
- Getting the DRMAA library
- Service configuration
- Configuring QCG-Accounting
- Creating applications' script space
- Starting the service
- Stopping the service
- Verifying the installation
- Maintenance
- GOCDB
Introduction
QCG-Computing service is an open source service acting as a computing provider exposing on demand access to computing resources and jobs over the HPC Basic Profile compliant Web Services interface. In addition the QCG-Computing offers remote interface for Advance Reservations management.
This document describes installation of the QCG-Computing service. The service should be deployed on the machine (or virtual machine) that:
- has at least 1GB of memory (recommended value: 2 GB)
- has 10 GB of free disk space (most of the space will be used by the log files)
- has any modern CPU (if you plan to use virtual machine you should dedicated to it one or two cores from the host machine)
- is running under:
- Scientific Linux 5/6 (in most cases the provided RPMs should work with any operating system based on Redhat Enterpise Linux, e.g. CentOS )
- Debian 6.X
- any other modern linux distribution if you are ready to install QCG from source packages
Prerequisites
We assume that you have the local resource manager/scheduler already installed. This would be typically a frontend machine (i.e. machine where for example the pbs_server Torque daemon is running). If you want to install the QCG-Computing service on a separate submit host you should read this note.
Since version 2.4 the QCG-Computing services discovers installed application using the Environment Modules package. For this reason you should install modules on the QCG-Computing host and mount directories that contain all module files used at your cluster and make sure that user qcg-comp can see all modules.
The QosCosGrid services do not require from you to install any QCG component on the worker nodes, however application wrapper scripts need the following software to be available on worker nodes:
- bash,
- rsync,
- zip/unzip,
- dos2unix,
- python.
Which are usually available out of the box on most of the HPC systems.
Shared file system
Deployment of the QCG-Computing requires usually two shared file systems in the cluster:
- "Users' directories" - shared between the QCG host and all worker nodes. Used for storing jobs sandbox directories. It can be either HOME or scratch file system. You can read more about this here.
- "Applications scripts" - shared between all worker nodes. Used for storing Applications Scripts
Firewall configuration
In order to expose the QosCosGrid services externally you need to open the following incoming ports in the firewall:
- 19000 (TCP) - QCG-Computing
- 19001 (TCP) - QCG-Notification
- 2811 (TCP) - GridFTP server
- 20000-25000 (TCP) - GridFTP port-range (if you want to use different port-range adjust the GLOBUS_TCP_PORT_RANGE variable in the /etc/xinetd.d/gsiftp file)
You may also want to allow SSH access from white-listed machines (for administration purpose only).
The following outgoing trafic should be allowed in general:
- NTP, DNS, HTTP, HTTPS services
- gridftp (TCP ports: 2811 and port-ranges: 20000-25000)
Also the PL-Grid QCG-Accounting publisher plugin (BAT) need access the following port and machine (PL-Grid only):
- acct.plgrid.pl 61616 (TCP)
Preparation of the environment
Database
- Install database backend (PostgresSQL). On ScientificLinux it can be done with:
yum install postgresql postgresql-server
- UnixODBC and the PostgresSQL odbc driver:
yum install unixODBC postgresql-odbc
CA and host certificates
At first install all need trusted CA certificates ( instruction). Moreover we assume that the X.509 host certificate (signed by your local Certificate Authority) and key is already installed in the following locations:
- /etc/grid-security/hostcert.pem
- /etc/grid-security/hostkey.pem
Other
Most of the grid services and security infrastructures are sensitive to time skews. Thus we recommended to install a Network Time Protocol daemon or use any other solution that provides accurate clock synchronization. Also disable automatic packages update as it may hurt running system.
Installation
At first you need to install appropriate repository or get the newest source package and use this guide to compile it manually.
Centos 7, Scientific Linux 5/6
yum install qcg-comp qcg-comp-client qcg-comp-logrotate
Debian
apt-get install qcg-comp qcg-comp-client qcg-comp-doc
Integration with QCG-Broker
In order to enable this QCG-Computing endpoint to be accessible by the QCG-Broker service you need to:
- add a qcg-broker user - the user that the service would be mapped to.
useradd -r -d /var/log/qcg/ qcg-broker
- install grid-ftp server using this instruction.
Database initializaiton
- setup QCG-Computing database using provided script:
/usr/share/qcg-comp/tools/qcg-comp-install.sh Welcome to qcg-comp installation script! This script will guide you through process of configuring proper environment for running the QCG-Computing service. You have to answer few questions regarding parameters of your database. If you are not sure just press Enter and use the default values. Use local PostgreSQL server? (y/n) [y]: y Database [qcg-comp]: User [qcg-comp]: Password [RAND-PASSWD]: MojeTajneHaslo Create database? (y/n) [y]: y Create user? (y/n) [y]: y Checking for system user qcg-comp...OK Checking whether PostgreSQL server is installed...OK Checking whether PostgreSQL server is running...OK Performing installation * Creating user qcg-comp...OK * Creating database qcg-comp...OK * Creating database schema...OK * Checking for ODBC data source qcg-comp... * Installing ODBC data source...OK Remember to add appropriate entry to /var/lib/pgsql/data/pg_hba.conf (as the first rule!) to allow user qcg-comp to access database qcg-comp. For instance: host qcg-comp qcg-comp 127.0.0.1/32 md5 and reload Postgres server.
Add a new rule to the pg_hba.conf as requested:
vim /var/lib/pgsql/data/pg_hba.conf /etc/init.d/postgresql reload
Authorization modules
The next subsections describes the three most common modes of authorization.
Manually created grid mapfile
For testing purpose or if your user community is small enough to maintain it manually you can use plain grid mapfile which provides static mapping between user's certificate Distinguish Name and a local account:
#for test purpose only add mapping for your account echo '"MyCertDN" myaccount' >> /etc/grid-security/grid-mapfile
LDAP generated grid mapfile (PL-Grid only)
In PL-Grid grid-mapfiles are generated automatically, based on information available in local LDAP replicas. You need to install and configure gridmap-file-generator
At the end add mapping in the grid-mapfile.local for the purpose of QCG-Broker.
"/C=PL/O=GRID/O=PSNC/CN=qcg-broker/qcg-broker.man.poznan.pl" qcg-broker
VOMS
Since version 3.0 of the service it is possible to authorize and map users based on their VO membership ( instruction).
Getting the DRMAA library
The QCG-Computing service use DRMAA compilant interface for batch job submission. Thus you need to install library appropriate for your system:
Service configuration
Edit the preinstalled service configuration file (/etc/qcg/qcg-comp/qcg-compd.xml):
<?xml version="1.0" encoding="UTF-8"?> <sm:QCGCore xmlns:sm="http://schemas.qoscosgrid.org/core/2011/04/config" xmlns="http://schemas.qoscosgrid.org/comp/2011/04/config" xmlns:smc="http://schemas.qoscosgrid.org/comp/2011/04/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Configuration> <sm:ModuleManager> <sm:Directory>/usr/lib64/qcg-core/modules/</sm:Directory> <sm:Directory>/usr/lib64/qcg-comp/modules/</sm:Directory> </sm:ModuleManager> <sm:Service xsi:type="qcg-compd" description="QCG-Computing"> <sm:Logger> <sm:Filename>/var/log/qcg/qcg-comp/qcg-compd.log</sm:Filename> <sm:Level>INFO</sm:Level> </sm:Logger> <sm:Transport> <sm:Module xsi:type="sm:ecm_gsoap.service"> <sm:Host>frontend.example.com</sm:Host> <sm:Port>19000</sm:Port> <sm:KeepAlive>false</sm:KeepAlive> <sm:Authentication> <sm:Module xsi:type="sm:atc_transport_gsi.service"> <sm:X509CertFile>/etc/grid-security/hostcert.pem</sm:X509CertFile> <sm:X509KeyFile>/etc/grid-security/hostkey.pem</sm:X509KeyFile> </sm:Module> </sm:Authentication> <sm:Authorization> <sm:Module xsi:type="sm:atz_mapfile"> <sm:Mapfile>/etc/grid-security/grid-mapfile</sm:Mapfile> </sm:Module> </sm:Authorization> </sm:Module> <sm:Module xsi:type="smc:qcg-comp-service"/> </sm:Transport> <sm:Module xsi:type="pbs_jsdl_filter"/> <sm:Module xsi:type="atz_ardl_filter"/> <sm:Module xsi:type="sm:general_python" path="/usr/lib64/qcg-comp/modules/python/monitoring.py"/> <!--sm:Module xsi:type="sm:general_python" path="/opt/qcg/lib/qcg-comp/modules/python/plgrid_info.py"/ this module is Mandatory in PL-Grid--> <sm:Module xsi:type="sm:general_python" path="/usr/lib64/qcg-comp/modules/python/modules_info.py"/> <sm:Module xsi:type="submission_drmaa" path="/usr/local/lib/libdrmaa.so"/> <sm:Module xsi:type="reservation_python" path="/usr/lib64/qcg-comp/modules/python/reservation_maui.py"/> <sm:Module xsi:type="notification_wsn"> <PublishedBrokerURL>https://frontend.example.com:19011/</PublishedBrokerURL> <sm:Module xsi:type="sm:ecm_gsoap.client"> <sm:ServiceURL>http://localhost:19001/</sm:ServiceURL> <sm:Authentication> <sm:Module xsi:type="sm:atc_transport_http.client"/> </sm:Authentication> <sm:Module xsi:type="sm:ntf_client"/> </sm:Module> </sm:Module> <sm:Module xsi:type="application_mapper"> <ApplicationMapFile>/etc/qcg/qcg-comp/application_mapfile</ApplicationMapFile> </sm:Module> <Database> <DSN>qcg-comp</DSN> <User>qcg-comp</User> <Password>qcg-comp</Password> </Database> <UnprivilegedUser>qcg-comp</UnprivilegedUser> <!--UseScratch>true</UseScratch> uncomment this if scratch is the only file system shared between the worker nodes and this machine --> <FactoryAttributes> <CommonName>hpc.mydomain.org</CommonName> <LongDescription>Cluster description</LongDescription> </FactoryAttributes> </sm:Service> </Configuration> </sm:QCGCore>
Common
In most cases it should be enough to change only following elements:
- Transport/Module/Host
- the hostname of the machine where the service is deployed. You can put here 0.0.0.0 if you want to listen on all interfaces.
- Transport/Module/Authentication/Module/X509CertFile and Transport/Module/Authentication/Module/X509KeyFile
- If you installed cert and key file in the recommended location you do not need to edit these fields.
- Module[type="smc:notification_wsn"]/PublishedBrokerURL
- the external URL of the QCG-Notification service (You can do it later, i.e. after installing the QCG-Notification service)
- Module[type="smc:notification_wsn"]/Module/ServiceURL
- the localhost URL of the QCG-Notification service (You can do it later, i.e. after installing the QCG-Notification service)
- Module[type="submission_drmaa"]/@path
- path to the DRMAA library (the libdrmaa.so).
- Module[type="reservation_python"]/@path
- path to the reservation module. Change this if you are using different scheduler than Maui (e.g. use reservation_moab.py for Moab, reservation_pbs.py for PBS Pro)
- Database/Password
- the qcg-comp database password
- UseScratch
- set this to true if you set QCG_SCRATCH_DIR_ROOT in sysconfig so any job will be started from scratch directory (instead of default home directory)
- FactoryAttributes/CommonName
- a common name of the cluster (e.g. reef.man.poznan.pl). You can use any name that is unique among all systems (e.g. cluster name + domain name of your institution)
- FactoryAttributes/LongDescription
- a human readable description of the cluster
Torque
- Module[type="reservation_python"]/@path
- path to the reservation module. Change this if you are using different scheduler than Maui (e.g. use reservation_moab.py for Moab)
PBS Professional
- Module[type="reservation_python"]/@path
- path to the reservation module. Change this to reservation_pbs.py.
SLURM
- at first replace:
<sm:Module xsi:type="pbs_jsdl_filter"/>
- with:
<sm:Module xsi:type="slurm_jsdl_filter"/>
- also if you want to offer advance reservation interface you need to:
- Module[type="reservation_python"]/@path
- path to the reservation module. Change this to reservation_slurm.py.
- finally make sure that the qcg-comp user exist on the service node and that it can list all nodes:
scontrol show node -o | wc -l
Configuring QCG-Accounting
Please use QCG-Accounting agent.
Creating applications' script space
A common case for the QCG-Computing service is that an application is accessed using abstract app name rather than specifying absolute executable path. The application name/version to executbale path mappings are stored in the file /etc/qcg/qcg-comp/application_mapfile:
cat /etc/qcg/qcg-comp/application_mapfile # ApplicationName ApplicationVersion Executable bash * /opt/exp_soft/qcg/qcg-app-scripts/app-scripts/bash.qcg
It is also common to provide here wrapper scripts rather than target executables. The wrapper script can handle such aspects of the application lifetime like: environment initialization, copying files from/to scratch storage and application monitoring. It is recommended to create separate directory for those wrapper scripts (e.g. the application partition) for an applications. This directory must be readable by all users and from every worker node (the application partition usually fullfils those requirements). Please read more on Application Scripts. You must provide at least mapping for the 'bash' application.
Starting the service
As root type:
/etc/init.d/qcg-compd start
The service logs can be found in:
/var/log/qcg/qcg-comp/qcg-compd.log
Note: In current version, whenever you restart the PosgreSQL server you need also restart the QCG-Computing and QCG-Notification service:
/etc/init.d/qcg-compd restart /etc/init.d/qcg-ntfd restart
Stopping the service
The service can be stopped using the following command:
/etc/init.d/qcg-compd stop
Verifying the installation
- Edit the QCG-Computing client configuration file (/etc/qcg/qcg-comp/qcg-comp.xml):
- set the Host and Port to reflects the changes in the service configuration file (qcg-compd.xml).
<?xml version="1.0" encoding="UTF-8"?> <sm:QCGCore xmlns:sm="http://schemas.qoscosgrid.org/core/2011/04/config" xmlns="http://schemas.qoscosgrid.org/comp/2011/04/config" xmlns:smc="http://schemas.qoscosgrid.org/comp/2011/04/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <Configuration> <sm:ModuleManager> <sm:Directory>/usr/lib64/qcg-core/modules/</sm:Directory> <sm:Directory>/usr/lib64/qcg-comp/modules/</sm:Directory> </sm:ModuleManager> <sm:Client xsi:type="qcg-comp" description="QCG-Computing client"> <sm:Transport> <sm:Module xsi:type="sm:ecm_gsoap.client"> <sm:ServiceURL>httpg://frontend.example.com:19000/</sm:ServiceURL> <sm:Authentication> <sm:Module xsi:type="sm:atc_transport_gsi.client"/> </sm:Authentication> <sm:Module xsi:type="smc:qcg-comp-client"/> </sm:Module> </sm:Transport> </sm:Client> </Configuration> </sm:QCGCore>
- set the Host and Port to reflects the changes in the service configuration file (qcg-compd.xml).
- Initialize your credentials:
grid-proxy-init -rfc Your identity: /C=PL/O=GRID/O=PSNC/CN=Mariusz Mamonski Enter GRID pass phrase for this identity: Creating proxy .................................................................. Done Your proxy is valid until: Wed Apr 6 05:01:02 2012
- Query the QCG-Computing service:
qcg-comp -G | xmllint --format - # the xmllint is used only to present the result in more pleasant way <bes-factory:FactoryResourceAttributesDocument xmlns:bes-factory="http://schemas.ggf.org/bes/2006/08/bes-factory"> <bes-factory:IsAcceptingNewActivities>true</bes-factory:IsAcceptingNewActivities> <bes-factory:CommonName>IT cluster</bes-factory:CommonName> <bes-factory:LongDescription>IT department cluster for public use</bes-factory:LongDescription> <bes-factory:TotalNumberOfActivities>0</bes-factory:TotalNumberOfActivities> <bes-factory:TotalNumberOfContainedResources>1</bes-factory:TotalNumberOfContainedResources> <bes-factory:ContainedResource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="bes-factory:BasicResourceAttributesDocumentType"> <bes-factory:ResourceName>worker.example.com</bes-factory:ResourceName> <bes-factory:CPUArchitecture> <jsdl:CPUArchitectureName xmlns:jsdl="http://schemas.ggf.org/jsdl/2005/11/jsdl">x86_32</jsdl:CPUArchitectureName> </bes-factory:CPUArchitecture> <bes-factory:CPUCount>4</bes-factory:CPUCount><bes-factory:PhysicalMemory>1073741824</bes-factory:PhysicalMemory> </bes-factory:ContainedResource> <bes-factory:NamingProfile>http://schemas.ggf.org/bes/2006/08/bes/naming/BasicWSAddressing</bes-factory:NamingProfile> <bes-factory:BESExtension>http://schemas.ogf.org/hpcp/2007/01/bp/BasicFilter</bes- factory:BESExtension> <bes-factory:BESExtension>http://schemas.qoscosgrid.org/comp/2011/04</bes-factory:BESExtension> <bes-factory:LocalResourceManagerType>http://example.com/SunGridEngine</bes-factory:LocalResourceManagerType> <smcf:NotificationProviderURL xmlns:smcf="http://schemas.qoscosgrid.org/comp/2011/04/factory">http://localhost:2211/</smcf:NotificationProviderURL> </bes-factory:FactoryResourceAttributesDocument>
- Submit a sample job:
qcg-comp -c -J /usr/share/qcg-comp/doc/examples/jsdl/sleep.xml Activity Id: ccb6b04a-887b-4027-633f-412375559d73
- Query it status:
qcg-comp -s -a ccb6b04a-887b-4027-633f-412375559d73 status = Executing qcg-comp -s -a ccb6b04a-887b-4027-633f-412375559d73 status = Executing qcg-comp -s -a ccb6b04a-887b-4027-633f-412375559d73 status = Finished exit status = 0
- Submit a job which produces some output:
$ qcg-comp -c -J /usr/share/qcg-comp/doc/examples/jsdl/date.xml Activity Id: 591effa9-143d-4cae-9dd9-02e40f760448 $ qcg-comp -s -a 591effa9-143d-4cae-9dd9-02e40f760448 status = Queued $ qcg-comp -s -a 591effa9-143d-4cae-9dd9-02e40f760448 status = Finished (exit status = 0) $ qcg-comp -o -J /usr/share/qcg-comp/doc/examples/jsdl/date.xml File /tmp/date.staged.out staged out. All files staged out. $ cat /tmp/date.staged.out Mon Jul 29 02:23:33 HST 2013
If possible perform a "reboot test", i.e. reboot the machine and check if all services are operational without manual intervention. You can also run extended UMD Verification Procedure.
Maintenance
The historic usage information is stored in two relations of the QCG-Computing database: jobs_acc and reservations_acc. You can always archive old usage data to a file and delete it from the database using the psql client:
psql -h localhost qcg-comp qcg-comp Password for user qcg-comp: Welcome to psql 8.1.23, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit qcg-comp=> \o jobs.acc qcg-comp=> SELECT * FROM jobs_acc where end_time < date '2010-01-10'; qcg-comp=> \o reservations.acc qcg-comp=> SELECT * FROM reservations_acc where end_time < date '2010-01-10'; qcg-comp=> \o qcg-comp=> DELETE FROM jobs_acc where end_time < date '2010-01-10'; qcg-comp=> DELETE FROM reservation_acc where end_time < date '2010-01-10';
you should also install logrotate configuration for QCG-Computing:
yum install qcg-comp-logrotate
Important: On any update/restart of the PostgreSQL database you must restart also the qcg-compd and qcg-ntfd services.
/etc/init.d/qcg-compd restart /etc/init.d/qcg-ntfd restart
On scheduled downtimes we recommend to disable submission in the service configuration file:
... <AcceptingNewActivities>false</AcceptingNewActivities> <FactoryAttributes>
GOCDB
Please remember to register the QCG-Computing and QCG-Notification services in the GOCDB using the QCG.Computing and QCG.Notification services types respectively.