Version 24 (modified by bartek, 13 years ago) (diff)

--

QosCosGrid

QosCosGrid could be viewed as a quasi-opportunistic supercomputer whose computational performance exceeds the power offered by a single supercomputing or data center. QosCosGrid is designed as a multi-layered architecture that is capable of dealing with computationally intensive large-scale, complex and parallel simulations that are usually too complex to run within a single computer cluster or machine. The QosCosGrid middleware enables computing resources (at the level of processor cores) from different administrative domains to be combined into a single powerful computing resource via Internet. Clearly, bandwidth and latency characteristics of the Internet may have an affect on overall application performance of QosCosGrid-enabled applications. However, the ability to connect and efficiently control advanced applications executed in parallel over the Internet is a feature that is highly appreciated by QosCosGrid users.

The QosCosGrid framework is highly flexible as it is composed of pluggable components that can be easily modified to support different scheduling and access policies to better maximize a diversity of utility functions. Furthermore, the framework exploits novel algorithms for topology-aware co-allocations that are required by parallel programming and execution set-ups in production-level high-performance computing environments, such as the Message Passing Interfaces (MPI), ProActive, or their hybrid extensions linking programming models like OpenMP or CUDA.

No image "QCG-Architecture-v4.png" attached to main_old

QCG BES/AR

QCG BES/AR (the successor of the OpenDSP project) is an open architecture implementation of SOAP Web service for multi-user access and policy-based job control routines by various queuing and batch systems managing local computational resources. This key service is using Distributed Resource Management Application API (DRMAA) to communicate with the underlying queuing systems. QCG BES/AR has been designed to support a variety of plugins and modules for external communication as well as to handle a large number of concurrent requests from external clients and services. Consequently, it can be used and integrated with various authentication, authorization and accounting services or to extend capabilities of existing e-infrastructures based on Unicore, gLite, Globus Toolkit, and others. QCG BES/AR service is compliant with the OGF HPC Basic Profile specification, which serves as a profile over Open Grid Forum standards like JSDL and OGSA Basic Execution Service. In addition, it offers remote interfaces for advance reservation management, and supports basic file transfer mechanisms. QCG BES/AR was successfully tested with the following queuing systems: Sun Grid Engine (SGE), Platform LSF, Torque/Maui, PBS Pro, Condor, Apple XGrid and LoadLeveler. Therefore, as a crucial component in QosCosGrid, it can be easily set up on the majority of computing clusters and supercomputers running the aforementioned queuing systems. Currently, advance reservation capabilities in QCG BES/AR are exposed for SGE, Platform LSF and Maui (a scheduler that is typically used in conjunction with Torque). Moreover, generic extensions for advance reservation have been proposed for the next DRMAA standard release.

QCG Notification and Data Movement services

QCG Notification is an open-source implementation of the family of WS-Notification standards (version 1.3). In the context of QosCosGrid, it is used to extend features provided by QCG BES/AR by adding standards-based synchronous and asynchronous notification features. QCG Notification supports the topic-based publish/subscribe pattern for asynchronous message exchange among Web services and other entities, in particular services or clients that want to be integrated with QosCosGrid. The main part of QCG Notification is based on a highly efficient, extended version of the NotificationBroker, managing all items participating in notification events. Today, QCG Notification offers sophisticated notification capabilities, e.g., topic and message content notification filtering, pull-and-push styles of transporting messages. QCG Notification has been integrated with a number of communication protocols as well as various Web services security mechanisms. The modular architecture of QCG Notification makes it relatively straightforward to develop new extensions and plugins to meet new requirements.

As many other e-infrastructures controlled by middleware services, QosCosGrid takes advantage of the GridFTP protocol for large data transfer operations, in particular to stage in and stage out files for advanced simulations. GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide area networks. It is a de facto standard for all data transfers in grid and cloud environments and extends the standard FTP protocol with functions such as third-party transfer, parallel and striped data transfer, self-tuning capabilities, X509 proxy certificate-based security, support for reliable and restartable data transfers. The development of GridFTP is coordinated by the GridFTP Working Group under the hood of the Open Grid Forum community.

QCG Broker

The QCG Broker is based on the Grid Resource Management System (GRMS) framework. GRMS was designed to be an open-source meta-scheduling framework that allows developers to build and easily deploy resource management systems to control large-scale distributed computing infrastructures running queuing or batch systems locally. Based on dynamic resource selection, advance reservation and various scheduling methodologies, combined with feedback control architecture, QCG Broker deals efficiently with various meta-scheduling challenges, e.g., co-allocation, load-balancing among clusters, remote job control, file staging support or job migration. The main goal of QCG Broker was to manage the whole process of remote job submission and advance reservation to various batch queuing systems and subsequently to underlying clusters and computational resources. It has been designed as an independent core component for resource management processes which can take advantage of various low-level core and grid services and existing technologies, such as QCG BES/AR or QCG Notification, as well as various grid middleware services such as gLite, Globus or Unicore. Addressing various demanding computational needs of large-scale complex simulations, which in many cases can exceed capabilities of a single cluster, the QCG Broker can flexibly distribute and control applications onto many computing clusters or supercomputers on behalf of end users. Moreover, owing to some built-in metascheduling procedures it can optimize and run efficiently a wide range of applications while at the same time increasing the overall throughput of computing e-infrastructures. Advance reservation mechanisms are used to create, synchronize and simultaneously manage the co-allocation of computing resources located at different Administrative Domains. The XML-based job definition language Job Profile makes it relatively easy to specify the requirements of large-scale parallel applications together with the complex parallel communication topologies. Consequently, application developers and end users are able to run their experiments in parallel over multiple clusters as well to perform various benchmark-based experiments as alternative topologies are taken into account during meta-scheduling processes in QCG Broker.

QCG Libraries and Cross-cluster communication

From a development perspective, the QosCosGrid applications were grouped into two classes: (a) Java applications taking advantage of ProActive library as the parallelization technology, and (b) applications based on ANSI C or similar codes, which rely on the message passing paradigm. Based on these groups, QosCosGrid was designed to support two parallel programming and execution environments, namely: QCG-OpenMPI (aiming at C/C++ and Fortran parallel applications developers) and QCG-ProActive (aiming at Java parallel application developers).

Additional services were required in order to support the spawning of parallel application processes on co-allocated computational resources. The main reason for this was that standard deployment methodologies used in OpenMPI and ProActive relied on either RSH/SSH or specific local queuing functionalities. Both are limited to single-cluster runs (e.g., the SSH-based deployment methods are problematic if at least one cluster has worker nodes that have private IP addresses). Those services are called the coordinators and are implemented as Web services.