The Typical drawback of a Gigabit Ethernet cluster is the poor performance of inter-process communication over the interconnect. Current implementations of industry-standard communication primitives, APIs, and protocols, usually show high communication latencies and sub-optimal communication throughput.
The Genoa Active Message MAchine (GAMMA) is a low-latency protocol for Gigabit Ethernet clusters running Linux. GAMMA runs on IA-32 processors (Intel Pentium, AMD K6, and superior models) and their 64-bit extensions (AMD Athlon64, AMD Opteron, Intel EM64T). Multi-CPU and multi-core nodes are supported.
The core of GAMMA is a custom Linux network device driver, which operates the Network Interface Card (NIC). The GAMMA driver allows low latency, high throughput inter-process communications, using a mechanism derived from Active Messages. Both point-to-point and broadcast communications are provided. Broadcast communication exploits the Ethernet broadcast natively.
The GAMMA driver is unable to manage standard IP traffic. Therefore, all IP services of the cluster must be supported by an additional LAN (an additional NIC on each cluster node, plus an additional hub/switch).
The communication mechanisms implemented in the GAMMA driver are made available to application writers through the GAMMA user library. The GAMMA library provides support to application launch, process grouping, point-to-point/broadcast communications based on the Active Ports mechanisms, and some collective routines (barrier synchronization, and broadcast).
GAMMA provides two levels of QoS. The lower one, corresponding to the fastest communications, is a best-effort service. With this service, network congestion and ``hot spots'' may cause the receiver NIC or even the LAN switch to loose packets by overrun. The other QoS level provides flow-controlled communication, ensuring reliability up to hardware faults, at a negligible performance penalty.
GAMMA is compiled as a Linux module for Linux 2.6.24. Therefore, the 2.6.24 kernel must be installed on all cluster nodes before loading the GAMMA module.
A porting of MPICH-1 atop GAMMA is available. See the MPI/GAMMA web page.
You need a cluster with IA-32 processors (Intel Pentium, AMD K6, and superior models) or their 64-bit extensions (AMD Athlon64, AMD Opteron, Intel EM64T). Multi-CPU and multi-core nodes are allowed.
For the main interconnect (namely, the one where GAMMA communications take place), each PC should have a Gigabit Ethernet card chosen in the following list:
You also need a Gigabit Ethernet switch to connect the nodes (or a back-to-back cable, for just two nodes).
An additional LAN, of any kind, is required to allow running IP traffic separate from GAMMA traffic. Without such additional LAN, all IP services (and especially NFS) would be disabled. GAMMA uses IP to spawn the processes of a parallel job, so the additional LAN is really necessary.
GAMMA provides two levels of Quality of Service (QoS). The basic one is a ``best-effort'' transmission, with no flow control, implemented by functions gamma_send() , gamma_send_2p() , gamma_isend(), gamma_isend_2p() . The other one is a ``flow controlled'' transmission, implemented by functions gamma_send_flowctl() , gamma_send_2p_flowctl() , gamma_isend_flowctl(), gamma_isend_2p_flowctl() .
In all cases, however, the GAMMA protocol implements policies for detecting packet losses and is able to retransmit missing packets. GAMMA is not particularly efficient in recovering from packet losses; indeed, we argue that the probability of a packet loss due to a hardware error is negligible, and the possibility of loosing packets due to LAN congestion is better managed by preventing congestion from occurring (for instance, by using smart communication patterns in parallel jobs). In other words, GAMMA does provide packet retransmission because it is of great help indeed, but does not take care of efficiency when retransmission occurs because this is supposed to be (made) rare.
Application-level performance comparisons: OpenFOAM 1.4.
And here, for the common folks, some latency-bandwidth numbers (taken at user level).
Definitions:
Message delay, D(S), is half the round-trip time with a message of size S bytes, as measured by running a simple ping-pong GAMMA microbenchmark.
Latency is the message delay D(Smin) where Smin is the smallest message size allowed by the communication system. GAMMA allows Smin = 0, TCP allows Smin = 1.
End-to-end throughput, T(S), is the transfer rate of the
whole communication path, from the sender to the receiver:
T(S) = S/D(S).
Note: This definition implies that the throughput
is measured using the ping-pong microbenchmark. Other techniques for
throughput measurement are based on the transmission of long streams of
messages from sender to receiver without any data flowing back; such techniques
will measure a different thing, namely the transmission throughput
(see below).
Asymptotic bandwidth, B, is the value of the following ``bandwidth'' function B(S) as the message size S > 0 approaches infinity: B(S) = (S - Smin)/(D(S) - D(Smin)). Since B(S) - T(S) approaches zero as S approaches infinity, B is commonly evaluated by measuring the end-to-end throughput T(S) with S very large.
Transmission throughput is the transfer rate perceived at the sender side, that is, the data rate at which an infinite stream of messages, of fixed size each, can be pushed into the network without causing any data loss. This requires to run a different microbenchmark compared to ping-pong: rather than exchanging a single message back and forth, the sender transmits a long stream of messages to the receiver and measures the time spent for the whole transmission.
Half-power point, is the message size H at which the throughput T(H) reaches half the asymptotic bandwith B. The value of H depend on which notion of throughput one has in mind. We refer here to the end-to-end throughput.
Two testbeds have been used for these measurements:
Here are the performance numbers
(best among three runs of a ping-pong, each providing an average over
50 trials; GAMMA flow control enabled; performance at user level):
Testbed | NIC | Latency | Latency | Asymptotic bandwidth | Half-power point | Half-power point |
(back to back) | (switch incl.) | (back to back) | (switch incl.) | |||
1 | Intel PRO/1000 | 6.1 µs | 10.8 µs | 123.4 MByte/s | 1820 byte | 7600 byte |
(MTU 4120 bytes) | ||||||
2 | Broadcom NetXtreme | 14.6 µs | ?? µs | 119.4 MByte/s | 3705 byte | ?? byte |
(MTU 1500 bytes) |
The requirements to install GAMMA are:
If you plan to install GAMMA, please read the README_GAMMA file before beginning the installation. The GAMMA source distribution itself contains a quite complete manual for installation and troubleshooting. Should you need additional information (e.g. you find the installation instructions unclear or incomplete) contact the maintainer, Giuseppe Ciaccio, ciaccio AT disi.unige.it.
Here is the current release of GAMMA:
GAMMA allows more process instances of the same parallel job to run on the same CPU. Thread safety is granted with point-to-point communications when running on distinct GAMMA ports. However, collective routines (barrier synchronization, and broadcast) are not thread compatible, as they use predetermined GAMMA ports.
The packet retransmission mechanism is not perfect. It will not work if the missing packet originates from a non-blocking send.
Over time, some of the GAMMA users contributed much to improve GAMMA usability and stability. I hereby acknowledge invaluable contribution from the following individuals or companies:
Please send suggestions and comments to:
Giuseppe Ciaccio, ciaccio AT disi.unige.it