Network Of Workstations (NOWs) and clusters of fast, well-equipped Personal Computers (PCs), interconnected by a standard LAN hardware (like Fast Ethernet, Gigabit Ethernet, Myrinet), and running the Linux operating system, are becoming more and more attractive as cheap and efficient platforms for parallel and distributed applications. The main drawback of a standard NOW/cluster architecture is the poor performance of the standard support to inter-process communication over any LAN hardware. Current implementations of industry-standard communication primitives (RPC), APIs (sockets), and protocols (TCP, UDP) usually show high communication latencies and low communication throughput.
We have developed an experimental system for inter-process communication, called the Genoa Active Message MAchine (GAMMA). GAMMA runs on clusters of PCs, based on Intel as well as AMD CPUs (Intel Pentium, AMD K6, and superior models), connected by Fast Ethernet or Gigabit Ethernet. Such a cluster can be used either as a set of autonomous workstations (in a student lab) or as a parallel computer for Single Program Multiple Data (SPMD) as well as MIMD applications.
The core of GAMMA is a custom device driver under Linux, which operates the Network Interface Card (NIC). The GAMMA driver delivers very low latency, high bandwidth communications using Active Ports, a mechanism derived from Active Messages. Both point-to-point and broadcast communications are provided. Broadcast communication exploits the Ethernet broadcast directly.
The GAMMA driver is able to manage standard IP traffic in addition to GAMMA fast communications. This means that using GAMMA on your LAN will not stop the standard UNIX network services.
The communication mechanisms implemented in the GAMMA driver are made available to application writers through the GAMMA user library. The GAMMA library provides support to application launch, process grouping, point-to-point/broadcast communications based on the Active Ports mechanisms, and some collective routines (barrier synchronization, gather, all-gather; more collective routines are being developed).
GAMMA provides two levels of QoS. The lower QoS level, corresponding to the fastest communications, is a best-effort service. With this service, network congestion and ``hot spots'' may cause the receiver NIC or even the LAN switch to loose packets by overrun. The other QoS level provides flow-controlled communication, therefore ensuring reliability up to hardware faults, at a negligible performance penalty.
Installing the GAMMA driver requires only very marginal changes to the original Linux kernel. The Linux kernel extended with the GAMMA driver has to run on each PC in the cluster.
The current version of GAMMA has been tested with a number of NICs, both Fast Ethernet (DEC 21143 chipset, 3COM 3c905) and Gigabit Ethernet (Alteon TIGON-II chipset, Netgear GA621, 3COM 3c996).
A porting of MPI atop GAMMA is now available.
A pool of PCs (Intel Pentium, AMD K6, or superior models), single-CPU as well as dual-CPU.
Each PC should have a NIC chosen in the following list:
The PCs must be networked together by a Gigabit Ethernet switch or a Fast Ethernet switch or hub.
In our own installation, the PCs are also networked by an additional LAN for standard IP traffic and services. Such additional LAN is not strictly required by GAMMA, as the GAMMA network device driver can manage both GAMMA and IP communications on the same LAN; however, separating GAMMA communications from IP will significantly boost performance.
GAMMA provides two levels of Quality of Service (QoS). The basic one is a ``best-effort'' transmission, with no flow control, implemented by functions gamma_send() , gamma_send_2p() , gamma_isend(), gamma_isend_2p() . The other one is a ``flow controlled'' transmission, implemented by functions gamma_send_flowctl() , gamma_send_2p_flowctl() , gamma_isend_flowctl(), gamma_isend_2p_flowctl() .
In all cases, however, the GAMMA protocol implements policies for detecting packet losses and is able to retransmit missing packets. GAMMA is not particularly efficient in recovering from packet losses; indeed, we argue that the probability of a packet loss due to a hardware error is negligible, and the possibility of loosing packets due to LAN congestion is better managed by preventing congestion from occurring (for instance, by using smart communication patterns in parallel jobs). In other words, GAMMA does provide packet retransmission because it is of great help indeed, but does not take care of efficiency when retransmission occurs because this is supposed to be rare.
We report here the measurement of the communication performance achieved at the application level. The following definitions hold:
Message delay, D(S), is half the round-trip time with a message of size S bytes, as measured by running a Ping-Pong GAMMA microbenchmark.
Latency is the message delay D(Smin) where Smin is the smallest message size allowed by the communication system. Smin = 0 with GAMMA, whereas Smin = 1 with TCP. With GAMMA, a zero-byte message is written to the NIC as a GAMMA frame header with no frame body; the NIC transmits it as a GAMMA frame header padded with a ``garbage'' body, to match the 72 byte min. frame size required by the IEEE 802.3 standard.
End-to-end throughput, T(S), is the transfer rate of the
whole communication path, from the sender to the receiver:
T(S) = S/D(S).
Note: This definition implies that the throughput
is measured using the ping-pong microbenchmark. Other techniques for
throughput measurement are based on the transmission of long streams of
messages from sender to receiver without any data flowing back; such techniques
will measure a different thing, that we call transmission throughput
(see below).
Asymptotic bandwidth, B, is the value of the following ``bandwidth'' function B(S) as the message size S > 0 approaches infinity: B(S) = (S - Smin)/(D(S) - D(Smin)). Since B(S) - T(S) approaches zero as S approaches infinity, B is commonly evaluated by measuring the end-to-end throughput T(S) with S very large.
Transmission throughput is the transfer rate perceived at the sender side, that is, the data rate at which an infinite stream of messages of fixed size can be pushed into the network without causing any data loss. This requires to run a different microbenchmark compared to ping-pong: Rather than exchanging a single message back and forth, the sender transmits a long stream of messages to the receiver and measures the time spent for the whole transmission. Clearly, the transmission throughput can be greater than the end-to-end throughput as it not necessarily takes into account the overhead and possible bottlenecks at the receiver side of communication.
We report the performance numbers and graphs of the gamma_send_flowctl() send routine, which provides flow-controlled data transfer. Best-effort routines like gamma_send() offer marginally better performance with less reliability.
In a cluster of two single-CPU PCs, each with
AMD Athlon K7 500 MHz CPU, |
100 MHz memory bus, |
Linux 2.2.13 with the GAMMA device driver |
NIC | Latency | Asymptotic bandwidth | Throughput graphs |
DEC DE-500 BA (DEC 21143 chipset) | 14.2 µs | 12.1 MByte/s |
end-to-end throughput, transmission throughput (unidirectional stream) |
3COM 3c905C | 17.7 µs | 12.1 MByte/s |
end-to-end throughput, transmission throughput (unidirectional stream) |
Intel EtherExpress Pro/100 | 24.5 µs | 12.1 MByte/s |
It is clear from the above numbers that GAMMA exploits about 96% of the nominal link speed of Fast Ethernet (12.5 MByte/s) with long messages.
This performance remarkably improves over the communication performance of Linux TCP/IP sockets. Indeed, GAMMA is currently the most performing messaging system running on low-cost NOW configurations, yields the best price/performance ratio in the whole range of NOW architectures, and rivals many communication layers running on commercial multiprocessors, according to the table below:
Platform |
|
|
|
||||||
Fast Messages | Myrinet (1.28 Gbit/s) | 9.6 |
100.5 |
||||||
GAMMA | 100base-T (DEC 21143 chipset), K7 500 | 14.3 |
12.1 |
||||||
Fast Messages | 1000base-F (Packet Engines GNIC-II?) | 14.7 |
81.5 |
||||||
M-VIA | 100base-T (DEC 21143 chipset), Pentium II 400 | 23 |
11.9 |
||||||
U-Net | 100base-T (DEC 21140 chipset), Pentium | 30.0 |
12.1 |
||||||
U-Net | 140 Mbit/s ATM | 35.5 |
14.8 |
||||||
Linux 2.0.36 TCP/IP sockets | 100base-T (DEC 21143 chipset), Pentium II 350 | 58 |
10.5 |
||||||
Thinking Machines CM-5 CMAML ports | custom (Fat Tree) | 15.0 |
8.5 |
||||||
IBM SP-2 MPL | custom | 44.8 |
34.9 |
||||||
Cray T3D PVMFAST | custom | 30.0 |
25.1 |
We report the performance numbers and pictures of the gamma_send_flowctl() send routine, which provides flow-controlled data transfer. Best-effort routines like gamma_send() offer marginally better performance with less reliability.
In a cluster of two single-CPU PCs, each with
Pentium III CPU @1GHz, |
133 MHz FSB, |
66 MHz 64 bit PCI bus, |
SuperMicro 370DE6 motherboard, ServerSet III HE-SL chipset |
Linux 2.4.16 with the GAMMA device driver |
NIC | Latency | Asymptotic bandwidth | Throughput pictures |
NetGear GA620 (Alteon TIGON-II) | 32 µs |
82 MByte/s (normal frames), 123.6 MByte/s (Jumbo frames, MTU 5140 bytes) |
end-to-end throughput, |
NetGear GA621 | 8 µs |
118.5 MByte/s (normal frames), 122 MByte/s (Jumbo frames, MTU 4116 bytes) |
curves of end-to-end throughput and transmission throughput coming soon |
3COM 3c996 | 12 µs | 102 MByte/s (Jumbo frames) | curves of end-to-end throughput and transmission throughput coming soon |
With the Netgear GA621, GAMMA achieves a 95% exploitation of the nominal link speed of Gigabit Ethernet (125 MByte/s). This is a very good result, given the very small size of the standard MTU (1500 bytes). Increasing the MTU size up to 4116 bytes (``Jumbo Frames'') allows to achieve almost 98% of the nominal link speed.
GAMMA improves quite a lot over the communication performance of Linux TCP/IP sockets on the same interconnect. Indeed, GAMMA is currently the most performing messaging system for Gigabit Ethernet, according to the table below:
Platform |
|
|
|
||||||
BIP | Myrinet (1.28 Gbit/s), Pentium Pro | 4.3 |
125.6 |
||||||
PM | Myrinet (1.28 Gbit/s), Pentium 166 MHz | 7.5 |
113.5 |
||||||
GAMMA | 1000base-F (Negear GA621), Pentium III 1 GHz | 8.5 |
122 |
||||||
Fast Messages | Myrinet (1.28 Gbit/s) | 9.6 |
100.5 |
||||||
Fast Messages | 1000base-F (Packet Engines GNIC-II?) | 14.7 |
81.5 |
||||||
M-VIA | 1000base-F (Packet Engines GNIC-II), Pentium II 400 | 19 |
60 |
||||||
GigaE-PM | 1000base-F (Essential EC-440-SF), Pentium 150 MHz | 24 |
58.3 |
||||||
GAMMA | 1000base-F (Netgear GA620, Alteon TIGON-II), Pentium III 1 GHz | 32 |
123.6 |
||||||
Linux 2.1.131 TCP/IP sockets (quoted from M-VIA FAQs) | 1000base-F (Packet Engines GNIC-II), Pentium II 400 | 59 |
31 |
Note: The above performance table reports performance numbers from messaging systems running on two very different Gigabit technologies, namely Gigabit Ethernet, and Myrinet. Comparing a Gigabit Ethernet-based communication layer to a Myrinet-based one is not necessarily fair: Gigabit Ethernet measurements are usually taken by connecting two machines back-to-back; this is not the case with Myrinet. A fair comparison should take the hardware latency of a Gigabit Ethernet switch into account (3 - 4 usec).
The requirements to install GAMMA are:
If you plan to install GAMMA, please read the README_GAMMA file before starting the installation. The GAMMA source distribution itself contains a quite complete manual for installation and troubleshooting. Should you need additional information (e.g. you find the installation instructions unclear or incomplete), you can contact Giuseppe Ciaccio, ciaccio@disi.unige.it.
The current release of GAMMA for Linux 2.4.21 is dated 13 August 2004.
Supported NICs:
GAMMA allows more process instances of the same parallel job to run on the same CPU. Thread safety is granted with point-to-point communications when running on distinct GAMMA ports. However, collective routines (barrier synchronization, and broadcast) are not thread compatible, as they use predetermined GAMMA ports.
The GAMMA driver for 3COM 3c905C may show a transient failure the first time a GAMMA (or MPI/GAMMA) job is launched after a boot-up of the cluster.
The packet retransmission mechanism is not yet perfect. It will not work if the missing packet originates from a non-blocking send.
Please send suggestions and comments to:
Giuseppe Ciaccio, ciaccio@disi.unige.it