Dipartimento di Informatica e Scienze dell'Informazione

GAMMA Project: Genoa Active Message MAchine

What's New: support for dual-CPU nodes, more process instances and more ports allowed (updated 13 August 2004)

Index

What is GAMMA?
Download and install MPI/GAMMA, a porting of MPICH to the GAMMA messaging system
People involved in the GAMMA project
Hardware requirements
Are GAMMA communications reliable?
Performance
Papers
How to install
Known Bugs
The API of GAMMA

What is GAMMA?

A low latency, high throughput communication system for clusters of PCs
Supports both single and dual CPU processing nodes
Runs on Fast Ethernet and Gigabit Ethernet NICs
SPMD parallel processing with message passing, even multiprogrammed
Compatible with standard network protocols and services
Good programmability thanks to fairly high abstraction level
Reliable thanks to mechanisms for retransmission of missing packets
Implemented as a network device driver for Linux 2.4, and released under GNU GPL

Network Of Workstations (NOWs) and clusters of fast, well-equipped Personal Computers (PCs), interconnected by a standard LAN hardware (like Fast Ethernet, Gigabit Ethernet, Myrinet), and running the Linux operating system, are becoming more and more attractive as cheap and efficient platforms for parallel and distributed applications. The main drawback of a standard NOW/cluster architecture is the poor performance of the standard support to inter-process communication over any LAN hardware. Current implementations of industry-standard communication primitives (RPC), APIs (sockets), and protocols (TCP, UDP) usually show high communication latencies and low communication throughput.

We have developed an experimental system for inter-process communication, called the Genoa Active Message MAchine (GAMMA). GAMMA runs on clusters of PCs, based on Intel as well as AMD CPUs (Intel Pentium, AMD K6, and superior models), connected by Fast Ethernet or Gigabit Ethernet. Such a cluster can be used either as a set of autonomous workstations (in a student lab) or as a parallel computer for Single Program Multiple Data (SPMD) as well as MIMD applications.

The core of GAMMA is a custom device driver under Linux, which operates the Network Interface Card (NIC). The GAMMA driver delivers very low latency, high bandwidth communications using Active Ports, a mechanism derived from Active Messages. Both point-to-point and broadcast communications are provided. Broadcast communication exploits the Ethernet broadcast directly.

The GAMMA driver is able to manage standard IP traffic in addition to GAMMA fast communications. This means that using GAMMA on your LAN will not stop the standard UNIX network services.

The communication mechanisms implemented in the GAMMA driver are made available to application writers through the GAMMA user library. The GAMMA library provides support to application launch, process grouping, point-to-point/broadcast communications based on the Active Ports mechanisms, and some collective routines (barrier synchronization, gather, all-gather; more collective routines are being developed).

GAMMA provides two levels of QoS. The lower QoS level, corresponding to the fastest communications, is a best-effort service. With this service, network congestion and ``hot spots'' may cause the receiver NIC or even the LAN switch to loose packets by overrun. The other QoS level provides flow-controlled communication, therefore ensuring reliability up to hardware faults, at a negligible performance penalty.

Installing the GAMMA driver requires only very marginal changes to the original Linux kernel. The Linux kernel extended with the GAMMA driver has to run on each PC in the cluster.

The current version of GAMMA has been tested with a number of NICs, both Fast Ethernet (DEC 21143 chipset, 3COM 3c905) and Gigabit Ethernet (Alteon TIGON-II chipset, Netgear GA621, 3COM 3c996).

A porting of MPI atop GAMMA is now available.

People involved in the GAMMA project

Giovanni Chiola
Giuseppe Ciaccio
Master students involved so far in the GAMMA project at DISI: Alessio Gregori (first implementation of a flow control mechanism in GAMMA), Luca Landi (author of the GAMMA driver for the TIGON-II Gigabit Ethernet chipset).

External collaborations

Politecnico di Torino:
Marco Maniezzo.
Marco is the author of the GAMMA driver for the Broadcom BCM5700 chipset (AKA ``TIGON 3''), equipping many Gigabit Ethernet adapters including the 3COM 3c996 (the only one actually tested).
Institut fur Informatik, Universitat Potsdam:
Marco Ehlert (Master Student), under supervision of prof. Bettina Schnor.
Marco is the author of the GAMMA driver for the Netgear GA621/GA622 Gigabit Ethernet adapter.
Università di Roma ``La Sapienza'':
Luigi Mancini, Guido A. Guerreschi, Pierluigi Rotondo(former Master Students).
They contributed to the initial development of GAMMA drivers for DEC ``tulip'' and Intel EtherExpress Pro/100 Fast Ethernet NICs.
Consorzio per le Applicazioni di Supercalcolo per Universita' e Ricerca (CASPUR), Universita' di Roma ``La Sapienza'':
Vincenzo Di Martino, Leonardo Valcamonici.
They contributed to the development and testing of the GAMMA driver for the Packet Engines GNIC-II Gigabit Ethernet NIC (no longer supported).
Dipartimento di Informatica, Sistemi e Produzione, Università di Roma ``Tor Vergata'':
Alessio Menale (former Master Student), Salvatore Filippone.
They developed the GAMMA driver for the 3COM 3c905[rev.B, B, C] NICs, in cooperation with the GAMMA team.
Computer Engineering Group (Dipartimento di Ingegneria dell'Informazione, Università di Parma):
PARMA² Project.

Hardware requirements

A pool of PCs (Intel Pentium, AMD K6, or superior models), single-CPU as well as dual-CPU.

Each PC should have a NIC chosen in the following list:

Netgear GA621 (and possibly GA622) Gigabit Ethernet adapters
3COM 3c996 Gigabit Ethernet adapter, and possibly many other NICs equipped with the Broadcom BCM570{0,1,2,3} chipsets
any Gigabit Ethernet adapter equipped with the Alteon TIGON-II chipset (NetGear GA620, Alteon AceNIC, 3COM 3c985B)
3COM 3c905[rev.B, B, C] (Fast Ethernet)
any Fast Ethernet adapter equipped with the DEC DS21143 / Intel DS21145 ``tulip'' chipsets and clones
Intel EtherExpress Pro/100 (Fast Ethernet)

The PCs must be networked together by a Gigabit Ethernet switch or a Fast Ethernet switch or hub.

In our own installation, the PCs are also networked by an additional LAN for standard IP traffic and services. Such additional LAN is not strictly required by GAMMA, as the GAMMA network device driver can manage both GAMMA and IP communications on the same LAN; however, separating GAMMA communications from IP will significantly boost performance.

Are GAMMA communications reliable?

GAMMA provides two levels of Quality of Service (QoS). The basic one is a ``best-effort'' transmission, with no flow control, implemented by functions gamma_send() , gamma_send_2p() , gamma_isend(), gamma_isend_2p() . The other one is a ``flow controlled'' transmission, implemented by functions gamma_send_flowctl() , gamma_send_2p_flowctl() , gamma_isend_flowctl(), gamma_isend_2p_flowctl() .

In all cases, however, the GAMMA protocol implements policies for detecting packet losses and is able to retransmit missing packets. GAMMA is not particularly efficient in recovering from packet losses; indeed, we argue that the probability of a packet loss due to a hardware error is negligible, and the possibility of loosing packets due to LAN congestion is better managed by preventing congestion from occurring (for instance, by using smart communication patterns in parallel jobs). In other words, GAMMA does provide packet retransmission because it is of great help indeed, but does not take care of efficiency when retransmission occurs because this is supposed to be rare.

Performance

We report here the measurement of the communication performance achieved at the application level. The following definitions hold:

Message delay, D(S), is half the round-trip time with a message of size S bytes, as measured by running a Ping-Pong GAMMA microbenchmark.

Latency is the message delay D(Smin) where Smin is the smallest message size allowed by the communication system. Smin = 0 with GAMMA, whereas Smin = 1 with TCP. With GAMMA, a zero-byte message is written to the NIC as a GAMMA frame header with no frame body; the NIC transmits it as a GAMMA frame header padded with a ``garbage'' body, to match the 72 byte min. frame size required by the IEEE 802.3 standard.

End-to-end throughput, T(S), is the transfer rate of the whole communication path, from the sender to the receiver:
T(S) = S/D(S).
Note: This definition implies that the throughput is measured using the ping-pong microbenchmark. Other techniques for throughput measurement are based on the transmission of long streams of messages from sender to receiver without any data flowing back; such techniques will measure a different thing, that we call transmission throughput (see below).

Asymptotic bandwidth, B, is the value of the following ``bandwidth'' function B(S) as the message size S > 0 approaches infinity: B(S) = (S - Smin)/(D(S) - D(Smin)). Since B(S) - T(S) approaches zero as S approaches infinity, B is commonly evaluated by measuring the end-to-end throughput T(S) with S very large.

Transmission throughput is the transfer rate perceived at the sender side, that is, the data rate at which an infinite stream of messages of fixed size can be pushed into the network without causing any data loss. This requires to run a different microbenchmark compared to ping-pong: Rather than exchanging a single message back and forth, the sender transmits a long stream of messages to the receiver and measures the time spent for the whole transmission. Clearly, the transmission throughput can be greater than the end-to-end throughput as it not necessarily takes into account the overhead and possible bottlenecks at the receiver side of communication.

Performance of GAMMA on Fast Ethernet
Performance of GAMMA on Gigabit Ethernet

Fast Ethernet (100 Mbps)

We report the performance numbers and graphs of the gamma_send_flowctl() send routine, which provides flow-controlled data transfer. Best-effort routines like gamma_send() offer marginally better performance with less reliability.

In a cluster of two single-CPU PCs, each with

AMD Athlon K7 500 MHz CPU,

100 MHz memory bus,

Linux 2.2.13 with the GAMMA device driver

connected by a crossover UTP cable, and running a GAMMA ``ping pong'' program, we get the following average communication performance at user level:

NIC	Latency	Asymptotic bandwidth	Throughput graphs
`DEC DE-500 BA (DEC 21143 chipset)`	14.2 µs	12.1 MByte/s	end-to-end throughput, transmission throughput (unidirectional stream)
`3COM 3c905C`	17.7 µs	12.1 MByte/s	end-to-end throughput, transmission throughput (unidirectional stream)
`Intel EtherExpress Pro/100`	24.5 µs	12.1 MByte/s

It is clear from the above numbers that GAMMA exploits about 96% of the nominal link speed of Fast Ethernet (12.5 MByte/s) with long messages.

This performance remarkably improves over the communication performance of Linux TCP/IP sockets. Indeed, GAMMA is currently the most performing messaging system running on low-cost NOW configurations, yields the best price/performance ratio in the whole range of NOW architectures, and rivals many communication layers running on commercial multiprocessors, according to the table below:

Platform

Interconnection

network

Latency

(µs)

Asymptotic bandwidth

(MByte/s)

Fast Messages

Myrinet (1.28 Gbit/s)

9.6

100.5

GAMMA

100base-T (DEC 21143 chipset), K7 500

14.3

12.1

Fast Messages

1000base-F (Packet Engines GNIC-II?)

14.7

81.5

M-VIA

100base-T (DEC 21143 chipset), Pentium II 400

11.9

U-Net

100base-T (DEC 21140 chipset), Pentium

30.0

12.1

U-Net

140 Mbit/s ATM

35.5

14.8

Linux 2.0.36 TCP/IP sockets

100base-T (DEC 21143 chipset), Pentium II 350

10.5

Thinking Machines CM-5 CMAML ports

custom (Fat Tree)

15.0

8.5

IBM SP-2 MPL

custom

44.8

34.9

Cray T3D PVMFAST

custom

30.0

25.1

Gigabit Ethernet (1000 Mbps)

We report the performance numbers and pictures of the gamma_send_flowctl() send routine, which provides flow-controlled data transfer. Best-effort routines like gamma_send() offer marginally better performance with less reliability.

In a cluster of two single-CPU PCs, each with

Pentium III CPU @1GHz,

133 MHz FSB,

66 MHz 64 bit PCI bus,

SuperMicro 370DE6 motherboard, ServerSet III HE-SL chipset

Linux 2.4.16 with the GAMMA device driver

connected back-to-back by an fiber-optic cable, and running a GAMMA ``ping pong'' program, we get the following average communication performance at user level:

NIC	Latency	Asymptotic bandwidth	Throughput pictures
NetGear GA620 (Alteon TIGON-II)	32 µs	82 MByte/s (normal frames), 123.6 MByte/s (Jumbo frames, MTU 5140 bytes)	end-to-end throughput,
NetGear GA621	8 µs	118.5 MByte/s (normal frames), 122 MByte/s (Jumbo frames, MTU 4116 bytes)	curves of end-to-end throughput and transmission throughput coming soon
3COM 3c996	12 µs	102 MByte/s (Jumbo frames)	curves of end-to-end throughput and transmission throughput coming soon

With the Netgear GA621, GAMMA achieves a 95% exploitation of the nominal link speed of Gigabit Ethernet (125 MByte/s). This is a very good result, given the very small size of the standard MTU (1500 bytes). Increasing the MTU size up to 4116 bytes (``Jumbo Frames'') allows to achieve almost 98% of the nominal link speed.

GAMMA improves quite a lot over the communication performance of Linux TCP/IP sockets on the same interconnect. Indeed, GAMMA is currently the most performing messaging system for Gigabit Ethernet, according to the table below:

Platform

Interconnection

network

Latency

(µs)

Asymptotic bandwidth

(MByte/s)

BIP

Myrinet (1.28 Gbit/s), Pentium Pro

4.3

125.6

Myrinet (1.28 Gbit/s), Pentium 166 MHz

7.5

113.5

GAMMA

1000base-F (Negear GA621), Pentium III 1 GHz

8.5

122

Fast Messages

Myrinet (1.28 Gbit/s)

9.6

100.5

Fast Messages

1000base-F (Packet Engines GNIC-II?)

14.7

81.5

M-VIA

1000base-F (Packet Engines GNIC-II), Pentium II 400

GigaE-PM

1000base-F (Essential EC-440-SF), Pentium 150 MHz

58.3

GAMMA

1000base-F (Netgear GA620, Alteon TIGON-II), Pentium III 1 GHz

123.6

Linux 2.1.131 TCP/IP sockets (quoted from M-VIA FAQs)

1000base-F (Packet Engines GNIC-II), Pentium II 400

Note: The above performance table reports performance numbers from messaging systems running on two very different Gigabit technologies, namely Gigabit Ethernet, and Myrinet. Comparing a Gigabit Ethernet-based communication layer to a Myrinet-based one is not necessarily fair: Gigabit Ethernet measurements are usually taken by connecting two machines back-to-back; this is not the case with Myrinet. A fair comparison should take the hardware latency of a Gigabit Ethernet switch into account (3 - 4 usec).

Papers

G. Chiola and G. Ciaccio, Operating System Support for Fast Communications in a Network of Workstations,
Technical Report DISI-TR-96-12, May 1996.
This is a report concerning an early prototype, and is slightly out of date. It is now substituted by DISI-TR-96-22.
G. Chiola and G. Ciaccio, Implementing a Low Cost, Low Latency Parallel Platform,
in proceedings of DAPSYS'96 (Third Austro-Hungarian Workshop on Distributed and Parallel Systems), Miskolc, Hungary, October 1996. A revised version appears in Parallel Computing 22 (1997), pages 1703-1717.
G. Chiola and G. Ciaccio, GAMMA: Architecture, Programming Interface and Preliminary Benchmarking,
Technical Report DISI-TR-96-22, November 1996.
This report concerns the current prototype of GAMMA. Some of its parts are still under development. Probably it is already outdated.
G. Chiola and G. Ciaccio, GAMMA: a Low-cost Network of Workstations Based on Active Messages,
in proceedings of PDP'97 (5th EUROMICRO workshop on Parallel and Distributed Processing), London, UK, January 1997.
G. Chiola and G. Ciaccio, Architectural Issues and Preliminary Benchmarking of a Low-cost Network of Workstations based on Active Messages,
in proceedings of ARCS'97 (14th ITG/GI conference on Architecture of Computer Systems), Rostock, Germany, September 1997.
G. Chiola and G. Ciaccio, A Performance-oriented Operating System Approach to Fast Communications in a Low-cost Network of Workstations,
Technical Report DISI-TR-98-01, January 12, 1998.
G. Chiola and G. Ciaccio, Fast Barrier Synchronization on Shared Fast Ethernet,
in proceedings of CANPC'98 (Workshop on Communication, Architecture, and Applications for Network-based Parallel Computing, in conjunction with HPCA-4), Las Vegas, Nevada, February 1998.
LNCS 1362, Springer.
G. Ciaccio, Optimal Communication Performance on Fast Ethernet with GAMMA,
in proceedings of PC-NOW'98 (International Workshop on Personal Computers based Networks Of Workstations, in conjunction with IPPS/SPDP 1998), Orlando, Florida, March 30/April 3, 1998.
LNCS 1388, Springer.
G. Ciaccio, V. Di Martino, Porting a Molecular Dynamics Application on a Low-cost Cluster of Personal Computers running GAMMA,
in proceedings of PC-NOW'98 (International Workshop on Personal Computers based Networks Of Workstations, in conjunction with IPPS/SPDP 1998), Orlando, Florida, March 30/April 3, 1998.
LNCS 1388, Springer.
G. Ciaccio, V. Di Martino, P. Lanucara, Porting the Flame Front Propagation Problem on GAMMA,
in proceedings of HPCN Europe '98 (International Conference and Exhibition on High-Performance Computing and Networking 1998), Amsterdam, The Netherlands, April 21-23, 1998.
LNCS 1401, Springer.
G. Ciaccio, V. Di Martino, Efficient Molecular Dynamics on a Network of Personal Computers,
in proceedings of VECPAR'98 (3rd International Meeting on Vector and Parallel Processing), Porto, Portugal, June 21-23, 1998.
G. Chiola, G. Ciaccio, A Performance-oriented Operating System Approach to Fast Communications in a Cluster of Personal Computers,
in proceedings of PDPTA'98 (1998 International Conference on Parallel and Distributed Processing Techniques and Applications), Las Vegas, Nevada, July 13-16, 1998.
G. Chiola, G. Ciaccio, Active Ports: A Performance-oriented Operating System Support to Fast LAN Communications,
in proceedings of Euro-Par'98, Southampton, UK, September 1-4, 1998.
LNCS 1470, Springer.
G. Chiola, G. Ciaccio, Porting and Measuring the Linpack Benchmark on GAMMA,
in proceedings of DAPSYS'98, Budapest, Hungary, September 1998.
G. Chiola, G. Ciaccio, L. V. Mancini, P. Rotondo, GAMMA on DEC 2114X with Efficient Flow Control,
in proceedings of PDPTA'99 (1999 International Conference on Parallel and Distributed Processing Techniques and Applications), Las Vegas, Nevada, June 28 - July 1, 1999.
G. Ciaccio, A Communication System for Efficient Parallel Processing on Clusters of Personal Computers,
PhD Thesis DISI-TH-1999-02, June 1999.
G. Chiola, G. Ciaccio, Porting MPICH ADI on GAMMA with Flow Control,
in proceedings of MWPP'99 (1999 Midwest Workshop on Parallel Processing), Kent, Ohio, August 11 - 13, 1999.
G. Ciaccio, G. Chiola, GAMMA and MPI/GAMMA on Gigabit Ethernet,
in proceedings of 7th EuroPVM-MPI, Balatonfured, Hungary, September 10 - 13, 2000.
LNCS 1908, Springer.
G. Ciaccio, G. Chiola, Low-cost Gigabit Ethernet at work,
(poster) in proceedings of CLUSTER 2000,Chemnitz, Germany, November 28 - December 1, 2000.
IEEE.
G. Ciaccio, Messaging on Gigabit Ethernet: Some experiments with GAMMA and other systems,
in proceedings of IPDPS '01, workshop CAC '01, San Francisco, California, April 23 - 27, 2001.
IEEE.
Extended version on Cluster Computing 6, 2003, pages 143-151.
G. Ciaccio, M. Ehlert, B. Schnor, Exploiting Gigabit Ethernet Capacity for Cluster Applications,
in proceedings of LCN 2002, workshop on High-Speed Local Networks (HSLN), Tampa, Florida, November 6 - 8, 2002.
IEEE.

How to install

The requirements to install GAMMA are:

at least two PCs (single or dual CPU) with Intel Pentium, AMD K6, or superior CPU models;
Fast Ethernet or Gigabit Ethernet network adapters supported by GAMMA (one adapter on each PC; additional adapters may be installed but GAMMA will not use them);
a 100Mb/s Fast Ethernet repeater hub or switch, or a Gigabit Ethernet switch (as an alternative, a ``back-to-back'' connection can be arranged for just two machines);
an additional LAN, for standard IP traffic (not strictly required, as the GAMMA network device driver can manage both GAMMA and IP communications on the same LAN; yet, separating GAMMA communications from IP can significantly improve performance);
Linux 2.4.21;
C compiler.
We have been successful with egcs-2.91.66 (egcs-1.1.2 release), gcc 2.95.3, and gcc 3.3.
The glorious gcc 2.7.2 cannot be used any longer (it is not recommended for compiling Linux 2.4 itself).

If you plan to install GAMMA, please read the README_GAMMA file before starting the installation. The GAMMA source distribution itself contains a quite complete manual for installation and troubleshooting. Should you need additional information (e.g. you find the installation instructions unclear or incomplete), you can contact Giuseppe Ciaccio, ciaccio@disi.unige.it.

The current release of GAMMA for Linux 2.4.21 is dated 13 August 2004.

NOTE: This is the last release of GAMMA for Linux kernels of the 2.4 series. Next release will be for Linux 2.6.7.

Supported NICs:

3COM 3c905[rev.B, B, C] (Fast Ethernet)
any Fast Ethernet adapter equipped with the DEC DS21143 / Intel DS21145 ``tulip'' chipsets and clones
Intel EtherExpress Pro/100 (Fast Ethernet)
any Gigabit Ethernet adapter equipped with the Alteon TIGON-II chipset (NetGear GA620, Alteon AceNIC, 3COM 3c985B).
Note: the most performing version of GAMMA for these adapters is this one, dated 27 February 2003, and is for Linux 2.4.17.
Netgear GA621 and GA622 Gigabit Ethernet adapters
3COM 3c996 Gigabit Ethernet adapter, and possibly the many other NICs equipped with the Broadcom BCM570{0,1,2,3} chipsets (AKA ``TIGON 3'')

Known Bugs

GAMMA allows more process instances of the same parallel job to run on the same CPU. Thread safety is granted with point-to-point communications when running on distinct GAMMA ports. However, collective routines (barrier synchronization, and broadcast) are not thread compatible, as they use predetermined GAMMA ports.

The GAMMA driver for 3COM 3c905C may show a transient failure the first time a GAMMA (or MPI/GAMMA) job is launched after a boot-up of the cluster.

The packet retransmission mechanism is not yet perfect. It will not work if the missing packet originates from a non-blocking send.

Please send suggestions and comments to:

Giuseppe Ciaccio, ciaccio@disi.unige.it

Last Updated: 13 August 2004