next up previous contents
Next: The Tera/Cray Research Inc. T3E. Up: Recount of (almost) available ... Previous: The Sun E10000 Starfire.

The Tera/Cray Research Inc. SV1.

Machine type Shared-memory multi-vector processor
Models SV1-A, SV1-1, SV1 Supercluster
Operating system UNICOS (Cray Unix variant)
Connection structure Crossbar
Compilers Fortran 90, C, C++, Pascal, ADA
Vendors information Web page www.cray.com/products/systems/craysv1/
Year of introduction 1998.

System parameters:

Model Tera/Cray SV1-A Tera/Cray SV1-1 SV1 Supercluster
Clock cycle 3.33 ns 3.33 ns 3.33 ns
Theor. peak performance
Per Proc. (64 bits) 1.2/4.8 Gflop/s 1.2/4.8 Gflop/s 1.2/4.8 Gflop/s
Maximal
Single frame: 19.2 Gflop/s 38.4 Gflop/s ---
Multi frame: --- --- 1.2 Tflop/s
Memory bandwidth
Memory-Cache 7.7 GB/s 7.7 GB/s 7.7 GB/s
Cache-CPU 9.6 GB/s 9.6 GB/s 9.6 GB/s
Aggregate 30.7 GB/s 61.4 GB/s --- GB/s
Node--node (bi-direct.) --- GB/s --- GB/s 1 GB/s
Main memory <= 16 GB <= 32 GB <= 1 TB
No. of processors 8--16 8--32 <= 1024

Remarks:

The Tera/Cray SV1 is the successor both to the CMOS-based Cray J90 and the Cray T90 which was based on ECL technology. The SV1 systems are CMOS-based and therefore much cheaper to manufacture than the ECL-based systems. In this respect Cray is following the trend set in by Fujitsu and NEC a few years ago with their vector systems VPP5000 and SX-5, respectively.

The single-cabinet configurations come in two sizes, the SV1-A and the SV1-1 that can house 4 and 8 processor boards, respectively. Each processor board contains 4 CPUs that can deliver a peak rate of 4 floating-point operations per cycle, amounting to a theoretical peak performance of 1.2 Gflop/s per CPU. However, 4 CPUs can be coupled across CPU boards in a configuration to form a so-called Multi Streaming Processor (MSP) resulting in a processing unit that has effectively a Theoretical Peak Performance od 4.8 Gflop/s. The configuration into MSPs and/or single CPU combinations can be done via software at start-up time. The vector start-up time for the single CPUs is smaller than for MSPs, so for small vectors single CPUs might be preferable while for programs containing long vectors the MSPs should be of advantage. The number of combinations that can be made is large but at least 8 CPUs must be configured as single 1.2 Gflop/s CPUs. So a full SV1-1 cabinet may be configured as 32 single 1.2 Gflop/s CPUs or as 1--6 MSPs with the remaining processors as single CPUs.

Another new feature in the SV1 is a combined scalar and vector cache of 256 KB per CPU. This cache is important because the bandwidth of 7.7 GB/s per CPU board amounts to only 0.8 eight-byte operands per cycle. The cache can ship 1 operand per cycle to a CPU. This relative bandwidth is much smaller than what was offered in the former Cray systems which makes the cache all the more important.

Like in the NEC SX-5 single cabinets can be combined to form a cluster (Supercluster in Cray's terminology) by a so-called GigaRing. The GigaRing, which is also used to couple I/O sub-systems, is comprised of two counter-rotating rings with a bandwidth of 1 GB/s each. Where the systems in a cabinet are SM-MIMD systems, a multi-cabinet Supercluster is an DM-MIMD system and can be operated in parallel only by some parallel programming model like MPI or HPF.

Measured Performances: The importance of the cache is well illustrated by the performance of a matrix-matrix multiplication as occurs in the EuroBen Benchmark. With a single processor (called Single Stream Processing, SSP, by Cray) and with the cache a peak speed of 999 Mflop/s is observed at a matrix order of n = 300 and decreasing to a speed of 666 Mflop/s at an order of n = 1000. Without the cache the speed reaches at an order of n = 300 a speed of 625 Mflop/s and slowly increases to about 650 Mflop/s at n = 1000. In MSP mode this behaviour is similar: with cache the speed at n = 100 is 2.61 Gflop/s, decreasing to 1.41 Gflop/s at n = 1000. Without the cache the observed speed at n = 100 is 1.0 Gflop/s and rises to 1.4 Gflop/s at n = 1000. This means that for modestly sized problems the cache can boost the performance with a factor 1.5--2. The efficiency in MSP mode is presently not too high: just over 50% in a favourable situation. As the MSP facility is very new, one may expect that the efficiency will increase considerably in the near future.



next up previous contents
Next: The Tera/Cray Research Inc. T3E. Up: Recount of (almost) available ... Previous: The Sun E10000 Starfire.

Aad van der Steen
Mon Mar 6 12:40:15 MET 2000