| Machine type | Shared-memory multi-vector processor |
|---|---|
| Models | SV1-A, SV1-1, SV1 Supercluster |
| Operating system | UNICOS (Cray Unix variant) |
| Connection structure | Crossbar |
| Compilers | Fortran 90, C, C++, Pascal, ADA |
| Vendors information Web page | www.cray.com/products/systems/craysv1/ |
| Year of introduction | 1998. |
System parameters:
| Model | Tera/Cray SV1-A | Tera/Cray SV1-1 | SV1 Supercluster |
|---|---|---|---|
| Clock cycle | 3.33 ns | 3.33 ns | 3.33 ns |
| Theor. peak performance | |||
| Per Proc. (64 bits) | 1.2/4.8 Gflop/s | 1.2/4.8 Gflop/s | 1.2/4.8 Gflop/s |
| Maximal | |||
| Single frame: | 19.2 Gflop/s | 38.4 Gflop/s | --- |
| Multi frame: | --- | --- | 1.2 Tflop/s |
| Memory bandwidth | |||
| Memory-Cache | 7.7 GB/s | 7.7 GB/s | 7.7 GB/s |
| Cache-CPU | 9.6 GB/s | 9.6 GB/s | 9.6 GB/s |
| Aggregate | 30.7 GB/s | 61.4 GB/s | --- GB/s |
| Node--node (bi-direct.) | --- GB/s | --- GB/s | 1 GB/s |
| Main memory | <= 16 GB | <= 32 GB | <= 1 TB |
| No. of processors | 8--16 | 8--32 | <= 1024 |
Remarks:
The Tera/Cray SV1 is the successor both to the CMOS-based Cray J90 and the Cray T90 which was based on ECL technology. The SV1 systems are CMOS-based and therefore much cheaper to manufacture than the ECL-based systems. In this respect Cray is following the trend set in by Fujitsu and NEC a few years ago with their vector systems VPP5000 and SX-5, respectively.
The single-cabinet configurations come in two sizes, the SV1-A and the SV1-1 that can house 4 and 8 processor boards, respectively. Each processor board contains 4 CPUs that can deliver a peak rate of 4 floating-point operations per cycle, amounting to a theoretical peak performance of 1.2 Gflop/s per CPU. However, 4 CPUs can be coupled across CPU boards in a configuration to form a so-called Multi Streaming Processor (MSP) resulting in a processing unit that has effectively a Theoretical Peak Performance od 4.8 Gflop/s. The configuration into MSPs and/or single CPU combinations can be done via software at start-up time. The vector start-up time for the single CPUs is smaller than for MSPs, so for small vectors single CPUs might be preferable while for programs containing long vectors the MSPs should be of advantage. The number of combinations that can be made is large but at least 8 CPUs must be configured as single 1.2 Gflop/s CPUs. So a full SV1-1 cabinet may be configured as 32 single 1.2 Gflop/s CPUs or as 1--6 MSPs with the remaining processors as single CPUs.
Another new feature in the SV1 is a combined scalar and vector cache of 256 KB per CPU. This cache is important because the bandwidth of 7.7 GB/s per CPU board amounts to only 0.8 eight-byte operands per cycle. The cache can ship 1 operand per cycle to a CPU. This relative bandwidth is much smaller than what was offered in the former Cray systems which makes the cache all the more important.
Like in the NEC SX-5 single cabinets can be combined to form a cluster (Supercluster in Cray's terminology) by a so-called GigaRing. The GigaRing, which is also used to couple I/O sub-systems, is comprised of two counter-rotating rings with a bandwidth of 1 GB/s each. Where the systems in a cabinet are SM-MIMD systems, a multi-cabinet Supercluster is an DM-MIMD system and can be operated in parallel only by some parallel programming model like MPI or HPF.
Measured Performances: The importance of the cache is well illustrated by the performance of a matrix-matrix multiplication as occurs in the EuroBen Benchmark. With a single processor (called Single Stream Processing, SSP, by Cray) and with the cache a peak speed of 999 Mflop/s is observed at a matrix order of n = 300 and decreasing to a speed of 666 Mflop/s at an order of n = 1000. Without the cache the speed reaches at an order of n = 300 a speed of 625 Mflop/s and slowly increases to about 650 Mflop/s at n = 1000. In MSP mode this behaviour is similar: with cache the speed at n = 100 is 2.61 Gflop/s, decreasing to 1.41 Gflop/s at n = 1000. Without the cache the observed speed at n = 100 is 1.0 Gflop/s and rises to 1.4 Gflop/s at n = 1000. This means that for modestly sized problems the cache can boost the performance with a factor 1.5--2. The efficiency in MSP mode is presently not too high: just over 50% in a favourable situation. As the MSP facility is very new, one may expect that the efficiency will increase considerably in the near future.