«Medium-Range Weather Prediction Austin Woods Medium-Range Weather Prediction The European Approach The story of the European Centre for Medium-Range ...»
The technical work required for the installation was impressive. All the extra power and cooling equipment had to be installed in advance — new condensing units, power cables and motor generators. On 4 December 1985 the boxes containing the new computer were wheeled in the back door of the Computer Hall. Within 48 hours the installation was completed, the machine powered up and testing was begun. The final configuration of the system was ready for testing on 21 December 1985.
In the following years, the system continued to give a stable and on the
204 Chapter 16whole satisfactory service. In early 1988 the CRAY was upgraded to allow implementation of two high-speed data transfer channels connecting it to the archive system, based now on the IBM 3090-150E, which had replaced the IBM 4341 in October 1987, and a direct link to the LCN.
Leaks in the roof of the Computer Hall led to the roof being replaced at some considerable expense by the UK government, the building’s owners, during 1988.
In May 1988, the Council began considering again the Centre’s future computer requirements. It was now planned that in the future the Centre would no longer buy its computers, but would instead buy a “computer service”, which would include the possibility of upgrading the system. A predictable cash flow year-on-year is more manageable than one with large annual fluctuations that would result from buying large computers every few years. The concept has advantages for the computer manufacturers as well. It was intended by this to ensure that the Centre would continue to have computer equipment of a standard suitable for its requirements, and with financing that the Member States could manage. This “service agreement” concept has continued to work well over the years and was still in use some 17 years later.
The work of a sub-group of the TAC was reported to Council in May
1989. The Council made funds available for preparation of the Computer Hall for installation of the next mainframe.
A CRAY Y-MP 8/864 replaced the last X-MP system in 1990. This system had 8 CPUs with a cycle time of 6 nanoseconds (166 MHz), 512 Megabytes of main memory, 1 Gigabyte of SSD memory and 62 Gigabytes of disk space, with a theoretical peak performance of 2.75 Gigaflops. This was the first supercomputer at the Centre with a Unix operating system. The previous three CRAY systems had used Cray’s proprietary operating system COS. The Y-MP used Cray’s implementation of Unix called UNICOS, based on ATT System V Unix with Berkeley extensions, and with further enhancements developed by Cray Research. This heralded the gradual introduction of Unix systems at the Centre. In the future all the systems used from desktops PCs to supercomputers would run some form of Unix. The operational model was transferred to the Y-MP on 7 November 1990.
The replacement of the two Cyber 855s front-end computers that had been installed in early 1989 with a Cyber 962-11 configuration was also agreed in 1990.
In 1992 a Cray C90/16-256 replaced the Y-MP. This system had 16 CPUs with a cycle time of 4.167 nanoseconds (240 MHz), 2 Gigabytes of main memory, 4 Gigabytes of SSD memory and 120 Gigabytes of disk space.
The computer system: CDC, Cray, Fujitsu, IBM 205 Each CPU of the C90 produced 4 results per clock cycle giving a theoretical peak performance of 960 Megaflops per CPU or just over 15 Gigaflops for the whole system. Its installation was not without problems. A chip design problem meant that programs that contained memory-addressing errors could corrupt other independent, programs running on the machine.
This led to a delay in accepting the system, as processors were shipped back a few at a time to the manufacturing plant at Chippewa Falls to be re-engineered. The C90 eventually passed its final acceptance test on 2 January
1993. To compensate for the delay, Cray provided the Centre with a CRAY Y-MP4E system for five months from June.
Up to this time, all the Cray supercomputers at the Centre, apart from the single processor CRAY-1, were Shared Memory Processor (SMP) systems. Each of the processors in the system could access any part of the memory. In 1994 the Centre entered the new world of distributed memory parallel processing. The Service Agreement with Cray was extended on 7 June 1994, leading to a CRAY-T3D being installed in JulyAugust, as additional equipment to the C90. Final acceptance was passed on 5 October.
This system comprised 128 Alpha microprocessors, each with 128 Mbytes of memory. The processors were connected by a fast interconnect in the form of a 3D-torus. This system was a distributed memory system with each processor “owning” 128 Mbytes of memory. The “PARMACS” message-passing programming paradigm was used to enable processors to access the memory that was attached to the other processors. Substantial changes were made to the forecasting system so that it would operate efficiently on this type of architecture. The T3D itself did not have any disks or network connections — these were provided by a small YMP-2E system connected to it by a 200 Mbytes/sec high-speed channel. The system was well suited to running the operational Ensemble Prediction System.
On 30 November 1993, the NOS/VE service, which had provided access to the computer system for many years, was terminated. From then on, access was via workstations.
Throughout these years, computer security was becoming more and more important at the Centre as well as in the rest of the world. Trials of access via smart cards began in 1994. These were still used to provide secure access more than ten years later.
On 19 July 1994 an improved version of UNICOS, UNICOS 8, was installed on the CRAY. This was a major improvement over version 7, effectively halving the CPU time used by the operating system. Users’ jobs had 10% more computing time available.
206 Chapter 16In December 1994, Council approved the cash flow for the period 1996 to 2000 to fund the replacement of the C90. This led to an invitation to manufacturers to tender against a money stream. The responses to the tender were excellent. After considerable debate and deliberation by the tender evaluation board the Director advised Council to accept the offer from Fujitsu Limited.
Massimo Capaldo returned to the Centre after an absence of some 15 years, now as Head of the Operations Department. Immediately he was faced with the challenge of the move from the familiar CRAY systems to the Fujitsu computers. This was a “quantum change, something of a leap in the dark”, but justified by the clearly superior offer from Fujitsu. The Cray team was understandably very disappointed by the decision.
In 1996 a small VPP300/16 system was installed for familiarization and testing, followed by the first of three large Fujitsu VPP systems, the VPP700/46. This initially had 39 Processing Elements (PEs) for computation, another six for I/O and one acting as a “primary-PE” running the batch subsystem and interactive work. This was also a distributed memory system, with each PE having direct access to its own 2 Gigabytes of main memory. But whereas the T3D had scalar processors, each VPP700 PE consisted of a single vector processor, similar to that of the Cray-C90, with a theoretical peak performance of 2.2 Gigaflops, giving a total peak performance of around 90 Gigaflops for the “compute nodes”. This Fujitsu system incorporated a very high speed non-blocking crossbar interconnect, which had low latency and very high bandwidth, enabling messages to be passed from any PE to any other PE at speeds of up to 1 Gigabyte per second. On 14 July, it had passed all its acceptance tests. The number of processors was increased to 116 in September 1997, to provide a total peak performance for the whole system of over 250 Gigaflops. The VPP ran the operational suite and dissemination from 18 September. The last of the CRAY systems was powered down on 1 October 1996, ending 20 years of contractual relations with Cray Research.
In 1998 a VPP700E with 48 processors was installed. The VPP700E was similar to the VPP700, but with slightly faster processors (2.4 Gigaflops). It was planned to install a VPP5000 system in early 1999, but in a situation reminiscent of the C90 design problem, it was found at a very late stage that there was a design fault in one of the VPP5000 CPU chips, so delivery had be delayed for several months while this was rectified. At last, in October 1999 the VPP5000, initially with 38 processors, later with 100, was installed. It passed its acceptance tests on 16 February 2000. The VPP5000 Processing Elements were almost a factor four faster than those of the The computer system: CDC, Cray, Fujitsu, IBM 207 VPP700 that it replaced, with a theoretical peak performance of 9.6 Gigaflops. The processor had a chip to speed up indirect memory accesses.
Fujitsu dubbed this chip the “LASCAW” chip, after the name of a subroutine in the model code; the chip was designed specifically to improve the performance of this subroutine.
Before the VPP5000 could be fully accepted, the operating system had to be brought into line with that on the other VPPs to make it “Y2K compliant”. At that time other Y2K issues were already being addressed at the Centre. Members of staff were requested to correct year 2000 faults in the software for which they were responsible by October 1998. The first Y2K problem at the Centre actually occurred on the data handling system on 26 September 1997. The CFS data management system used the value 999 to indicate an infinite retention period. Unfortunately 999 days from the first day of the new millennium was 26 September 1997 and on that date CFS started complaining about invalid retention dates!
Capaldo returned to Italy in February 1999 after four years as Head of Operations. He would have liked to stay, and Director Burridge wanted to keep him. However for administrative reasons, Italy insisted that he return.
He was “proud to have been involved in the huge amount of work during that time: changing from CRAY to Fujitsu, implementation of variational analyses, seasonal prediction, wave forecasting, ECMWF Re-analysis, ensemble prediction and more. We were pioneering lots of new things.” The discussions in WMO concerning commercialization issues lead to the Centre’s Operations Department publishing its first Catalogue of Products during his time; his work in Italy before coming to the Centre had well prepared him for dealing with these difficult issues.
Early in 1999 a stand-alone test system was set up to test all the major components of the Centre’s software. Horst Böttger, as Head of the Meteorological Division, had the worrying responsibility to ensure that so far as possible harm to ECMWF operations would be minimised. He, and other Centre staff, contributed to the work of a WMO Working Group on the Y2K problem. At a WMO meeting hosted by the Centre in 1999, it was decided that the Centre would monitor data around the turn of the year, and the provision of information to WMO Members was agreed. The Centre was responsible for informing the nations of the world in real time of problems, or lack of them, with incoming data. It set up an area on its web site, which was able to report the trouble-free arrival, first of Australian and Pacific data, immediately after the hour (and millennium) changed at sequential time zones. An alcohol-free party was organised at the Centre for the night of 31 December, to ensure that relevant staff would be available throughout
208 Chapter 16the night in anticipation of problems. In the event the change to 1 January 2000 passed without major incident, although the date was wrong on some of the plotted charts.
In May 2000, operations were transferred to the VPP5000 system. It was upgraded to its final configuration with 100 processors in July 2000, at which point its sustained performance on the operational model was about 288 Gigaflops, compared to its theoretical peak of 960 Gigaflops.
A disaster recovery system was installed in 1999 outside the Computer Hall in a separate building, to hold back-up copies of important data sets.
Planning began in 2000 for the replacement of the Fujitsu. An Invitation to Tender was issued on 23 March 2001. IBM’s offer was judged the best value for money. Early in 2002 a single 32-processor p690 server was delivered as a familiarization and test system and in the second half of the year of Phase 1, two IBM Cluster 1600 systems, were installed and commissioned. Each cluster comprised 30 IBM pSeries p690 servers, each with 32 CPUs with a clock cycle time of 1.3 GHz (5.4 Gigaflops peak) logically partitioned into four 8-CPU nodes each with 8 Gigabytes of memory. A “colony” switch, which was an IBM-proprietary interconnect, connected these nodes. Each cluster contained a set of four “nighthawk” nodes connected to the switch to provide the I/O capabilities to the network and to a set of fibrechannel RAID disk subsystems. There were initial firmware problems with memory and the colony switch adapters. It took a long time to convince IBM of the seriousness of the problem, but it was sorted out just in time to start the acceptance tests. This led Dominique Marbouty, then Head of the Operations Department, to remark “It is frustrating that IBM waits until the last minute to sort out these problems, but it’s amazing what they can do in that last minute!” The first operational forecasts from this system were produced on 4 March 2003.
The Fujitsu VPP systems were decommissioned at the end of March
2003. However, we saw in Chapter 14 where we discuss Re-Analysis, that Fujitsu allowed the VPP700E computer to remain on site for a further month, and ERA-40 was extended to August 2002, hitting the 45-year mark.
The VPP5000 was shipped to Toulouse, where Météo France used it to upgrade its computer system.