The Big Red Cluster

Outages and Hardware Service

2008

Date System(s) Problem
May 1 At 6:18pm, the GPFS file system became inaccessible due to a hardware failure. Access was restored at 9:36pm. Running jobs were affected by this outaged
Apr 5 Myrinet At 3:39pm, the myrinet network went down due to a mapper issue. The myrinet network was returned to service at 11:47am on April 6
Apr 5 GPFS At 3:39pm, the GPFS file system became inaccessible due to a hardware failure. Access was restored at 10:35am on April 6.
Feb 4 BigRed At 11:00pm BigRed was returned to service. Cooling towers were reparied, power maintenance was performed, network upgrades were performed. Regularly scheduled maintenance for 2/5/2008 is canceled.
Feb 3 BigRed At 10:23am on February 3, the BigRed machine room experienced loss of cooling. Admins shutdown the machine at 10:30am.

2007

Date System(s) Problem
Nov 28 GPFS GPFS was inaccessible from 12:34PM until 8:59PM. Running jobs were affected by this event.
Nov 16 GPFS GPFS was inaccessible from 4:59PM until Nov 17 at 2:03am. Running jobs were affected by this event.
Nov 1 GPFS GPFS was inaccessible from ~4pm until 6:20pm. Running jobs were affected by this event.
Oct 25 GPFS GPFS was inaccessible from ~midnight until 12:47am. Running jobs were not affected by this event.
Oct 23 GPFS GPFS was inaccessible from ~11:00pm until Oct 24 at 9:00am. Running jobs were not affected by this event.
Oct 20 GPFS GPFS was inaccessible from 8:47am to 12:25pm. Running jobs were affected by this event.
Oct 19 GPFS GPFS was inaccessible from 5:54pm until Oct 20 at 2:26am. Running jobs were affected by this event.
Oct 16 GPFS GPFS was inaccesible from ~8:00pm until Oct 17 at 12:07am. Running jobs were not affected by this event.
Oct 7 GPFS GPFS was in accessible from ~4:26pm until 10:16pm. Running jobs were not affected by this event.
Sep 30 Big Red, GPFS Power outage, ~6:25am to 3:49pm EDT. All systems were down during this event.
Sep 25 Big Red, GPFS Power outage, 4:14pm to 10:08pm EDT. All systems were down during this event.
Jul 4 Big Red, GPFS Power outage, ~9:00am to 2:30pm EDT. All systems were down during this event.
Jun 21 GPFS Failure of several blades during benchmarking of expansion hardware resulted in GPFS instability, approximately 3:45pm until 5:50pm. GPFS recovered without a restart, though remounts did occur on several nodes: some jobs were lost.
Apr 27 GPFS Communication failure during test of a firmware update process resulted in GPFS instability, approximately noon until 5pm. GPFS restarted; no data was lost.
Apr 17 Login nodes Network configuration change on campus resulted in a routing error; access to login nodes was unavailable from approximately 2:30pm until 3:00pm.
Apr 10-11 Racks 1-8 NFS server issues resulted in hanging NFS mounts (see Apr 6 outage).
Apr 6 Rack 9 10:45am - 3:00pm; NFS server issues resulted in hanging NFS mounts on the compute blades.
Mar 7-10 GPFS Switch and disk issues resulted in multiple GPFS outages
Jan 31 GPFS Failed disk controller resulted in a GPFS outage from 16:27 until 22:15 EST. No files appear to have been lost during the rebuild process.

2006

Date System(s) Problem
Nov 30 Rack 4 Outage due to failed /dev/sdb (intermittent SCSI errors) in the image server, s4. This resulted in a loss of access to the NFS exports for all blades in rack 4. Resolved following a power-cycle of s4. Began 6:40am, resolved 10:30am.
Nov 6-7 GPFS Outage due to defective SFP on DDN disk controller, which resulted in SAN switch problems. Began 11:08am, 11/6, resolved 9:41am, 11/7.
Sep 18 GPFS Three storage hosts (storage8u, 11u, and 15u) were affected during last night's storm. Some NSDs are unavailable to GPFS.