NCI logo and link to NCI home page

Home Notices and News Accounts Facilities, Software and Userguides Frequently Asked Questions Training Annual Reports
SGI Altix AC Downtime Information

26/08/2008 7am: 384P (nodes ac25 to ac30) crashed as a result of a network problem. These nodes had to rebooted around 1pm due to a related node crash. The problems were not fully resolved until around 7pm.
11/08/2008 12:00pm: Emergency AC reboot. About 80% of AC was rebooted to recover from a hung network problem. Recovered at about 2pm.
04/08/2008 11:00am: Emergency AC reboot. About 80% of AC was rebooted to recover from a hung network problem. Recovery took until 3:20pm because a RAID controller was unresponsive after the reboot.
23/06/2008 8:00am: Planned downtime for minor repair of corrupted /home filesystem. A few long-running jobs were terminated at that time.
29/04/2008 3:54pm: The "spontaneous" reboot of an AC node has led to the failure of 80% of the AC nodes. All jobs on those nodes failed.
17/04/2008 2:13am: A cpu failure caused 16 AC nodes to crash or lose filesystem access. To recover those nodes, the reset of the 1536P (24 nodes in total) had to be rebooted. The system was tested and returned to service at 8:10am.
10/04/2008 9:50am: A cpu failure caused 11 AC nodes to crash or lose filesystem access. To recover those nodes, the reset of the 1536P (24 nodes in total) had to be rebooted. The system was tested and returned to service at 1:30pm.
18/03/2008 1536P system crashed when ac21 reset due to a hardware problem.
14/03/2008 384P system crashed when a router failed.
13/03/2008 AC nodes undergoing a rolling kernel upgrade to resolve some node crashes.
Dec 2007 - Jan 2008 AC was upgraded to the SuSE SLES10/SP1 / SGI ProPack5/SP3 operating system. This happened in fits and starts due to various problems with this OS. Quite a number of issues have also been resolved by this upgrade.
15/10/2007 9am-11:30: AC was down to upgrade the CXFS metadata servers. This should allow the system to be upgraded to SLES10.
07/10/2007 8pm: An interruption to mains power caused many AC compute nodes to crash requiring a complete reboot of the system. AC returned to service at about 3am 08/10/07.
17/09/2007 9am-11am: All AC compute nodes were drained and shutdown for a firmware upgrade and reconfiguration into 64 cpu nodes.
07/08/2007 A complete crash of AC occurred at about 4pm during a hardware replacement. All jobs on the system were lost. The system was rebooted and brought back into service at 11:30pm.
01/05/2007 9am-11am: To progress the long-delayed upgrade to SLES10, we will be having a "quiet period" on AC to upgrade IO system firmware. All jobs will be suspended and logins disabled for about 2hrs.
27/02/2007 At 11pm a severe hail storm caused the power to fail in the machine room. Recovery commenced 8am 28/02/2007. The jobs running at the time of the forced shutdown still appear under PBS, but they are definitely gone and will be cleared from the PBS information once the machine is returned to service. SGI used the opportunity to perform maintenance tasks, and the system underwent testing before being returned to full service at around 4:15pm on 28/02/2007.
07/02/2007-
08/02/2007
A number of AC nodes crashed due to a network problem. Eventually 80% of AC (the "1536P") had to be rebooted to correct the problem.
19/01/2007 The whole of AC crashed due to a combination of human error and network and CXFS filesystem vulnerabilities.
11/12/2006 The live upgrade of CXFS did not go as planned. The CXFS filesystem failed during the process leading to all jobs being killed and the whole cluster being rebooted.
11/12/2006 The delayed upgrade of the CXFS metadata servers will occur on 11/12/06 starting at 8am. All jobs will be suspended and logins will be disabled while the upgrade occurs.
23/10/2006 Complete system reboot due to 'cascading node crashes' followed by a shutdown of the cluster filesystem. All running jobs were lost.
25/08/2006 PBS glitch on AC whereby the PBS database was wiped clean resulted in PBS losing information on the current running jobs. PBS was also temporarily unavailable. The running jobs which PBS is no longer in control of ARE being finished, and we will notify the job owners when output etc is recovered after job completion. New jobs will not be scheduled on the same cpus as the still running old jobs.
14/08/2006 ac reboot - There will be a brief reboot of the ac login node at 1pm to improve interactive response. The system will be up again in about 5 minutes. Please SAVE FILES and LOG OFF by 1pm.
19/07/2006 ANU wide power outage at 1540 caused most nodes of the ac to fail, then eventually the cluster file system failed causing loss of all jobs.
19/06/2006 1640 - a system problem arising from the installation of the new AC nodes caused the shutdown of the global filesystems (/home, /opt, /short etc). All jobs on all AC nodes failed as a result. It is also possible that some files may have been corrupted as a result of the filesystem shutdown - we are not sure. Our apologies for the loss of work and inconvenience.
12/04/2006 11am. A power supply to an internal system router failed causing many nodes to crash and the consequent loss of many jobs. Apologies. The system ran in a reduced configuration until replacement parts were installed at around 1930 and nodes brought back on-line ~10pm.
30/03/2006 1100-1900 Thursday Mar 30: AC Down

During a memory upgrade to a node, something caused most of the AC nodes to crash. As a result, the majority of jobs running on the system were lost. Apologies for the loss of work. A resulting problem with the global filesystem caused the delay in restarting the system.

06/03/2006 0800 - AC will be down most of Monday 6th March to upgrade firmware in preparation for an OS upgrade and to perform preventative maintenance.
17/11/2005 1600 - Loss of power to the machine room caused all of lc, ac and the mass data store to go offline. The systems are being tested before full return to service.
17/11/2005 1230 - The global filesystems of ac are currently performing some recovery operations and are affecting a number of jobs that have I/O. We will be rebooting the interactive node shortly to help resolve the problem.
29/09/2005 1825 - ac is running jobs again. A network router, believed to have caused two total system crashes, has been replaced. We apologise for the loss of jobs.
28/09/2005 1100 - ac login node rebooted to clear inode caching problem.
27/09/2005 Many nodes of the ac crashed between 1600 and 1730, killing most of the running jobs. PBS will not be scheduling any more jobs until the reason for the crashes is resolved.
19/08/2005 8:30am. The AC will be down to complete the NUMAlink. The login node and 128 cpus will be available later on Friday 19/8. The rest of the system should be back in use by Monday 22nd.
27/06/2005 7am. As part of ongoing machine room power upgrades the AC login node and some AC compute nodes will be powered off for an hour or two at 7am on Monday June 27th. Most nodes and the queues will continue to run during this time.
20/06/2005 11:00 - 17:30 System logins were not working due to a file system problem, and all running jobs were lost.
15/06/2005 System down 8:00am to install upgraded operating system
06/06/2005
  • System made available to users at 13:30.
  • Interactive access disrupted at 14:50, but jobs are still running.
  • Interactive access resumed around 15:30.

Email problems, suggestions, questions to