| 03/11/2008 1330 |
A filesystem check has
been run on all the AC filesystems.
A number of files and directories have been moved by the filesystem
check to the directory /home/lost+found/. The file names under this
directory are the system inode numbers of the original files (this is
unavoidable). To find files which belong to you, run the command: Ignore the "Permission denied" errors which you will get from this find command. They refer to directories NOT owned by you, hence not containing any files owned by you. Contact us if you need help recovering files. We hope we have detected all the damage to the filesystems, but there may be problems which we are unaware of. It is fortunate that /short was recovered after this system problem - for a while it appeared that all /short data was lost. This incident highlights the need for all to heed the advice given for /short in the local userguide i.e. that /short is "NOT backed up - users should save to MDS system as necessary." |
| 01/11/2008 | On Sat November 1st at around 6am a hardware RAID controller failed leaving all shared filesystems unavailable. The system was unavailable from then until Monday as we waited for attention from SGI. |
| 21/10/2008 | 7:30am: About 80% of AC crashed. All jobs on ac[1-24] failed. Nodes were rebooted shortly after midday. |
| 26/08/2008 | 7am: 384P (nodes ac25 to ac30) crashed as a result of a network problem. These nodes had to rebooted around 1pm due to a related node crash. The problems were not fully resolved until around 7pm. |
| 11/08/2008 | 12:00pm: Emergency AC reboot. About 80% of AC was rebooted to recover from a hung network problem. Recovered at about 2pm. |
| 04/08/2008 | 11:00am: Emergency AC reboot. About 80% of AC was rebooted to recover from a hung network problem. Recovery took until 3:20pm because a RAID controller was unresponsive after the reboot. |
| 23/06/2008 | 8:00am: Planned downtime for minor repair of corrupted /home filesystem. A few long-running jobs were terminated at that time. |
| 29/04/2008 | 3:54pm: The "spontaneous" reboot of an AC node has led to the failure of 80% of the AC nodes. All jobs on those nodes failed. |
| 17/04/2008 | 2:13am: A cpu failure caused 16 AC nodes to crash or lose filesystem access. To recover those nodes, the reset of the 1536P (24 nodes in total) had to be rebooted. The system was tested and returned to service at 8:10am. |
| 10/04/2008 | 9:50am: A cpu failure caused 11 AC nodes to crash or lose filesystem access. To recover those nodes, the reset of the 1536P (24 nodes in total) had to be rebooted. The system was tested and returned to service at 1:30pm. |
| 18/03/2008 | 1536P system crashed when ac21 reset due to a hardware problem. |
| 14/03/2008 | 384P system crashed when a router failed. |
| 13/03/2008 | AC nodes undergoing a rolling kernel upgrade to resolve some node crashes. |
| Dec 2007 - Jan 2008 | AC was upgraded to the SuSE SLES10/SP1 / SGI ProPack5/SP3 operating system. This happened in fits and starts due to various problems with this OS. Quite a number of issues have also been resolved by this upgrade. |
| 15/10/2007 | 9am-11:30: AC was down to upgrade the CXFS metadata servers. This should allow the system to be upgraded to SLES10. |
| 07/10/2007 | 8pm: An interruption to mains power caused many AC compute nodes to crash requiring a complete reboot of the system. AC returned to service at about 3am 08/10/07. |
| 17/09/2007 | 9am-11am: All AC compute nodes were drained and shutdown for a firmware upgrade and reconfiguration into 64 cpu nodes. |
| 07/08/2007 | A complete crash of AC occurred at about 4pm during a hardware replacement. All jobs on the system were lost. The system was rebooted and brought back into service at 11:30pm. |
| 01/05/2007 | 9am-11am: To progress the long-delayed upgrade to SLES10, we will be having a "quiet period" on AC to upgrade IO system firmware. All jobs will be suspended and logins disabled for about 2hrs. |
| 27/02/2007 | At 11pm a severe hail storm caused the power to fail in the machine room. Recovery commenced 8am 28/02/2007. The jobs running at the time of the forced shutdown still appear under PBS, but they are definitely gone and will be cleared from the PBS information once the machine is returned to service. SGI used the opportunity to perform maintenance tasks, and the system underwent testing before being returned to full service at around 4:15pm on 28/02/2007. |
| 07/02/2007- 08/02/2007 |
A number of AC nodes crashed due to a network problem. Eventually 80% of AC (the "1536P") had to be rebooted to correct the problem. |
| 19/01/2007 | The whole of AC crashed due to a combination of human error and network and CXFS filesystem vulnerabilities. |
| 11/12/2006 | The live upgrade of CXFS did not go as planned. The CXFS filesystem failed during the process leading to all jobs being killed and the whole cluster being rebooted. |
| 11/12/2006 | The delayed upgrade of the CXFS metadata servers will occur on 11/12/06 starting at 8am. All jobs will be suspended and logins will be disabled while the upgrade occurs. |
| 23/10/2006 | Complete system reboot due to 'cascading node crashes' followed by a shutdown of the cluster filesystem. All running jobs were lost. |
| 25/08/2006 | PBS glitch on AC whereby the PBS database was wiped clean resulted in PBS losing information on the current running jobs. PBS was also temporarily unavailable. The running jobs which PBS is no longer in control of ARE being finished, and we will notify the job owners when output etc is recovered after job completion. New jobs will not be scheduled on the same cpus as the still running old jobs. |
| 14/08/2006 | ac reboot - There will be a brief reboot of the ac login node at 1pm to improve interactive response. The system will be up again in about 5 minutes. Please SAVE FILES and LOG OFF by 1pm. |
| 19/07/2006 | ANU wide power outage at 1540 caused most nodes of the ac to fail, then eventually the cluster file system failed causing loss of all jobs. |
| 19/06/2006 | 1640 - a system problem arising from the installation of the new AC nodes caused the shutdown of the global filesystems (/home, /opt, /short etc). All jobs on all AC nodes failed as a result. It is also possible that some files may have been corrupted as a result of the filesystem shutdown - we are not sure. Our apologies for the loss of work and inconvenience. |
| 12/04/2006 | 11am. A power supply to an internal system router failed causing many nodes to crash and the consequent loss of many jobs. Apologies. The system ran in a reduced configuration until replacement parts were installed at around 1930 and nodes brought back on-line ~10pm. |
| 30/03/2006 | 1100-1900 Thursday Mar 30: AC Down During a memory upgrade to a node, something caused most of the AC nodes to crash. As a result, the majority of jobs running on the system were lost. Apologies for the loss of work. A resulting problem with the global filesystem caused the delay in restarting the system. |
| 06/03/2006 | 0800 - AC will be down most of Monday 6th March to upgrade firmware in preparation for an OS upgrade and to perform preventative maintenance. |
| 17/11/2005 | 1600 - Loss of power to the machine room caused all of lc, ac and the mass data store to go offline. The systems are being tested before full return to service. |
| 17/11/2005 | 1230 - The global filesystems of ac are currently performing some recovery operations and are affecting a number of jobs that have I/O. We will be rebooting the interactive node shortly to help resolve the problem. |
| 29/09/2005 | 1825 - ac is running jobs again. A network router, believed to have caused two total system crashes, has been replaced. We apologise for the loss of jobs. |
| 28/09/2005 | 1100 - ac login node rebooted to clear inode caching problem. |
| 27/09/2005 | Many nodes of the ac crashed between 1600 and 1730, killing most of the running jobs. PBS will not be scheduling any more jobs until the reason for the crashes is resolved. |
| 19/08/2005 | 8:30am. The AC will be down to complete the NUMAlink. The login node and 128 cpus will be available later on Friday 19/8. The rest of the system should be back in use by Monday 22nd. |
| 27/06/2005 | 7am. As part of ongoing machine room power upgrades the AC login node and some AC compute nodes will be powered off for an hour or two at 7am on Monday June 27th. Most nodes and the queues will continue to run during this time. |
| 20/06/2005 | 11:00 - 17:30 System logins were not working due to a file system problem, and all running jobs were lost. |
| 15/06/2005 | System down 8:00am to install upgraded operating system |
| 06/06/2005 |
|