| 26/08/2008 |
7am: 384P (nodes ac25 to ac30) crashed as a result of
a network problem. These nodes had to rebooted around 1pm due to a related
node crash. The problems were not fully resolved until around 7pm. |
 |
| 11/08/2008 |
12:00pm: Emergency AC reboot. About 80% of AC was
rebooted to recover from a hung network problem. Recovered at about 2pm. |
 |
| 04/08/2008 |
11:00am: Emergency AC reboot. About 80% of AC was rebooted to recover
from a hung network problem. Recovery took until 3:20pm because a RAID controller was
unresponsive after the reboot. |
 |
| 23/06/2008 |
8:00am: Planned downtime for minor repair of corrupted
/home filesystem. A few long-running jobs were terminated at that time. |
 |
| 29/04/2008 |
3:54pm: The "spontaneous" reboot of an AC node has led to
the failure of 80% of the AC nodes. All jobs on those nodes failed. |
 |
| 17/04/2008 |
2:13am: A cpu failure caused 16 AC nodes to crash or
lose filesystem access. To recover those nodes, the reset of the 1536P
(24 nodes in total) had to be rebooted. The system was tested and
returned to service at 8:10am. |
 |
| 10/04/2008 |
9:50am: A cpu failure caused 11 AC nodes to crash or
lose filesystem access. To recover those nodes, the reset of the 1536P
(24 nodes in total) had to be rebooted. The system was tested and
returned to service at 1:30pm. |
 |
| 18/03/2008 |
1536P system crashed when ac21 reset due to a hardware
problem. |
 |
| 14/03/2008 |
384P system crashed when a router failed. |
 |
| 13/03/2008 |
AC nodes undergoing a rolling kernel upgrade to
resolve some node crashes. |
 |
| Dec 2007 - Jan 2008 |
AC was upgraded to the SuSE SLES10/SP1 / SGI
ProPack5/SP3 operating system. This happened in fits and starts due
to various problems with this OS. Quite a number of issues have
also been resolved by this upgrade. |
 |
| 15/10/2007 |
9am-11:30: AC was down to upgrade the CXFS metadata
servers. This should allow the system to be upgraded to SLES10. |
 |
| 07/10/2007 |
8pm: An interruption to mains power caused many AC
compute nodes to crash requiring a complete reboot of the system. AC
returned to service at about 3am 08/10/07. |
 |
| 17/09/2007 |
9am-11am: All AC compute nodes were drained and
shutdown for a firmware upgrade and reconfiguration into 64 cpu nodes. |
 |
| 07/08/2007 |
A complete crash of AC occurred at about 4pm during a hardware replacement.
All jobs on the system were lost. The system was rebooted and brought back
into service at 11:30pm. |
 |
| 01/05/2007 |
9am-11am: To progress the long-delayed upgrade to SLES10,
we will be having a "quiet period" on AC to upgrade IO system
firmware. All jobs will be suspended and logins disabled for about
2hrs. |
 |
| 27/02/2007 |
At 11pm a severe hail storm caused the power to fail in the machine room.
Recovery commenced 8am 28/02/2007. The jobs running at the time of the
forced shutdown still appear under PBS, but they are definitely gone and
will be cleared from the PBS information once the machine is returned to
service. SGI used the opportunity to perform maintenance tasks, and the
system underwent testing before being returned to full service at around
4:15pm on 28/02/2007.
|
 |
07/02/2007- 08/02/2007 |
A number of AC nodes crashed due to a network
problem. Eventually 80% of AC (the "1536P") had to be rebooted to
correct the problem. |
 |
| 19/01/2007 |
The whole of AC crashed due to a combination of
human error and network and CXFS filesystem vulnerabilities. |
 |
| 11/12/2006 |
The live upgrade of CXFS did not go as planned. The CXFS
filesystem failed during the process leading to all jobs being killed
and the whole cluster being rebooted. |
 |
| 11/12/2006 |
The delayed upgrade of the CXFS metadata servers will occur
on 11/12/06 starting at 8am. All jobs will be suspended and
logins will be disabled while the upgrade occurs. |
 |
| 23/10/2006 |
Complete system reboot due to 'cascading node crashes' followed by a
shutdown of the cluster filesystem. All running jobs were lost. |
 |
| 25/08/2006 |
PBS glitch on AC whereby the PBS database was wiped clean resulted in
PBS losing information on the current running jobs. PBS was also
temporarily unavailable. The running jobs which PBS is no longer in
control of ARE being finished, and we will notify the job owners when
output etc is recovered after job completion. New jobs will not be
scheduled on the same cpus as the still running old jobs. |
 |
| 14/08/2006 |
ac reboot - There will be a brief reboot of the ac login node at 1pm
to improve interactive response. The system will be up again in about
5 minutes. Please SAVE FILES and LOG OFF by 1pm. |
 |
| 19/07/2006 |
ANU wide power outage at 1540 caused most nodes of the ac to fail,
then eventually the cluster file system failed causing loss of all
jobs. |
 |
| 19/06/2006 |
1640 - a system problem arising from the installation of
the new AC nodes caused the shutdown of the global filesystems
(/home, /opt, /short etc). All jobs on all AC nodes
failed as a result. It is also possible that some files may have been
corrupted as a result of the filesystem shutdown - we are not
sure. Our apologies for the loss of work and inconvenience. |
 |
| 12/04/2006 |
11am. A power supply to an internal system router failed causing many
nodes to crash and the consequent loss of many jobs. Apologies. The
system ran in a reduced configuration until replacement parts were
installed at around 1930 and nodes brought back on-line ~10pm. |
 |
| 30/03/2006 |
1100-1900 Thursday Mar 30: AC Down
During a memory upgrade to a node, something caused most of
the AC nodes to crash. As a result, the majority of jobs
running on the system were lost. Apologies for the loss of
work. A resulting problem with the global filesystem
caused the delay in restarting the system. |
 |
| 06/03/2006 |
0800 - AC will be down most of Monday 6th March to
upgrade firmware
in preparation for an OS upgrade and to perform preventative
maintenance.
|
 |
| 17/11/2005 |
1600 - Loss of power to the machine room caused all of lc, ac and
the mass data store to go offline. The systems are being tested
before full return to service.
|
 |
| 17/11/2005 |
1230 -
The global filesystems of ac are currently performing some recovery
operations
and are affecting a number of jobs that have I/O. We will be rebooting
the interactive node shortly to help resolve the problem.
|
 |
| 29/09/2005 |
1825 - ac is running jobs again.
A network router, believed to have caused two total system crashes,
has been replaced. We apologise for the loss of jobs.
|
 |
| 28/09/2005 |
1100 - ac login node rebooted to clear inode caching problem. |
 |
| 27/09/2005 |
Many nodes of the ac crashed between 1600 and 1730, killing most of the running jobs. PBS will not be scheduling any more jobs until the reason for the crashes is resolved. |
 |
| 19/08/2005 |
8:30am.
The AC will be down to complete the NUMAlink. The login node and 128 cpus will be
available later on Friday 19/8. The rest of the system should be back in
use by Monday 22nd.
|
 |
| 27/06/2005 |
7am.
As part of ongoing machine room power upgrades the AC login node
and some AC compute nodes will be powered off for an hour or two at
7am on Monday June 27th. Most nodes and the queues will continue
to run during this time.
|
 |
| 20/06/2005 |
11:00 - 17:30 System logins were not working due to a file system problem, and all running jobs were lost. |
 |
| 15/06/2005 |
System down 8:00am to install upgraded operating
system |
 |
| 06/06/2005 |
- System made available to users at 13:30.
- Interactive access disrupted at 14:50, but jobs are still running.
- Interactive access resumed around 15:30.
|
 |