NCI logo and link to NCI home page

Home Notices and News Accounts Facilities, Software and Userguides Frequently Asked Questions Training Annual Reports
Linux Cluster Downtime Information

17/06/2008
1455 - 1724
Emergency downtime. Disk problem.
21/05/2007 lc0 down for 7 hours to migrate back to fixed hardware RAID controller. Upgraded lc0 and lc1 PERC firmware too.
27/03/2007 lc down for 1 hour to investigate disk controller problem.
17/03/2007 lc back up. New motherboard in lc0.
27/02/2007 At 11pm a severe hail storm caused the power to fail in the machine room. One of the head nodes of the lc has been damaged and is not able to be booted. Recovery of the lc system is expected to take a couple of days.
6/03/2007: Hardware has been replaced and filesystems are currently being checked.
01/06/2006
1000 - 1700
The lc login nodes (lc0 and lc1) were unavailable while lc0 was reconfigured to use software RAID on its boot disks. Running jobs were suspended during the downtime and should not have been otherwise affected.
22/05/2006 lc is back in service.
20/5/2006 lc is down due to a hardware failure in one of the login nodes. A service call has been placed but the machine will not be back up until Monday the 22nd of May at the earliest.
19/12/2005 - 20/12/2005 Schedulled downtime from 0800 Monday 19 Dec to 1222 Tuesday 20 Dec took longer than expected. LC has been upgraded to Centos 4.2 (a RedHat4 clone). We are aware of a few problems (like VNC not working). Please let us know what doesn't work for you.
17/11/2005 1600 - Loss of power to the machine room caused all of lc, ac and the mass data store to go offline. The systems are being tested before full return to service.
04/07/2005 10:00AM: lc0 and lc1 (the lc login nodes) will be rebooted. Running jobs should not be affected. Access to lc0 and lc1 will be disrupted for about 15 minutes.
24/06/2005 0600 - 0800 the compute nodes will be powered off to allow machine room power upgrades to take place. All jobs that will not complete by that time are not being started. The LC login nodes will not be affected - you can still access your files during this time.
23/06/2005 1700 - 2300 Job submission failures occured due to problems with a software update after the sc end of service.
07/05/2005 LC will be down until around 1pm to allow work on machine room power supplies. Queues will be drained in advance and jobs that would not finish before the downtime starts will be held.
25/01/2005 Scheduled lc0 and lc1 upgrade for 12:30PM for approximately 45 minutes. Running jobs should not be affected
02/12/2004 Starting at 6:30am the machine room went down for a power upgrade.
09/08/2004 11a.m. lc1 crashed, its filesystems are being checked now, hopefully back to service soon
24/02/2004 Crash: One of the front end nodes crashed causing many jobs to be killed, login failures and other related problems. Hopefully they have now been resolved. Apologies for any jobs killed.
10/12/2003 Downtime: Interactive access to the LC will be unavailable from 0950 to 1050 to apply software patches.
07/10/2003 Downtime: the LC will be unavailable from 0800 to relocate network switches and fibre terminations - only connectivity to LC will be affected, running jobs will be unaffected.
27/08/2003 Downtime: the LC was unavailable from 0900 - 1500 while it was physically moved. Not all nodes have been returned to service yet due to hardware failures.
04/08/2003 Downtime: the LC was unavailable from 8am for replacement of hardware on a front end node and software installation. System back in service at 2pm.
10/07/2003 1700 - There were problems on the sc which also affected jobs on the lc for about an hour.
10/07/2003 9am: One of the lc login nodes is down and is affecting logins, and queueing system operations.
01/07/2003 Another 50 Dell 350 nodes have been added to the system, bringing the total number of compute nodes to 150.
23/05/2003 Extended downtime is finally over. Apologies for the inconvenience. A major overhaul of security was done on the system along with the planned upgrades.
19/05/2003 Downtime: the LC will be unavailable nearly the entire day to upgrade the operating system and install more disk into the servers. The queues have been set to drain.
Production service commenced on 9 April, 2003.

Email problems, suggestions, questions to