National Computational Infrastructure
NCI National Facility
Newsletter June 2002

Table of Contents

Major review of all MAS projects Demand for resources under the Merit Allocation Scheme is now far in excess of supply. As a result, the Merit Allocation Committee (MAC) has decided to do a major review of all projects later this year.

At that time, the Principal Investigators of all MAS projects will be invited to apply for resources for the period January to December 2003. Before then, the MAC will formulate further guidelines for applications and the proposal forms will be modified accordingly. It will be particularly important to provide evidence of progress and why use of the National Facility is appropriate and critical to your work.

Using your grant All users are reminded that the provision of an allocation does not guarantee that the time will be fully usable. In particular, larger projects must use their allocations at a steady rate and cannot expect to use a large part of their grant towards the end of each quarter.

Users should note that for every ~2000 hours granted in a quarter, it is necessary on average for your jobs to keep 1 processor running 24 hours a day, 7 days a week throughout the quarter, or say, 24 processors running for 1 hour per day.

Access to /massdata Due to continuing difficulties with massdata access, the filesystem /massdata was unmounted on Monday June 24th. As a result, access to massdata directories will now be solely by utility commands to help ensure efficient usage. These utilities are mdss, netcp and netmv. See the specific man pages or the userguide web page (linked in heading to left) for details of usage.
Quotas on mass data storage File space used under /massdata/project is now being limited in the same way as /home and /short have been quota'd. Usage is monitored, and is available as part of the output of the quota -v command on the sc. If it is found that a project is over quota each member of the project group is sent an email to inform them and warn them that if they are not back under quota within a 5 day grace period, their project's access to the batch queues on the sc will be denied until the file space used is back under quota.

Since the monitoring occurs relatively infrequently (unlike for /home and /short which is every five minutes), a new utility is available to allow project groups who believe they are back under quota on the Mass Data Store to reactivate their queue access.

The utility is invoked as either of

reactivate_queue_access
reactivate_queue_access -P project

In the first case the $PROJECT environment variable is used as the project to be examined for reactivation. The command gives the updated massdata usage information for the project as a number of 512 byte blocks (i.e. the output of a du command), and depending on this new usage information will reactivate queue access for the project.

Memory upgrade During the upgrade to the SC on May 7th further memory was added. The new configuration is
  • 80 4cpu nodes of 4GB
  • 36 4cpu nodes of 8GB
  • 4 4cpu nodes of 16GB

The amount of memory available to single cpu jobs is limited to slightly less than half the total physical memory of a node. Thus jobs requiring 2GB/cpu or more are limited to 40 of the SC nodes and those requiring 4GB/cpu or more are limited to 4 of the nodes. As usual, users should not request more memory for a job than it needs.

More processing power coming An additional 7 ES45 nodes (28 processors) will be added to the system in about 8 weeks. This will bring the total peak performance of the system to be in excess of 1 Teraflop.
CT&T web page As part of the Computing Techniques and Tools Expertise Program of APAC a web site is provided as a single reference site for new techniques and tools. Feedback on content for this site is welcome.
Job turnaround Users often ask for an explanation of the time it takes for their jobs to run as their experiences have changed as the SC has been more heavily used. Typically there are over 1000 cpus worth of jobs on the system queued, running or suspended and there are fewer than half that many physical cpus. So the average turnaround delay will be roughly equal to the requested runtime. In general, single cpu jobs will not queue for long, but will be suspended for parallel jobs. And parallel jobs may queue for some time until the requisite number of cpus are available. Overall, the priority regime of the queuing system aims to have no jobs delayed significantly more than others (in terms of percentage of the requested time).

Because of the increasing demand, we have had occasion to suggest to users that they should ensure that parallel jobs are using cpus sufficiently well to justify the number requested, for example a 4 cpu job with %cpu < 40 might be better run as a 2 cpu job. We will continue to promote efficient use of the system to ensure equal access for all users.

Mutiple processor policy Another change to the queuing system has been introduced to encourage optimal use of the system. There is a new restriction that any parallel jobs requesting more than 4 cpus ask for a multiple of 4 cpus. For example, jobs that require 18 cpus must request (and will be charged for) 20 cpus. The change was due mainly to a restriction in the underlying RMS (MPI management) system and the complexity required to overcome it. With this new restriction, the overall efficiency of scheduling the system has been improved.
Optimal resource requests for batch jobs As much as possible, batch job requests for all resources (memory, number of cpus, walltime and jobfs) should reflect the requirements of the job as closely as possible. Your job will not start until the requested resources are available so an excessive request may delay your job start. While running, your job will have dedicated access to the resources requested so an excessive request may unnecessarily delay other users' jobs. In particular excessive memory and/or jobfs requests may result in your job tying up a node with large memory or disk whilst jobs in genuine need of these resources are left queued.
Interruptions to service There will be a a somewhat reduced service during the first week of July for the installation of replacement hardware which should reduce the failure rate of nodes. The system will be available but the capacity will be reduced.

A complete shutdown will occur early on July 18th for electrical and fire testing of the machine room. Following this, further SC hardware replacement will be carried out. It is expected that the system will be available again on the afternoon of July 18th but at a slightly reduced capacity. The system should return to full capacity over the next day or two.

Software As new packages or later versions of known packages are released with parallel options added, it is important to check that jobs are requesting the optimal number of processors for best performance. This is done by running some short scalability tests and checking that the walltime recorded at the end of the job is decreasing in proportion to the number of CPUs used. Please contact help@nf.apac.edu.au if you need assistance with this. There is also information available for each of the packages on the National Facility software web page.

A number of software packages have recently been updated or installed, and a number are currently being updated. These include Amber, ADF & ADF/BAND, Fluent, Molpro, Mopac, and Portland HPF. Please let us know if you are interested in using CPLEX ILOG, Linda Gaussian, and Gaussview. Parallel Q-chem is available to people with a specific interest as part of the beta testing program. If there is sufficient interest Parallel Q-chem may be made generally available.

Note that package installations and availability are advertised through the Message of the Day, with more details found on the National Facility's software web page.

Training courses The APAC National Facility staff provide introductory courses on using the SC and a course on programming using MPI. If there is sufficient demand from any city for the delivery of one or both of these courses we are happy to do so. Suggestions for courses on other topics that would be useful to users are also welcome.
Email problems, suggestions, questions to