Table of Contents
| Major review of all MAS projects |
Demand for resources under the Merit
Allocation Scheme is now far
in excess of supply. As a result, the Merit Allocation Committee
(MAC) has decided to do a major review of all projects later this year.
At that time, the Principal Investigators of all MAS projects will be invited to apply for resources for the period January to December 2003. Before then, the MAC will formulate further guidelines for applications and the proposal forms will be modified accordingly. It will be particularly important to provide evidence of progress and why use of the National Facility is appropriate and critical to your work. |
| Using your grant |
All users are reminded that the provision of an allocation does not
guarantee that the time will be fully usable. In particular, larger
projects must use their allocations at a steady rate and cannot expect
to use a large part of their grant towards the end of each quarter.
Users should note that for every ~2000 hours granted in a quarter, it is necessary on average for your jobs to keep 1 processor running 24 hours a day, 7 days a week throughout the quarter, or say, 24 processors running for 1 hour per day. |
| Access to /massdata | Due to continuing difficulties with massdata access, the filesystem /massdata was unmounted on Monday June 24th. As a result, access to massdata directories will now be solely by utility commands to help ensure efficient usage. These utilities are mdss, netcp and netmv. See the specific man pages or the userguide web page (linked in heading to left) for details of usage. |
| Quotas on mass data storage |
File space used under /massdata/project is now being limited in the
same way as /home and /short
have been quota'd. Usage is monitored, and
is available as part of the output of the quota -v command on the sc.
If it is found that a project is over quota each member of the project
group is sent an email to inform them and warn them that if they are
not back under quota within a 5 day grace period, their project's
access to the batch queues on the sc will be denied until the file space
used is back under quota.
Since the monitoring occurs relatively infrequently (unlike for /home and /short which is every five minutes), a new utility is available to allow project groups who believe they are back under quota on the Mass Data Store to reactivate their queue access. The utility is invoked as either of
reactivate_queue_access In the first case the $PROJECT environment variable is used as the project to be examined for reactivation. The command gives the updated massdata usage information for the project as a number of 512 byte blocks (i.e. the output of a du command), and depending on this new usage information will reactivate queue access for the project. |
| Memory upgrade |
During the upgrade to the SC on May 7th
further memory was added. The new configuration is
The amount of memory available to single cpu jobs is limited to slightly less than half the total physical memory of a node. Thus jobs requiring 2GB/cpu or more are limited to 40 of the SC nodes and those requiring 4GB/cpu or more are limited to 4 of the nodes. As usual, users should not request more memory for a job than it needs. |
| More processing power coming | An additional 7 ES45 nodes (28 processors) will be added to the system in about 8 weeks. This will bring the total peak performance of the system to be in excess of 1 Teraflop. |
| CT&T web page | As part of the Computing Techniques and Tools Expertise Program of APAC a web site is provided as a single reference site for new techniques and tools. Feedback on content for this site is welcome. |
| Job turnaround |
Users often ask for an explanation of the time
it takes for their jobs to run as their experiences have changed as the SC
has been more heavily used. Typically there are over 1000 cpus worth
of jobs on the system queued, running or suspended and there are
fewer than half that many physical cpus. So the average turnaround delay
will be roughly equal to the requested runtime. In general, single
cpu jobs will not queue for long, but will be suspended for parallel jobs. And
parallel jobs may queue for some time until the requisite number of
cpus are available. Overall, the priority regime of the queuing system aims
to have no jobs
delayed significantly more than others (in terms of percentage of the requested
time).
Because of the increasing demand, we have had occasion to suggest to users that they should ensure that parallel jobs are using cpus sufficiently well to justify the number requested, for example a 4 cpu job with %cpu < 40 might be better run as a 2 cpu job. We will continue to promote efficient use of the system to ensure equal access for all users. |
| Mutiple processor policy | Another change to the queuing system has been introduced to encourage optimal use of the system. There is a new restriction that any parallel jobs requesting more than 4 cpus ask for a multiple of 4 cpus. For example, jobs that require 18 cpus must request (and will be charged for) 20 cpus. The change was due mainly to a restriction in the underlying RMS (MPI management) system and the complexity required to overcome it. With this new restriction, the overall efficiency of scheduling the system has been improved. |
| Optimal resource requests for batch jobs | As much as possible, batch job requests for all resources (memory, number of cpus, walltime and jobfs) should reflect the requirements of the job as closely as possible. Your job will not start until the requested resources are available so an excessive request may delay your job start. While running, your job will have dedicated access to the resources requested so an excessive request may unnecessarily delay other users' jobs. In particular excessive memory and/or jobfs requests may result in your job tying up a node with large memory or disk whilst jobs in genuine need of these resources are left queued. |
| Interruptions to service |
There will be a a somewhat reduced service during the first week of
July for the installation of replacement hardware which should reduce
the failure rate of nodes. The system will be available but the
capacity will be reduced.
A complete shutdown will occur early on July 18th for electrical and fire testing of the machine room. Following this, further SC hardware replacement will be carried out. It is expected that the system will be available again on the afternoon of July 18th but at a slightly reduced capacity. The system should return to full capacity over the next day or two. |
| Software |
As new packages or later versions of known packages are released with
parallel options added, it is important to check that jobs are requesting
the optimal number of processors for best performance. This is done by
running some short scalability tests and checking that the
walltime recorded at the end of the job is decreasing in proportion to
the number of CPUs used. Please contact help@nf.apac.edu.au
if you need assistance with this. There is also information available
for each of the packages on the National Facility software web page.
A number of software packages have recently been updated or installed, and a number are currently being updated. These include Amber, ADF & ADF/BAND, Fluent, Molpro, Mopac, and Portland HPF. Please let us know if you are interested in using CPLEX ILOG, Linda Gaussian, and Gaussview. Parallel Q-chem is available to people with a specific interest as part of the beta testing program. If there is sufficient interest Parallel Q-chem may be made generally available. Note that package installations and availability are advertised through the Message of the Day, with more details found on the National Facility's software web page. |
| Training courses | The APAC National Facility staff provide introductory courses on using the SC and a course on programming using MPI. If there is sufficient demand from any city for the delivery of one or both of these courses we are happy to do so. Suggestions for courses on other topics that would be useful to users are also welcome. |