Local Userguide for the SGI Altix AC and Linux Cluster

 

Getting Started

 
Connecting to the SGI Altix AC and Linux Cluster

Once you have obtained an account via one of the mechanisms on the Accounts page, you will be sent an initial email informing you of your login name, and project code.

The respective hostnames of the SGI Altix AC and Linux Cluster are

ac.apac.edu.au
lc.apac.edu.au
You can use secure shell (ssh) to connect to the SGI Altix AC and Linux Cluster.

See the Software Page for the details of other Network Access software available.

If you are connecting for the first time, please change your initial password to one of your own choice via the passwd command, which will prompt you as below: (Note the % is the command prompt supplied by the interactive "shell" as in all examples in this document - it is not something you type in.)

     % passwd
     Old password:
     New password:
     Re-enter new  password:
Currently you have to make this change on the LC and not the AC but changing your password on the LC will also change it on the AC.
 
Interactive Use and Basic Unix

The operating system on all systems is Unix. A basic guide to Unix operating system commands is available HERE.

When you login you will come in under the Resource Accounting SHell, (referred to as RASH), which is a local shell used to impose interactive limits and account for the time used in each interactive session.

Your account will be set up with an initial environment via a default .login file, and an equivalent .profile file, as well as a .rashrc file. The .rashrc file can be edited to change the default project (see Project Accounting) and the command interface shell to be started by RASH as you login. Your initial command interface shell will be the tcsh. You can change this to bash by changing the line in .rashrc from

     setenv SHELL /opt/rash/bin/tcsh
to be
     setenv SHELL /opt/rash/bin/bash
instead. (Note that only these two shells are valid for use on the lc but in fact on the ac /bin/tcsh or /bin/bash may be chosen as your login shell. If you try to use a shell not registered with rash for the particular machine you will default to the tcsh.)

Each interactive process you run has imposed on it a time limit and a memory use limit. To see what these limits are enter the command nf_limits. This shows not only the details of the memory limits and time limits for interactive processes, but for batch jobs as well. The limits are not published here as they are liable to change, and it is also possible to vary these limits on an 'as needs' basis by project or user.

If your process exceeds the memory limit, the simple message "Killed" will be returned to you by the Unix operating system. If your process does not exceed the limit, but does exceed the current physical amount of memory available, you might get something a little more informative, like "Not enough space". Beware, such a message will also appear if you have filled your disk quota on your home directory, and your process is trying to open a new file to write to.


   
Project Accounting

All use on the compute systems is accounted against projects. Each project has a single grant of time per 3 month quarter, which can be used on one or other, or both the compute systems. (The grant is NOT per machine, but rather may be used wherever you choose).

If your username is connected to more than one project, you are prompted for which project to charge each session to as you login. A default project should be available for you to avoid typing. Batch job usage will also be charged to whatever project chosen at login, unless you otherwise specify a project on the qsub command line, or within the script file.

To change or set the default project, edit your .rashrc file in your home directory, and change the PROJECT variable as desired. The correct syntax is

    setenv PROJECT x99

For projects allocated time under the Merit Allocation or Partner shares, it is possible to keep submitting jobs to the queues after the project grant is exhausted. The jobs will run at a lower priority.

 
Monitoring Resource Usage

  • nf_limits displays imposed limits and charging rates relevant to the machine it is run on.

  • quotasu -P project -h displays the usage of the project in the current quarter, as well as some recent history of the project if available. Total usage is shown across both machines, but it is also possible to see the usage per queue on each machine.

  • quota -v displays your disk usage and quota in your home directory and the project usages in both the /short/<proj>/ directories and on the Massdata Storage System for the projects which you are connected to. See quota -h for details of other options for the command.

  • nqstat displays status of all running and queued batch jobs.

 
Software Environments

Environment Modules are available on the lc and ac to allow easy customisation of your shell environment to the requirements of whatever software you wish to use. The module command syntax is the same no matter which command shell you are using.
module avail will show you a list of the software environments which can be loaded via a module load package command. module help package should give you a little information about what the module load package will achieve for you. Alternatively module show package will detail the commands in the module file.

 

PBS Batch Use

Most jobs will require greater resources than are available to interactive processes. Larger jobs must be scheduled by the batch job system (which however does allow an interactive mode). The batch system software in use on both machines is a locally modified version of Portable Batch System (PBS)(1), a queueing system similar to NQS. You submit jobs to PBS specifying the number of CPUs, the amount of memory, and the length of time needed (and, possibly, other resources). PBS runs the job when the resources are available, subject to constraints on maximum resource usage.

1. This product includes software developed by NASA Ames Research Center, Lawrence Livermore National Laboratory, and Veridian Information Solutions, Inc. Visit the OpenPBS site for OpenPBS software support, products, and information.
   
Basic commands

The basic PBS commands are the same on both systems.
qstat
Standard queue status command supplied by PBS. See man qstat for details of options. (But see the local nqstat command below.)

nqstat
Local version of qstat. The queue header of nqstat gives the limit on wall clock time and memory for you and your project. The fields in the job lines are fairly straightforward.

qdel jobid
Delete your unwanted jobs from the queues. The jobid is returned by qsub at job submission time, and is also displayed in the nqstat output.

qsub
Submit jobs to the queues. The simplest use of the qsub command is typified by the following example (note that there is a carriage-return after -wd and ./a.out):

   % qsub -P a99 -q normal -l walltime=20:00:00,vmem=300MB -wd
   ./a.out
   ^D     (that is control-D)
or
   % qsub -P a99 -q normal -l walltime=20:00,vmem=300MB -wd jobscript
where jobscript is an ascii file containing the shell script to run your commands (not the compiled executable which is a binary file). More conveniently, the qsub options can be placed within the script to avoid typing them for each job:
   #!/bin/csh
   #PBS -P a99 
   #PBS -q normal 
   #PBS -l walltime=20:00:00,vmem=300MB 
   #PBS -wd
   ./a.out
You submit this script for execution by PBS using the command:
   % qsub jobscript

You may need to enter data to the program and may be used to doing this interactively when prompted by the program. There are two ways of doing this in batch jobs. If, for example, the program requires the numbers 1000 then 50 to be entered when prompted. You can either create a file called, say, input containing these values

   %cat input
   1000
   50
then run the program as
   ./a.out < input
or the data can be included in the batch job script as follows:
   #!/bin/csh
   #PBS -P a99 
   #PBS -q normal 
   #PBS -l walltime=20:00:00,vmem=300MB 
   #PBS -wd
   ./a.out << EOF 
   1000
   50
   EOF

Notice that the PBS directives are all at the start of the script, that there are no blank lines between them, and there are no other non-PBS commands until after all the PBS directives.

qsub options of note:

-P project
The project which you want to charge the jobs resource usage to. The default project is specified by the PROJECT environment variable.

-l walltime=20:00:00
The wall clock time limit for the job. Time is expressed in seconds as an integer, or in the form: [[hours:]minutes:]seconds[.milliseconds]

-l vmem=???MB
The total (virtual) memory limit for the job - can be specified with units of "MB" or "GB" but only integer values can be given. There is a small default value.
Your job will only run if there is sufficient free memory so making a sensible memory request will allow your jobs to run sooner. A little trial and error may be required to find how much memory your jobs are using - nqstat lists jobs actual usage.

-l ncpus=?
The number of cpus required for the job to run on. The default is 1.
-lncpus=N - If the number of cpus requested, N, is small (currently 8 or less on AC) the job will run within a single shared memory node. If the number of cpus specified is greater, the job will (probably) be distributed over multiple nodes. Currently on AC, these larger requests are restricted to multiples of 8 cpus.
-lncpus=N:M - This form requests a total of N cpus with (a multiple of) M cpus per node. Typically, this is used to run shared memory jobs where M=N and N is currently limited to 48 on AC.

-l jobfs=???GB
The requested job scratch space. This will reserve disk space, making it unavailable for other jobs, so please do not over estimate your needs. Any files created in the $PBS_JOBFS directory are automatically removed at the end of the job. Ensure that you use integers, and units of gb, or GB.

-l software=???
Specifies licensed software the job requires to run. See the software for the string to use for specific software. The string should be a colon separated list (no spaces) if more than one software product is used.

If your job uses licensed software and you do not specify this option (or mis-spell the software), you will probably receive an automatically generated email from the license shadowing daemon (see man lsd), and the job may be terminated. You can check the lsd status and find out more by looking at the URL mentioned in man lsd.

-l other=???
Specifies other requirements or attributes of the job. The string should be a colon separated list (no spaces) if more than one attribute is required. Generally supported attributes are:
  • iobound - the job should not share a node with other IO bound jobs
  • mdss - the job requires access to the MDSS (usually via the mdss command). If MDSS is down, the job will not be started.
  • pernodejobfs - the job's jobfs resource request should be treated as a per node request. Normally the jobfs request is for total jobfs summed over all nodes allocated to the job (like vmem). Only relevant to distributed parallel jobs using jobfs.
You may be asked to specify other options at times to support particular needs or circumstances.

-r y
Specifies your job is restartable, and if the job is executing on a node when it crashes, the job will be requeued. Both resources used by and resource limits set for the original job will carry over to the requeued job. Hence a restartable job must be checkpointing such that it will still be able to complete in the remaining walltime should it suffer a node crash.

The default is that jobs are assumed to not be restartable. Note that regardless of the restartable status of a job, time used by jobs on crashed nodes is charged against the project they are running under, since the onus is on users to ensure minimum waste of resources via a checkpointing mechanism which they must build into any particularly long running codes.

-wd
Start the job in the directory from which it was submitted. Normally jobs are started in the users home directory.

Look at the qsub and pbs_resources man page for complete details of all options. Note that -l options maybe combined as a comma separated list with no spaces, eg. -lvmem=500mb,walltime=20:00.

qps jobid
show the processes of a running job

qls jobid
list the files in a job's jobfs directory

qcat jobid
show a running job's stdout, stderr or script

qcp jobid
copy a file from a running job's jobfs directory

The man pages for these commands on the system detail the various options you will probably need to use.

 
Interactive PBS Jobs

The qsub -I option will result in an interactive shell being started out on the batch cpu[s] once your job starts. A submission script cannot be used in this mode - you must provide all qsub options on the command line.

Your job is subject to all the same constraints and management as any other job in the same queue. In particular, it will be charged on the basis of walltime, the same as any other batch job, since you will have dedicated access to the cpus reserved for your request. Dont forget to exit your interactive batch session to avoid both leaving cpus idle on the machine, and to avoid being charged for idle time!

Interactive batch jobs are likely to be used for debugging large or parallel programs etc. Since you want interactive response, it may be necessary to use the express queue to run immediately and avoid your session being suspended. However the express queue attracts a higher charging rate so don't leave the session idle.

To use an X display in an interactive batch job, use ssh to login to the AC or LC (do not change the DISPLAY variable ssh provides) and then submit your job with at least the following options:

    % qsub -I -q express -v DISPLAY
 
Common Problems

See the faq for the resolution of common problems on the systems.
 

Queues and Scheduling

 
Queue Structure

The systems have a simple queue structure with two main levels of priority; the queue names reflect their priority. There is no longer a separate queue for the lowest priority "bonus jobs" as these are to be submitted to the other queues, and PBS lowers their priority within the queues.
express:
  • high priority queue for testing, debugging or quick turnaround
  • charging rate of 3 SUs per processor-hour (walltime)
  • small limits particularly on time and number of cpus

normal:
  • the default queue designed for all production use
  • charging rate of 1 SU per processor-hour (walltime)
  • allows the largest resource requests

bonus time
Most projects can continue to submit jobs when their account is exhausted - such jobs are called "bonus jobs" but are in fact submitted to either of the express or normal queues.
bonus jobs:
  • queue at a lower priority than other jobs and will generally only run if there are no non-bonus jobs
  • are more suspendable than non-bonus jobs
  • make use of otherwise idle cycles while minimally hindering other jobs
  • may be terminated if they are impeding normal jobs or for system management reasons (usually jobs are just suspended)
In addition, there is the special purpose queue:
copyq:
  • specifically for IO work, in particular, mdss commands for copying data to the mass-data system.
  • where relevant copyq jobs run on the /short (and /fast) server nodes.
  • runs on nodes with external network interface(s) and so can be used for remote data transfers (you may need to configure passwordless ssh).
  • tars, compresses and other manipulation of /short files can be done in copyq.
  • compute jobs will be deleted whenever detected.
Job charging is based on wall clock time used (i.e. the T in the table below is wall clock time used) and number of cpus requested.
 
Detailed Configuration

The APAC system uses a slightly modified version of the PBS queueing system with per-user/project limits added via RASH.
  • All limits can be (and are intended to be) varied on a per-user or per-project basis - reasonable variation requests will be granted where possible.

  • Resources on the system are strictly allocated with the intent that if a job does not exceed its resource (time, memory, disk) requests, it should not be unduly affected by other jobs on the system. The converse of this is that if a job does try to exceed its resource requests, it will be terminated.

The queue configuration and default limits are subject to change without notice as we need to respond to the demand on the system and try to deliver the fairest share schedulling at the same time as allowing as many jobs to be queued per project as possible. The limits on the queues also vary from machine to machine. The command nf_limits -P project is available on EACH of the systems to allow users to see what limits apply to their username and project combination, on the particular machine. If used without the -P project specified, the environment PROJECT is assumed.

The nf_limits command returns the limits for maximum number of CPUs queued, maximum number of CPUs per job, and the maximum memory and maximum walltime for each PBS queue. As memory and walltime limits depend on the number of CPUs of the job, it is necessary to use nf_limits -n ncpus to determine the limits of a job requesting ncpus to run.

The maximum number of CPUs queued shown is the number if all jobs are single cpu jobs. If all jobs are parallel jobs using an even number of cpus, they may queue up double that number of CPUs. See the notes which form part of the nf_limits output, and also man nf_limits.

An example of the queues available and an indication of the limits which may apply on the ac is available HERE.

 
Scheduling issues

The scheduling algorithm used on APAC-NF is somewhat complicated but its aims are to:
  • promote large scale parallel use of the Facility
  • allow equal access to resources for all users independent of their "share" or grant
  • provide good turnaround for all users
  • minimize the impact of jobs on one another
Some of the features of the scheduler designed to achieve these aims are:
  • resources are strictly allocated so jobs will not start unless there is sufficient free memory and jobfs (as well as cpus).
  • queued jobs are shuffled so that jobs from different users and projects are "interleaved". This means your first job should appear near the top of the queue even if there are many jobs in the queue as reported by nqstat.
  • running jobs can be suspended to allow express and parallel jobs to run. Long jobs and jobs belonging to users/projects with lots of other running jobs are most "suspendable" but any job can be suspended. The fraction of time a job can be suspended is heavily limited.
From a user's perspective, it is very important that you minimize your requests for resources (i.e. walltime, memory and jobfs). Otherwise your job may be queued or suspended longer than necessary. Of course, make sure you ask for sufficient resources - a little experimentation in the express queue might help.

Further details on the scheduling policy and algorithm are available. Dont hesitate to contact us if you wish to query or have comments or suggestions about the queues and scheduling.

 

File Systems

A number of file systems are available, each with a different purpose - the appropriate file system should be used whenever possible.

As well as the generally available filesystems listed below, there are parallel and other filesystem options available for high performance I/O. Please contact us if you have a need for such filesystems.

The file systems currently generally available, listed in order of most permanent and backed up to most transient and NOT backed up, are:
   
home directories

 
massdata

 
/short

 
/jobfs

 
fast IO

Users who are dealing with large files in large chunks (i.e. > 1 MB reads and writes) have a number of options available to them. Contact us for assistance in choosing the best option and to gain access to the /fast filesystems.

 
/var

 
/tmp

Traditionally the TMPDIR environment variable is set to /tmp. TMPDIR is used by various commands and programs, perhaps without the users being aware of this, for example the intermediate files created during compilation are saved to TMPDIR. As the /tmp area is not very large, for interactive use TMPDIR is set to /short/tmp. Batch jobs which require to write scratch files to $TMPDIR MUST request jobfs space, as TMPDIR is then set to $PBS_JOBFS. If jobfs space is not requested, TMPDIR is set to a meaningless path and an error will be generated if the job attempts to use $TMPDIR.

 
Summary

Name(1) Purpose Availability Quota(2) Timelimit
/home/unigrp/user Irreproducible data
eg. source code
Global 200MB none
massdata Archiving large
data files
External - access using mdss 20GB none
/short/projectid Non IO intensive
data maintained
beyond one job
Global 20GB 42 days
/jobfs/projectid/jobid IO intensive data,
job lifetime
Local to node 50GB+(3) Duration
of job
/var PBS spool area no direct access 10MB/file -
/tmp Avoid - - -
  1. Each user belongs to at least two Unix groups:
              unigrp - determined by their host institution, and
              projectid(s) - one for each project they are attached to.
  2. These limits can be increased on a per user or per project basis if necessary.
  3. Users request allocation of /jobfs as part of their job submission. The actual disk quota for a particular job is given by the jobfs request.
 

Remaining Documentation is Specific to the Machines

For compiling and other details specific to each machine, please see contents listing to the left.