[Top] [Prev] [Next] [Bottom]

APPENDIX 4 Distributed BatchMin


BatchMin allows multiple minimization, Monte Carlo conformational search and free-energy perturbation methods to be run in a distributed manner on networked UNIX workstations or multi-processor machines. The user writes a single .com file that is interpreted by a "master" BatchMin process. The master process then:

· Decomposes the search into a number of sub-searches
· Creates .com files for the sub-searches
· Makes contact with the other processors
· Starts and monitors "slave" BatchMin processes on the other processors to perform the sub-searches
· Collates and reports the results from the completed sub-searches.

Capabilities

· MCMM - Multi minimum conformational searching.
· MULT - Multi-conformer minimization.
· FEAV, FESA - Free-energy perturbation.
· Up to 100 processes on up to 100 remote processors can be controlled by a single master process.

Requirements

BatchMin must be installed on each target machine. The user must have an account with the same user name on all the machines. Some additional installation procedures may be necessary; these are described below.

When attempting to use the distributed BatchMin facility for the first time, users should first make certain that simple test jobs can be started up on the remote machine using rbm. To make certain of this, the following must be done:

Communication between remote hosts uses the "rbm" mechanism also used by MacroModel to initiate BatchMin jobs on remote hosts. This mechanism is described in the MacroModel User Manual.

· The user's accounts on the machines on which subprocesses are to be run must be set up to be equivalent to his account on the machine running the master BatchMin process, using the UNIX rhost mechanism. The most common way of doing this is by means of .rhost files in the user's home directories.
· BatchMin .com files must use relative, not absolute, pathnames for the names of the input and output files.

Using the Distributed Procedure

The distributed processing facility in BatchMin has been designed to have minimal impact on the existing BatchMin functionality. A conformational search can be "distributed" simply by adding one extra op-code - NPRC - to the command file, and by creating a single additional setup file giving the names of the hosts over which the job is to be distributed. In addition, depending on how UNIX is configured at a given site, the name of the system on which the master process is running may have to appear in the user's .rhosts file on the slave machines. The entire procedure is best understood by means of an example. The .com file shown here distributes a 5000-step MCMM search of cyclodecane over five hosts.

With the exception of the NPRC command, this file is identical to the .com file which would be used to perform the same task on a single host.

In addition to the .com file, a file called dhostfiles.dat is needed. This file contains a list of the hosts which will be used for the slave processes. Since the NPRC command specifies distribution over five hosts, BatchMin will use the first five hosts named in the file. A dhostfiles.dat file is normally created by the user in the directory from which the job to be run. If BatchMin does not find such a file there, it looks in the directory given by the environment variable BATCH_ROOT; however, in our experience a local file is usually preferable. A machine having N processors is listed in the file N times, allowing the program to create up to N sub-processes on it. A sample dhostfiles.dat is shown here:

Here we assume that host1 is a four-processor machine, so it is listed four times in the file; in this run, five processors are requested, but only two processes will be assigned to host1 at any one time, because of the sequence in which the hosts are listed in the file.

The second token in each line of the dhostfiles.dat file specifies which executable is to be used on the remote machine. The remote server searches for an executable program bearing this name first in $MMOD_ROOT/run/exec and then, if not found, in $MMOD_ROOT/run/mmdat. (These directory names are actually taken from the inetd.conf entry for the bminrd process.) If the second token is not present then bmin.auto will be used by default. In the sample file shown above, the first two entries would be found in the .../run/mmdat directory and the remaining entries would be found in .../run/exec. As demonstrated by the first two entries, the bmin.auto family of scripts may be specified. As discussed in Chapter 3, Running BatchMin, these select a BatchMin of appropriate size and/or optimization level for the molecule being simulated and the platform in use.

The third token specifies the user-id on the remote machine under whose account the job is to be run. This token, like the second, is optional. If absent, the user-id which launched the master process is assumed. Of course, appropriate rhosts permissions must be in effect to use either the default or a non-default user name.

If fewer hosts are listed in the file than the number requested in the NPRC command, then a warning will be issued in the log file and the job will be distributed over the number of hosts listed. In most cases the name of the host running the master process can also be included in the file dhostfiles.dat. This will cause a slave BatchMin job to be run on that host, along with as the master; however, the master will be largely dormant, since little CPU time is required by BatchMin to control the sub-searches and collate the results.

The second argument of the NPRC command indicates that the MCMM job is to be broken up into sub-searches of 100 steps each. The value of this parameter is important for achieving maximum efficiency in the distributed searching procedure; see the discussion of efficiency below. For a multi-conformer minimization, rather than a conformational search, this argument represents the number of structures to be minimized by each slave process.

The third argument in the NPRC file indicates that the progress of the sub-searches will be monitored every 60 seconds. This is suitable or most jobs, but could be increased if the job is very large, in order to avoid the overhead associated with monitoring.

Note the use of a SEED op-code in the command file. The value given here will be used for generating unique seeds for the sub-searches' random number generator.

Host Consistency Checking

It is critical to the success of the distribution process that each of the hosts used produce the same energy for the same system. There are a number of possible causes of discrepancy, but the most important are that identical force-field and solvent files might not be installed on all the hosts. If arg4 of the NPRC command is set to a non-zero value, then before initiating the search BatchMin will perform an energy calculation of the first structure in the .dat file on each host. If any significant discrepancy between the hosts is detected then the job will be terminated and a warning message printed in the log file. We strongly suggest that this facility be turned on before beginning any project with distributed procedures.

If a force-field or solvent file is present in the directory from which the distributed process is run then this will be copied to the working directory on all the remote hosts and used in each sub-search. This is meant to ensure consistency on all hosts, in the event that the user is providing a modified force field or solvent parameterization; recall that local .fld and .slv files override the default files in the BATCH_ROOT directory.

Running The Distributed Job

The normal BatchMin job-control facilities can be applied to the distributed job. For example, creating a file filename.slp will cause the masterBatchMin processes to first put all the sub-processes to sleep and then enter the pause state. Deleting the .slp file will wake up all the sub-searches. The .stp file convention works in a similar manner, terminating the sub-processes, performing cleanup, then exiting the master process.

While the distributed job is running, temporary files will be created. These have the form filename_@n.sfx, where .sfx is a filename suffix from the set {com,dat,log,out} and n is an integer which represents one of the subprocesses. These files will be removed by the master process once the job has successfully completed, unless DEBG 940 has been specified.

Efficiency Considerations

In a typical installation, the distributed procedure will be used on a heterogeneous set of hosts; that is, the hosts will have different architectures and different speeds; in addition, some will be more heavily loaded than others when the job is run. It is important that the job be distributed in an efficient manner - one which assigns more work to the faster or less heavily loaded hosts. This is called "load balancing." We accomplish this by breaking the search up into a number of sub-searches which exceeds the number of hosts available, then assigning each new subprocesses to the next host that has completed its last task.

Reducing the size of the subprocess too much, however, can reduce the efficiency of the overall procedure. There are two reasons for this. First, there is a fixed overhead associated with each sub-search, resulting from reading the force-field and assigning the parameters for each sub-search. Second, it is usual to employ a "usage-directed" strategy in the MCMM search algorithm. In order to reduce the need for inter-process communication, each sub-process builds up usage information only from the structures it itself has generated. If each sub-process generates only a very small number of structures, insufficient information is created within a sub-process to provide meaningful usage direction.

These factors are illustrated in the following results for the 10000 step search of cycloheptadecane: First, the search distributed over 5 workstations achieved a speedup of 4.5 over that of a single processor. This represents 90% efficiency in the distribution process. Second, reducing the size of the sub-searches to 200 MCMM steps slightly reduced the efficiency of the search both in terms of structures searched per minute and in terms of the total number of unique structures found. We recommend that each sub-search should constitute at least 5% of the total search. Finally the comparison with the Micro-Vax and Cray computers show that using the distributed procedure on even a modest number of workstations gives a two-order-of-magnitude speed-up over the "standard" single-processor platforms of a few years,and in fact gives performance in the supercomputer range.

Number of Unique Structures Located Within 3.0 kcal/mol of the Global Minimum

Structures Produced Per Minute

5 HP-720 workstations. 50 subsearches of 200 structures each.

228

37.3

5 HP-720 workstations. 20 subsearches of 500 structures each.

236

38.8

HP-720 workstation. 1 search of 10000 steps.

237

8.6

Micro-Vax

0.2

Cray 2

16.7

Installation of Distributed BatchMin

Please refer to the MacroModel User Manual, Appendix 5.



[Top] [Prev] [Next] [Bottom]


If you have any questions about running MacroModel®, please send email to the following address:
help@schrodinger.com
Copyright © 1999, Schrödinger, Inc. All rights reserved. Last updated: 12/22/99 20:37:32