Home | Register | News | Forums | Guide | MyLinks | Bookmark

 Latest News General News   Reviews   Press Releases   Software   Hardware   Security   Tutorials   Off Topic

Back to files

An HTML version of this document exists at http://gel.ahabs.wisc.edu/docserver/mpiBLAST-1.4.0 A PDF version of this document exists at http://gel.ahabs.wisc.edu/docserver/mpiBLAST-1.4.0/mpiBLAST_docs.pdf

What is mpiBLAST?

mpiBLAST is a parallelization of NCBI BLAST. mpiBLAST is a pair of programs that replace formatdb and blastall with versions that execute BLAST jobs in parallel on a cluster of computers with MPI installed. There are two primary advantages to using mpiBLAST versus traditional BLAST. First, mpiBLAST splits the database across each node in the cluster. Because each node's segment of the database is smaller it can usually reside in the buffer-cache, yielding a significant speedup due to the elimination of disk I/O. Second, it allows BLAST users to take advantage of efficient, low-cost Beowulf clusters because interprocessor communication demands are low.

Why use mpiBLAST?

Two reasons:

1. Super-linear speedups when searching large databases,
2. Improved per query latency when running individual queries or batches of queries.

Other ways of getting mpiBLAST

Several other groups have packaged mpiBLAST for use and created derivative versions of mpiBLAST. Before building mpiBLAST, you may consider using:

mpiBLAST Requirements

mpiBLAST requires that an MPI implementation is installed. Two free implementations are:

In its most common configurations, mpiBLAST also requires that the computers have some shared storage directory. This
can be an NFS mount, samba share, AFS, or any other type of shared network filesystem. The location of the shared directory must be specified in the '~/.ncbirc' configuration file.

To build mpiBLAST from source you will also need to compile "a matching version":http://mpiblast.lanl.gov/download.html of the NCBI Toolbox, available from NCBI.

Current Version

The current release series of mpiBLAST have version numbers 1.4.x. For a list of new features and changes between each release, please view the "Version History":version_history.stx section of this User Guide.

Known Bugs and Limitations of mpiBLAST 1.4.x

• A maximum of 250 database fragments can be created by 'mpiformatdb' on most systems
• tiny differences in e-value between NCBI BLAST and mpiBLAST may exist due to numerical instability in result processing. Most results are unaffected.

Known Bugs and Limitations of mpiBLAST 1.3.x

• A maximum of 250 database fragments can be created by 'mpiformatdb' on most systems
• NCBI blastall may report results with the same e-value and bit score in a different order than mpiBLAST does
• mpiBLAST does not report search statistics like the number of hits to the database or the number of extensions
• When requesting XML format output, mpiBLAST may generate warning messages about deleting a locked sequence. This is due to a bug in the NCBI Toolbox and can be safely ignored. Search results do not appear to be affected.
• Searching with more workers than fragments was broken
• Database timestamp checking was broken
• '--removedb' was broken

Known Bugs and Limitations of mpiBLAST 1.0.x, 1.1.x, and 1.2.x

• mpiBLAST 1.2.0 outputs invalid alignment results when the query set contains multiple queries with the same defline and different sequences.
• When writing results from translated searches in XML or Tab-delimited-text format mpiBLAST may print warning messages like: $blastall$ ERROR: query 1;: BioseqFindFunc: couldn't uncache When this happens the query sequence in the alignments may be replaced with X's.
• Errors can occur when using the -m $1-6$ output options with tblastn searches
• Translated searches (blastx, tblastn, and tblastx) in the 1.0.x and 1.1.x releases do not include alignments in the results file. This problem was fixed in the 1.2.0 release.
• mpiBLAST runs out of memory when formatting result output for very large query sets, causing a crash.
• mpiBLAST 1.0.x and 1.1.x occasionally crash during result output, especially when XML or tab-delimited text output has been selected (-m 7, -m 8, or -m 9). This problem was fixed in the 1.2.0 release.
• The current release of mpiBLAST does not print the Karlin-Altschul statistics or the database info at the bottom of each query's BLAST results.
• mpiBLAST uses the actual number of nucleotides in the database to calculate the E-value instead of the effective number of nucleotides in the database. In some cases this results in a discrepancy between the E-value reported by mpiBLAST and that reported by NCBI-BLAST. For protein sequence searches the difference in E-value is more pronounced due to higher variability of effective database search lengths.
• BLAST results for a query that have the same bit score may be returned in a different order by mpiBLAST than they would by NCBI-BLAST.

Support

Questions regarding the usage of mpiBLAST should be sent to the "mpiBLAST mailing list":http://sourceforge.net/mail/?group_id=78850 on our sourceforge.net "project page":http://sourceforge.net/projects/mpiblast .

The "mpiBLAST documentation server":http://gel.ahabs.wisc.edu/docserver/mpiBLAST/ supports anonymous reader comments directly in the text. If part of the documentation is unclear, missing, or incorrect, please contribute by making a comment! Installing mpiBLAST

Installation on Unix

If you are using Linux on Intel or AMD hardware, you may be able to find a pre-compiled RPM of mpiBLAST from "Scalable Informatics":http://downloads.scalableinformatics.com/downloads/mpiblast/ Otherwise, you will have to download and compile mpiBLAST from source code.

There are five steps to installing mpiBLAST from source.

1. Install MPI (if not already installed)
3. Patch the NCBI Toolbox with the mpiBLAST patch and compile it
4. Compile and install mpiBLAST
5. Configure mpiBLAST by editing the ~/.ncbirc file

Each of these steps are described in more detail below.

1. MPI installation

There are two popular free versions of MPI. A successful installation of either is sufficient for mpiBLAST to work. Many OS vendors bundle either MPICH or LAM-MPI with the system, making an installation from source unnecessary.

LAM-MPI -- "http://www.lam-mpi.org/":http://www.lam-mpi.org/

Extract mpiBLAST:

'> tar xzf mpiBLAST-1.4.0.tar.gz'

Download NCBI Toolbox (ncbi.tar.gz) from "ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/old":ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/old The mpiBLAST 1.4.0 release requires the "October 2004 NCBI Toolbox release":ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/old/20041020 (20041020) We have reports that the "April 2005 NCBI Toolbox release":ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/old/20050428 (20050428) also works, but erroneously sets the build CPU type to Pentium 4 by default.

Extract the NCBI Toolbox:

'> tar xzf ncbi.tar.gz'

3) Patch and compile the NCBI Toolbox

In order to provide exact e-value statistics, mpiBLAST requires slight modifications to the NCBI Toolbox. We have provided a patch that modifies the NCBI Toolbox accordingly. Once patched, the toolbox must be re-compiled prior to compiling mpiBLAST. The e-value patch is required for mpiBLAST 1.4.x to work properly.

From a directory containing the both the 'mpiblast-1.4.0' and 'ncbi' directories, use the following command to patch the NCBI Toolbox:

'> patch -p0 < mpiblast-1.4.0/ncbi_Oct2004_evalue.patch'

Alternatively, the patch can be applied from within the 'ncbi' directory by changing the patchlevel:

'> patch -p1 < ../mpiblast-1.4.0/ncbi_Oct2004_evalue.patch'

Once the patch has applied successfully, the NCBI toolbox can be built with the following command:

'> ncbi/make/makedis.csh'

Usually the Toolbox will build without problems. If not, refer to ncbi/README for more detailed build instructions. It may be necessary to edit one of the makefiles in the ncbi/platform subdirectory.

4) Compile and install mpiBLAST

Enter the mpiBLAST directory

'> cd mpiBLAST-1.4.0'

Configure mpiBLAST

'> ./configure --with-ncbi=/full/path/to/ncbi'

Useful options for 'configure':

'--prefix=/path/to/install/directory' -- Specifies the location where mpiBLAST should be installed

'--with-ncbi=/path/to/ncbi/' -- Specifies the path to the NCBI Toolbox

'--with-mpi=/full/path/to/mpi -- Specifies a particular MPI installation that mpiBLAST should build and link against

'--enable-mpe-timing' -- Causes mpiblast to use MPE logging to measure the running time of its components

'--with-mpe=/full/path/to/mpe -- Give the location of MPE headers and libraries

'--enable-MPI_Alloc_mem' -- Cause mpiblast to use MPI_Alloc_mem() for better performance on some interconnects. Not compatible with mpich 1.2.x.

Debugging symbols and other compiler flags can be added by setting the 'CFLAGS' and 'CXXFLAGS' environment variables

Build mpiBLAST

'> make'

Install mpiBLAST

If the '--prefix' configure option wasn't used then mpiBLAST will be placed in '/usr/local' and root priveleges will probably be required to install it.

'> make install'

5) Edit the '~/.ncbirc' configuration file

Before running mpiBLAST, it is necessary to configure the shared and local storage paths that each node will use to access the database. A shared storage path is usually a path to a directory residing on a file server, such as NFS, AFS, samba, PVFS, GPFS, Lustre, etc. The local storage path is typically a subdirectory within the '/tmp' directory, e.g. '/tmp/mpiblast'. As worker nodes search the database, they will copy fragments to the local storage directory. During subsequent searches of the same database, the fragments will already reside in local storage and thus will not need to be copied. Note that diskless nodes can be supported by setting the local storage path to be the same as the shared storage path.
To configure mpiBLAST create a '.ncbirc' file in your home directory that looks like::

     [NCBI]
Data=/path/to/shared/storage/data

[BLAST]
BLASTDB=/path/to/shared/storage
BLASTMAT=/path/to/shared/storage/data

[mpiBLAST]
Shared=/path/to/shared/storage
Local=/path/to/local/storage


The 'Data' variable gives the location of the NCBI Data directory containing BLOSUM and PAM scoring matrices, among other things. The scoring matrix files are necessary for any type of protein
BLAST search and should be accessible by all cluster nodes. The 'BLASTMAT' variable also specifies the path to the scoring matrices, and will usually be identical to the 'Data' variable. The 'BLASTDB' variable tells standard NCBI blastall (not mpiBLAST) where to find BLAST databases. As previously mentioned, the 'Shared' and 'Local' variables give the shared and local database paths, respectively. By setting 'BLASTDB' to the same path as 'Shared', it is possible for NCBI 'blastall' to share the same databases that mpiBLAST uses. In such a configuration, be sure to format all databases with 'mpiformatdb' rather than 'formatdb'.

Installation on Windows

1. Copy mpiblast.exe, mpiformatdb.exe, and mpiblast_cleanup.exe to a shared storage directory
2. Create the mpiblast configuration file (see below)
3. Get the "NCBI data distribution":ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/CURRENT/data.tar.gz and copy its contents to your shared (database) storage directory
4. Make sure "mpich2":http://www-unix.mcs.anl.gov/mpi/mpich2/ is installed, mpiBLAST can be run using the included wmpiexec program
5. If your Windows machines are running Windows XP SP2 or have third party firewall software installed, be sure to configure it to allow the mpich2 daemon smpd access to the internet. By default it listens on tcp ports 8675 and 8676 and possibly others.

Creating the config file

On Windows, mpiBLAST (and NCBI's 'blastall') requires a 'ncbi.ini' configuration file created in the Windows directory, usually 'C:\Windows' or 'C:\WINNT'. The format of the 'ncbi.ini' is similar to the '.ncbirc' under Unix. For example::

[NCBI]
Data=\\path\to\shared\storage\data

[BLAST]
BLASTDB=\\path\to\shared\storage
BLASTMAT=\\path\to\shared\storage\data

[mpiBLAST]
Shared=\\path\to\shared\storage
Local=C:\path_to_local_storage

UNC paths (paths that start with '\\') are acceptable in the mpiblast windows configuration. It is not necessary to map the shared storage directory to a drive letter. Please ensure that the shared storage directory has been configured as a shared folder readable by any machine/user that will execute 'mpiblast' and writable by any machine/user that will execute 'mpiformatdb'.

Using mpiBLAST

In order to perform a search with mpiBLAST, the target BLAST database must first be formatted and segmented using 'mpiformatdb'. Then, 'mpirun' can be used to execute 'mpiblast' in parallel on several cluster nodes.

Formatting a database

Before processing blast queries the sequence database must be formatted with 'mpiformatdb'. The sequence database used in the following examples is the yeast amino acid database available from NCBI: "ftp://ftp.ncbi.nih.gov/blast/db/FASTA/":ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ The command line syntax to format the (uncompressed) yeast.aa database looks like this:

'> mpiformatdb -N 5 -i yeast.aa'

The above command would format the database into 5 fragments, ideally for 5 worker nodes. 'mpiformatdb' accepts the same command line options as NCBI's 'formatdb'. See the README.formatdb file that comes with the NCBI BLAST distribution for more details.

'mpiformatdb' reads the '~/.ncbirc' file and creates the formatted database fragments in the shared storage directory.

Querying the database

'mpiblast' command line syntax is nearly identical to NCBI's 'blastall' program. See the README.bls file included in the BLAST distribution for details. A small test query file called 'ech_10k.fas' is packaged with the mpiBLAST distribution. Running a search using 5 worker nodes would look like:

'> mpirun -np 7 mpiblast -p blastx -d yeast.aa -i ech_10k.fas -o blast_results.txt'

The above command would query the nucleotide (DNA) sequences in 'ech_10k.fas' against the amino acid (protein) sequences in the 'yeast.aa' database, translating the nucleotide sequence to amino acids on-the-fly. Results would be written to the 'blast_results.txt' file in the current working directory. To get the best performance, it is important to start at least one more process than the number of processors in the cluster because one of the mpiBLAST processes is dedicated to scheduling, which is not CPU-intensive. Furthermore, mpiBLAST needs at least 3 processes to perform a search. One process performs file output and another schedules search tasks, while any additional processes actually perform search tasks.

Extra options for mpiblast:

'--debug[=filename]' -- Produces verbose debugging output for each node, optionally logs the output to a file

'--scheduler-rank=<number>' -- Set the scheduler process' MPI Rank (default is 1). Because the scheduler uses very little CPU it can be useful to force the scheduler to run on the same physical machine as the writer (rank 0).

'--altschul-reference' -- Print the Altschul. et. al. 1997 paper reference instead of the mpiBLAST paper reference. With this option mpiblast output is nearly identical to NCBI-BLAST output.

'--removedb' -- Causes each node's local data to be created in a temporary directory that is removed when the job exits.

'--copy-via=[cp|rcp|scp|mpi|none]' -- Sets the method of copying files that each worker will use. Default = "cp"

• cp : use standard file system "cp" command. Additional option is --concurrent.
• rcp : use rsh "rcp" command. Additonal option is --concurrent.
• scp : use ssh "scp" command. Additional option is --concurrent.
• mpi : use MPI_Send/MPI_Recv to copy files. Additional option is --mpi-size.
• none : do not copy files, instead use shared storage as local storage

'--concurrent=' -- set the number of concurrent accesses to shared storage. Default = 1

'--mpi-size=' -- in bytes, set the maximum buffer size that MPI will use to send data when transferring files. Default = 65536

'--lock[=on|off]' -- set whether file locking should be used to manage local fragment lists. Defaults to off. When '--concurrency' > 1 defaults to on

'--disable-mpi-db' -- When set, the writer will use the database on shared storage for sequence lookup. Can drastically reduce overhead for some 'blastn' searches.

'--nice=<number>' -- Under unix, sets the nice value for each mpiblast process.

'--resume-run' -- Experimental. When set, mpiblast will read the output file and attempt to continue a previously aborted run where it left off

'--version' -- print the mpiBLAST version

Removing a database

NOTE: this functionality was broken prior to version 1.4.0 of mpiblast. It now works on both unix and Windows.

The '--removedb' command line option will cause mpiBLAST to do all work in a temporary directory that will get removed from each node's local storage directory upon successful termination. For example:

'> mpirun -np 16 mpiblast -p blastx -d yeast.aa -i ech_10k.fas -o results.txt --removedb'

The above command would perform a 16 node (14 worker) search of the yeast.aa database, writing the output to 'results.txt'. Upon completion, worker nodes would delete the nt database fragments from their local storage. Databases can also be removed without performing a search in the following manner:

% Anonymous User - June 29, 2005 6:16 pm:

     "Upon completion, worker nodes would delete the nt database" should actually be
"Upon completion,
worker nodes would delete the yeast.aa database"


'> mpirun -np 16 mpiblast_cleanup'

The mpiBLAST web interface

Scripts have been included that allow mpiBLAST to interface with the NCBI web front-end available at "ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST-WWWBLAST":ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST-WWWBLAST

The scripts are designed to queue jobs submitted through the web front-end into a PBS system. The interface between the NCBI HTML pages and BLAST is a file called 'blast.cgi'. mpiBLAST includes a replacement 'blast.cgi' that calls mpiBLAST's 'WWWBlastWrap.pl' job submission script. 'blast.cgi' and 'WWWBlastWrap.pl' should be copied into wwwblast's directory. You will have to manually edit 'WWWBlastWrap.pl' to adapt it to your system. Miscellaneous items

Use the MPICH mpd

It is recommended that you use the mpd job startup daemon that comes with mpich and mpich2 for fastest job startup. mpd must be started on each node. See the MPICH documentation for more details.

Know bugs when using LAM/MPI

The message pattern generated by mpiblast appears to cause memory corruption between LAM and the Linux kernel. A workaround suggested by Jason Gans is to run mpiblast with the '-ssi rpi lamd' flag to mpirun: 'mpirun -np 10 -ssi rpi lamd mpiblast ...'

Formatting huge databases

Large databases like nt can consume several gigabytes of disk space and it is preferable to store them in compressed form. Starting with mpiBLAST 1.4.0 it is possible to pipe FastA formatted sequence data into 'mpiformatdb'. This feature provides the ability to directly format a compressed (gzip/bzip etc.) database using command line syntax like::

zcat nt.gz | mpiformatdb -i stdin -N 100 --skip-reorder -t nt -p F

'mpiformatdb' needs the '--skip-reorder', '-t <title>' and '-p <T|F>' options to format a database piped via standard input.

Platform-specific installation notes

Installation on Mac OS X

mpiBLAST versions prior to 1.3 were not supported on Mac OS X. During the build process you may have to './configure' with the '--disable-dependency-tracking' option.

Compiling mpiBLAST on AIX 5.2 with the IBM VisualAge compiler

Note: this experience with mpiBLAST on AIX was gathered using mpiBLAST 1.3.0. It may not apply to newer releases.
Our (limited) experience building mpiBLAST on AIX revealed C header conflicts in the C++ build system. In order to work around the header conflicts, we created a duplicate copy of the '/usr/vacpp/include' directory, minus all files ending in '.h' except 'ansic_aix.h', 'xlocinfo.h', and 'yvals.h'.
When running 'configure', the environment variables 'CFLAGS' and 'CXXFLAGS' should contain the compiler flags
'-qnostdinc -I/usr/include -I/usr/local/include -I/path/to/duplicate/vacpp_inc_dir' in addition to any other custom compiler flags. These flags include the standard system headers without including the conflicting VisualAge C++ headers.

Furthermore, automake's dependency tracking appears broken on AIX, so it is necessary to run 'configure' with the '--disable-dependency-tracking' option. Another snag is that autoconf thinks it must define 'LARGEFILES' on these systems, which breaks the build ('open64' is not defined). We hope to fix this behavior in a future release, but for now it can be worked around by editing 'src/config.h' to comment out the line '#define LARGEFILES 1'.

If anybody else has experience compiling applications with the VisualAge C++ compiler, we would appreciate feedback on how to make the build process smoother. Frequently Asked Questions

How accurate are the E-value statistics?

In mpiBLAST 1.3 they are exact for all supported search types. In versions 1.2.1 and earlier, e-values for 'blastn' were loosely approximated using a linear equation. For 'blastp', 'blastx', 'tblastn', and 'tblastx' they were inaccurate in versions 1.2.1 and earlier. Note that by "exact" we mean exactly the same as those generated by NCBI-BLAST. As of 2004, NCBI is still refining the e-value calculations in their blast implementation.

How does mpiBLAST output differ from NCBI 'blastall' output?

As of mpiBLAST 1.3, the text, XML, and ASN.1 output formats are nearly identical to NCBI 'blastall'. When an indiviual query has multiple database hits with the same e-value and bit score, mpiBLAST may report these hits in a different order than NCBI's blastall. Further, mpiBLAST does not report some search statistics such as the number of hits to the database.

Does mpiBLAST support PSI-BLAST, PHI-BLAST, RPS-BLAST, etc.?

No.
Although it may be possible to parallelize these search algorithms using database segmentation, our preliminary studies indicate they would not benefit as much as the other blast search types do from such a parallelization scheme.

Does mpiBLAST support Mega-BLAST?

No. We are focusing our efforts on 'blastn', 'blastp', 'blastx', 'tblastn', and 'tblastx'.

I have a cluster with yy processors. How many database fragments should I use?

yy

I have a cluster with yy processors. How many MPI processes should I start with 'mpirun -np'?

Start yy + 2 'mpiblast' processes. This will start one mpiBLAST worker per processor, plus one output and one scheduler process. The minimum value that can be specified to '-np' is 3. Depending on the hardware, it may be more efficient to format the database into yy - 1 fragments and start yy + 1 mpiBLAST processes so that the output process doesn't compete with a worker process for resources.

Our experience indicates that BLAST jobs tend to be limited by memory bandwidth more so than CPU speed. Since each virtual CPU in a hyperthreaded setup shares the same memory bus, the benefit of running additional mpiBLAST processes is usually negligible.

Can mpiBLAST run without local storage?

Yes, set the local storage path to be identical to the shared storage path.

Can mpiBLAST run without a shared filesystem?

Yes, as of mpiBLAST 1.3.0. The database and query can be stored on a remotely accessible filesystem and copied via rcp or scp. If the database and query reside on the node with rank 0 they can be distributed directly by mpiBLAST. The '--copy-via' option described in the Usage section of this document has more details.

Can mpiBLAST be run on a single processor system for testing purposes?

Yes, simply execute the desired number of MPI processes using the '-np' flag. The minimum is '-np 3'.

I benchmarked mpiBLAST but I don't see super-linear speedup! Why?!

mpiBLAST only yields super-linear speedup when the database being searched is significantly larger than the core memory on an individual node. The super-linear speedup results published in the ClusterWorld 2003 paper describing mpiBLAST are measurements of mpiBLAST v0.9 searching a 1.2GB (compressed) database on a cluster where each node has 640MB of RAM. A single node search results in heavy disk I/O and a long search time.

Developer Information

Compiling mpiBLAST from CVS

The latest version of mpiBLAST can be exported directly from the Sourceforge CVS repository using this command:
'cvs -d :pserver:anonymous@cvs.sf.net:/cvsroot/mpiblast export mpiblast' Alternatively, you may want to use 'checkout' and 'update' instead of 'export' in order to stay up-to-date with the latest changes.

When mpiBLAST is checked out directly from CVS (via a 'cvs checkout' or 'cvs export' command) it's missing several necessary components of the build system. In order to create these, you will need the GNU autotools development system. Specifically, the following series of commands will check out a copy of 'mpiblast' from CVS and prepare it for a build::

> cvs -d :pserver:anonymous@cvs.sf.net:/cvsroot/mpiblast export -Dtomorrow mpiblast > cd mpiblast
> aclocal
> automake -a
> autoconf

These commands regenerate the build and dependency tracking system. At this point the build can proceed via the standard './configure' ; 'make' ; 'make install' procedure.

On some older systems it may be necessary to change AC_CONFIG_HEADERS in configure.in to AM_CONFIG_HEADER. It also may be necessary to run 'libtoolize' before running 'aclocal'. For the best mpiBLAST build experience, we suggest using automake 1.7 or later and autoconf 2.5 or later.

mpiBLAST API documentation

mpiBLAST uses the "doxygen":http://www.doxygen.org documentation system to provide a standardized format for source code documentation. Searchable doxygen documentation in HTML format is available at
"http://gel.ahabs.wisc.edu/mpiblast/apidocs/":http://gel.ahabs.wisc.edu/mpiblast/apidocs/ The doxygen API docs are based on a nightly snapshot of the CVS HEAD revision.

Bug tracking

We have been logging bugs using the sourceforge mpiBLAST bug tracker at 'http://sourceforge.net/tracker/?group_id=78850&atid=554646'

Contact the mpiBLAST developers!

If you are thinking about implementing a new feature based on the mpiBLAST code, please contact the developers first! We may be able to give you helpful pointers on the quickest way to get the job done, or connect you with somebody else already working on that feature. The best way to reach all developers is through the mpiblast mailing list: mpiblast-users@lists.sourceforge.net.

If you would like to directly integrate your code with the primary mpiBLAST codebase, contact us and we will give you write access to our sourceforge CVS repository. Version History for mpiBLAST Releases

Changes between 1.3.0 and 1.4.0

General changes:

• mpiBLAST more accurately reproduces NCBI BLAST output formats. Specifically, results are now output in the same order even if they have identical bit scores and e-values. Search statistics such as the number of hits and number of extensions are correctly collected.
• Scalability has improved dramatically. The speedup results from streamlined communication between worker and writer processes. Thanks to Mike Firpo and Adam Moody of LLNL for the suggestion.
• mpiBLAST performs better when the database has not already been distributed. A new fragment copy scheduling algorithm copies each fragment to at least one worker prior to search startup. By default, a single complete copy of the database is distributed. The number of replicate copies of the database that will be distributed can be controlled with the --db-replicate-count argument to mpiblast. Settings larger than 1 enable mpiblast to load balance the search.
• An MPI program called 'mpiblast_cleanup' has been added. When run on a set of nodes using mpirun it cleans up mpiblast data on each node's local storage device
• MPICH 2 is supported on unix, and is required on Windows
• An optional '--with-mpi=</path/to/mpi>' argument to './configure' can specify the path to the MPI installation during compilation

Changes to mpiblast:

• Using more workers than fragments has been fixed
• '--copy-via=mpi' has been fixed
• '--removedb' has been fixed. The new behavior is to create a temporary directory on local storage where all work gets done. The temporary directory is automatically removed upon successful job completion.
• timestamp checking has been fixed. mpiblast now checks the BLAST database timestamp in the .phr file to see whether fragments on local storage match fragments on shared storage. If the fragments on shared storage have a different timestamp, the local fragments are ignored.
• during result communication between workers and the writer, the workers send only the portion of aligned biosequences used in the alignment instead of the entire biosequence. This change allows nucleotide databases with large sequences such as 'nt' and 'human_chromosome' to be searched without the '--disable-mpi-db' flag.

Changes to mpiformatdb:

• By default mpiformatdb writes a temporary file with reordered input sequences. Reordering the sequence file facilitates balanced fragment sizes. Rewriting the input database slows formatting and can be skipped with the '--skip-reorder' option.
• mpiformatdb supports reading databases from stdin using '-i stdin' in conjunction with '--skip-reorder -t <db title>'.

Changes between 1.2.1 and 1.3.0

IMPORTANT! The build process has changed. The NCBI Toolbox must be patched and re-compiled prior to mpiBLAST compilation. See the README for more details.

General changes:

• mpiBLAST requires the October 2004 release of the NCBI Toolbox (version

2.2.10 of blastall)

• Up to 250 database fragments are supported by default. No patching to the NCBI toolbox necessary.
• mpiblast.conf is no longer used! The shared and local storage directories should be specified directly in the .ncbirc configuration file in the following manner::
        [mpiBLAST]
Shared=/path/to/shared/storage
Local=/path/to/local/storage

• If the shared and local storage paths can't be found in .ncbirc then the environment variables MPIBLAST_SHARED and MPIBLAST_LOCAL are checked. If neither .ncbirc nor MPIBLAST_LOCAL have the local storage directory, then $TMPDIR/mpiBLAST_local_db is used, if$TMPDIR isn't defined it defaults to /tmp.
• Shared and local directories are checked for existence and proper permissions, if they don't exist they are created with perms 775.
• replaced setenv() with putenv() to support Solaris

Changes to mpiblast:

• mpiblast does e-value adjustment using both the effective database and the effective query lengths, leading to more accurate e-value statistics.
• mpiblast has been updated with the latest changes to blastall version 2.2.9
• mpiblast implements query pipelining. This means that queries are searched in order and the workers send results to the master as queries complete. As the master receives results it writes them to disk. Query pipelining eliminates the extreme memory requirements that mpiblast previously had for large query sets. It also permits better parallelization for jobs with few database fragments and many queries.
• The text output formats better reflect the actual NCBI text output
• mpiblast supports database pipelining. This means that instead of all the workers copying their fragments from shared storage at once, users have the option to limit the number concurrently accessing shared storage via the --concurrent option. Slow NFS disks should probably use --concurrent=1 in order to see speedups of a factor of 5 or greater. Faster shared storage should use a higher value or use "--copy-via=none" instead (see below for details).
• User can set copy functionality at runtime throught the --copy-via option. Possibilities are "cp" to use the standard copy command, "scp" to use ssh, "rcp" to use rsh, "none" to not copy at all (very useful for fast parallel file systems like GFS, PFS, and PVFS), and "mpi" to use MPI_Send/MPI_Recv.
• If using --copy-via=mpi, user can set the maximum buffer size that MPI will use when copying files through the "--mpi-size=" option.
• The -z option can be used to specify an effective database size

Changes to mpiformatdb:

• mpiformatdb now creates exactly the requested number of fragments reliably
• created fragments will be identically sized for better load-balance. In order to support this behavior, the input database is rewritten with its sequence entries in a different order in the system's temp directory.
• the reordering can be skipped with the --skip-reorder option
• the temp directory for reordering defaults to $TMPDIR, if$TMPDIR is not set then /tmp is used.
• mpiformatdb supports GI list creation using formatdb's -F -B -L options
• --decomp is no longer supported
• 3 digit fragment identifiers are ALWAYS used
• mpiformatdb now returns 0 on success instead of the number of fragments created because it guarantees the requested # of frags

Changes between 1.2.0 and 1.2.1:

Changes to mpiblast:

• Fixed a memory deallocation bug that caused crashes during result output
• mpiblast now prints "No hits found" for all queries without results at the end of a query set. Previously if a query file had no results mpiblast would write an empty results file.

General changes:

• Added a 1000 fragments patch for the November 2003 NCBI toolbox release

Bugs discovered:

• Under Windows mpiBLAST may print incorrect alignments for queries which have a large number (e.g. > 100) of results. mpiBLAST prints the following error message when this occurs: "Error: Timed out waiting for biosequence from workers" This problem can be avoided by setting a more restrictive E-value cutoff using the -e command line option. This behavior has not been observed under Linux.

Changes between 1.1.1 and 1.2.0:

New Features:

• When writing result alignments, mpiBLAST uses the database as distributed on the worker nodes instead of reading the database from the shared filesystem. This can result in significant speedups, especially when the file server is slow or loaded.
• mpiBLAST prints alignments for translated searches (blastx,tblastn,tblastx)
• Database update functionality. New sequences can be added to an existing mpiBLAST database.
• mpiBLAST has a web interface. A script to interface mpiBLAST to NCBI's wwwblast web service has been included.
• Updated the BLAST functionality to be consistent with the latest NCBI-BLAST release. mpiBLAST now accepts the -m 10 and -m 11 output format options to write output in text or binary ASN.1 format. Previously the -O option was used.

Fixes to mpiBLAST:

• Fixed a crash when writing tabular format output (-m 8 and -m 9)
• Fixed a crash when writing XML format output (-m 7)
• The --removedb option will now remove the database even if a search is not being performed

General changes:

• mpiBLAST 1.2.0 requires the April 2003 or later release of NCBI Toolbox
• getopt1.c is now included in the build for systems lacking getopt_long (AIX)
• MS Visual Studio 7 .NET projects are now included to assist users who would like to compile mpiBLAST under Windows.

Changes between 1.1.0 and 1.1.1:

General Fixes:

• mpiblast now correctly looks for the configuration file in the $INSTALL_PREFIX/etc/ directory if it is not at$HOME/.mpiblastrc
• Fixed text README to reflect that the configuration file only contains two lines
• mpiblast no longer prints warnings when using databases formatted without indices ( The -o F formatdb option )
• Numerous compiler-specific compilation error workarounds

Fixes to mpiformatdb:

• mpiformatdb directly outputs the database to the destination instead of trying to copy it
• The gcc 3.x standard c++ library has a large file bug that prevented counting the database size correctly on databases > 2GB. A workaround using C file I/O was contributed by Cesar Delgado.
• mpiformatdb does a better job fragmenting the database into the requested number of fragments, thanks to a patch by Jason Gans
• There is a --decomp option to mpiformatdb that prints APPROXIMATE database sizes based on number of fragments (also by J.D. Gans)

Fixes to mpiblast:

• mpiblast now uses MPI_Abort() when exiting on an error condition
• Several memory leaks were corrected and memory requirements reduced by a patch contributed by Jason Gans

Changes between 1.0.1 and 1.1.0:

• Ported to Windows/mpich-nt
• Rewrote mpiformatdb in C++, now it directly links to the NCBI formatdb code. As a result it is no longer necessary to install the formatdb or BLAST executables, or to specify their location in the mpiBLAST configuration file
• Fixed a file copy bug when formatting protein databases with mpiformatdb
• mpiformatdb no longer needs to be run from the directory containing the unformatted database.
• The default configuration file semantics have been changed. Under Unix both mpiblast and mpiformatdb default to ~/.mpiblastrc. If ~/.mpiblastrc does not exist or the $HOME environment variable is not set then$INSTALL_PREFIX/etc/mpiblast.conf is used, where \$INSTALL_PREFIX is the path given to ./configure for your mpiBLAST installation. If the configuration file is specified on the command line using --config-file the defaults are overriden.
• Under Windows the default configuration file location is %USERPROFILE%\.mpiblastrc. If %USERPROFILE%\.mpiblastrc does not exist or %USERPROFILE% is not set then mpiBLAST tries %windir%\mpiblast.ini. As in Unix, the default config file location can be overriden with the --config-file command line argument.
• Some versions of the NCBI toolkit have a bug that causes mpiBLAST to print warning messages about 'taxdb.bti'. These are harmless. To eliminate the warning message the following line can be deleted from tools/readdb.c in the NCBI development library. Of course you will need to recompile after the modification::
     /tools/readdb.c
12:29:30.000000000 -0800
12:29:51.000000000 -0800
@@ -2497,7 +2497,6 @@
return buffer;
} else {
/* we cannot find directory :( */
-            ErrPostEx(SEV_WARNING, 0, 0, "Could not find %s", filename);
MemFree(buffer); MemFree(buffer1);
return NULL;
}


Changes between 1.0.0 and 1.0.1:

• Fixed a bug causing a crash when the database was formatted without indexes
• Added support for up to 1000 database fragments (see README for details)
• Added support for cleaning up local storage of database fragments
• Included GNU getopt in the distribution for compilation on systems such as AIX and Solaris that do not have getopt_long().

Changes between 0.9.0 and 1.0.0:

• Dynamic database distribution:

This change has many implications. Database fragments are no longer distributed when formatting the database with mpiformatdb. Instead, database fragments are copied from shared storage to worker nodes as necessary in order to complete each BLAST search. Once copied, a fragment remains on the worker's local storage for use by future searches.

• Use of NCBI library to output merged results:

Previously results were merged with a text file parser. BLAST results are now merged by mpiBLAST and output directly using the NCBI library. As a result, users can now choose to output BLAST results in most formats supported by NCBI BLAST, including XML and ASN.1.

• Corrected E-value statistics:

E-values are now adjusted based on the size of the entire database being searched. NOTE 06/11/2005: The 1.0.0 release of mpiBLAST corrected only blastn (nucleotide) search statistics!

• All nucleotide DB fragment index files are now copied to workers correctly
• mpiformatdb is now standalone and should be run without mpirun. This is a side effect of the dynamic database distribution described above.