Documentation for build_gbff_cu.pl
This file illustrates one method by which the build_gbff_cu.pl utility can be used to build a cumulative GenBank flatfile, based on GenBank incremental update (GIU) files provided by NCBI.
A detailed help/usage message can be obtained for this utility like so:
build_gbff_cu.pl -h
Note: Although this document describes the creation of a cumulative update for DNA sequences, the same procedures can be used on incremental update files for protein sequences in the GenPept flatfile format, yielding a cumulative GenPept file.
Strategy :
First, the initial week's worth of GIU files must be obtained and gunzip'd. We will start with the first week of updates that were provided after the close-date for GenBank Release 135.0 (April 2003) . The files are:
83545259 nc0411.flat 403200472 nc0412.flat 35783681 nc0413.flat 99721631 nc0414.flat 68090065 nc0415.flat 257087328 nc0416.flat 403420734 nc0417.flat ----------------------- 1350849170 bytes
Create a file (for example, 'giu.list.1') which contains a list of these seven GIU files. If the GIUs are being stored in their own directory, make sure that the pathname is included. Here's what giu.list.1 might contain :
/d1/nc_build/nc0411.flat /d1/nc_build/nc0412.flat /d1/nc_build/nc0413.flat /d1/nc_build/nc0414.flat /d1/nc_build/nc0415.flat /d1/nc_build/nc0416.flat /d1/nc_build/nc0417.flat
Important! The list of GIU files MUST be ordered by date, oldest to most recent. If not, the content of the CU that you build will be incorrect.
Create a de-novo cumulative GenBank flatfile and an associated accession-number index using these GIU files:
build_gbff_cu.pl -i giu.list.1 -ncu new.gbcu.flat -ni new.gbcu.idx
On a lightly-loaded 4-processor Solaris/SunOS 5.8 machine with 333MHz cpus, building the CU required about 96 MB of RAM (up to 43 MB for executions of ffidx.pl, and 53 MB for build_gbff_cu.pl) and 35 minutes of clocktime.
The resulting CU was 1.22 GB in size:
1224762056 new.gbcu.flat 6003842 new.gbcu.idx
Once convinced that these CUs are valid (ie, no errors issued by ffidx.pl or by build_gbff_cu.pl), they could be renamed in anticipation of the next CU build:
mv new.gbcu.flat gbcu.flat mv new.gbcu.idx gbcu.idx
Next, we obtain (nearly) a second week's worth of GIU files and gunzip. The files are:
194744240 nc0418.flat 139439481 nc0419.flat 32428330 nc0420.flat 24184266 nc0421.flat 47840602 nc0422.flat 90636882 nc0423.flat ----------------------- 529273801 bytes
Create a file (for example, 'giu.list.2') which contains a list of these six GIU files. Here's what giu.list.2 might contain :
/d1/nc_build/nc0418.flat /d1/nc_build/nc0419.flat /d1/nc_build/nc0420.flat /d1/nc_build/nc0421.flat /d1/nc_build/nc0422.flat /d1/nc_build/nc0423.flat
Create a new cumulative GenBank flatfile (and associated index) from the existing CU and these additional GIUs :
build_gbff_cu.pl -i giu.list.2 -ncu new.gbcu.flat -ni new.gbcu.idx -pcu gbcu.flat -pi gbcu.idx
On a lightly-loaded 4-processor Solaris/SunOS 5.8 machine with 333MHz cpus, building the second version of the CU required about 59 MB of RAM (up to 50 MB for executions of ffidx.pl, and 9 MB for build_gbff_cu.pl) and 27 minutes of clocktime.
The resulting CU was 1.64 GB in size:
1640714095 new.gbcu.flat 6653651 new.gbcu.idx
If this second CU build appears to have been successful, the resulting "new" files could again be renamed, in preparation for the next build attempt:
mv new.gbcu.flat gbcu.flat mv new.gbcu.idx gbcu.idx
The RAM requirements of build_gbff_cu.pl and ffidx.pl depend on the number of sequences being processed. In this example, the second CU build required only about two thirds of the RAM required for the first build.
But, in general, as the CU grows and the number and size of the GIUs increases, so too will the RAM usage.
Thus, it may be a good idea to build the CU on a frequent basis, perhaps using no more than 2 or 3 GIUs at any given time.