Documentation for build_gbff_cu.pl

This file illustrates one method by which the build_gbff_cu.pl utility can be used to build a cumulative GenBank flatfile, based on GenBank incremental update (GIU) files provided by NCBI.

A detailed help/usage message can be obtained for this utility like so:

   build_gbff_cu.pl -h

Note: Although this document describes the creation of a cumulative update for DNA sequences, the same procedures can be used on incremental update files for protein sequences in the GenPept flatfile format, yielding a cumulative GenPept file.

Strategy :


First, the initial week's worth of GIU files must be obtained and gunzip'd. We will start with the first week of updates that were provided after the close-date for GenBank Release 135.0 (April 2003) . The files are:

   83545259    nc0411.flat
   403200472   nc0412.flat
   35783681    nc0413.flat
   99721631    nc0414.flat
   68090065    nc0415.flat
   257087328   nc0416.flat
   403420734   nc0417.flat
   -----------------------
   1350849170  bytes

Create a file (for example, 'giu.list.1') which contains a list of these seven GIU files. If the GIUs are being stored in their own directory, make sure that the pathname is included. Here's what giu.list.1 might contain :

   /d1/nc_build/nc0411.flat
   /d1/nc_build/nc0412.flat
   /d1/nc_build/nc0413.flat
   /d1/nc_build/nc0414.flat
   /d1/nc_build/nc0415.flat
   /d1/nc_build/nc0416.flat
   /d1/nc_build/nc0417.flat

Important! The list of GIU files MUST be ordered by date, oldest to most recent. If not, the content of the CU that you build will be incorrect.


Create a de-novo cumulative GenBank flatfile and an associated accession-number index using these GIU files:

   build_gbff_cu.pl -i giu.list.1 -ncu new.gbcu.flat -ni new.gbcu.idx

On a lightly-loaded 4-processor Solaris/SunOS 5.8 machine with 333MHz cpus, building the CU required about 96 MB of RAM (up to 43 MB for executions of ffidx.pl, and 53 MB for build_gbff_cu.pl) and 35 minutes of clocktime.

The resulting CU was 1.22 GB in size:

   1224762056  new.gbcu.flat
   6003842     new.gbcu.idx

Once convinced that these CUs are valid (ie, no errors issued by ffidx.pl or by build_gbff_cu.pl), they could be renamed in anticipation of the next CU build:

   mv new.gbcu.flat gbcu.flat
   mv new.gbcu.idx gbcu.idx

Next, we obtain (nearly) a second week's worth of GIU files and gunzip. The files are:

   194744240   nc0418.flat
   139439481   nc0419.flat
   32428330    nc0420.flat
   24184266    nc0421.flat
   47840602    nc0422.flat
   90636882    nc0423.flat
   -----------------------
   529273801   bytes

Create a file (for example, 'giu.list.2') which contains a list of these six GIU files. Here's what giu.list.2 might contain :

   /d1/nc_build/nc0418.flat
   /d1/nc_build/nc0419.flat
   /d1/nc_build/nc0420.flat
   /d1/nc_build/nc0421.flat
   /d1/nc_build/nc0422.flat
   /d1/nc_build/nc0423.flat

Create a new cumulative GenBank flatfile (and associated index) from the existing CU and these additional GIUs :

   build_gbff_cu.pl -i giu.list.2 -ncu new.gbcu.flat -ni new.gbcu.idx -pcu gbcu.flat -pi gbcu.idx 

On a lightly-loaded 4-processor Solaris/SunOS 5.8 machine with 333MHz cpus, building the second version of the CU required about 59 MB of RAM (up to 50 MB for executions of ffidx.pl, and 9 MB for build_gbff_cu.pl) and 27 minutes of clocktime.

The resulting CU was 1.64 GB in size:

   1640714095  new.gbcu.flat
   6653651     new.gbcu.idx

If this second CU build appears to have been successful, the resulting "new" files could again be renamed, in preparation for the next build attempt:

   mv new.gbcu.flat gbcu.flat
   mv new.gbcu.idx gbcu.idx

The RAM requirements of build_gbff_cu.pl and ffidx.pl depend on the number of sequences being processed. In this example, the second CU build required only about two thirds of the RAM required for the first build.

But, in general, as the CU grows and the number and size of the GIUs increases, so too will the RAM usage.

Thus, it may be a good idea to build the CU on a frequent basis, perhaps using no more than 2 or 3 GIUs at any given time.