DBupdate: mining new marker sequences

Databse update (dbupdate mode) is a workflow that allows PhyloSift users to customize and expand their local reference marker gene databases with new data. Running dbupdate mode will output an updated marker directory that incorporates new reference sequences mined from an input data repository. To execute the database updates, you will need to specify a directory of input genome sequences (e.g. newly published genome sequences, assembled contigs, or any other type of sequence data in FASTA format), and specify the path to your PhyloSift marker directory.

At UC Davis, we currently update the public PhyloSift marker database at least on a weekly basis (this is the marker database URL that is automatically downloaded and updated when users run any type of PhyloSift analysis). We identify  newly published genome sequences on EBI’s servers and mine them for new reference marker gene sequences. It usually takes around a day to fully execute a database update on our central server.

To get started, run the command:

./phylosift dbupdate --repository <data_repository_path> --destination <desintation_marker_path>

For example:

./phylosift dbupdate --repository /home/user/PhyloSift/PS_temp/new_genomes --destination /home/user/share/phylosift/markers

A schematic diagram of the DBupdate workflow is as follows:

Once running, DBupdate will iterate through the following steps:

  1. DBupdate first runs the PhyloSift client workflow (phylosift) through the search and align modes. New sequences are mined from genome data and added into the existing PhyoSift marker packages.
  2. Updated trees are inferred for each marker gene. The workflow then splits into two parallel tracks: amino acid trees (protein coding gene alignments only) and nucleotide trees (translated amino acid alignments).
  3. For amino acids, a taxa set is selected with a maxPD cutoff of 0.02 (equivalent to 1 out of 4 positions being different between pairs of sequences) and a new tree is inferred. New sequences are then added to the PhyloSift marker database at a threshold of 0.25 PD; this higher PD threshold enables more aggressive searches of reference databases, since LAST searching is faster against fewer database sequences.
  4. For nucleotide trees, a PD metric is used to split the guide tree into smaller subtrees. Subsets of taxa (containing a minimum of 3 taxa) are selected such that no branch connecting them has a length >0.X for some value of X.
  5. For both amino acid trees and nucleotide subtrees, the NCBI taxonomy IDs are reconciled with the phylogenetic topologies.
  6. Markers are packaged and formatted according to pplacer software requirements and moved into the central PhyloSift marker repository.
  7. The final, updated marker database is provided as an automated download to PhyloSift users: local marker databases are automatically scanned each time PhyloSift is run and, if available, any new updates are automatically downloaded.