The Monkey: Packaging new marker genes

The Monkey (build_marker and index modes) is a workflow that allows users to build new PhyloSift marker packages from locally-housed reference marker gene alignments. This workflow may be useful, for example, if users have unpublished, phylogenetically informative marker genes they want to harness for environmental taxonomic assignments. The Monkey will output a complete marker package for each input gene, consisting of a tree, a HMM profile (or CM for rRNA data), a taxon map, a list of representative sequences in FASTA format, and the reference gene alignment. This workflow is executed in two steps: build_marker mode to compile and write the reference packages, followed by index mode to organize marker databases for LAST and Bowtie searches in the main PhyloSift client workflow. To execute the Monkey you will need to specify an input FASTA file of aligned marker gene sequences, a user-defined threshold for Phylogenetic Distance (PD) used in tree pruning, and a taxon mapping file containing two columns (accession number and NCBI taxon ID, separated by a tab).

Note #1: You will need to install the taxtastic package (developed by Erick Matsen’s group at the Fred Hutchinson Cancer Research Center) and its dependencies before you will be able to execute build_marker mode in PhyloSift. This package requires a number of dependencies which we have decided not to bundle within the PhyloSift download, including Python 2.7, SQL Alchemy, xlrd, and docutils. Full instructions for installing taxtastic and its dependencies can be found on here on the GitHub install page.

Note #2: New marker packages are labeled according to their input file names (the final reference package in the /share/phylosift/markers folder will be have the same name as the original filename, e.g. NewMarkerGeneAlignment.fasta). Core marker data will be overwritten during new marker builds if input files do not have unique names compared to existing PhyloSift markers. In addition, the Monkey can only build a marker package for one gene at a time; we recommend the use of a wrapper script if you have many different gene alignments that need to be processed.

To get started with build_marker mode, first ensure you have taxtastic and its dependencies installed on your machine. You can confirm a successful installation by typing the following command (this should bring up the help dialogue for taxit):

./taxit -h

Now we can get started building marker packages by running the command:

./phylosift build_marker –alignment <alignment file> –tree_pd <PD pruning threshold for tree> –taxonmap <optional Acc to NCBI taxonID map>

For example:

./phylosift build_marker –alignment test.aln –tree_pd 0.01 –taxonmap

Once marker packages are built, run index mode using the command:

./phylosift index

A schematic diagram of the Monkey workflow is as follows:

Once running, the Monkey will iterate through the following steps:

  1. Marker genes in the FASTA file are counted and imported into HMMbuild (to create profile HMMs for protein coding genes) or ssu-build (to create Covariant Models for rRNA genes) to create a Hidden Markov Model (HMM) out of the alignment.
  2. Alignments are cleaned and unique IDs are arbitrarily generated for each marker gene sequence.
  3. A Fasttree phylogeny is constructed de novo using the renamed marker gene sequences with unique IDs. Following phylogeny construction, representative sequences are identified according to PD (the tree is collapsed using sequence similarity criterion defined by user input in the initial command, e.g. 99%).
  4. Tree reconciliation is carried out to reconcile the Fasttree marker gene phylogeny with the NCBI taxonomy hierarchy, where nodes in the marker gene tree are mapped to nodes in the NCBI tree. In PhyloSift this reconciliation process is always conducted as follows:

    [NOTE: In the future we may aim to carry out this reconciliation step in the opposite manner, e.g. mark up the Fasttree phylogeny with taxonomy. However, at the moment we need a single framework to mark up taxonomy in order to calculate taxon abundances in PhyloSift. Because we rely on a diverse collection of marker genes, annotating the reference phylogenies with taxonomy will require a significant effort to reconcile topologies across gene trees, and will require the development of new computational algorithms to ensure such an approach is robust]

  5. Marker gene alignments are cleaned and wrapped into reference marker packages and placed into the shared directory containing all other PhyloSift markers.
  6. New marker genes must now be indexed using PhyloSift’s index mode. Note that these locally indexed marker packages will not interfere with the automatic update process for PhyloSift core markers.