The Kangaroo: Generating simulated datasets

 The Kangaroo (simulate mode) is a workflow that allows PhyloSift users to generate simulation datasets using existing genome data they may have on hand locally, or published genome sequences obtained from repositories such as EBI or NCBI. The Kangaroo will output mock datasets (simulating both 454 and PE-Illumina data), and a new “knockout” marker directory where the chosen genomes mined for simulation data are excluded from reference marker packages. To execute the kangaroo you will need to specify an output directory, the path to the directory housing your genome sequences, the number of genomes you want to simulate, and the number of simulated reads you want to generate.

To start, run the command:

./phylosift simulate --output=<output_path> --genome_dir=<genomes_path --genome_count=<number_of_genomes_to_choose_from> --read_count=<reads_to_generate>

For example:

./phylosift simulate --output=/home/user/PhyloSift/PS_temp/sim1 --genome_dir=genomes --genome_count=10 --read_count=100000

A schematic diagram of the Kangaroo workflow is as follows:

Once running, the Kangaroo will iterate through the following steps:

  1. Calculate Phylogenetic Distance (PD) for all sequences in the PhyloSift guide tree constructed from concatenated gene alignments (e.g. the tree located in the PhyloSift marker directory).
  2. Multiple approaches are used to select taxa: a) A set of taxa that contributed to a defined PD threshold are selected (user input, default = 10 taxa), and b) The same number of taxa are sampled randomly, without replacement, from the list of input taxa.
  3. Several metrics are computed on the relationship between the target taxa and the remaining taxa used in the database. These include the distance to the nearest neighbor in the database, the length of the branch connecting the target taxon to the nearest sampled lineage, whether two or more of the target taxa connect to the same nearest sampled lineage, and the length of the branches above and below that connecting node, along with the number of sampled nodes within various PD units of the connecting node (0.05, 0.1 0.15, 0.2, etc.).
  4. The Kangaroo now plugs into the Database Update workflow to knockout genomes that have been used to simulate metagenome data. The Kangaroo also knocks out a swath of taxa related to the 10 or N chosen taxa up to some specified PD away from those chosen taxa.
  5. The Grinder algorithm randomly generates reads from the selected genome files, outputting simulated PE-Illumina reads and 454 datasets. Each genome sequence is essentially chopped up into a simulated metagenome, representing a single species.
  6. A new marker directory is created, where simulated genomes have been removed from the reference marker packages.