Dynamic quality trimming for raw FASTQ input data

Exciting news! PhyloSift now supports dynamic quality trimming of raw FASTQ data. Upon detecting FASTQ input files, the software workflow will perform Heng Li’s BWA quality trim algorithm (when running PhyloSift all or search mode). Reads are trimmed according to the following formula (where l is the original read length):

argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT

To utilize this new feature, download the development version of PhyloSift, available at http://edhar.genomecenter.ucdavis.edu/~koadman/phylosift/devel/phylosift_latest.tar.bz2

Advanced user settings – PhyloSift Run Control file

Advanced users may want to alter the default parameters and search values executed in the PhyloSift pipeline. We’ve added support for a PhyloSift Run Control input file (the phylosiftrc file, which comes packed in the home directory of the PhyloSift download). This file is for advanced users wanting to change the specific settings for different programs packaged within PhyloSift (HMMer, LAST, etc), or for system administrators overseeing a single shared copy of PhyloSift and its reference databases. To use the RC file, specify the command line flag –config=<file>

Further details can be found here:

PhyloSift Run Control File – Instructions

Development & code tracking on Ohloh

Ohloh is another cool site we’ve been using to track the development of PhyloSift. You can visit our Ohloh project page at https://www.ohloh.net/p/PhyloSift  (also accessible via the quicklinks sidebar on the right side of this page). Here are some interesting tidbits you can find there (stats accurate as of 8/8/12):

A visual breakdown of the PhyloSift code and its expansion over the last year:

Some interesting statistics:

And Ohloh’s estimate of the overall project cost:

Daily PhyloSift Benchmarks

As we code away here at UC Davis, we’ve been keeping track of PhyloSift’s accuracy and performance using daily benchmarks on simulated data. These benchmarks are carried out using the latest versions of the PhyloSift code available on github (master and development branches).

Click the link below to access graphs showcasing a history of how the PhyloSift pipeline has changed during the course of its development. Here we provide information for simulated Illumina and 454 data, as well as information about PhyloSift compute time for each dataset (running on a single-core).

Daily PhyloSift benchmarks on Simulated Data

Note that these benchmarking plots are permanently accessible via the “Quick Links” sidebar located on the righthand side of the PhyloSift homepage.

Support for Multisample Comparisons

Most researchers use barcoded adaptors to pool DNA/RNA from multiple samples and sequence them in a single run (e.g. multiplexing 20 samples within one lane of an Illumina flow cell). We are working hard to increase support for multiplexed data within PhyloSift. Currently, the main client workflow (PhyloSift all mode as executed via the wrapper script) assumes that an input file contains sequence data from a single sample.

We have now prepared a guide to help users conduct multisample comparisons in PhyloSift:

Processing Multiplexed Samples in PhyloSift

Once raw data are demultiplexed and sequence data from individual samples are run through the PhyloSift workflow, the guppy software packages enables rapid taxonomic comparisons across multiple samples. Diversity analyses in guppy include: Principal Components Analysis (akin to PCA using UniFrac), Squash Clustering (akin to UPGMA clustering using UniFrac) and Kantorovich-Rubinstein distance (akin to weighed UniFrac).

New! Workflow Diagrams and Tutorials

We’ve been working hard to improve PhyloSift these past few months, and things are really starting to heat up! You’ll notice some heavy website updates to go accompany our expanded and streamlined code. The following links are permanently housed under Tutorials–>Running Phylosift

For biologists wanting to run their own environmental metagenome data through PhyloSift, we’re compiling a quickstart guide and step-by-step tutorial using test data:

Running Phylosift – An Overview

Metagenome Analysis Tutorial using PE-Illumina Data

PhyloSift also includes discrete workflows for users to incorporate their own marker genes, mine genome data, and generate simulation datasets. Visual diagrams explaining these PhyloSift features can be found here:

Building PhyloSift marker packages (The Monkey)

Generating simulated datasets (The Kangaroo)

Expanding PhyloSift marker databases with new genome data (UpdateDB)

User Support Forum on Google Groups


As we’ve been ramping up the coding, we’re trying to expand our online presence to best meet the needs of our users. With that in mind, we’ve established a dedicated support forum on Google Groups to interact with users and build an online community:

We envision this as a place to ask questions, request software functionalities, get help if you’re stuck, and generally interact with our development team (Aaron Darling, Guillaume Jospin, and Holly Bik).
If you run into a major bug during your data analysis, please post this as an “Issue” in our GitHub repository: https://github.com/gjospin/PhyloSift/issues
Tasks in GitHub get first priority, and we’ll fix any broken pipes as soon as possible!