# Executing PhyloSift

All PhyloSift scripts are run on the command line, so you’ll need to be familiar with terminal windows and comfortable with executing basic Unix commands. The pipeline will not work on Windows operating systems (sorry!).

PhyloSift currently accepts input data in the following file formats:

• FASTA
• paired FASTA (specify paired data by using the -paired  flag)
• interleaved paired FASTA (specify paired data by using the -paired  flag)
• FASTQ (same as FASTA but with quality scores; this file type is the standard output from Illumina platforms)
• interleaved FASTQ (same as FASTA but with quality scores)
• .gz (any of the above file types compressed using gzip)
• .bz (any of the above file types compressed using bzip)

PhyloSift now supports dynamic quality trimming of raw FASTQ data. Upon detecting FASTQ input files, the workflow will perform Heng Li’s BWA quality trim algorithm (when running PhyloSift all or search mode). Reads are trimmed according to the following formula (where l is the original read length):

argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT

Users can use the -paired flag to specify either 1 or 2 files. If you specify 2 files, PhyloSift assumes (according to format guidelines) that each file contains an entry for each molecule (e.g. that both ends of a sequence are included in the input data).

When inputting sequences in interleaved format, a user will specify only 1 file. However, an interleaved file CANNOT contain orphan sequences or unpaired reads, since this will lead to errors in the downstream PhyloSift results.

The following is an example of how reads should be arranged and named in an interleaved fastq file (e.g for read1, the appendixes /1 and /2 denote matching pairs):

>read1/1
>read1/2
>read2/2
>read3/1
>read4/1
>read4/2

Once you have determined your input data, starting a PhyloSift run can be done in one step . To run the full pipeline workflow (all mode), type the following command, replacing <sequence_file> with the path to your input data:

./phylosift all <sequence_file>

PhyloSift all mode consists of four distinct steps – users may also want to run these steps separately. For example if you are only interested in pulling out 16S rRNA sequences from a shotgun metagenome datasets, you could use search mode to screen for candidate sequences of interest:

./phylosift search <sequence_file>

PhyloSift all  mode sequentially executes the following steps, each of which can be run independently (scroll down further on this page for specific command line flags that can specifically be invoked within each mode) :

• search – execute the database searches and write candidate files
• align – hmmalign + filtering of poorly aligned sequences
• placer – run pplacer on the results from the align step
• summary – run the taxonomy assignment steps

To process paired-end data (e.g. Illumina sequences in FASTQ format) specify the -paired flag as follows:

./phylosift all <options> --paired <sequence_file_1> <sequence_file_2>

phylosift is a wrapper script listed in the main directory of the tarball archive, and it directs the pipeline workflow based on your mode (in this case the “all” mode will carry out every step in PhyloSift, from database searching and read placement through to taxonomy assignment and output of taxonomy summaries).

When run for the first time, PhyloSift will automatically download the current reference set of marker genes (and the associated NCBI taxonomy for these sequences) from the UC Davis Genome Center server.

The PhyloSift wrapper script joins a number of distinct workflows, but these can be executed separately if desired. Users can also add a number of options for customized workflows:

./phylosift <mode> <options> <sequence_file>

Operations for <mode> can be specified as follows (click on the mode name for a link to specific command line flags applicable to each mode):

align: align homologous sequences identified by ‘search’ (only executes the alignment step for candidate sequences)
all: run all steps for phylogenetic analysis of genomic or metagenomic sequence data
benchmark: measure taxonomic prediction accuracy on a simulated dataset
build_marker: add a new marker the reference database based on a sequence alignment
dbupdate: update the phylosift database with new genomic data
index: index a phylosift database after changes have been made
name: Replaces phylosift’s own sequence IDs with the original IDs found in the input file header
place: place aligned reads onto a reference phylogeny (only executes the phylogenetic placement of aligned candidate sequences)
search: search input sequence for homology to reference gene database (only executes the database searches and write candidate files)
simulate: simulate sequencing from a metagenomic sample **As of 3/31/14, the simulate command is currently non-functional, since it is not engineered to work well on systems other than the setup at UC Davis**
summarize: translate a collection of phylogenetic placements into a taxonomic summary
test_lineage: conduct a statistical test (a Bayes factor) for the presence of a particular lineage in a sample

### align mode – command line options

Example usage:

./phylosift align [-f] [long options…] <sequence file> [read 2 sequence file]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)
--paired   Indicates data are read pairs. This can be provided either as a single file with read pairs interleaved, or two files, one for each read.
--custom   Reads a custom marker list from a file otherwise use all the markers from the markers directory
--isolate   Use this mode if you are running data from an isolate genome
--updated   Use the set of updated markers instead of stock markers
--marker_url=”URL”   Phylosift will use markers available from the url provided
--unique   Permit only a single hit between a marker and query sequence, discard any ambiguous hits
--stdin   Read sequence input on standard input
--chunk_size   Run so many sequences per chunk
--threads   Runs parallel portions using the specified number of processes (DEFAULT : 1)
--output   Specifies an output directory other than PS_temp
-f or --force   Overwrites a previous Phylosift run with the same file name
--start_chunk   Start processing on a particular chunk
--chunks   Only run a set number of chunks
--continue   Enables the pipeline to continue to subsequent steps when not using the ‘all’ mode. (Note that --continue cannot be run if --chunks >1. An error message will appear if you specify multiple chunks and PhyloSift will not run)
--besthit   When there are multiple hits to the same read, keeps only the best hit to that read
--extended   Uses the extended set of markers

### all mode – command line options

Example usage:

./phylosift all [-f] [long options…] <sequence file> [read 2 sequence file]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)
--coverage   Provides a contig/scaffold coverage file for estimating relative abundance
--paired   Indicates data are read pairs. This can be provided either as a single file with read pairs interleaved, or two files, one for each read.
--custom   Reads a custom marker list from a file otherwise use all the markers from the markers directory
--isolate   Use this mode if you are running data from an isolate genome
--updated   Use the set of updated markers instead of stock markers
--marker_url=”URL”   Phylosift will use markers available from the url provided
--bayes   Compute posterior probabilities during phylogenetic placement. Required for Bayesian hypothesis testing
--unique   Permit only a single hit between a marker and query sequence, discard any ambiguous hits
--stdin   Read sequence input on standard input
--chunk_size   Run so many sequences per chunk
--threads   Runs parallel portions using the specified number of processes (DEFAULT : 1)
--output   Specifies an output directory other than PS_temp
-f or --force   Overwrites a previous Phylosift run with the same file name
--start_chunk   Start processing on a particular chunk
--chunks   Only run a set number of chunks
--simple   Creates a simple taxonomic summary of the output; no Krona output
--continue   Enables the pipeline to continue to subsequent steps when not using the ‘all’ mode
--keep_search Keeps the blastDir files (Default: Delete the blastDir files after every chunk)
--besthit   When there are multiple hits to the same read, keeps only the best hit to that read
--extended   Uses the extended set of markers

### benchmark mode – command line options

Example usage:

./phylosift benchmark [long options…]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)
--output   Path to write benchmark results
--summary-file Taxonomic summary input file
--curve_path   Path to write precision-recall curve data

### build_marker mode – command line options

Example usage:

./phylosift build_marker [-f] [long options…]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)
-f or --force   Overwrites a previous Phylosift run with the same file name
--alignment   A multiple sequence alignment of the gene family
--update-only   Generate an updated marker only, no new HMM
--unaligned   Unaligned and unmasked sequences corresponding from the gene family
--reps_pd   Specify the minimum divergence between representative sequences
--tree_pd   Specify the minimum phylogenetic diversity in the reference tree
--taxonmap   A file containing a mapping of sequence names to taxon IDs
--destination   Store the new marker package at the given location instead of the default marker repository

### dbupdate mode – command line options

Example usage:

./phylosift dbupdate [long options…]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)
--repository   Path to repository for local copies of NCBI and EBI genome databases
--destination   Path to destination for updated markers
--knockouts   File containing a list of taxon IDs to exclude from the database
--local-storage   Path to local storage for results of scanning genomes for markers
--base-markers   Path to base markers to add to updated set
--skip-scan   Skip scanning any new genomes for homologs to the marker database
--keep-paralogs   Don’t discard paralogs when building trees and fasta files

### index mode – command line options

Example usage:

./phylosift index [long options…]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)

### name mode – command line options

This mode is for renaming candidate sequences when using the PhyloSift workflow outside of all mode.  When manually running the search–>align–>place modes, your output files will contain unique numerical IDs assigned during the PhyloSift runs.  Name mode allows you to manually rename sequences (reinstating their original IDs as listed in the input file) before running summarize mode. Carrying out this step ensures that the final summary output files are in human-readable format.

Example usage:

./phylosift name [-f] [long options…] <sequence file> [read 2 sequence file]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)
--paired   Indicates data are read pairs. This can be provided either as a single file with read pairs interleaved, or two files, one for each read.
--custom   Reads a custom marker list from a file otherwise use all the markers from the markers directory
--isolate   Use this mode if you are running data from an isolate genome
--updated   Use the set of updated markers instead of stock markers
--marker_url=”URL”   Phylosift will use markers available from the url provided
--chunk_size   Run so many sequences per chunk
--threads   Runs parallel portions using the specified number of processes (DEFAULT : 1)
--output   Specifies an output directory other than PS_temp
-f or --force   Overwrites a previous Phylosift run with the same file name
--start_chunk   Start processing on a particular chunk
--chunks   Only run a set number of chunks
--continue   Enables the pipeline to continue to subsequent steps when not using the ‘all’ mode
--besthit   When there are multiple hits to the same read, keeps only the best hit to that read
--extended   Uses the extended set of markers

### place mode – command line options

Example usage:

./phylosift place [-f] [long options…] <sequence file> [read 2 sequence file]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)
--coverage   Provides a contig/scaffold coverage file for estimating relative abundance
--paired   Indicates data are read pairs. This can be provided either as a single file with read pairs interleaved, or two files, one for each read.
--custom   Reads a custom marker list from a file otherwise use all the markers from the markers directory
--updated   Use the set of updated markers instead of stock markers
--marker_url=”URL”   Phylosift will use markers available from the url provided
--bayes   Compute posterior probabilities during phylogenetic placement. Required for Bayesian hypothesis testing
--chunk_size   Run so many sequences per chunk
--threads   Runs parallel portions using the specified number of processes (DEFAULT : 1)
--output   Specifies an output directory other than PS_temp
-f or --force   Overwrites a previous Phylosift run with the same file name
--start_chunk   Start processing on a particular chunk
--chunks   Only run a set number of chunks
--continue   Enables the pipeline to continue to subsequent steps when not using the ‘all’ mode (Note that --continue cannot be run if --chunks >1. An error message will appear if you specify multiple chunks and PhyloSift will not run)
--extended   Uses the extended set of markers

### search mode – command line options

Example usage:

./phylosift search [-f] [long options…] <sequence file> [read 2 sequence file]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)
--paired   Indicates data are read pairs. This can be provided either as a single file with read pairs interleaved, or two files, one for each read.
--custom   Reads a custom marker list from a file otherwise use all the markers from the markers directory
--isolate   Use this mode if you are running data from an isolate genome
--updated   Use the set of updated markers instead of stock markers
--marker_url=”URL”   Phylosift will use markers available from the url provided
--unique   Permit only a single hit between a marker and query sequence, discard any ambiguous hits
--stdin   Read sequence input on standard input
--chunk_size   Run so many sequences per chunk
--threads   Runs parallel portions using the specified number of processes (DEFAULT : 1)
--output   Specifies an output directory other than PS_temp
-f or --force   Overwrites a previous Phylosift run with the same file name
--start_chunk   Start processing on a particular chunk
--chunks   Only run a set number of chunks
--continue   Enables the pipeline to continue to subsequent steps when not using the ‘all’ mode (Note that --continue cannot be run if --chunks >1. An error message will appear if you specify multiple chunks and PhyloSift will not run)
--besthit   When there are multiple hits to the same read, keeps only the best hit to that read
--extended   Uses the extended set of markers

### simulate mode – command line options

**As of 3/31/14, the simulate command is currently non-functional, since it is not engineered to work well on systems other than the setup at UC Davis**

Example usage:

./phylosift simulate [long options…]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)
--genome_count   The number of genomes to include in the simulated community
--genome_dir   Path to a genome repository created by phylosift updateDB

### summarize mode – command line options

Example usage:

./phylosift summarize [-f] [long options…] <sequence file> [read 2 sequence file]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)
--config   Provides a custom configuration file to Phylosift (phylosiftrc file)
--paired   Indicates data are read pairs. This can be provided either as a single file with read pairs interleaved, or two files, one for each read.
--custom   Reads a custom marker list from a file otherwise use all the markers from the markers directory
--updated   Use the set of updated markers instead of stock markers
--marker_url=”URL”   Phylosift will use markers available from the url provided
--chunk_size   Run so many sequences per chunk
--threads   Runs parallel portions using the specified number of processes (DEFAULT : 1)
--output   Specifies an output directory other than PS_temp
-f or --force   Overwrites a previous Phylosift run with the same file name
--start_chunk   Start processing on a particular chunk
--chunks   Only run a set number of chunks
--simple   Creates a simple taxonomic summary of the output; no Krona output
--continue   Enables the pipeline to continue to subsequent steps when not using the ‘all’ mode (Note that --continue cannot be run if --chunks >1. An error message will appear if you specify multiple chunks and PhyloSift will not run)
--extended   Uses the extended set of markers

### test_lineage mode – command line options

Example usage:

./phylosift test_lineage [long options…]

Available long options:

--help   returns the script usage screen on the command line
--debug   Print debugging messages (make PhyloSift very verbose when it is running)