All results are generated within the main PhyloSift directory in the the folder:
In output files, user input sequences that are found to match PhyloSift markers are referred to as candidate sequences. The output folder contains a number of subdirectories and summary files, listed below.
Typically, output files follow a standard naming convention:
Where markername is the name of the PhyloSift reference marker gene, filetype is the data contained in the file (aligned/unaligned, nucleotide/amino acid), chunk is the batch number of sequences being proccessed from the input data (e.g. chunk 1 would refer to the first 1 million sequences pulled from the input file), and process is the job number (e.g. multiple jobs distributed when enabling the parallel option in PhyloSift). So, for example:
A file named DNGNGWU00008.codon.updated.1.4.fasta would contain the nucleotide (codon) candidate sequences matching the DNGNGWU00008 marker (derived from chunk #1 in process #4).
alignDir : Folder containing all files related to the alignment and masking steps. For all masked alignment outputs, candidate sequences must align to the respective profile HMM model at a minimum of 20 positions or else the candidate sequences will be discarded.
For protein coding marker genes, the following files are produced:
*.newCandidate.aa – unaligned candidate sequences from the blastDir that successfully pass the HMMER alignment step. These sequences might be redundant from the blastDir but most likely they will not be fully identical since some sequences will get discarded after being compared to profile HMMs for each marker gene.
*.updated.1.fasta – amino acid alignment of candidate sequences. candidate sequences from *.newCandidate.aa are aligned to the reference sequences from the PhyloSift marker packages. Sections of a sequence that don’t align to the profile HMM are masked out.
*.codon.updated.1.fasta – Nucleotide version of *.updated.1.fasta output (masked alignment). Nucleotide alignments may be useful for increased accuracy of reads during the tree placement workflow in pplacer (e.g. narrowing down placements for candidate sequences within well-sampled clades where there are many available reference sequences)
*.codon.updated.sub1.1.fasta – Following phylogenetic placement on an amino-acid based tree, this file contains only candidate sequences that fall within a fixed distance (in amino acid substitutions per site). The codon alignments for this subset of candidate sequences is written into the *.codon.updated.sub1.1.fasta file, and subsequently placed onto the codon tree. The rationale for this approach is that sequences that are heavily diverged at the amino acid level will be so far diverged in their nucleotide sequence that they become unplaceable on the nucleotide trees. They would likely end up as long branch attraction artifacts and could potentially give misleading results.
*.unmasked – raw alignment of candidate sequences from *.newCandidate.aa, without any alignment masking. This output is not currently utilized by PhyloSift, but is provided as an output for the benefit of users.
For 18S/16S rRNA sequences, PhyloSift carries out separate workflows for long (>600 bp) and short (<600 bp) rRNA sequences. Filenames are labelled according to each respective workflow. For both long/short rRNA workflows, two files per taxonomic group are produced:
*.fasta – alignment of candidate rRNA sequences aligned via the HMMER software workflow (conducted via profile HMMs derived from taxon-specific covariant models for rRNA secondary structure – our rRNA reference marker packages are built using Infernal). Sections of a sequence that don’t align to the profile HMM are masked out.
*.unmasked – raw output of aligned candidate rRNA sequences, with no alignment masking applied.
NOTE: Files of unaligned rRNA sequences can be found in the blastDir (see outputs below) – if running PhyloSift all mode, you must specify the –keep_search command line flag in order to retain the temporary blastDir files and get access to the unaligned rRNA file outputs
blastDir : Folder containing all files related to the search step. If running PhyloSift all mode, temporary search results are not written, and you will only see one output file:
marker_summary.txt – summary listing the number of candidate sequences hits for each reference marker gene
If running PhyloSift search mode, or PhyloSift all mode using the –keep_search command line flag, all temporary files will be retained. Search files in the blastDir directory are named with the following convention:
<marker_name>.<search_type>.candidate.<aa or ffn>.<chunk_ID><process_ID>
For example, a file named DNGNGWU00003.lastal.candidate.aa.1.1 contains the mined candidate sequences (amino acids) matching to the DNGNGWU00003 PhyloSift marker when run through the LAST software workflow. The chunk IDs (batch of input sequences) and process IDs (job number, e.g. when running PhyloSift in parallel) are internally assigned within PhyloSift to differentiate different streams of the LAST software workflow and their associated output files.
For each protein coding marker gene, you’ll see two different file versions. These are produced so that PhyloSift can quickly look up a protein’s nucleotide sequence without having to re-search the original input sequence file:
*.lastal.candidate.aa – candidate sequences (unaligned amino acids) matching PhyloSift protein-coding markers, mined via the LAST software workflow
*.lastal.candidate.ffn – candidate sequences (unaligned nucleotides) matching PhyloSift protein-coding markers, mined via the LAST software workflow
For 18S/16S rRNA data, a number of files are produced:
*.lastal.rna.candidate.aa – candidate ribosomal sequences mined via the LAST software workflow (longer input sequences >600 nt) after being searched against PhyloSift rRNA markers
*.rna.short.candidate.aa – candidate ribosomal sequences mined via the LAST software workflow (shorter input sequences < 600 nt) after being searched agains the PhyloSift rRNA markers
treeDir : Folder containing all files related to the tree placements of the candidate sequences by pplacer.
*.jplace Placement files for the new candidate sequences. Each marker gene will be represented by a separate placement file. (NOTE: To visualize candidate sequences within a phylogeny, *.jplace files must currently be converted to tree files via the guppy program, which comes packaged with the PhyloSift download – commands for doing this can be found on the multisample comparison page )
For each marker, you will see two different files representing two types of tree placement:
*.codon.updated.sub1.1.jplace – placement file for nucleotide (codon) alignments
*.updated.1.jplace – placement file for amino acid (protein) alignments
Summary files (written to the main output directory)
NOTE: Numbers before the file extensions represent results from specific “chunks” of input sequence data run through PhyloSift (for computational efficiency in large datasets, we split sequences into more manageable subsets and process them separately). For example, sequence_taxa.1.txt and sequence_taxa.2.txt report placement results from chunks 1 and 2, respectively, while sequence_taxa.txt combines the information from all chunks into one cumulative file.
taxasummary.txt Sum of probability mass over taxa present in the sample, in tab-delimited text format. This file (or the taxa_90pct_HPD.txt filtered output) should be used for abundance estimates, since we merge information from Paired-End reads so that one taxonomic rank is listed per sequenced DNA molecule.
Column 1 — NCBI Taxon ID
Column 2 — Taxonomic rank (genus, species, phylum, etc)
Column 3 — Taxon Name
Column 4 — Read/sequence probability sum placed at this taxon. The values in this column can be normalized to sum to 1, the result will be a rank-abundance distribution. For example if the number in this column is 277, you can achieve this number by having 277 sequence placements with each having a probability of 1 (highest confidence value possible), or 554 placements each exhibiting a 0.5 probability at this specific location in the tree.
taxa_90pct_HPD.txt – filtered version of taxasummary.txt, containing only candidate sequences where the placement probability is >0.9 (90% confidence). In this file we have removed taxa in the bottom 10% of the probability distribution, e.g. placements on the guide tree which are likely to represent isolated, erroneous taxon assignments.
sequence_taxa.txt – For each read, this file lists the nodes (and associated probability distributions) at which each read is placed during the pplacer run; note that each unique sequence read will typically have a number different of taxon assignments with varying probabilities. Currently we list all of this information for the benefit of the user; an alternative strategy would be to only accept the placement with the highest probability score (akin to taking the top BLAST hit from pairwise sequence comparisons). Forward and reverse sequences listed separately for Illumina-PE data (denoted with the identifiers /1 and /2). The columns in this file are as follows:
Column 1 — Input sequence name
Column 2 — Coordinates on the input reads that hit the reference sequences. For example, 1.338 means that your input sequence aligns from position 1 to position 338. If the coordinates are 338.1 it would mean that the hit happens on the reverse strand.
Column 3 — NCBI Taxon ID
Column 3 — Taxonomic rank (genus, species, phylum, etc)
Column 4 — Taxon Name
Column 5 — Probability reported for this placement
Column 6 — Reference marker gene matching input sequence, used to infer placement on the guide tree
sequence_taxa_summary.txt – This file expands the information listed in sequence_taxa.txt information by walking up the tree until the probability distribution reaches 1 for each placed sequence. So, in addition to the raw placement data for a unique sequence ID, this summary file gives an indication of the minimum higher taxon level which can be accepted with 100% confidence.
Note the that raw outputs from tree placement are listed in sequence_taxa.txt and sequence_taxa_summary.txt; these list the placement information for unique input sequence IDS (e.g. FASTQ headers). These files should NOT be used to infer abundance information, since both reads from paired-end data will be listed separately in these sequence_taxa outputs.
filename_allmarkers.html – krona plot summarizing the community taxonomy reported from all markers in treeDir (e.g. markers with .jplaces files)
filename.html – krona plot summarizing the community taxonomy, using only the “elite marker” set (markers named with DNGNGWU prefix in the treeDir)
Other visual representations of the data in the output directory are as follows:
filename.xml – fat tree visualization of reads placed to the “elite” DNGNGWU marker genes (file generated using the concat.jplace file in the treeDir, converted to an XML tree file using the guppy program). More information about guppy tree visualizations can be found on the PhyloSift intro tutorial page.
marker_summary.txt – Reports the marker name and the number of total number of candidate sequences matching each marker after the placement step (this file is different from the marker summaries contained in the blastDir and alignDir, which summarize the marker hits after LAST searches and hmmalign steps, respectively)
run_info.txt – lists the PhyloSift commands run, the md5 file checksums, and step completion status (start/end time and duration for each chunk at each step – search, align, place, summarize)