Output¶
All output is stored under a results directory within the main workdir.
Results are stored per sample according to the sample ids you provided in the
sample sheet.
For each sample, results for each tool are stored in directories named after
the tool. An example looks like this:
$ tree -L2 results/A
results/A
├── all_predictions.tsv
├── lca.tsv
├── crispropendb
│ ├── predictions.tsv
│ └── raw.txt
├── htp
│ ├── predictions.tsv
│ └── raw.txt
├── phist
│ ├── common_kmers.csv
│ ├── predictions.tsv
│ └── raw.txt
├── rafah
│ ├── A_CDS.faa
│ ├── A_CDS.fna
│ ├── A_CDS.gff
│ ├── A_CDSxMMSeqs_Clusters
│ ├── A_Genomes.fasta
│ ├── A_Genome_to_Domain_Score_Min_Score_50-Max_evalue_1e-05.tsv
│ ├── A_Ranger_Model_3_Predictions.tsv
│ ├── A_Seq_Info.tsv
│ └── predictions.tsv
├── tmp
│ ├── filtered.fa.gz
│ ├── genomes
│ └── reflist.txt
├── vhmnet
│ ├── feature_values
│ ├── predictions
│ ├── predictions.tsv
│ └── tmp
├── vhulk
│ ├── predictions.tsv
│ └── results
└── wish
├── llikelihood.matrix
├── prediction.list
└── predictions.tsv
Per sample¶
all_predictions.tsv
Contains the best prediction per contig (rows) for
each tool along with its confidence/p-value/whatever-single-value each tool
uses to evaluate its confidence in the prediction.
An example for three genomes:
contig_id vhulk_pred vhulk_score rafah_pred rafah_score vhmnet_pred vhmnet_score wish_pred wish_score htp_proba crispropendb_pred crispropendb_score phist_pred phist_score
NC_005964.2 None 4.068828105926514 Mycoplasma 0.461 Mycoplasma fermentans 0.9953 Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma fermentans;Mycoplasma fermentans MF-I2 -1.20857 0.8464285626352002 None 0.0 Mycoplasmopsis fermentans M64 0.0
NC_015271.1 Escherichia_coli 1.0301523208618164 Salmonella 0.495 Muricauda pacifica 0.9968 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Raoultella;Raoultella sp. NCTC 9187;Raoultella sp. NCTC 9187 -1.38692 0.995161392517451 None 0.0 Flammeovirga aprica JL-4 0.0
NC_023719.1 Bacillus 0.001257509808056 Bacillus 0.55 Clostridium sp. LS 1.0 Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium beijerinckii;Clostridium beijerinckii -1.29454 0.9999957241187084 None 0.0 Lysinibacillus fusiformis 0.0
lca.tsv
Last Common Ancestor of predictions, based on taxonomy
An example for the genomes above:
contig_id name rank lca
NC_005964.2 Mycoplasmataceae family 2092
NC_015271.1 Bacteria superkingdom 2
NC_023719.1 Firmicutes phylum 1239
tmp (dir)
Directory
genomes: Contains one fasta file per input genomeFile
reflist.txt: An intermediate file that holds paths to all produced genome fastas (used as intermediate file to ensure smooth execution)File
filtered.fa.gz: Fasta files containing sequences > 5000 bp.
Per tool¶
crispropendb
File
raw.txt: The raw output ofcrispropendbper contigFile
predictions.tsv: Three-column separated tsv with contig id, predicted host and assignation criteria.
htp
File
raw.txt: The raw output ofhtpper contigFile
predictions.tsv: Two-column separated tsv with contig id and probability of host being a phage.
phist
File
kmers_table.txt: Stores numbers of common k-mers between phages (in columns) and hosts (in rows).File
raw.txt: The raw output ofphistper contigFile
predictions.tsv: Three-column separated tsv with contig id, predicted host and adjusted p-value.
rafah
Files prefixed with
<sample_id>_are the rafah’s raw outputpredictions.tsv: A selection of the 1st (Contig) , 6th (Predicted_Host) and 7th (Predicted_Host_Score) columns from file<sample_id>_Seq_Info.tsv
vhulk
File
results.csv: Copy of theresults/sample/tmp/genomes/results/results.csvFile
predictions.tsv: A selection of the 1st (BIN/genome), 10th (final_prediction) 11th (entropy) columns from fileresults.csv.
vhmnet
Directories
feature_valuesandpredictionsare the raw outputDirectory
tmpis a temporary dir written byVirHostMatcher-Netfor doing its magic.File
predictions.tsvcontains contig, host taxonomy and scores.
wish
Files
llikelihood.matrixandprediction.listare the raw outputFile
predictions.tsvhas contig, host taxonomy and llikelihood scores.
Logs¶
Log files capturing stdout and stderr during execution of each rule can be
found in workdir/logs/<sample_id>/*.log files.
Report¶
After successful execution of the workflow, a (basic) html report with summary statistics can be produced with
(phap)$ snakemake --use-singularity \
--singularity-args "-B path/to/data_dir:/data" --report phap.html
This will produce a phap.html file, making use of the information in the
report directory.
The report directory contains the two main aggregated tables from
the per sample results directory rendered as html documents.
These are accessible under the Results category of the main phap.html.