Output

All output is stored under a results directory within the main workdir. Results are stored per sample according to the sample ids you provided in the sample sheet. For each sample, results for each tool are stored in directories named after the tool. An example looks like this:

$ tree -L2 results/A
results/A
├── all_predictions.tsv
├── lca.tsv
├── crispropendb
│   ├── predictions.tsv
│   └── raw.txt
├── htp
│   ├── predictions.tsv
│   └── raw.txt
├── phist
│   ├── common_kmers.csv
│   ├── predictions.tsv
│   └── raw.txt
├── rafah
│   ├── A_CDS.faa
│   ├── A_CDS.fna
│   ├── A_CDS.gff
│   ├── A_CDSxMMSeqs_Clusters
│   ├── A_Genomes.fasta
│   ├── A_Genome_to_Domain_Score_Min_Score_50-Max_evalue_1e-05.tsv
│   ├── A_Ranger_Model_3_Predictions.tsv
│   ├── A_Seq_Info.tsv
│   └── predictions.tsv
├── tmp
│   ├── filtered.fa.gz
│   ├── genomes
│   └── reflist.txt
├── vhmnet
│   ├── feature_values
│   ├── predictions
│   ├── predictions.tsv
│   └── tmp
├── vhulk
│   ├── predictions.tsv
│   └── results
└── wish
    ├── llikelihood.matrix
    ├── prediction.list
    └── predictions.tsv

Per sample


all_predictions.tsv Contains the best prediction per contig (rows) for each tool along with its confidence/p-value/whatever-single-value each tool uses to evaluate its confidence in the prediction.

An example for three genomes:

contig_id	vhulk_pred	vhulk_score	rafah_pred	rafah_score	vhmnet_pred	vhmnet_score	wish_pred	wish_score	htp_proba	crispropendb_pred	crispropendb_score	phist_pred	phist_score
NC_005964.2	None	4.068828105926514	Mycoplasma	0.461	Mycoplasma fermentans	0.9953	Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma fermentans;Mycoplasma fermentans MF-I2	-1.20857	0.8464285626352002	None	0.0	Mycoplasmopsis fermentans M64	0.0
NC_015271.1	Escherichia_coli	1.0301523208618164	Salmonella	0.495	Muricauda pacifica	0.9968	Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Raoultella;Raoultella sp. NCTC 9187;Raoultella sp. NCTC 9187	-1.38692	0.995161392517451	None	0.0	Flammeovirga aprica JL-4	0.0
NC_023719.1	Bacillus	0.001257509808056	Bacillus	0.55	Clostridium sp. LS	1.0	Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae;Clostridium;Clostridium beijerinckii;Clostridium beijerinckii	-1.29454	0.9999957241187084	None	0.0	Lysinibacillus fusiformis	0.0
lca.tsv

Last Common Ancestor of predictions, based on taxonomy

An example for the genomes above:

contig_id	name	rank	lca
NC_005964.2	Mycoplasmataceae	family	2092
NC_015271.1	Bacteria	superkingdom	2
NC_023719.1	Firmicutes	phylum	1239
tmp (dir)
  • Directory genomes: Contains one fasta file per input genome

  • File reflist.txt: An intermediate file that holds paths to all produced genome fastas (used as intermediate file to ensure smooth execution)

  • File filtered.fa.gz: Fasta files containing sequences > 5000 bp.

Per tool

crispropendb
  • File raw.txt: The raw output of crispropendb per contig

  • File predictions.tsv: Three-column separated tsv with contig id, predicted host and assignation criteria.

htp
  • File raw.txt: The raw output of htp per contig

  • File predictions.tsv: Two-column separated tsv with contig id and probability of host being a phage.

phist
  • File kmers_table.txt: Stores numbers of common k-mers between phages (in columns) and hosts (in rows).

  • File raw.txt: The raw output of phist per contig

  • File predictions.tsv: Three-column separated tsv with contig id, predicted host and adjusted p-value.

rafah
  • Files prefixed with <sample_id>_ are the rafah’s raw output

  • predictions.tsv: A selection of the 1st (Contig) , 6th (Predicted_Host) and 7th (Predicted_Host_Score) columns from file <sample_id>_Seq_Info.tsv

vhulk
  • File results.csv: Copy of the results/sample/tmp/genomes/results/results.csv

  • File predictions.tsv: A selection of the 1st (BIN/genome), 10th (final_prediction) 11th (entropy) columns from file results.csv.

vhmnet
  • Directories feature_values and predictions are the raw output

  • Directory tmp is a temporary dir written by VirHostMatcher-Net for doing its magic.

  • File predictions.tsv contains contig, host taxonomy and scores.

wish
  • Files llikelihood.matrix and prediction.list are the raw output

  • File predictions.tsv has contig, host taxonomy and llikelihood scores.

Logs

Log files capturing stdout and stderr during execution of each rule can be found in workdir/logs/<sample_id>/*.log files.

Report

After successful execution of the workflow, a (basic) html report with summary statistics can be produced with

(phap)$ snakemake --use-singularity \
--singularity-args "-B path/to/data_dir:/data" --report phap.html

This will produce a phap.html file, making use of the information in the report directory.

The report directory contains the two main aggregated tables from the per sample results directory rendered as html documents. These are accessible under the Results category of the main phap.html.