Commands details¶
nanodisco <subtask> [options], <subtask> include:
- preprocess: Extract reads (.fasta) from base called fast5 files and map reads on reference (meta)genome
- chunk_info: Display chunks information regarding supplied reference (meta)genome
- difference: Compute nanopore signal difference between a native and a whole genome amplified (WGA) dataset
- merge: Combine nanopore signal difference for all processed chunks in directory
- motif: De novo discovery of methylation motifs
- refine: Generate refine plot for candidate methylation motifs
- characterize: Predict the methylation type and fine-map the modification within de novo discovered methylation motifs file
- coverage: Compute average coverage for each contig in a reference genome (uses bedtools genomecov)
- profile: Compute the methylation profile matrix for a metagenome sample (methylation feature at common or expected methylation motifs)
- select_feature: Select informative feature from a methylation profile matrix
- filter_profile: Compute the methylation profile matrix for selected features for a metagenome sample
- binning: Perform methylation binning, cluster metagenomic contigs according to methylation feature similarities using t-SNE
- plot_binning: Plot results of methylation binning
- score: Attribute methylation scores to each motif occurrence
- version: Print version
- help: Print help
preprocess¶
Extract reads (.fasta) from base called fast5 files and map reads on a reference (meta)genome.
Usage:
nanodisco preprocess -p <nb_threads> -f <path_fast5> -s <sample_name> -o <path_output> -r <path_reference_genome>
-p : Number of threads to use.
-f : Path to the directory containing .fast5 files (nanopore sequencing dataset). Note that fast5 files are searched recursively within the directory.
-s : Name to the sample processed (e.g. Ecoli_native).
-o : Path to output directory (e.g. ./analysis/Ecoli).
-r : Path to a reference genome (i.e. fasta). Necessary index will be generated at runtime.
--basecall_version : (Optional) Specify basecalling version if multiple ones available (<basecaller:version>, e.g. Guppy:3.2.4).
-h : Print help.
Output:
- Fasta file (
<path_output>/<sample_name>.fasta) - Bam file (
<path_output>/<sample_name>.sorted.bam) - Bam index (
<path_output>/<sample_name>.sorted.bam.bai) - (Optional) Create reference index
chunk_info¶
Display chunks information regarding the supplied reference (meta)genome.
Usage:
nanodisco chunk_info -r <path_reference_genome> [-t <target_region> -s <chunk_size>]
-r : Path to a reference genome (i.e. fasta).
-t : Specify a genomic region whose chunks need to be processed (e.g. chr1:2500-85000).
-s : Size of chunk in bp to use when dividing the reference genome (default is 5000).
-h : Print help.
Output:
- Default is the number of chunks in the reference genome
- With
-t, index of the first and last chunk to process for the targeted region
difference¶
Compute the current difference between a native and a WGA dataset.
Usage:
nanodisco difference -nj <nb_jobs> -nc <nb_chunks> -p <nb_threads> -i <path_input> -o <path_output> -w <name_WGA> -n <name_native> -r <path_genome> [-f <first_chunk> -l <last_chunk> + advanced parameters]
-nj : Number of jobs to run in parallel (affect CPU and memory usage).
-nc : Number of chunks to process in a row (affect memory usage).
-p : Number of threads to use.
-i : Path to input data folder containing .fasta, .bam, and .bam.bai generated with nanodisco preprocess.
-o : Path to output directory for current difference file and logs.
-w : Specify the name of WGA sample (same than for nanodisco preprocess -s <sample_name>).
-n : Specify the name of native sample (same than for nanodisco preprocess -s <sample_name>).
-r : Path to a reference genome (i.e. fasta).
-h : Print help.
Advanced parameters. We recommend leaving them set to default values:
-f : First chunk to process. -l needs to be set. All chunks between -f and -l will be processed. All genome processed if not provided.
-l : Last chunk to process. -f needs to be set. All chunks between -f and -l will be processed. All genome processed if not provided.
-x : Execution type between seq or batch. Default is batch and seq is for development only.
-a : IQR factor for outliers removal (0 to skip; smaller is harsher). Default is 1.5.
-z : Type of additional signal normalization (0 is none, 1 is lm, and 2 is rlm). Default is 2.
-b : Correct for strand bias (ori is no and revc is yes). Default is revc.
-e : Minimum number of events per position. Default is 5.
-j : Type of filtering for mapping. Default is noAddSupp.
-k : Minimum mapped read length. Default is 0 (no filtering).
--basecall_version : (Optional) Specify basecalling version if multiple ones available, for tracking only (<basecaller:version>, e.g. Guppy:3.2.4).
Output:
- Current difference files (
<path_output>/chunk.*.difference.rds), one per chunk:
columns:
contig name of contig
position genomic position
dir genomic strand, fwd or rev
strand read strand, used when 2D nanopore reads
N_wga number of current values at this position and strand in WGA dataset
N_nat number of current values at this position and strand in native dataset
mean_diff current difference in pA
t_test_pval p-values from t-test
u_test_pval p-values from Mann-Whitney u-test
merge¶
Combine nanopore signal difference for all processed chunks in directory.
Usage:
nanodisco merge -d <path_difference> -o <path_output> -b <base_name>
-d : Path to current differences directory (*.rds produced from nanodisco difference).
-o : Path to output directory. Default is current directory.
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-h : Print help.
Output:
- Current difference file (
<path_output>/<base_name>_difference.RDS; same format asnanodisco differenceoutput)
motif¶
De novo discovery of methylation motifs from current differences file.
Usage:
nanodisco motif -p <nb_threads> -b <base_name> -d <path_difference> -o <path_output> -r <path_genome> [+ advanced parameters]
-p : Number of threads to use.
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-d : Path to current differences file (*.RDS produced from nanodisco difference).
-o : Path to output directory. Default is current directory.
-r : Path to a reference genome (i.e. fasta).
-h : Print help.
Advanced parameters. We recommend leaving them set to default values:
-c : (Optional) Comma separated list of contigs (e.g. contig_1,contig_3).
-m : (Optional) Comma separated list of motifs (<motif_1,motif_2>, e.g. GATC,CCWGG).
--contigs_file : (Optional) Path to file with list of contigs (one per line).
-a : Disable manual motif discovery procedure (not recommended). Default is FALSE.
-t : Smoothed peaks p-values threshold for sequence selection (if double: peaks > <threshold> or if NA: top <nb_peaks> only). Default is NA.
--nb_peaks : Number of sequence with p-value peaks to keep for each round. Default is 2000.
--stat_type : Select which type of p-value sources used. Default is u_test_pval.
--smooth_func : Function to use for p-values smoothing. Default is sumlog.
--smooth_win_size : Window size used for smoothing p-values. Default is 5.
--peak_win_size : Window size used for p-values peaks detection. Default is 2.
Output:
- Comma separated list of de novo discovered methylation motifs.
- Intermediate meme files (
<path_output>/motif_detection/) - Refinement plots for each motif without
-aoption.
refine¶
Generate refine plot for candidate methylation motifs.
Usage:
nanodisco refine -p <nb_threads> -b <base_name> -d <path_difference> -o <path_output> -m <motif_1,motif_2> -M <motif_3,motif_4|all> -r <path_genome>
-p : Number of threads to use.
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-d : Path to current differences file (*.RDS produced from nanodisco difference).
-o : Path to output directory. Default is current directory.
-m : Comma separated list of discovered motifs (e.g. GATC,CCWGG).
-M : Comma separated list of candidate motifs or 'all' to analyze '-m' motifs individually (e.g. GATC,CCWGG).
-r : Path to a reference genome (i.e. fasta).
-h : Print help.
Output:
- Refinement plots for the candidate motif(s) with
-M <motif_3,motif_4>option, or each motif with-M alloption.
characterize¶
Predict the methylation type and fine map the modification within de novo discovered methylation motifs file.
Usage:
nanodisco characterize -p <nb_threads> -b <base_name> -d <path_difference> -o <path_output> -m <motif1,motif2,...> -t <models> -r <path_genome>
-p : Number of threads to use.
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-d : Path to current differences file (*.RDS produced from nanodisco difference).
-o : Path to output directory. Default is current directory.
-m : Comma separated list of motifs following IUPAC nucleotide code (e.g. GATC,CCWGG).
-t : Comma separated list of model to apply (nn: neural network, rf: random forest, or knn: k-nearest neighbor; e.g. nn,rf)
-r : Path to a reference genome (i.e. fasta).
-c : (Optional) Comma separated list of contigs (e.g. contig_1,contig_3).
--contigs_file : (Optional) Path to file with list of contigs (one per line).
-h : Print help.
Output:
- Identified methylation type and methylated position summarized in a heatmap (
Motifs_classification_<base_name>_<model_name>_model.pdf) as presented in the preprint Figure 4d. - Best predictions compiled in a text file (
Motifs_classification_<base_name>_<model_name>_model.tsv) - Figure representing the data used to define the motif signature center as presented in the preprint Supplementary Figure 5a.
coverage¶
Compute average coverage for each contig in a reference genome (uses bedtools genomecov).
Usage:
nanodisco coverage -b <path_mapping> -r <path_metagenome> -o <path_output>
-b : Path of mapping data (.sorted.bam)
-r : Path to a reference metagenome (i.e. fasta).
-o : Path to output directory (.sorted.bam suffix replaced by .cov).
Output:
- Genomic coverage for each contig (
<path_output>/<bam_file_name>.cov)
profile¶
Compute the methylation profile matrix for a metagenome sample (methylation feature at common or expected methylation motifs).
Usage:
nanodisco profile -p <nb_threads> -r <path_fasta> -d <path_difference> -w <path_WGA_cov> -n <path_NAT_cov> -b <base_name> -o <path_output> (-a || -m <motif1,motif2,...> || --motifs_file <path_motif>) [+ advanced parameters]
-p : Number of threads to use.
-r : Path to reference metagenome (.fasta).
-d : Path to current differences file (*.RDS produced from nanodisco difference).
-w : Path to WGA sample coverage (*.cov).
-n : Path to native sample coverage (*.cov).
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-o : Path to output directory. Default is current directory.
-a : Compute methylation profile from predefined common motifs followed by filtering (automated binning; all|4mer|5mer|6mer|noBi). -a & -m & --motifs_file are exclusive.
-m : Comma separated list of motifs following IUPAC nucleotide code (e.g. GATC,CCWGG). -a & -m & --motifs_file are exclusive.
--motifs_file : Path to file with list of motifs (one per line) following IUPAC nucleotide code. -a & -m & --motifs_file are exclusive.
Advanced parameters. We recommend leaving them set to default values:
-c : Minimum coverage/number of current values needed at given position for methylation feature computation. Default is 10.
--min_contig_len : Minimum length to consider a contig for feature selection. Default is 100000.
Output:
- Methylation profile matrix (
<path_output>/methylation_profile_<base_name>.RDS):
columns:
contig name of contig
motif motif sequence (e.g. CCWGG)
distance_motif relative distance to first base of motif occurrence (0-based)
signal_ratio for development only. Expected strength of signal if motif known.
dist_score methylation feature value at relative distance (absolute average current difference across all motif occurrences)
nb_occurrence number of motif occurrence in the contig
attribute:
contig_coverage (data.frame):
chr name of contig
contig_length contig length
avg_cov.dataset_A average contig coverage in -w dataset (WGA)
avg_cov.dataset_B average contig coverage in -n dataset (native)
diff coverage difference (A - B)
ratio coverage difference (A + 0.001)/(B + 0.001)
- With
-a, additional attributemin_contig_lenfor minimum length to consider a contig for feature selection.
select_feature¶
Select informative feature from a methylation profile matrix.
Usage:
nanodisco select_feature -p <nb_threads> -r <path_fasta> -s <path_profile> -b <base_name> -o <path_output> [+ advanced parameters]
-p : Number of threads to use.
-r : Path to reference metagenome (.fasta).
-s : Path to methylation profile file (*.RDS produced from nanodisco profile).
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-o : Path to output directory. Default is current directory.
Advanced parameters. We recommend leaving them set to default values:
--fsel_min_contig_len : Minimum length to consider a contig for feature selection. Default is 100000.
--fsel_min_cov : Minimum average coverage to consider a contig for feature selection. Default is 10.
--fsel_min_motif_occ : Minimum number of motif occurrences in a contig for feature selection. Default is 20.
--fsel_min_signal : Absolute threshold for considering a feature informative. Default is 1.5.
Output:
- Selected methylation features (
<path_output>/selected_features_<base_name>.RDS):
columns:
feature_name feature identification (<motif>_<relative_distance>)
motif motif sequence (e.g. CCWGG)
contigs_origin list of contigs with significant feature (e.g. contig1|contig2)
attribute:
contig_coverage (data.frame), same format as in nanodisco profile
filter_profile¶
Compute the methylation profile matrix for selected features for a metagenome sample.
Usage:
nanodisco filter_profile -p <nb_threads> -r <path_fasta> -d <path_difference> -f <path_feature> -b <base_name> -o <path_output> [+ advanced parameters]
-p : Number of threads to use.
-r : Path to reference metagenome (.fasta).
-d : Path to current differences file (*.RDS produced from nanodisco difference).
-f : Path to selected features file (*.RDS produced from nanodisco selected_feature).
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-o : Path to output directory. Default is current directory.
Advanced parameters. We recommend leaving them set to default values:
-c : Minimum coverage/number of current values needed at given position for methylation feature computation. Default is 10.
Output:
- Filtered methylation profile matrix (
<path_output>/methylation_profile_<base_name>.RDS):
columns:
contig name of contig
motif motif sequence (e.g. CCWGG)
distance_motif relative distance to first base of motif occurrence (0-based)
signal_ratio for development only. Expected strength of signal if motif known.
dist_score methylation feature value at relative distance (absolute average current difference across all motif occurrences)
nb_occurrence number of motif occurrence in the contig
attribute:
contig_coverage (data.frame), same format as in nanodisco profile
binning¶
Perform methylation binning, cluster metagenomic contigs according to methylation feature similarities using t-SNE.
Usage:
nanodisco binning -r <path_fasta> -s <path_profile> -b <base_name> -o <path_output> [+ advanced parameters]
-r : Path to reference metagenome (.fasta).
-s : Path to methylation profile file (*.RDS produced from nanodisco profile).
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-o : Path to output directory. Default is current directory.
Advanced parameters. We recommend leaving them set to default values:
--min_motif_occ : Minimum number of motif occurrence to conserve entry in the methylation profile matrix. Default is 5.
--min_contig_len : Minimum contig length to conserve entry in the methylation profile matrix. Default is 25000.
--contig_weight_unit : Weight unit (bp) used for additional exaggeration in binning. Default is 50000.
--max_relative_weight : Maximum relative weight a contig can have, weighting ceiling. Default is 0.05.
--tsne_perplexity : t-SNE perplexity parameter. Default is 30.
--tsne_max_iter : t-SNE maximum iteration parameter. Default is 2500.
--tsne_seed : Seed set before t-SNE processing using set.seed function. Default is 101.
--rdm_seed : Seed used for random number generation in missing value filling using set.seed function. Default is 42.
Output:
- Methylation binning results from t-SNE dimensionality reduction (
<path_output>/methylation_binning_<base_name>.RDS):
columns:
tSNE_1 x coordinate from t-SNE dimensionality reduction
tSNE_2 y coordinate from t-SNE dimensionality reduction
contig name of contig
contig_length contig length
id contig identifier (e.g. species), is NA by default
plot_binning¶
Plot results of methylation binning.
Usage:
nanodisco plot_binning -r <path_fasta> -u <path_methylation_binning> -b <base_name> -o <path_output> [+ advanced parameters]
-r : Path to reference metagenome (.fasta).
-u : Path to methylation binning file (*.RDS produced from nanodisco binning).
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-o : Path to output directory. Default is current directory.
Advanced parameters. We recommend leaving them set to default values:
-a : Path to contig annotation. We expect two columns .txt or .RDS file with contig_name<tab>custom_name.
-c : Comma separated list of MGE contigs (e.g. contig_1,contig_3).
--list_MGE_contig : Comma separated list of MGE contigs (e.g. contig_1,contig_3).
--MGEs_file : Path to file with list of MGE contigs (one per line).
--xlim : Optional x-axis zooming (e.g. -5:10).
--ylim : Optional y-axis zooming (e.g. -10:9).
--min_contig_len : Minimum length for plotting contigs. Default is 25000 bp.
--split_fasta : Split reference metagenome into binned fasta ('yes' from annotation, 'default'|'<int,int>' from dbscan cluster analysis. Default is 'no'.
Output:
- Methylation binning figure (
Contigs_methylation_tsne_<base_name>.pdf) similar to Figure 5a-b in the preprint
score¶
Attribute methylation scores to each motif occurrence.
Usage:
nanodisco score -p <nb_threads> -b <base_name> -d <path_difference> -o <path_output> -m <motif1,motif2,...> -r <path_fasta>
-p : Number of threads to use. Default is 1.
-b : Base name for outputting results (e.g. Ecoli_K12). Default is 'results'.
-d : Path to current differences file (*.RDS produced from nanodisco difference).
-o : Path to output directory. Default is current directory.
-m : Comma separated list of motifs following IUPAC nucleotide code (e.g. GATC,CCWGG).
-r : Path to a reference genome (i.e. fasta).
-c : (Optional) Comma separated list of contigs (e.g. contig_1,contig_3).
--contigs_file : (Optional) Path to file with list of contigs (one per line).
-h : Print help.
Output:
- Methylation score for each occurrence of the supplied motif(s) in a text file (
Motifs_occurrences_scores_<base_name>.tsv).
columns:
contig name of contig
pos_motif genomic position of motif start
motif motif sequence (e.g. CCWGG)
pos_signal genomic position of motif start
dir genomic strand, fwd or rev
strand read strand, used when 2D nanopore reads
cov_wga average coverage of the motif in the wga dataset
cov_nat average coverage of the motif in the nat dataset
score maximum of the averaged current differences ([-2, +3])