Tutorial¶

The aim of this tutorial is to show how to build a pipeline to analyse metagenomic samples. Moreover, the SNPs calling part was made to show how diversity estimates can be calculated from metagenomic data, hence it should be changed to be more strict.

We’re going to use Peru Margin Subseafloor Biosphere as an example, which can be download from the ENA website.

This tutorial is expected to run on a UNIX (Linux/MacOSX/Solaris), with the bash shell running, because of some of the loops (not tested with other shells).

Note

We assume that all scripts/commands are run in the same directory.

Warning

It is advised to run the tutorial on a cluster/server: the memory requirements for the programs used are quite high (external to the library).

Initial setup¶

We will assume that the pipeline and it’s relative packages are already installed on the system where the tutorial is run, either through a system-wide install or a virtual environment (advised). The details are in the Installation section of the documentation.

Also for the rest of the tutorial we assume that the following software are installed and accessible system-wide:

Getting Sequence Data¶

The data is stored on the EBI ftp as well, and can be downloaded with the following command (on Linux):

$wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001326/SRR001326.fastq.gz$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001325/SRR001325.fastq.gz
$wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001323/SRR001323.fastq.gz$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001322/SRR001322.fastq.gz


on MacOSX you can replace wget with curl -O.

And then uncompress with:

$gunzip *.fastq.gz  Taxonomy Data¶ We only need the taxonomy for an optional part of the gene prediction for the analysis. It can be downloaded using the command: $ download-taxonomy.sh


The data will be saved in the file taxonomy.pickle to which we’ll refer from now on. More information can be found in Download Taxonomy

Metagenome Assembly¶

We’re going to use velvet to assemble the metagenomics sample, using the following commands in bash:

$velveth velvet_work 31 -fmtAuto *.fastq$ velvetg velvet_work -min_contig_lgth 50


The contigs are in the velvet_work/contigs.fa file. We want to take out some of the information in each sequence header, to make it easier to identify them. We decided to keep only NODE_#, where # is a unique number in the file (e.g. from >NODE_27_length_157_cov_703.121033 we keep only >NODE_27). We used this command in bash:

$cat velvet_work/contigs.fa | sed -E 's/(>NODE_[0-9]+)_.+/\1/g' > assembly.fa  Alternatively, fasta-utils uid can be used to avoid problems with spaces in the headers: $ fasta-utils uid cat velvet_work/contigs.fa assembly.fa


Gene Prediction¶

Gene prediction can be done with any software that supports the tab format that BLAST outputs. Besides BLAST, RAPSearch can be used as well.

Before that a suitable DB must be downloaded. In this tutorial we’ll use the SwissProt portion of Uniprot <http://www.uniprot.org> that can be downloaded using the following commands:

$wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz$ gunzip uniprot_sprot.fasta.gz


Using BLAST¶

BLAST needs the DB to be indexed using the following command:

$makeblastdb -dbtype prot -in uniprot_sprot.fasta  After which BLAST can be run: $ blastx -query assembly.fasta -db uniprot_sprot.fasta -out \
assembly.uniprot.tab -outfmt 6


Using RAPSearch¶

RAPSearch is faster than BLAST, while giving similar results. As with BLAST, there is a command to be executed before it can predict genes:

$prerapsearch -d uniprot_sprot.fasta -n uniprot_sprot  After this command is complete its execution, RAPSearch can be started: $ rapsearch -q assembly.fa -d uniprot_sprot -o assembly.uniprot.tab


RAPSearch will produce two files, assembly.uniprot.tab.m8 and assembly.uniprot.tab.aln. assembly.uniprot.tab.m8 is the file in the correct format, so we can rename it and remove the other one:

$rm assembly.uniprot.tab.aln$ mv assembly.uniprot.tab.m8 assembly.uniprot.tab


Create the GFF¶

After BLAST or RAPSearch are finished, we can convert all predictions to a GFF file:

$blast2gff uniprot -b 40 -db UNIPROT-SP -dbq 10 assembly.uniprot.tab \ assembly.uniprot.gff  And then, because the number of annotations is high, we filter them to reduce the number of overlapping annotations: $ filter-gff overlap assembly.uniprot.gff assembly.uniprot-filt.gff


This will result in a smaller file. Both script supports piping, so they can be used together, for example to save a compressed file:

$blast2gff uniprot -b 40 -db UNIPROT-SP -dbq 10 assembly.uniprot.tab | \ filter-gff overlap - assembly.uniprot-filt.gff  Finally, rename the filtered GFF file: $ mv assembly.uniprot-filt.gff assembly.uniprot.gff


Warning

filter-gff may require a lot of memory, so it’s recommended to read its documentation for strategies on lowering the memory requirements for big datasets (a small script to sort a GFF is included sort-gff.sh)

Taxonomic Refinement¶

This section is optional, as taxonomic identifiers are assigned using Uniprot, but it can result in better identification. It requires the the nt database from NCBI to be found on the system, in the ncbi-db directory.

if you don’t have the nt database installed, it can be downloaded (> 80GB uncompressed, about 30 compressed) with this command (you’ll need to install ncftpget):

$mkdir ncbi-db$ cd ncbi-db
$ncftpget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt*.gz$ tar xfvz *.tar.gz
$cd ..  To do it, first the nucleotide sequences must be extracted and then use blastn against the nt database: $ get-gff-info sequence -f assembly.fa assembly.uniprot.gff \
assembly.uniprot.frag.fasta
$blastn -query assembly.uniprot.frag.fasta -db ncbi-db/nt -out \ assembly.uniprot.frag.tab -outfmt 6  After BLAST completes, we need to download supporting file to associate the results with the taxonomic information: $ download-ncbi-taxa.sh
$gunzip -c ncbi-nucl-taxa.gz | taxon-utils to_hdf -n nt  We now need to run the taxon-utils (taxon-utils - Taxonomy Utilities) script to find the LCA for each annotation. BLAST will output too many matches, so we want to also filter this file first, with filter-gff. First we convert into GFF the BLAST tab file, then use filter-gff to pick only the 95% quantile of hit length out of all hits and finally filter to get the 95% of identities. Finally run taxon-utils to get the LCA table: $ blast2gff blastdb -i 3 -r assembly.uniprot.frag.tab | \
filter-gff sequence -t -a length -f quantile -l 0.95 -c gt | \
filter-gff sequence -t -a identity -f quantile -l 0.95 -c gt | \
taxon-utils lca -b 40 -t taxonomy.pickle -s -p - lca.tab


What we do is convert the BLAST results into a GFF file, removing the version information from the accession. Then filter the GFF keeping only the annotation which are in the top 5% of indentity scores, but also use only annotations that have a bitscore of 40 and write the result as a 2 columns table.

We can now run the script to add the taxonomic information to the GFF file, with:

$add-gff-info addtaxa -v -t lca.tab -a seq_id -db NCBI-NT \ assembly.uniprot.gff assembly.uniprot-taxa.gff  after it completes, it is safe to rename the output GFF: $ mv assembly.uniprot-taxa.gff assembly.uniprot.gff


Complete GFF¶

To add the remaining information, mapping to KO and others, including the taxonomic information, a script is provided that downloads this information into a GFF file:

$add-gff-info uniprot --buffer 500 -t -e -ec -ko \ assembly.uniprot.gff assembly.uniprot-final.gff  After which we can rename the GFF file: $ mv assembly.uniprot-final.gff assembly.uniprot.gff


Alignment¶

The alignment of all reads to the assembly we’ll be made with bowtie2. The first step is to build the index for the reference (out assembly) with the following command:

bowtie2-build assembly.fa assembly.fa  and subsequently start the alignment, using bowtie2 and piping the output SAM file to samtools to convert it into BAM files with this command: for file in *.fastq; do BASENAME=basenamefile .fastq
bowtie2 -N 1 -x assembly.fa -U $file \ --very-sensitive-local \ --rg-id$BASENAME --rg PL:454 --rg PU:454 \
--rg SM:$BASENAME | samtools view -Sb - >$BASENAME.bam;
done


We’ll have BAM files which we need to sort and index:

for file in *.bam; do
samtools sort -o basename $file .bam-sort.bam$file;
mv basename $file .bam-sort.bam$file
samtools index $file; done  Coverage and SNP Info¶ The coverage information is added to the GFF and needs to be added for later SNP analysis, including information about the expected number of synonymous and non-synonymous changes. The following lines can do it, using one of the scripts included with the library: $ export SAMPLES=$(for file in *.bam; do echo -a basename$file .bam,$file ;done)$ add-gff-info coverage $SAMPLES assembly.uniprot.gff | add-gff-info \ exp_syn -r assembly.fa > assembly.uniprot-update.gff$ mv assembly.uniprot-update.gff assembly.uniprot.gff
$unset SAMPLES  The first line prepares part of the command line for the script and stores it into an environment variable, while the last command unsets the variable, as it’s not needed anymore. The second command adds the expected number of synonymous and non-synonymous changes for each annotation. A faster way to add the coverage to a GFF file is to use the cov_samtools command instead: $ for x in *.bam; do samtools depth -aa $x > basename$x .bam.depth; done
$add-gff-info cov_samtools$(for file in *.depth; do echo -s basename $file .depth -d$file ;done) assembly.uniprot.gff assembly.uniprot-update.gff
$mv assembly.uniprot-update.gff assembly.uniprot.gff  This requires the creation of depth files from samtools, which can be fairly big. The script will accept files compressed with gzip, bzip2 (and xz if the module is available), but will be slower. For this tutorial, each uncompressed depth file is aboud 110MB. The coverage command memory footprint is tied to the GFF file (kept in memory). The cov_samtools reads the depth information one line at a time and keeps a numpy array for each sequence in memory (and each sample), while the GFF is streamed. SNP Calling¶ bcftools¶ For calling SNPs, we can use bcftools (v 1.8 was tested) bcftools mpileup -f assembly.fa -Ou *.bam | bcftools call -m -v -O v --ploidy 1 -A -o assembly.vcf  Data Preparation¶ Diversity Analysis¶ To use diversity estimates (pN/pS) for the data, we need to first first is aggregate all SNP information from the vcf file into data structures that can be read and analysed by the library. This can be done using the included script snp_parser, with this lines of bash: $ export SAMPLES=$(for file in *.bam; do echo -m basename$file .bam;done)
$snp_parser -s -v -g assembly.uniprot.gff -p assembly.vcf -a assembly.fa$SAMPLES
$unset SAMPLES  Note The -s options must be added if the VCF file was created with bcftools Count Data¶ To evaluate the abundance of taxa and functional categories in the data we need to produce one file for each sample using htseq-count, from the HTSeq library. for file in *.bam; do htseq-count -f bam -r pos -s no -t CDS -i uid -a 8 \ -m intersection-nonempty$file assembly.uniprot.gff \
> basename $file .bam-counts.txt done  And to add the counts to the GFF file: $ add-gff-info counts for x in *.bam; do echo -s $(basename$x .bam); done \
for x in *-counts.txt; do echo -c $x; done assembly.uniprot.gff tmp.gff$ mv tmp.gff assembly.uniprot.gff


Alternatively featureCounts from the subread package can be used:

$featureCounts -a assembly.uniprot.gff -g uid -O -t CDS -o counts-featureCounts.txt *.bam  And adding it to the GFF is similar: $ add-gff-info counts for x in *.bam; do echo -s $(basename$x .bam); done -c counts-featureCounts.txt -e assembly.uniprot.gff tmp.gff
$mv tmp.gff assembly.uniprot.gff  Note however that there will be one file only made by featureCounts and that is allowed when using add-gff-info counts when the -e option is passed. Additional Downloads¶ The following files needs to be downloaded to analyse the functional categories in the following script: $ wget http://eggnog.embl.de/version_3.0/data/downloads/COG.members.txt.gz
$wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.members.txt.gz$ wget http://eggnog.embl.de/version_3.0/data/downloads/COG.funccat.txt.gz
$wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.funccat.txt.gz  and this for Enzyme Classification: $ wget ftp://ftp.expasy.org/databases/enzyme/enzclass.txt


Full Bash Script¶

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 #!/bin/bash #download data #50 meters wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001326/SRR001326.fastq.gz #1 meter wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001325/SRR001325.fastq.gz #32 meters wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001323/SRR001323.fastq.gz #16 meters wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001322/SRR001322.fastq.gz #uncompress data gunzip -v *.fastq.gz #assembly - preparatory phase velveth velvet_work 31 -fmtAuto *.fastq #assembly velvetg velvet_work -min_contig_lgth 50 #change sequence names cat velvet_work/contigs.fa | sed -E 's/(>NODE_[0-9]+)_.+/\1/g' > assembly.fa #alternative #fasta-utils uid cat velvet_work/contigs.fa assembly.fa #remove velvet working directory rm -R velvet_work #To use the LCA option and other analysis we need a taxonomy file download-taxonomy.sh #Uniprot SwissProt DB wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz #Uncompress it gunzip uniprot_sprot.fasta.gz ######## #Gene prediction ###BLAST #index Uniprot makeblastdb -dbtype prot -in uniprot_sprot.fasta #Run blastx blastx -query assembly.fa -db uniprot_sprot.fasta -out \ assembly.uniprot.tab -outfmt 6 ###RAPSearch #Index # prerapsearch -d uniprot_sprot.fasta -n uniprot_sprot #Run # rapsearch -q assembly.fa -d uniprot_sprot -o assembly.uniprot.tab #rename .m8 file to assembly.uniprot.tab and delete .aln #rm assembly.uniprot.tab.aln #mv assembly.uniprot.tab.m8 assembly.uniprot.tab ######## #Converts gene prediction into GFF annotations blast2gff uniprot -b 40 -db UNIPROT-SP -dbq 10 assembly.uniprot.tab \ assembly.uniprot.gff filter-gff overlap assembly.uniprot.gff assembly.uniprot-filt.gff #rename the new filtered file mv assembly.uniprot-filt.gff assembly.uniprot.gff ######## #Taxonomic refinement - requires NCBI nt DB installed and indexed export NCBINT_DIR=ncbi-db if [ -d "$NCBINT_DIR" ]; then echo "Taxonomic refinement"; #Extract annotations sequences get-gff-info sequence -f assembly.fa assembly.uniprot.gff \ assembly.uniprot.frag.fasta #Use blastn to match against NCBI NT blastn -query assembly.uniprot.frag.fasta -db ncbi-db/nt -out \ assembly.uniprot.frag.tab -outfmt 6 #Download necessary data download-ncbi-taxa.sh gunzip -c ncbi-nucl-taxa.gz | taxon-utils to_hdf -n nt # Get the LCA for the sequences blast2gff blastdb -i 3 -r assembly.uniprot.frag.tab | \ filter-gff sequence -t -a length -f quantile -l 0.95 -c gt | \ filter-gff sequence -t -a identity -f quantile -l 0.95 -c gt | \ add-gff-info addtaxa -f taxa-table.hf5:nt | \ taxon-utils lca -b 40 -t taxonomy.pickle -s -p - lca.tab # Add the LCA info to the GFF add-gff-info addtaxa -v -t lca.tab -a seq_id -db NCBI-NT \ assembly.uniprot.gff assembly.uniprot-taxa.gff #rename the file to continue the script mv assembly.uniprot-taxa.gff assembly.uniprot.gff fi unset NCBINT_DIR ######## #Finalise information from Gene Prediction #Adds remaining taxonomy, EC numbers, KO and eggNOG mappings add-gff-info uniprot --buffer 500 -t -e -ec -ko \ assembly.uniprot.gff assembly.uniprot-final.gff #Rename the GFF mv assembly.uniprot-final.gff assembly.uniprot.gff ######## #Alignments bowtie2-build assembly.fa assembly.fa for file in *.fastq; do BASENAME=basename$file .fastq; bowtie2 -N 1 -x assembly.fasta -U $file \ --very-sensitive-local \ --rg-id$BASENAME --rg PL:454 --rg PU:454 \ --rg SM:$BASENAME | samtools view -Sb - >$BASENAME.bam; done #sort and index BAM files with samtools for file in *.bam; do samtools sort $file basename$file .bam-sort; #removes the unsorted file, it's not needed mv basename $file .bam-sort.bam$file samtools index $file; done ######## #Add coverage and expected changes to GFF file #export SAMPLES=$(for file in *.bam; do echo -a basename $file .bam,$file ;done) #Coverage info #add-gff-info coverage $SAMPLES assembly.uniprot.gff | add-gff-info \ # exp_syn -r assembly.fa > assembly.uniprot-update.gff #rename to continue the script #mv assembly.uniprot-update.gff assembly.uniprot.gff #unset SAMPLES #Faster for x in *.bam; do samtools depth -aa$x > basename $x .bam.depth; done add-gff-info cov_samtools$(for file in *.depth; do echo -s basename file .depth -d $file ;done) assembly.uniprot.gff assembly.uniprot-update.gff mv assembly.uniprot-update.gff assembly.uniprot.gff ######## #SNP calling using samtools bcftools mpileup -f assembly.fa -Ou *.bam | bcftools call -v -O v --ploidy 1 -A -o assembly.vcf #Index fasta with Picard tools - GATK requires it #java -jar picard-tools/picard.jar CreateSequenceDictionary \ # R=assembly.fa O=assembly.dict #merge vcf #export SAMPLES=$(for file in *.bam.vcf; do echo -V:basename $file .bam.vcf$file ;done) # java -Xmx10g -jar GATK/GenomeAnalysisTK.jar \ # -R assembly.fa -T CombineVariants -o assembly.vcf \ # -genotypeMergeOptions UNIQUIFY \ # $SAMPLES # unset SAMPLES ######## #snp_parser export SAMPLES=$(for file in *.bam; do echo -m basename $file .bam;done) snp_parser -s -v -g assembly.uniprot.gff -p assembly.vcf -a assembly.fa$SAMPLES unset SAMPLES ######## #Count reads # Using HTSeq for file in *.bam; do htseq-count -f bam -r pos -s no -t CDS -i uid -a 8 \ -m intersection-nonempty $file assembly.uniprot.gff \ > basename$file .bam-counts.txt done # Adding counts to GFF add-gff-info counts for x in *.bam; do echo -s $(basename$x .bam); done \ for x in *-counts.txt; do echo -c $x; done assembly.uniprot.gff tmp.gff mv tmp.gff assembly.uniprot.gff # Using featureCounts #featureCounts -a assembly.uniprot.gff -g uid -O -t CDS -o counts-featureCounts.txt *.bam #add-gff-info counts for x in *.bam; do echo -s$(basename \$x .bam); done -c counts-featureCounts.txt -e assembly.uniprot.gff tmp.gff #mv tmp.gff assembly.uniprot.gff ######## #eggNOG mappings wget http://eggnog.embl.de/version_3.0/data/downloads/COG.members.txt.gz wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.members.txt.gz wget http://eggnog.embl.de/version_3.0/data/downloads/COG.funccat.txt.gz wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.funccat.txt.gz ######## #EC names wget ftp://ftp.expasy.org/databases/enzyme/enzclass.txt 

Footnotes

 [1] Picard Tools needs to be found in the directory picard-tools in the same directory as this tutorial.
 [2] GATK directory is expected to be called GATK and inside the tutorial directory. It also needs java v1.7.x in newer versions.