Changed in version 0.3.4: updates

Tutorial

The aim of this tutorial is to show how to build a pipeline to analyse metagenomic samples. Moreover, the SNPs calling part was made to show how diversity estimates can be calculated from metagenomic data, hence it should be changed to be more strict.

We’re going to use Peru Margin Subseafloor Biosphere as an example, which can be download from the ENA website.

This tutorial is expected to run on a UNIX (Linux/MacOSX/Solaris), with the bash shell running, because of some of the loops (not tested with other shells).

Note

We assume that all scripts/commands are run in the same directory.

Warning

It is advised to run the tutorial on a cluster/server: the memory requirements for the programs used are quite high (external to the library).

Initial setup

We will assume that the pipeline and it’s relative packages are already installed on the system where the tutorial is run, either through a system-wide install or a virtual environment (advised). The details are in the Installation section of the documentation.

Also for the rest of the tutorial we assume that the following software are installed and accessible system-wide:

Getting Sequence Data

The data is stored on the EBI ftp as well, and can be downloaded with the following command (on Linux):

$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001326/SRR001326.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001325/SRR001325.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001323/SRR001323.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001322/SRR001322.fastq.gz

on MacOSX you can replace wget with curl -O.

And then uncompress with:

$ gunzip *.fastq.gz

Taxonomy Data

We only need the taxonomy for an optional part of the gene prediction for the analysis. It can be downloaded using the command:

$ download-taxonomy.sh

The data will be saved in the file taxonomy.pickle to which we’ll refer from now on. More information can be found in Download Taxonomy

Metagenome Assembly

We’re going to use velvet to assemble the metagenomics sample, using the following commands in bash:

$ velveth velvet_work 31 -fmtAuto  *.fastq
$ velvetg velvet_work -min_contig_lgth 50

The contigs are in the velvet_work/contigs.fa file. We want to take out some of the information in each sequence header, to make it easier to identify them. We decided to keep only NODE_#, where # is a unique number in the file (e.g. from >NODE_27_length_157_cov_703.121033 we keep only >NODE_27). We used this command in bash:

$ cat velvet_work/contigs.fa | sed -E 's/(>NODE_[0-9]+)_.+/\1/g' > assembly.fa

Alternatively, fasta-utils uid can be used to avoid problems with spaces in the headers:

$ fasta-utils uid cat velvet_work/contigs.fa assembly.fa

Gene Prediction

Gene prediction can be done with any software that supports the tab format that BLAST outputs. Besides BLAST, RAPSearch can be used as well.

Before that a suitable DB must be downloaded. In this tutorial we’ll use the SwissProt portion of Uniprot <http://www.uniprot.org> that can be downloaded using the following commands:

$ wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
$ gunzip uniprot_sprot.fasta.gz

Using BLAST

BLAST needs the DB to be indexed using the following command:

$ makeblastdb -dbtype prot -in uniprot_sprot.fasta

After which BLAST can be run:

$ blastx -query assembly.fasta -db uniprot_sprot.fasta -out \
    assembly.uniprot.tab -outfmt 6

Using RAPSearch

RAPSearch is faster than BLAST, while giving similar results. As with BLAST, there is a command to be executed before it can predict genes:

$ prerapsearch -d uniprot_sprot.fasta -n uniprot_sprot

After this command is complete its execution, RAPSearch can be started:

$ rapsearch -q assembly.fa -d uniprot_sprot -o assembly.uniprot.tab

RAPSearch will produce two files, assembly.uniprot.tab.m8 and assembly.uniprot.tab.aln. assembly.uniprot.tab.m8 is the file in the correct format, so we can rename it and remove the other one:

$ rm assembly.uniprot.tab.aln
$ mv assembly.uniprot.tab.m8 assembly.uniprot.tab

Create the GFF

After BLAST or RAPSearch are finished, we can convert all predictions to a GFF file:

$ blast2gff uniprot -b 40 -db UNIPROT-SP -dbq 10 assembly.uniprot.tab \
    assembly.uniprot.gff

And then, because the number of annotations is high, we filter them to reduce the number of overlapping annotations:

$ filter-gff overlap assembly.uniprot.gff assembly.uniprot-filt.gff

This will result in a smaller file. Both script supports piping, so they can be used together, for example to save a compressed file:

$ blast2gff uniprot -b 40 -db UNIPROT-SP -dbq 10 assembly.uniprot.tab | \
    filter-gff overlap - assembly.uniprot-filt.gff

Finally, rename the filtered GFF file:

$ mv assembly.uniprot-filt.gff assembly.uniprot.gff

Warning

filter-gff may require a lot of memory, so it’s recommended to read its documentation for strategies on lowering the memory requirements for big datasets (a small script to sort a GFF is included sort-gff.sh)

Taxonomic Refinement

This section is optional, as taxonomic identifiers are assigned using Uniprot, but it can result in better identification. It requires the the nt database from NCBI to be found on the system, in the ncbi-db directory.

if you don’t have the nt database installed, it can be downloaded (> 80GB uncompressed, about 30 compressed) with this command (you’ll need to install ncftpget):

$ mkdir ncbi-db
$ cd ncbi-db
$ ncftpget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt*.gz
$ tar xfvz *.tar.gz
$ cd ..

To do it, first the nucleotide sequences must be extracted and then use blastn against the nt database:

$ get-gff-info sequence -f assembly.fa assembly.uniprot.gff \
    assembly.uniprot.frag.fasta
$ blastn -query assembly.uniprot.frag.fasta -db ncbi-db/nt -out \
    assembly.uniprot.frag.tab -outfmt 6

After BLAST completes, we need to download supporting file to associate the results with the taxonomic information:

$ download-ncbi-taxa.sh
$ gunzip -c ncbi-nucl-taxa.gz | taxon-utils to_hdf -n nt

We now need to run the taxon-utils (taxon-utils - Taxonomy Utilities) script to find the LCA for each annotation. BLAST will output too many matches, so we want to also filter this file first, with filter-gff. First we convert into GFF the BLAST tab file, then use filter-gff to pick only the 95% quantile of hit length out of all hits and finally filter to get the 95% of identities. Finally run taxon-utils to get the LCA table:

$ blast2gff blastdb -i 3 -r assembly.uniprot.frag.tab | \
    filter-gff sequence -t -a length -f quantile -l 0.95 -c gt | \
    filter-gff sequence -t -a identity -f quantile -l 0.95 -c gt | \
    add-gff-info addtaxa -f taxa-table.hf5:nt | \
    taxon-utils lca -b 40 -t taxonomy.pickle -s -p - lca.tab

What we do is convert the BLAST results into a GFF file, removing the version information from the accession. Then filter the GFF keeping only the annotation which are in the top 5% of indentity scores, but also use only annotations that have a bitscore of 40 and write the result as a 2 columns table.

We can now run the script to add the taxonomic information to the GFF file, with:

$ add-gff-info addtaxa -v -t lca.tab -a seq_id -db NCBI-NT \
    assembly.uniprot.gff assembly.uniprot-taxa.gff

after it completes, it is safe to rename the output GFF:

$ mv assembly.uniprot-taxa.gff assembly.uniprot.gff

Complete GFF

To add the remaining information, mapping to KO and others, including the taxonomic information, a script is provided that downloads this information into a GFF file:

$ add-gff-info uniprot --buffer 500 -t -e -ec -ko \
    assembly.uniprot.gff assembly.uniprot-final.gff

After which we can rename the GFF file:

$ mv assembly.uniprot-final.gff assembly.uniprot.gff

Alignment

The alignment of all reads to the assembly we’ll be made with bowtie2. The first step is to build the index for the reference (out assembly) with the following command:

$ bowtie2-build assembly.fa assembly.fa

and subsequently start the alignment, using bowtie2 and piping the output SAM file to samtools to convert it into BAM files with this command:

for file in *.fastq; do
    BASENAME=`basename $file .fastq`
    bowtie2 -N 1 -x assembly.fa -U $file \
    --very-sensitive-local \
    --rg-id $BASENAME --rg PL:454 --rg PU:454 \
    --rg SM:$BASENAME | samtools view -Sb - > $BASENAME.bam;
done

We’ll have BAM files which we need to sort and index:

for file in *.bam; do
    samtools sort -o `basename $file .bam`-sort.bam $file;
    mv `basename $file .bam`-sort.bam $file
    samtools index $file;
done

Coverage and SNP Info

The coverage information is added to the GFF and needs to be added for later SNP analysis, including information about the expected number of synonymous and non-synonymous changes. The following lines can do it, using one of the scripts included with the library:

$ export SAMPLES=$(for file in *.bam; do echo -a `basename $file .bam`,$file ;done)
$ add-gff-info coverage $SAMPLES assembly.uniprot.gff | add-gff-info \
    exp_syn -r assembly.fa > assembly.uniprot-update.gff

$ mv assembly.uniprot-update.gff assembly.uniprot.gff
$ unset SAMPLES

The first line prepares part of the command line for the script and stores it into an environment variable, while the last command unsets the variable, as it’s not needed anymore. The second command adds the expected number of synonymous and non-synonymous changes for each annotation.

A faster way to add the coverage to a GFF file is to use the cov_samtools command instead:

$ for x in *.bam; do samtools depth -aa $x > `basename $x .bam`.depth; done
$ add-gff-info cov_samtools $(for file in *.depth; do echo -s `basename $file .depth` -d $file ;done) assembly.uniprot.gff assembly.uniprot-update.gff
$ mv assembly.uniprot-update.gff assembly.uniprot.gff

This requires the creation of depth files from samtools, which can be fairly big. The script will accept files compressed with gzip, bzip2 (and xz if the module is available), but will be slower. For this tutorial, each uncompressed depth file is aboud 110MB.

The coverage command memory footprint is tied to the GFF file (kept in memory). The cov_samtools reads the depth information one line at a time and keeps a numpy array for each sequence in memory (and each sample), while the GFF is streamed.

SNP Calling

bcftools

For calling SNPs, we can use bcftools (v 1.8 was tested)

bcftools mpileup -f assembly.fa -Ou *.bam | bcftools call -m -v -O v --ploidy 1 -A -o assembly.vcf

Data Preparation

Diversity Analysis

To use diversity estimates (pN/pS) for the data, we need to first first is aggregate all SNP information from the vcf file into data structures that can be read and analysed by the library. This can be done using the included script snp_parser, with this lines of bash:

$ export SAMPLES=$(for file in *.bam; do echo -m `basename $file .bam`;done)
$ snp_parser -s -v -g assembly.uniprot.gff -p assembly.vcf -a assembly.fa $SAMPLES
$ unset SAMPLES

Note

The -s options must be added if the VCF file was created with bcftools

Count Data

To evaluate the abundance of taxa and functional categories in the data we need to produce one file for each sample using htseq-count, from the HTSeq library.

for file in *.bam; do
    htseq-count -f bam -r pos -s no -t CDS -i uid -a 8 \
    -m intersection-nonempty $file assembly.uniprot.gff \
    > `basename $file .bam`-counts.txt
done

And to add the counts to the GFF file:

$ add-gff-info counts `for x in *.bam; do echo -s $(basename $x .bam); done` \
    `for x in *-counts.txt; do echo -c $x; done` assembly.uniprot.gff tmp.gff
$ mv tmp.gff assembly.uniprot.gff

Alternatively featureCounts from the subread package can be used:

$ featureCounts -a assembly.uniprot.gff -g uid -O  -t CDS -o counts-featureCounts.txt *.bam

And adding it to the GFF is similar:

$ add-gff-info counts `for x in *.bam; do echo -s $(basename $x .bam); done` -c counts-featureCounts.txt -e assembly.uniprot.gff tmp.gff
$ mv tmp.gff assembly.uniprot.gff

Note however that there will be one file only made by featureCounts and that is allowed when using add-gff-info counts when the -e option is passed.

Additional Downloads

The following files needs to be downloaded to analyse the functional categories in the following script:

$ wget http://eggnog.embl.de/version_3.0/data/downloads/COG.members.txt.gz
$ wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.members.txt.gz
$ wget http://eggnog.embl.de/version_3.0/data/downloads/COG.funccat.txt.gz
$ wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.funccat.txt.gz

and this for Enzyme Classification:

$ wget ftp://ftp.expasy.org/databases/enzyme/enzclass.txt

Full Bash Script

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
#!/bin/bash

#download data
#50 meters
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001326/SRR001326.fastq.gz
#1 meter
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001325/SRR001325.fastq.gz
#32 meters
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001323/SRR001323.fastq.gz
#16 meters
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001322/SRR001322.fastq.gz
#uncompress data
gunzip -v *.fastq.gz

#assembly - preparatory phase
velveth velvet_work 31 -fmtAuto *.fastq
#assembly
velvetg velvet_work -min_contig_lgth 50
#change sequence names
cat velvet_work/contigs.fa | sed -E 's/(>NODE_[0-9]+)_.+/\1/g' > assembly.fa
#alternative
#fasta-utils uid cat velvet_work/contigs.fa assembly.fa
#remove velvet working directory
rm -R velvet_work

#To use the LCA option and other analysis we need a taxonomy file
download-taxonomy.sh

#Uniprot SwissProt DB
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
#Uncompress it
gunzip uniprot_sprot.fasta.gz

########
#Gene prediction

###BLAST
#index Uniprot
makeblastdb -dbtype prot -in uniprot_sprot.fasta
#Run blastx
blastx -query assembly.fa -db uniprot_sprot.fasta -out \
	assembly.uniprot.tab -outfmt 6

###RAPSearch
#Index
# prerapsearch -d uniprot_sprot.fasta -n uniprot_sprot
#Run
# rapsearch -q assembly.fa -d uniprot_sprot -o assembly.uniprot.tab
#rename .m8 file to assembly.uniprot.tab and delete .aln
#rm assembly.uniprot.tab.aln
#mv assembly.uniprot.tab.m8 assembly.uniprot.tab

########
#Converts gene prediction into GFF annotations
blast2gff uniprot -b 40 -db UNIPROT-SP -dbq 10 assembly.uniprot.tab \
	assembly.uniprot.gff
filter-gff overlap assembly.uniprot.gff assembly.uniprot-filt.gff
#rename the new filtered file
mv assembly.uniprot-filt.gff assembly.uniprot.gff

########
#Taxonomic refinement - requires NCBI nt DB installed and indexed
export NCBINT_DIR=ncbi-db
if [ -d "$NCBINT_DIR" ]; then
	echo "Taxonomic refinement";
	#Extract annotations sequences
	get-gff-info sequence -f assembly.fa assembly.uniprot.gff \
        assembly.uniprot.frag.fasta
	#Use blastn to match against NCBI NT
	blastn -query assembly.uniprot.frag.fasta -db ncbi-db/nt -out \
		assembly.uniprot.frag.tab -outfmt 6

	#Download necessary data
	download-ncbi-taxa.sh
	gunzip -c ncbi-nucl-taxa.gz | taxon-utils to_hdf -n nt

	# Get the LCA for the sequences
	blast2gff blastdb -i 3 -r assembly.uniprot.frag.tab | \
        filter-gff sequence -t -a length -f quantile -l 0.95 -c gt | \
        filter-gff sequence -t -a identity -f quantile -l 0.95 -c gt | \
        add-gff-info addtaxa -f taxa-table.hf5:nt | \
        taxon-utils lca -b 40 -t taxonomy.pickle -s -p - lca.tab

	# Add the LCA info to the GFF
	add-gff-info addtaxa -v -t lca.tab -a seq_id -db NCBI-NT \
        assembly.uniprot.gff assembly.uniprot-taxa.gff

	#rename the file to continue the script
	mv assembly.uniprot-taxa.gff assembly.uniprot.gff
fi
unset NCBINT_DIR

########
#Finalise information from Gene Prediction
#Adds remaining taxonomy, EC numbers, KO and eggNOG mappings
add-gff-info uniprot --buffer 500 -t -e -ec -ko \
	assembly.uniprot.gff assembly.uniprot-final.gff
#Rename the GFF
mv assembly.uniprot-final.gff assembly.uniprot.gff

########
#Alignments
bowtie2-build assembly.fa assembly.fa
for file in *.fastq; do
	BASENAME=`basename $file .fastq`;
	bowtie2 -N 1 -x assembly.fasta -U $file \
	--very-sensitive-local \
	--rg-id $BASENAME --rg PL:454 --rg PU:454 \
	--rg SM:$BASENAME | samtools view -Sb - > $BASENAME.bam;
done
#sort and index BAM files with samtools
for file in *.bam; do
	samtools sort $file `basename $file .bam`-sort;
	#removes the unsorted file, it's not needed
	mv `basename $file .bam`-sort.bam $file
	samtools index $file;
done

########
#Add coverage and expected changes to GFF file
#export SAMPLES=$(for file in *.bam; do echo -a `basename $file .bam`,$file ;done)
#Coverage info
#add-gff-info coverage $SAMPLES assembly.uniprot.gff | add-gff-info \
#	exp_syn -r assembly.fa > assembly.uniprot-update.gff
#rename to continue the script
#mv assembly.uniprot-update.gff assembly.uniprot.gff
#unset SAMPLES
#Faster
for x in *.bam; do samtools depth -aa $x > `basename $x .bam`.depth; done
add-gff-info cov_samtools $(for file in *.depth; do echo -s `basename file .depth` -d $file ;done) assembly.uniprot.gff assembly.uniprot-update.gff
mv assembly.uniprot-update.gff assembly.uniprot.gff

########
#SNP calling using samtools
bcftools mpileup -f assembly.fa -Ou *.bam | bcftools call -v -O v --ploidy 1 -A -o assembly.vcf


#Index fasta with Picard tools - GATK requires it
#java -jar picard-tools/picard.jar CreateSequenceDictionary \
#	R=assembly.fa O=assembly.dict

#merge vcf
#export SAMPLES=$(for file in *.bam.vcf; do echo -V:`basename $file .bam.vcf` $file ;done)
# java -Xmx10g -jar GATK/GenomeAnalysisTK.jar \
# 	  -R assembly.fa -T CombineVariants -o assembly.vcf \
# 	  -genotypeMergeOptions UNIQUIFY \
# 	  $SAMPLES
# unset SAMPLES

########
#snp_parser
export SAMPLES=$(for file in *.bam; do echo -m `basename $file .bam`;done)
snp_parser -s -v -g assembly.uniprot.gff -p assembly.vcf -a assembly.fa $SAMPLES
unset SAMPLES

########
#Count reads
# Using HTSeq
for file in *.bam; do
	htseq-count -f bam -r pos -s no -t CDS -i uid -a 8 \
	-m intersection-nonempty $file assembly.uniprot.gff \
	> `basename $file .bam`-counts.txt
done

# Adding counts to GFF
add-gff-info counts `for x in *.bam; do echo -s $(basename $x .bam); done` \
	`for x in *-counts.txt; do echo -c $x; done` assembly.uniprot.gff tmp.gff

mv tmp.gff assembly.uniprot.gff
# Using featureCounts
#featureCounts -a assembly.uniprot.gff -g uid -O  -t CDS -o counts-featureCounts.txt *.bam
#add-gff-info counts `for x in *.bam; do echo -s $(basename $x .bam); done` -c counts-featureCounts.txt -e assembly.uniprot.gff tmp.gff
#mv tmp.gff assembly.uniprot.gff

########
#eggNOG mappings
wget http://eggnog.embl.de/version_3.0/data/downloads/COG.members.txt.gz
wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.members.txt.gz
wget http://eggnog.embl.de/version_3.0/data/downloads/COG.funccat.txt.gz
wget http://eggnog.embl.de/version_3.0/data/downloads/NOG.funccat.txt.gz

########
#EC names
wget ftp://ftp.expasy.org/databases/enzyme/enzclass.txt

Footnotes

[1]Picard Tools needs to be found in the directory picard-tools in the same directory as this tutorial.
[2]GATK directory is expected to be called GATK and inside the tutorial directory. It also needs java v1.7.x in newer versions.