Using Kraken2 & BBDuk to check for contamination in short-read datasets

This tutorial will guide you through:

Strategies for cleaning and QCing your data

How to install Kraken2, build a database, and use it to “decontaminate” your short reads

How to use BBDuk to clean your datasets for more unusual use cases, such as removing mitochondrial or plastid genomes.

Before you get started…

Kraken2 [1] is a popular program for the taxonomic classification of high-throughput sequencing reads. The program assigns each read in a fastq file to a taxonomic group based on the presence of k-mers in a reference database.

Although Kraken2 is used primarily in metagenomic workflows, it can also be used to check for contamination in short-read datasets of a single organism. Additionally, since it allows you to add your own fasta files to a standard or custom-built database, you can use this classification approach to answer many other interesting questions about sequence containment. I encourage you to get creative!

IMPORTANT: If you found this article, then you are probably hoping to apply Kraken2 to your short-read dataset. However, if you’re flirting with the idea of using it on long-read data or genome assembles, then be warned: this method ONLY works for short reads. If you attempt it on anything else, it will result in almost all of your data being “classified” as a contaminant and thrown out.

[1] clean and QC your data

Before using Kraken2, check the quality of your raw reads with a tool like FastQC or an equivalent.

Trim your reads to remove any contaminating adaptors and/or select reads based on their quality score. Trimmomatic is a great tool for this, but other options exist, such as BBDuk (see below).

After trimming or filtering, re-check your reads with your preferred QC tool to ensure the desired changes were made before continuing.

[2] installing Kraken2

Mamba is a package manager for Python that allows you to easily install and manage software packages, including Kraken2. It is a Python-based CLI that is substantially faster than Conda. Keep in mind, however, that the version of Kraken2 found on the bioconda channel may not reflect the most current version of the program.

To download and install Kraken2 using Mamba, follow these steps:

1. Install Mamba: If you do not already have Mamba installed, you can install it via Mambaforge.

2. Create a virtual environment: To keep the installation of Kraken2 isolated from your system-wide Python packages, it is recommended to create a virtual environment using Mamba. To create a virtual environment for Kraken2, run the following command in your terminal:

				
					mamba create -n kraken2

3. Activate the virtual environment: To activate the virtual environment, run the following command in your terminal:

				
					mamba activate kraken2

4. Install Kraken2: Once the virtual environment is activated, you can install Kraken2 using Mamba by running the following command:

				
					mamba install kraken2

[3] building a Kraken2 database

The standard Kraken2 database will include the reference genomes and taxonomic information for RefSeq bacterial, archaeal, and viral domains plus the human genome and a handful of other important vectors. You can add fungi, plants, and lots of other categories to this database or build a custom one.

If you plan on just using the standard database, then the following command will be sufficient to build and optimize it. However, this is still a memory-intensive task and if you have access to an HPC cluster, I highly recommend using it.

				
					kraken2-build --standard --db $DBNAME

If you want to add more RefSeq categories to your standard database, then utilize the `–download-library` option as shown below for fungi.

				
					kraken2-build --download-library fungi --db $DBNAME

If you want to incorporate other genomes into the standard or a custom database, utilize the `–add-to-library` option.

				
					kraken2-build --add-to-library other_genomes.fa --db $DBNAME

After you’ve added your desired sequences to the database, you’ll need to build it (or, re-build it if you’ve added on to a pre-built one) with the following command:

				
					 kraken2-build --build --db $DBNAME

Keep in mind that Kraken2 may not be able to perform as well in instances that fall outside of its common use cases, such as mitochondrial and/or plastid genomes. See the end of this article for an explanation of how to do this kind of classification outside of Kraken2.

[4] running Kraken2

To classify reads, Kraken2 is run on the command line or in a submission script using the following syntax:

				
					kraken2 --db $DBNAME reads.fastq

If you are using paired-end reads, you can use the following line:

				
					kraken2 --db $DBNAME --unclassified-out unclassified_raw_reads#.fq --classified-out classified_raw_reads#.fq --paired --report --use-names R1.fq R2.fq

where the options `–unclassified-out` and `–classified-out` send the sequences to their own separate files for further analysis. These options can be used without paired end reads too, and it is highly recommended.

`–use-names` will print scientific names instead of just taxids and `–report` will print a report with aggregate counts/clade to a tab-delimited file, where each row is a taxon. This file is different than the file generated via `–output` option, where each row is a classified or unclassified sequence. The name of the results file can be changed using the `–output` switch in Kraken2. Otherwise, the results will be written to standard output.

In the context of a decontamination approach to this workflow, the “classified” sequences are the contaminants and the “unclassified” sequences are your genome of interest. So, in an ideal world, you want those classified files to be as small as possible, but if they’re not, then at least you know.

Unusual Use Cases:

Kraken2 provides the ease of seamless integration with RefSeq, but BBDuk – a part of the BBTools package offered by the JGI – is also very powerful for a lot of the tasks mentioned above. I highly encourage you to read the BBDuk guide to see just HOW much you’re able to do with this tool.

For this tutorial, we’ll be using it to remove the mitochondrial genome from our raw reads to run the program dnaPipeTE, which requires all non-nuclear and contaminant DNA to be removed beforehand to avoid false hits.

I use BBDuk for this because this form of “decontamination” is not covered in Kraken2’s common use cases and I can be more sure that these sequences are actually removed from my dataset.

The easiest way to do this is to take the unclassified reads generated by Kraken2 and remove the mitochondrial reads by making this file the “ref=” and matching my unclassified reads to it.

				
					bbduk.sh ref=path/to/mito_seq.fasta threads=24 k=31 in=unclassified_raw_reads_1.fq in2=unclassified_raw_reads_2.fq out=good.R1.fq out2=good.R2.fq outm=bad.R1.fq outm2=bad.R2.fq -Xmx900g

‘outm=’ specifies the files where “matched” sequences are stored and all the others are sent to ‘out=’. Because anything that matches with the ref is also, most likely, a mitochondrial sequence, these are the ones I label “bad” and disregard. I move forward with the “good” R1 and R2 files.

Important: as a consequence of this process, you now may have some singletons in your paired-end dataset. You’ll want to remove these and ensure everything is sufficiently “paired” before moving forward with this dataset. A lot of tools exist to facilitate this, such as fastq-pair.

References:

[1] Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. _Genome biology_, _20_, 1-13.