BAM (Binary Alignment/Map) files are the standard format for storing aligned sequencing reads, typically generated from next-generation sequencing (NGS) experiments. While BAM files themselves don't directly contain contigs (continuous stretches of assembled DNA sequence), they are the crucial input for creating contigs through genome assembly. This guide explains how to obtain contigs from your BAM file, detailing the necessary steps and tools involved.
Understanding the Process: From Reads to Contigs
The journey from a BAM file to contigs involves several key stages:
-
BAM File: This contains aligned sequencing reads, showing where each read maps onto a reference genome (if available) or representing short sequence fragments.
-
Read Extraction: First, you'll need to extract the raw reads from your aligned BAM file. These reads are the building blocks for contig assembly.
-
De Novo Assembly (or Mapping-based Assembly): This crucial step uses algorithms to piece together overlapping reads into longer, continuous sequences, the contigs. There are two main approaches:
-
De novo assembly: This approach doesn't rely on a reference genome. It's used when assembling a completely new genome or when a reference genome is unavailable or highly divergent. Tools like SPAdes, Unicycler, and Flye are commonly employed.
-
Mapping-based assembly: This approach utilizes a reference genome to guide the assembly process. Reads are mapped to the reference, and gaps or discrepancies are filled based on the alignment information. This is often faster and produces more accurate results if a closely related reference is available.
-
-
Contigs Output: The final output of the assembly process is a FASTA file containing the assembled contigs. Each contig represents a continuous stretch of assembled sequence.
Tools and Workflow for Contig Generation
The exact process depends on whether you're performing de novo or mapping-based assembly. Here's a breakdown using common tools:
1. Extracting Reads from BAM (SAMtools)
Regardless of the assembly method, you'll likely start by extracting reads from the BAM file using samtools
. This command-line tool is essential for manipulating BAM and SAM (Sequence Alignment/Map) files.
samtools view -Sb input.bam > input.sam #Convert BAM to SAM
samtools view -f 0x40 input.sam > unmapped_reads.fastq #Extract unmapped reads in fastq format
samtools view -F 0x4 input.sam > mapped_reads.fastq #Extract mapped reads in fastq format
This extracts reads in FASTQ format, suitable as input for assemblers. Adjust the flags (-f
and -F
) as needed depending on whether you want mapped or unmapped reads.
2. De Novo Assembly (SPAdes Example)
SPAdes is a popular de novo assembler. The basic workflow is:
spades.py -k 21,33,55,77 --careful -t 8 -o output_directory input.fastq
-k
: Specifies k-mer sizes to use (experiment with different values).--careful
: Enables a more computationally intensive but often more accurate assembly.-t
: Sets the number of threads to use.-o
: Specifies the output directory.input.fastq
: Your input FASTQ file (fromsamtools
).
The assembled contigs will be found within the output_directory
in FASTA format.
3. Mapping-based Assembly (Example with BWA and Minimap2)
Mapping-based assembly requires a reference genome. Here's a simplified outline using BWA (Burrows-Wheeler Aligner) and Minimap2:
- Alignment: Align reads to the reference genome using BWA mem or Minimap2.
- Variant Calling (optional): Tools like GATK can identify variations between the reads and the reference.
- Assembly (optional): For filling gaps or resolving discrepancies, tools like Pilon can be used.
This approach is more complex and involves multiple steps. Consult the documentation of your chosen tools for detailed instructions.
Choosing the Right Approach
The best approach (de novo or mapping-based) depends on your specific needs and resources:
- De novo assembly: Use when you have no reference genome or the reference is highly divergent. It's more computationally intensive.
- Mapping-based assembly: Use when you have a closely related reference genome. It's generally faster and can produce more accurate results, especially for smaller genomes.
This guide provides a starting point for obtaining contigs from BAM files. Remember to consult the documentation for each tool used and adapt the workflows to suit your specific data and experimental design. The specific commands and parameters might need adjustments based on your dataset size and computational resources. Always check the output files to ensure the assembly has completed successfully.