Table of contents
SAM format specification
The Sequence Alignment/Map (SAM) format is a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly. Importantly:
- It is flexible enough to store all the alignment information generated by various alignment programs;
- It is simple enough to be easily generated by alignment programs or converted from existing alignment formats;
- It is compact in file size;
- It allows most of the operations on the alignment to work on a stream without loading the whole alignment into memory;
- It allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.
SAM is a tab-delimited text format. SAM is a bit slow to parse; so there is a binary equivalent to SAM, called BAM.
SAM allows optional fields to be stored. In SAM, each alignment must contain a fixed number of mandatory fields that describe the key information about the alignment (such as coordinate detailed alignment and sequences) and may contain a variable number of optional fields which are less important or aligner specific.
Using SAM to store various types of alignments
SAM is able to store clipped alignments, spliced alignments, multi-part alignments, padded alignments and alignments in colour space. The extended CIGAR string is the key to describing these types of alignments.
In Smith-Waterman alignment, a sequence may not be aligned from the first residue to the last one. Subsequences at the ends may be clipped off. We introduce operation 'S' to describe clipped alignment. Suppose the clipped alignment is:
REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG READ: gggGTGTAACC-GACTAGgggg
where on the read sequence, bases in uppercase are matches and bases in lowercase are clipped off. The CIGAR for this alignment is: 3S8M1D6M4S (which I interpret as 3 soft, 8 match, 1 deletion, 6 match and 4 soft).
In cDNA-to-genome alignment, we may want to distinguish introns from deletions in exons. We introduce openation 'N' to represent long skip on the reference sequence. Suppose the spliced alignment is:
REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG READ: GTGTAACCC................................TCAGAATA
where '...' on the read sequence indicates intron. The CIGAR for this alignment is : 9M32N8M.
One query sequence may be aligned to multiple places on the reference genome, either with or without overlaps. In SAM, we keep multiple hits as multiple alignment records. To avoid presenting the full query sequence multiple times for non-overlapping hits, we introduce operation 'H' to describe hard clipped alignment. Hard clipping (H) is similar to soft clipping (S). They are different in that hard clipped subsequence is not present in the alignment record. The example alignment in "clipped alignment" can also be represented with CIGAR: 3H8M1D6M4H, but in this case, the sequence stored in SAM is "GTGTAACCGACTAG", instead of "GGGGTGTAACCGACTAGGGGG" if soft clipping is in use.
Most sequence aligners only give the sequences inserted to the reference genome, but do not present how these inserted sequences are aligned against each other. Alignment with inserted sequences fully aligned is called padded alignment. Padded alignment is always produced by de novo assemblers and is important for an alignment viewer to display the alignment properly. To store padded alignment, we introduce operation 'P' which can be considered as a silent deletion from padded reference sequence. In the following example, GA on READ1 and A on READ2 are inserted to the reference. With unpadded CIGAR, we would not be able to distinguish the following padded multi-alignments:
REF: CACGATCA**GACCGATACGTCCGA READ1: CGATCAGAGACCGATA READ2: ATCA*AGACCGATAC READ3: GATCA**GACCG The padded CIGAR are different: READ1: 6M2I8M READ2: 4M1P1I9M READ3: 5M2P5M
Note that it is hard to convert unpadded CIGAR to padded one. Fully resolving the alignment between inserted sequences would essentially require a de novo assembler. However, it is easy vice versa. By simply removing all P operations we get the CIGAR without padding.
Alignments in colour space Colour alignments are stored as normal nucleotide alignments with additional tags describing the raw colour sequences, qualities and colour-specific properties.
Working on an example
Downloaded from the 1000 genome project ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA06984/alignment/NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.bam
samtools view NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.bam | head -1
SRR035022.2621862 163 16 59999 37 22S54M = 60102 179 CCAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCGACCCTCACCCTCACCC >AAA=>?AA>@@B@B?AABAB?AABAB?AAC@B?@AB@A?A>A@A?AAAAB??ABAB?79A?AAB;B?@?@<=8:8 XT:A:M XN:i:2 SM:i:37 AM:i:37 XM:i:0 XO:i:0 XG:i:0 RG:Z:SRR035022 NM:i:2 MD:Z:0N0N52 OQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCBBCC@CCCCCCCCCCACCCCC;CCCBBC?CCCACCACA@
The alignment section consists of multiple TAB-delimited lines with each line describing an alignment. Each line is:
<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]]
1. QNAME and FLAG are required for all alignments. If the mapping position of the query is not available, RNAME and CIGAR are set as Ã¢â‚¬Å“*Ã¢â‚¬Â, and POS and MAPQ as 0. If the query is unpaired or pairing information is not available, MRNM equals Ã¢â‚¬Å“*Ã¢â‚¬Â, and MPOS and ISIZE equal 0. SEQ and QUAL can both be absent, represented as a star Ã¢â‚¬Å“*Ã¢â‚¬Â. If QUAL is not a star, it must be of the same length as SEQ.
2. The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits. The maximum string length is 254.
3. If SQ is present in the header, RNAME and MRNM must appear in an SQ header record.
4. Field MAPQ considers pairing in calculation if the read is paired. Providing MAPQ is recommended. If such a calculation is difficult, 255 should be applied, indicating the mapping quality is not available.
5. If the two reads in a pair are mapped to the same reference, ISIZE equals the difference between the coordinate of the 5ÃŠÂ¼-end of the mate and of the 5ÃŠÂ¼-end of the current read; otherwise ISIZE equals 0 (by the Ã¢â‚¬Å“5ÃŠÂ¼-endÃ¢â‚¬Â we mean the 5ÃŠÂ¼-end of the original read, so for Illumina short-insert paired end reads this calculates the difference in mapping coordinates of the outer edges of the original sequenced fragment). ISIZE is negative if the mate is mapped to a smaller coordinate than the current read.
6. Color alignments are stored as normal nucleotide alignments with additional tags describing the raw color sequences, qualities and color-specific properties (see also Note 5 in section 2.2.4).
7. All mapped reads are represented on the forward genomic strand. The bases are reverse complemented from the unmapped read sequence and the quality scores and cigar strings are recorded consistently with the bases. This applies to information in the mate tags (R2, Q2, S2, etc.) and any other tags that are strand sensitive. The strand bits in the flag simply indicates whether this reverse complement transform was applied from the original read sequence to obtain the bases listed in the SAM file.
So in our example:
The QNAME is the query name. For the FLAG of 163 we transform this into a binary string: 10100011. So accordingly to the flag table:
|1||the read is paired in sequencing, no matter whether it is mapped in a pair|
|1||the read is mapped in a proper pair|
|0||mate is not unmapped|
|1||mate strand is negative|
|0||the read is not the first read in a pair|
|1||the read is the second read in a pair|
The RNAME is chromosome 16; the POS is the 59_999; the MAQ is 37.
And now the CIGAR format:
A CIGAR string is comprised of a series of operation lengths plus the operations. The conventional CIGAR format allows for three types of operations: M for match or mismatch, I for insertion and D for deletion. The extended CIGAR format futher allows four more operations, as is shown in the following table, to describe clipping, padding and splicing:
|M||Alignment match (can be a sequence match or mismatch|
|I||Insertion to the reference|
|D||Deletion from the reference|
|N||Skipped region from the reference|
|S||Soft clip on the read (clipped sequence present in <seq>)|
|H||Hard clip on the read (clipped sequence NOT present in <seq>)|
|P||Padding (silent deletion from the padded reference sequence)|
Optional fields are in the format: <TAG>:<VTYPE>:<VALUE>. Each tag is encoded in two alphanumeric characters and appears only once for an alignment. The <VTYPE> follows Perl's rule (see perldoc -f pack). Valid types in SAM are:
|i||Signed 32-bin interger|
|f||Single-precision float number|
|H||Hex string (high nybble first)|
There are a bunch of predefined tags, please see the SAM manual for more information. For the tags used in this example:
Any tags that start with X? are reserved fields for end users: XT:A:M, XN:i:2, XM:i:0, XO:i:0, XG:i:0
SM:i:37 - Mapping quality if the read is mapped as a single read rather than as a read pair
AM:i:37 - Smaller single-end mapping quality of the two reads in a pair
RG:Z:SRR035022 - Read group. Value matches the header RG-ID tag if @RG is present in the header
NM:i:2 - Number of nucleotide differences (i.e. edit distance to the reference sequence)
MD:Z:0N0N52 - String for mismatching positions in the format of [0-9]+(([ACGTN]|\^[ACGTN]+)[0-9]+)*
OQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCBBCC@CCCCCCCCCCACCCCC;CCCBBC?CCCACCACA@ - Original base quality. Encoded in the same wasy as QUAL
Parsing SAM files using Perl