Loading...
 

SAM

SAM format specification


The Sequence Alignment/Map (SAM) format is a generic nucleotide alignment format that describes the alignment of query sequences or sequencing reads to a reference sequence or assembly. Importantly:

  • It is flexible enough to store all the alignment information generated by various alignment programs;
  • It is simple enough to be easily generated by alignment programs or converted from existing alignment formats;
  • It is compact in file size;
  • It allows most of the operations on the alignment to work on a stream without loading the whole alignment into memory;
  • It allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.


SAM is a tab-delimited text format. SAM is a bit slow to parse; so there is a binary equivalent to SAM, called BAM.

SAM allows optional fields to be stored. In SAM, each alignment must contain a fixed number of mandatory fields that describe the key information about the alignment (such as coordinate detailed alignment and sequences) and may contain a variable number of optional fields which are less important or aligner specific.

Using SAM to store various types of alignments


SAM is able to store clipped alignments, spliced alignments, multi-part alignments, padded alignments and alignments in colour space. The extended CIGAR string is the key to describing these types of alignments.

Clipped alignment


In Smith-Waterman alignment, a sequence may not be aligned from the first residue to the last one. Subsequences at the ends may be clipped off. We introduce operation 'S' to describe clipped alignment. Suppose the clipped alignment is:

clipped_alignment
REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG
READ:          gggGTGTAACC-GACTAGgggg


where on the read sequence, bases in uppercase are matches and bases in lowercase are clipped off. The CIGAR for this alignment is: 3S8M1D6M4S (which I interpret as 3 soft, 8 match, 1 deletion, 6 match and 4 soft).

Spliced alignment


In cDNA-to-genome alignment, we may want to distinguish introns from deletions in exons. We introduce openation 'N' to represent long skip on the reference sequence. Suppose the spliced alignment is:

spliced_alignment
REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTAGTCAGACTAGTCGATCGATGTG
READ:          GTGTAACCC................................TCAGAATA


where '...' on the read sequence indicates intron. The CIGAR for this alignment is : 9M32N8M.

Multi-part alignment


One query sequence may be aligned to multiple places on the reference genome, either with or without overlaps. In SAM, we keep multiple hits as multiple alignment records. To avoid presenting the full query sequence multiple times for non-overlapping hits, we introduce operation 'H' to describe hard clipped alignment. Hard clipping (H) is similar to soft clipping (S). They are different in that hard clipped subsequence is not present in the alignment record. The example alignment in "clipped alignment" can also be represented with CIGAR: 3H8M1D6M4H, but in this case, the sequence stored in SAM is "GTGTAACCGACTAG", instead of "GGGGTGTAACCGACTAGGGGG" if soft clipping is in use.

Padded alignment


Most sequence aligners only give the sequences inserted to the reference genome, but do not present how these inserted sequences are aligned against each other. Alignment with inserted sequences fully aligned is called padded alignment. Padded alignment is always produced by de novo assemblers and is important for an alignment viewer to display the alignment properly. To store padded alignment, we introduce operation 'P' which can be considered as a silent deletion from padded reference sequence. In the following example, GA on READ1 and A on READ2 are inserted to the reference. With unpadded CIGAR, we would not be able to distinguish the following padded multi-alignments:

padded_alignment
REF:  CACGATCA**GACCGATACGTCCGA
READ1:  CGATCAGAGACCGATA
READ2:    ATCA*AGACCGATAC
READ3:   GATCA**GACCG

The padded CIGAR are different:
READ1: 6M2I8M
READ2: 4M1P1I9M
READ3: 5M2P5M


Note that it is hard to convert unpadded CIGAR to padded one. Fully resolving the alignment between inserted sequences would essentially require a de novo assembler. However, it is easy vice versa. By simply removing all P operations we get the CIGAR without padding.

Alignments in colour space Colour alignments are stored as normal nucleotide alignments with additional tags describing the raw colour sequences, qualities and colour-specific properties.

Working on an example


Downloaded from the 1000 genome project ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA06984/alignment/NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.bam

samtools view NA06984.chrom16.ILLUMINA.bwa.CEU.low_coverage.20100517.bam | head -1

SRR035022.2621862 163 16 59999 37 22S54M = 60102 179 CCAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCGACCCTCACCCTCACCC >AAA=>?AA>@@B@B?AABAB?AABAB?AAC@B?@AB@A?A>A@A?AAAAB??ABAB?79A?AAB;B?@?@<=8:8 XT:A:M XN:i:2 SM:i:37 AM:i:37 XM:i:0 XO:i:0 XG:i:0 RG:Z:SRR035022 NM:i:2 MD:Z:0N0N52 OQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCBBCC@CCCCCCCCCCACCCCC;CCCBBC?CCCACCACA@

The alignment section consists of multiple TAB-delimited lines with each line describing an alignment. Each line is:

sequence_string.sam
<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]]


Image

1. QNAME and FLAG are required for all alignments. If the mapping position of the query is not available, RNAME and CIGAR are set as “*”, and POS and MAPQ as 0. If the query is unpaired or pairing information is not available, MRNM equals “*”, and MPOS and ISIZE equal 0. SEQ and QUAL can both be absent, represented as a star “*”. If QUAL is not a star, it must be of the same length as SEQ.
2. The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits. The maximum string length is 254.
3. If SQ is present in the header, RNAME and MRNM must appear in an SQ header record.
4. Field MAPQ considers pairing in calculation if the read is paired. Providing MAPQ is recommended. If such a calculation is difficult, 255 should be applied, indicating the mapping quality is not available.
5. If the two reads in a pair are mapped to the same reference, ISIZE equals the difference between the coordinate of the 5ʼ-end of the mate and of the 5ʼ-end of the current read; otherwise ISIZE equals 0 (by the “5ʼ-end” we mean the 5ʼ-end of the original read, so for Illumina short-insert paired end reads this calculates the difference in mapping coordinates of the outer edges of the original sequenced fragment). ISIZE is negative if the mate is mapped to a smaller coordinate than the current read.
6. Color alignments are stored as normal nucleotide alignments with additional tags describing the raw color sequences, qualities and color-specific properties (see also Note 5 in section 2.2.4).
7. All mapped reads are represented on the forward genomic strand. The bases are reverse complemented from the unmapped read sequence and the quality scores and cigar strings are recorded consistently with the bases. This applies to information in the mate tags (R2, Q2, S2, etc.) and any other tags that are strand sensitive. The strand bits in the flag simply indicates whether this reverse complement transform was applied from the original read sequence to obtain the bases listed in the SAM file.

So in our example:

QNAMESRR035022.2621862
FLAG163
RNAME16
POS59999
MAQ37
CIGAR22S54M
MRNM=
MPOS60102
ISIZE179
SEQCCAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCGACCCTCACCCTCACCC
QUAL>AAA=>?AA>@@B@B?AABAB?AABAB?AAC@B?@AB@A?A>A@A?AAAAB??ABAB?79A?AAB;B?@?@<=8:8
TAGXT:A:M
TAGXN:i:2
TAGSM:i:37
TAGAM:i:37
TAGXM:i:0
TAGXO:i:0
TAGXG:i:0
TAGRG:Z:SRR035022
TAGNM:i:2
TAGMD:Z:0N0N52
TAGOQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCBBCC@CCCCCCCCCCACCCCC;CCCBBC?CCCACCACA


The QNAME is the query name. For the FLAG of 163 we transform this into a binary string: 10100011. So accordingly to the flag table:

Image

1the read is paired in sequencing, no matter whether it is mapped in a pair
1the read is mapped in a proper pair
0not unmapped
0mate is not unmapped
0forward strand
1mate strand is negative
0the read is not the first read in a pair
1the read is the second read in a pair


The RNAME is chromosome 16; the POS is the 59_999; the MAQ is 37.

And now the CIGAR format:

A CIGAR string is comprised of a series of operation lengths plus the operations. The conventional CIGAR format allows for three types of operations: M for match or mismatch, I for insertion and D for deletion. The extended CIGAR format futher allows four more operations, as is shown in the following table, to describe clipping, padding and splicing:

opDescription
MAlignment match (can be a sequence match or mismatch
IInsertion to the reference
DDeletion from the reference
NSkipped region from the reference
SSoft clip on the read (clipped sequence present in <seq>)
HHard clip on the read (clipped sequence NOT present in <seq>)
PPadding (silent deletion from the padded reference sequence)


Optional fields are in the format: <TAG>:<VTYPE>:<VALUE>. Each tag is encoded in two alphanumeric characters and appears only once for an alignment. The <VTYPE> follows Perl's rule (see perldoc -f pack). Valid types in SAM are:

TypeDescription
APrintable character
iSigned 32-bin interger
fSingle-precision float number
ZPrintable string
HHex string (high nybble first)


There are a bunch of predefined tags, please see the SAM manual for more information. For the tags used in this example:

Any tags that start with X? are reserved fields for end users: XT:A:M, XN:i:2, XM:i:0, XO:i:0, XG:i:0

SM:i:37 - Mapping quality if the read is mapped as a single read rather than as a read pair
AM:i:37 - Smaller single-end mapping quality of the two reads in a pair
RG:Z:SRR035022 - Read group. Value matches the header RG-ID tag if @RG is present in the header
NM:i:2 - Number of nucleotide differences (i.e. edit distance to the reference sequence)
MD:Z:0N0N52 - String for mismatching positions in the format of [0-9]+(([ACGTN]|\^[ACGTN]+)[0-9]+)*
OQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCBBCC@CCCCCCCCCCACCCCC;CCCBBC?CCCACCACA@ - Original base quality. Encoded in the same wasy as QUAL

Image

Parsing SAM files using Perl


http://search.cpan.org/~lds/Bio-SamTools/lib/Bio/DB/Sam.pm

SAM Format Specification


http://samtools.sourceforge.net/SAM1.pdf