The UCSC Genome Browser provides multiple alignments of 46 vertebrate species and conveniently provides them for download. The multiple alignments show regions of sequence conservation among vertebrates. For more information see http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=cons46way. The multiple alignments are stored as Multiple Alignment Files and there are Perl and Python packages that parse them. The MAF format is quite straight forward and here I convert the MAF files into a BED file for use with intersectBed from the BEDTools suite.
multiz46way
#!/bin/bash wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr1.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr2.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr3.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr4.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr5.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr6.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr7.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr8.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr9.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr10.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr11.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr12.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr13.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr14.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr15.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr16.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr17.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr18.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr19.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr20.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr21.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr22.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chrM.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chrX.maf.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chrY.maf.gz
Convert the MAF files into BED:
#!/usr/bin/perl use strict; use warnings; =head ##maf version=1 scoring=autoMZ.v1 a score=1189.000000 s hg19.chr22 16050000 8 + 51304566 GATCTGAT s panTro2.chrUn 7680399 8 + 58616431 gatctgat q panTro2.chrUn 99999999 i panTro2.chrUn N 0 C 0 s choHof1.scaffold_39500 681 6 + 15110 --TCAGGG q choHof1.scaffold_39500 --999999 i choHof1.scaffold_39500 N 0 N 0 s dasNov2.scaffold_31225 17898 6 - 18719 --TCAGGG q dasNov2.scaffold_31225 --999999 i dasNov2.scaffold_31225 N 0 N 0 =cut my $usage = "Usage: $0 <infile.maf>\n"; my $infile = shift or die $usage; if ($infile =~ /\.gz$/){ open(IN,'-|',"zcat $infile") || die "Could not open $infile: $!\n"; } else { open(IN,'<',$infile) || die "Could not open $infile: $!\n"; } my @alignment = (); my $score = ''; while(<IN>){ chomp; next if /^#/; if (/^a/){ if (/score=(-*\d+)\./){ $score = $1; } else { die "Could not parse score on line $.: $_\n"; } } elsif (/^s/){ push(@alignment,$_); } elsif (/^$/){ #first should always be hg19 my $homologous_region = scalar(@alignment); my ($junk,$src,$start,$size,$strand,$srcSize,$text) = split(/\s+/,$alignment[0]); my ($genome,$chr) = split(/\./,$src); if ($strand eq '+'){ my $id = join("_",$chr,$start,$start+$size,$homologous_region); print join("\t",$chr,$start,$start+$size,$id,$score,$strand),"\n"; } elsif ($strand eq '-'){ my $id = join("_",$chr,$start-$size,$start,$homologous_region); print join("\t",$chr,$start-$size,$start,$id,$score,$strand),"\n"; } else { die "Unrecognised strand: $strand on line $alignment[0]\n"; } #for (my $i=0; $i<scalar(@alignment); ++$i){ #} @alignment = (); $score = ''; } } close(IN); exit(0); __END__
BASH scripting to generate all BED files:
for file in `ls *.gz`; do base=`basename $file .maf.gz`; echo $base; maf.pl $file > ${base}.bed; done
By sorting on the score column of chr1.bed, we can find the most conserved region on chromosome 1 of hg19 (in the red rectangle), which happens to be the 3' tail of the LRRC8D gene:
chr1 90400115 90401012 chr1_90400115_90401012_45 61991922 +
phyloP46way and phastCons46way
There are also two methods that show measurements of sequence conservation among the 46 vertebrate genomes. Quoting the UCSC Genome Browser:
PhastCons (which has been used in previous Conservation tracks) is a hidden Markov model-based method that estimates the probability that each nucleotide belongs to a conserved element, based on the multiple alignment. It considers not just each individual alignment column, but also its flanking columns. By contrast, phyloP separately measures conservation at individual columns, ignoring the effects of their neighbours. As a consequence, the phyloP plots have a less smooth appearance than the phastCons plots, with more "texture" at individual sites. The two methods have different strengths and weaknesses. PhastCons is sensitive to "runs" of conserved sites, and is therefore effective for picking out conserved elements. PhyloP, on the other hand, is more appropriate for evaluating signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites).
If I turn on the PhastCons and phyloP scores on the UCSC Genome Browser for the region I highlighted above:
Concordant agreement (concordance according to my eye) between the two methods showing a long stretch of sequence conservation in the tail of the LRRC8D gene.
These scores are available as fixedStep wiggle plots.
Download the phyloP46way files:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr1.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr2.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr3.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr4.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr5.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr6.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr7.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr8.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr9.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr10.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr11.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr12.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr13.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr14.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr15.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr16.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr17.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr18.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr19.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr20.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr21.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr22.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chrM.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chrX.phyloP46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chrY.phyloP46way.wigFix.gz
Download the phastCons46way files:
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr1.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr2.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr3.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr4.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr5.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr6.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr7.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr8.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr9.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr10.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr11.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr12.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr13.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr14.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr15.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr16.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr17.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr18.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr19.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr20.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr21.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr22.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chrM.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chrX.phastCons46way.wigFix.gz wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chrY.phastCons46way.wigFix.gz
To create a more readable format of these files (but using up more space), download the wigToBigWig and the bigWigToBedGraph files from the UCSC Genome Browser's list of executables and convert them to the bedGraph format.
#convert to bigWig for file in `ls *.gz`; do base=`basename $file .wigFix.gz`; echo $file; wigToBigWig $file ~/davetang/chrom_size/hg19_chrom_info.txt ${base}.bw; done #convert to bedGraph for file in `ls *.bw`; do base=`basename $file .bw`; echo $file; bigWigToBedGraph $file $base.bedGraph; done rm *.bw *.wigFix.gz
Finding conserved intergenic regions
To find sequence conservation within intergenic regions, firstly calculate the intergenic regions. Let's look at chr22:
#the gencode_v17_intergenic.bed file was created following the link above cat gencode_v17_intergenic.bed | grep chr22 > gencode_v17_intergenic_chr22.bed cat gencode_v17_intergenic_chr22.bed | wc 746 2238 17897 #create 75 bp non-overlapping windows bedtools makewindows -b gencode_v17_intergenic_chr22.bed -w 75 > gencode_v17_intergenic_chr22_75_window.bed cat gencode_v17_intergenic_chr22_75_window.bed | wc 381936 1145808 8870165 #intersect the windows to the bedGraph file we created above bedtools intersect -a gencode_v17_intergenic_chr22_75_window.bed -b ~/ucsc/phastCons46way/vertebrate/chr22.phastCons46way.bedGraph -wo > gencode_v17_intergenic_chr22_75_window_phastcons46way.tsv head gencode_v17_intergenic_chr22_75_window_phastcons46way.tsv chr22 16050000 16050075 chr22 16050000 16050001 0.093 1 chr22 16050000 16050075 chr22 16050001 16050002 0.086 1 chr22 16050000 16050075 chr22 16050002 16050003 0.079 1 chr22 16050000 16050075 chr22 16050003 16050004 0.068 1 chr22 16050000 16050075 chr22 16050004 16050005 0.05 1 chr22 16050000 16050075 chr22 16050005 16050006 0.045 1 chr22 16050000 16050075 chr22 16050006 16050007 0.035 1 chr22 16050000 16050075 chr22 16050007 16050008 0.032 1 chr22 16050000 16050075 chr22 16050008 16050009 0.034 1 chr22 16050000 16050075 chr22 16050009 16050011 0.035 2 #sum the phastCons scores per window cat gencode_v17_intergenic_chr22_75_window_phastcons46way.tsv | groupBy -g 1,2,3 -c 7 -o sum | head chr22 16050000 16050075 4.61099999999999976552 chr22 16050075 16050150 10.1810000000000009379 chr22 16050150 16050225 6.27699999999999658229 chr22 16050225 16050300 5.73000000000000042633 chr22 16050300 16050375 0.675000000000000155431 chr22 16050375 16050450 0.744000000000000327738 chr22 16050450 16050525 0.287000000000000088374 chr22 16050525 16050600 0.620000000000000106581 chr22 16050600 16050675 0.186000000000000054179 chr22 16050675 16050750 0.543000000000000038192 #find the most conserved block cat gencode_v17_intergenic_chr22_75_window_phastcons46way.tsv | groupBy -g 1,2,3 -c 7 -o sum | sort -k4rn | head chr22 22557650 22557725 68.5529999999999830607 chr22 48699552 48699627 67.7090000000000031832 chr22 20361407 20361482 67.5860000000000127329 chr22 48648171 48648246 67.550999999999973511 chr22 18705182 18705257 67.5360000000000013642 chr22 49308598 49308673 67.470000000000013074 chr22 22979374 22979449 67.4459999999999837428 chr22 48648246 48648321 67.354000000000013415 chr22 21012303 21012378 67.2660000000000053433 chr22 49308673 49308748 67.1859999999999928377
Visualisation of the most conserved window (chr22:22557650-22557725) on the UCSC Genome Browser.
This work is licensed under a Creative Commons
Attribution 4.0 International License.