Finding sequence conservation

I have written about sequence conservation in vertebrates previously but without much elaboration, hence I'm writing another post on this topic. An assumption of sequence conservation is that regions that show conservation, are under purifying selection, i.e. alleles that decrease the fitness of an organism are removed, and therefore probably do something important. Protein-coding regions are typically well conserved among the genomes of different species, so it's widely accepted that they are useful. Sequences need to be aligned together in order to infer sequence conservation and conveniently, a multiple sequence alignment (MSA) of 46 vertebrate genomes is provided at the UCSC Genome Browser site.

This whole genome MSA was produced using a tool called MULTIZ, which produces MSAs by processing pairwise BLASTZ alignments; a technical note on how this is done is provided on the UCSC Genome Browser Wiki Site. Just last week I bought this BLAST book in the hope of understanding a bit more about BLAST (and then BLASTZ, and finally MULTIZ).

The MSA are provided as Multiple Alignment Format (MAF) files, which are used to store multiple alignments between entire genomes. Let's download the MAF file for chromosome 22 and take a look:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr22.maf.gz

gunzip -c chr22.maf.gz | head -13
##maf version=1 scoring=autoMZ.v1
a score=1189.000000
s hg19.chr22             16050000 8 + 51304566 GATCTGAT
s panTro2.chrUn           7680399 8 + 58616431 gatctgat
q panTro2.chrUn                                99999999
i panTro2.chrUn          N 0 C 0
s choHof1.scaffold_39500      681 6 +    15110 --TCAGGG
q choHof1.scaffold_39500                       --999999
i choHof1.scaffold_39500 N 0 N 0
s dasNov2.scaffold_31225    17898 6 -    18719 --TCAGGG
q dasNov2.scaffold_31225                       --999999
i dasNov2.scaffold_31225 N 0 N 0

The above paragraph shows one alignment. There are lines that begin with an "a", "s", "q", and "i"; the "a" line denotes the start of a multiple alignment and an empty line denotes the end. The "s" line contains the sequence information, which has 6 fields: the source, start, size, strand, srcSize, and text. The "q" line indicates the quality of each aligned base ranging from 0-9. The "i" line contains information about the aligned species DNA; the five fields of this line indicate the source, leftStatus, leftCount, rightStatus, and rightCount. Check out the MAF specification for more information.

Extracting an alignment from a specific region

To find sequence conversation for a specific region from the MAF files, there's a tool called maf_parse, which is part of the PHAST package, which allows specific regions to be extract (among other things). Let's download and install PHAST (since I use a MacBook Air at work now, I've included the package link for OS X):

#for Ubuntu
wget http://compgen.bscb.cornell.edu/phast/downloads/phast_install.tgz
tar -xzf phast_install.tgz
sudo sh install.sh

#for OS X
wget http://compgen.bscb.cornell.edu/phast/downloads/phast.v1_1.mpkg.zip
unzip phast.v1_1.mpkg.zip
#then using Finder, double-click on phast.mpkg to install

Say we were interested in examining the sequence conservation of the region upstream of transcription start site (TSS) of the SOX10 gene; this region should contain regulatory sequences, which we would like to check for conservation. We would first create a bed file of this region and then use maf_parse with the -g parameter with the MAF file:

#this is 300 bp upstream of the SOX10 TSS
cat sox10_upstream.bed 
chr22	38380547	38380846	sox10_up	0	-

#maf_parse needs to work with an uncompressed MAF file
gunzip chr22.maf.gz

#run maf_parse
#USAGE: maf_parse [OPTIONS] <infile>
maf_parse -g sox10_upstream.bed chr22.maf > sox10_upstream.maf

#add the MAF header
head -1 chr22.maf > header
#I don't have sponge on OS X, so I'm doing this in two steps
cat header sox10_upstream.maf > temp
mv temp sox10_upstream.maf

While the usage of the maf_parse tool specifies that different output formats are available, such as FASTA, if you input anything other than MAF, you will get: "Sorry, only MAF format output has been implemented right now." (which can be seen when you look at the source code). Galaxy provides a tool to convert the MAF file into FASTA; below is a snippet of the conversion to FASTA:

>hg19.chr22(+):38380547-38380601|sequence_index=0|block_index=0|species=hg19|hg19_0_0
ACTGACTGAGCGACTGAGCCTGAGCC----GCTGC-GGAGGGCT-G---gggaaggaaggagg
>calJac1.Contig2893(+):346348-346396|sequence_index=0|block_index=0|species=calJac1|calJac1_0_0
ACTGACTGAGCGA------CTGAGCC----GCTGC-AGAGGGCT-G---gggaaggaaggagg
>papHam1.scaffold4116(+):29385-29433|sequence_index=0|block_index=0|species=papHam1|papHam1_0_0
ACTGACTGAGCGA------CTGAGCC----GCTGC-AGAGGGCT-G---gggaaggaaggagg
>rheMac2.chr10(+):81918124-81918172|sequence_index=0|block_index=0|species=rheMac2|rheMac2_0_0
ACTGACTGAGCGA------CTGAGCC----GCTGC-AGAGGGCT-G---gggaaggaaggagg

This is how the MSA of our region of interest looks on the UCSC Genome Browser (it can be turned on in one of the tracks):

sox10_upstreamWe observe stretches of aligned sequence, i.e. conserved bases, between the human genome with other vertebrate genomes, with the most conservation seen amongst the primates.

phyloP46way and phastCons46way

I've mentioned the phyloP46way and phastCons46way tracks provided by the UCSC Genome Browser site in the previous post. The phyloP track was generated using the phyloP program, which computes conservation or acceleration p-values based on an alignment and a model of neutral evolution. The phastCons track was generated using the phastCons program, which identifies conserved elements or produce conservation scores, given a multiple alignment and a phylo-HMM. The scores provided by the two programs are provided as wiggle files. Let's download the two files for chromosome 22:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/vertebrate/chr22.phyloP46way.wigFix.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons46way/vertebrate/chr22.phastCons46way.wigFix.gz

#check out the phyloP file
gunzip -c chr22.phyloP46way.wigFix.gz | head -5
fixedStep chrom=chr22 start=16050001 step=1
0.064
0.058
0.466
0.534

#check out the phastCons file
gunzip -c chr22.phastCons46way.wigFix.gz | head -5
fixedStep chrom=chr22 start=16050001 step=1
0.093
0.086
0.079
0.068

In a nutshell, the phyloP method measures conservation at individual columns of the MSA, ignoring the effects of their neighbours, while phastCons considers the flanking columns in a multiple sequence alignment. Below I wrote a Perl script that can extract the scores from the wiggle files, given a specific region.

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <infile.wig> <chr:start-end>\n";
my $infile = shift or die $usage;
my $region = shift or die $usage;

#get rid of commas from the input region
$region =~ s/,//g;

#declare variables
my $get_chr = '';
my $get_start = 0;
my $get_end = 0;

#check region format
if ($region =~ /^(\w+):(\d+)-(\d+)$/){
   $get_chr = $1;
   $get_start = $2;
   $get_end = $3;
} else {
   die "Could not recognise $region\n";
}

#open according to whether file in gzipped or not
if ($infile =~ /\.gz$/){
   open(IN,'-|',"gunzip -c $infile") || die "Could not open $infile: $!\n";
} else {
   open(IN,'<',$infile) || die "Could not open $infile: $!\n";
}

#declare more variables
my $current_pos = 0;
my ($junk, $chr, $start, $step) = ('','','','');
my $switch = 0;

#loop through file
while(<IN>){
   chomp;
   #parse fixed step line
   #fixedStep chrom=chr22 start=16050001 step=1
   if (/^fixedStep/){
      ($junk, $chr, $start, $step) = split(/\s+/);
      $start =~ s/start=//;
      $step =~ s/step=//;
      $current_pos=$start;
   } else {
      #if current position is equal to our input start
      #turn on the switch
      $current_pos+=$step;
      if ($current_pos == $get_start){
         $switch = 1;
      }
      #turn off the switch once we have reached the input end
      if ($current_pos == $get_end){
         print "$current_pos\t$_\n";
         $switch = 0;
      }
      #print the current line if the switch is on
      if ($switch == 1){
         print "$current_pos\t$_\n";
      }
   }
}
close(IN);

exit(0);

Let's run the above script to extract the scores from the phyloP and phastCons files for our region of interest:

#sanity checks
wig_region.pl chr22.phyloP46way.wigFix.gz chr22:38380547-38380846 | wc -l
     300
wig_region.pl chr22.phastCons46way.wigFix.gz chr22:38380547-38380846 | wc -l
     300

wig_region.pl chr22.phyloP46way.wigFix.gz chr22:38380547-38380846 > chr22_38380547_38380846_phylop.data
wig_region.pl chr22.phastCons46way.wigFix.gz chr22:38380547-38380846 > chr22_38380547_38380846_phastcons.data

Let's plot this in R:

phastcons <- read.table("chr22_38380547_38380846_phastcons.data", header=F)
phylop <- read.table("chr22_38380547_38380846_phylop.data", header=F)

par(mfrow=c(2,1))
plot(phastcons$V2,
     type='l',
     main="phastCons")
plot(phylop$V2,
     type='l',
     main="phyloP")
abline(h=0)

conservationAs mentioned above, the phyloP scores are much more jagged compared to the phastCons scores. They reflect the MSA above, where there is higher sequence conservation directly upstream of the SOX10 gene (remember that the SOX10 gene is on the negative strand) than further away.

Let's check out the sequence conservation for a set of regions that are upstream TSSs. First, we can obtain the TSSs by using biomaRt:

#if you're unfamiliar with biomaRt check out this post
#http://davetang.org/muse/2012/04/27/learning-to-use-biomart/
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

library("biomaRt")
ensembl <- useMart("ensembl")
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
chr22_start <- getBM(attributes=c('chromosome_name', 'start_position', 'strand'), filters = 'chromosome_name', values = '22', mart = ensembl)

set.seed(31)
my_sample <- chr22_start[sample(1:nrow(chr22_start), size=50),]
write.table(my_sample, 'random_50_tss.tsv', quote = F, sep = "\t", row.names = F)

Now we need to run the wig_region.pl script for each of these regions. I've written another script that takes the output from R and runs the wig_region.pl script for each region upstream a TSS, but only for regions that are on the positive strand.

#!/usr/bin/perl

use strict;
use warnings;

my $usage = "Usage: $0 <tss.tsv>\n";
my $infile = shift or die $usage;

open(IN,'<',$infile) || die "Could not open $infile: $!\n";

while(<IN>){
   chomp;
   #chromosome_name	start_position	strand
   #22	27706612	1
   next if /^chro/;
   my ($chr, $start, $strand) = split(/\t/);
   my $end = $start;
   next unless $strand == 1;
   $start = $start - 299;
   my $region = 'chr' . $chr . ':' . $start . '-' . $end;

   #wig_region.pl chr22.phyloP46way.wigFix.gz chr22:38380547-38380846 > chr22_38380547_38380846_phylop.data
   #wig_region.pl chr22.phastCons46way.wigFix.gz chr22:38380547-38380846 > chr22_38380547_38380846_phastcons.data

   my $phylop    = "wig_region.pl chr22.phyloP46way.wigFix.gz $region > chr${chr}_${start}_${end}_phylop.data";
   warn("$phylop\n");
   system($phylop);

   my $phastcons = "wig_region.pl chr22.phastCons46way.wigFix.gz $region > chr${chr}_${start}_${end}_phastcons.data";
   warn("$phastcons\n");
   system($phastcons);
}
close(IN);

exit(0);

Now to run the script and to join the results:

get_regions.pl random_50_tss.tsv 

paste *phastcons.data > random_50_tss_phastcons.data
paste *phylop.data > random_50_tss_phylop.data

We will use R again to make some plots; I will use the boxplot.matrix() function to plot the distribution of scores at each position:

phylop_50 <- read.table("random_50_tss_phylop.data", header=F)
phastcons_50 <- read.table("random_50_tss_phastcons.data", header=F)

phylop_50 <- phylop_50[, seq(from = 2, to = 54, by = 2)]
phastcons_50 <- phastcons_50[, seq(from = 2, to = 54, by = 2)]

phylop_50 <- as.matrix(t(phylop_50))
phastcons_50 <- as.matrix(t(phastcons_50))

boxplot.matrix(phastcons_50, pch='.', las=2)
boxplot.matrix(phylop_50, pch='.', las=2)

phylop_50Particular positions seem more highly conserved, but it's hard to tell.

phastcons_50The distribution of phastCons scores is much higher for the region closer to the TSS. Perhaps these more conserved regions are the TATA box and the BRE or CpG islands.




Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.
10 comments Add yours
  1. Awesome! I was looking for something like that for a long time to analyse genomic regions of bacteria. Thanks a lot!

  2. Has anyone gotten the issue:

    ERROR at line 1 (gff_read_set): minimum of 5 columns are required.

      1. Wow, first off: Thank you for responding to a 9 year old post 🙂

        My command is: maf_parse -g fakeBed.bed chr1.maf

        fakeBed.bed looks like this:
        chr1 2019360 2030741 GABRD, which I just created by typing echo chr1 2019360 2030741 GABRD > fakeBed.bed.

        1. No worries. The error message in this case is informative; you need 5 columns in your BED file. However if you add another column using your echo command, you will still get the same error. This is because BED files are delimited using tabs. Therefore you need to do this:

          echo -e "chr1\t2019360\t2030741\tGABRD\t42" > fakeBed.bed
          maf_parse -g fakeBed.bed chr1.maf
          

          The example above works for me and hopefully for you too.

          One more tip: when testing try to use chr22 because it’s the smallest chromosome, instead of chr1, which is the largest.

          Have fun 🙂

  3. Ahh, thank you! It works, but doesn’t do quite what I need it to (definitely my fault for not reading carefully).

    I’m trying to stitch blocks together so that I can represent an ORF, exon, isoform, etc. I know that galaxy has a tool that does that, but it only works for hg19. I’ve been attempting to write my own for the last 2 weeks, but my first attempt is ~ 80% correct and I’ve lost hours trying to understand what’s wrong. My second attempt has hope, but was hoping this tool could help me output data in the meantime.

    By any chance, have you had to do this, and can recommend a tool for this? If I’m not mistaken in your analysis above, you didn’t actually require the blocks to be stitched together as the .fa file shown spanned a segment of your .bed range.

    1. If you don’t mind sharing, can you email me@davetang.org about what you’re trying to do and what outcome you want in a bit more detail? Easier to communicate over email instead of the comments section.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.