How often are CpG islands upstream of transcriptional start sites?

CpG islands are genomic regions that have a high frequency of cytosine nucleotides that occurs next to a guanine nucleotide.

Download a CpG island BED file for hg19 from the UCSC Table Browser. For the Group, select "Regulation", the track "CpG Islands" and output format = BED. According to the table schema, there are 28,691 defined CpG islands.

CpG Islands

#how many CpG islands are in hg19_cpg_island.bed.gz
zcat hg19_cpg_island.bed.gz | wc -l
#how many CpG islands on assembled chromosomes
zcat hg19_cpg_island.bed.gz | perl -nle '@a = split; print unless $a[0] =~ /chrUn|random|hap/;' | wc -l

#Distribution of CpG islands on hg19 (excluding random, chrUn and hap chromsomes)
zcat hg19_cpg_island.bed.gz | perl -nle '@a = split; print unless $a[0] =~ /chrUn|random|hap/;' | cut -f1 | sort | uniq -c | sort -k1rn
   2541 chr19
   2462 chr1
   1688 chr2
   1634 chr17
   1578 chr7
   1491 chr16
   1367 chr11
   1253 chr6
   1229 chr5
   1226 chr9
   1221 chr12
   1163 chr3
   1143 chr10
   1037 chr8
   1031 chr4
    896 chrX
    801 chr20
    792 chr15
    788 chr14
    719 chr22
    605 chr13
    507 chr18
    365 chr21
    181 chrY


Now that we have the CpG islands, let's download some reference gene models. Provided here are some sequences 1,000, 2,000 and 5,000 bases upstream of annotated transcription starts of RefSeq genes with annotated 5' UTRs. As these are fasta files, we need to transform them into BED files.

zcat upstream1000.fa.gz | grep "^>" | cut -f2 -d ' ' | sed 's/[:-]/\t/g' > upstream1000.bed
cat upstream1000.bed | wc -l
zcat upstream2000.fa.gz | grep "^>" | cut -f2 -d ' ' | sed 's/[:-]/\t/g' > upstream2000.bed
zcat upstream5000.fa.gz | grep "^>" | cut -f2 -d ' ' | sed 's/[:-]/\t/g' > upstream5000.bed

Now how many CpG islands are located at 1,000, 2,000 and 5,000 bp upstream of RefSeq genes?

#any amount of overlap between the CpG island and upstream region
intersectBed -a hg19_cpg_island.bed.gz -b upstream1000.bed -u | wc -l
intersectBed -a hg19_cpg_island.bed.gz -b upstream2000.bed -u | wc -l
intersectBed -a hg19_cpg_island.bed.gz -b upstream5000.bed -u | wc -l

Are CpG islands located within RefSeq genes? To answer this question we will need to download specific parts of the RefSeq gene model. Again using the UCSC Table Browser, select the group "Genes and Gene Prediction Tracks", select the track "RefSeq Genes" and output format as BED. When you click get output you will be presented another page, which allows you to save 5' UTR Exons, 3' UTR Exons, Exons and Introns. Save these four separate tracks.

#5' UTR
intersectBed -a hg19_cpg_island.bed.gz -b hg19_refseq_5_utr.bed.gz -u | wc -l
#Do the 5' UTR overlap with the 1kp upstream BED files?
intersectBed -a hg19_refseq_5_utr.bed.gz -b upstream1000.bed -u | wc -l
intersectBed -a hg19_cpg_island.bed.gz -b hg19_refseq_exon.bed.gz -u | wc -l
#Does the exon track include 5' UTR regions
zcat hg19_refseq_5_utr.bed.gz | wc -l
intersectBed -a hg19_refseq_5_utr.bed.gz -b hg19_refseq_exon.bed.gz -u | wc -l
intersectBed -a hg19_refseq_intron.bed.gz -b hg19_cpg_island.bed.gz -u | wc -l
#3' UTR
intersectBed -a hg19_refseq_3_utr.bed.gz -b hg19_cpg_island.bed.gz -u | wc -l


CpG islands were defined in the paper by M. Gardiner-Garden and M. Frommer. Using the UCSC Table Browser, these 28,691 CpG islands can be classified by where they are located in the genome with respect to known gene models, such as by RefSeq. 13,174 of the total CpG islands overlap 5' UTR regions (~46%) and 4,393 overlap 3' UTR regions (~15%).

These numbers give only a rough estimate, however, since 3' UTRs of one gene may overlap an intron of another gene and so on. Regardless, we can see that a subset of CpG islands are not nearby or overlap RefSeq gene models, which is interesting.

Print Friendly, PDF & Email

Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
One comment Add yours

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.