CpG islands are genomic regions that have a high frequency of cytosine nucleotides that occurs next to a guanine nucleotide.
Download a CpG island BED file for hg19 from the UCSC Table Browser. For the Group, select "Regulation", the track "CpG Islands" and output format = BED. According to the table schema, there are 28,691 defined CpG islands.
CpG Islands
#how many CpG islands are in hg19_cpg_island.bed.gz
zcat hg19_cpg_island.bed.gz | wc -l
28691
#how many CpG islands on assembled chromosomes
zcat hg19_cpg_island.bed.gz | perl -nle '@a = split; print unless $a[0] =~ /chrUn|random|hap/;' | wc -l
27718
#Distribution of CpG islands on hg19 (excluding random, chrUn and hap chromsomes)
zcat hg19_cpg_island.bed.gz | perl -nle '@a = split; print unless $a[0] =~ /chrUn|random|hap/;' | cut -f1 | sort | uniq -c | sort -k1rn
2541 chr19
2462 chr1
1688 chr2
1634 chr17
1578 chr7
1491 chr16
1367 chr11
1253 chr6
1229 chr5
1226 chr9
1221 chr12
1163 chr3
1143 chr10
1037 chr8
1031 chr4
896 chrX
801 chr20
792 chr15
788 chr14
719 chr22
605 chr13
507 chr18
365 chr21
181 chrY
RefSeq
Now that we have the CpG islands, let's download some reference gene models. Provided here are some sequences 1,000, 2,000 and 5,000 bases upstream of annotated transcription starts of RefSeq genes with annotated 5' UTRs. As these are fasta files, we need to transform them into BED files.
zcat upstream1000.fa.gz | grep "^>" | cut -f2 -d ' ' | sed 's/[:-]/\t/g' > upstream1000.bed cat upstream1000.bed | wc -l 35056 zcat upstream2000.fa.gz | grep "^>" | cut -f2 -d ' ' | sed 's/[:-]/\t/g' > upstream2000.bed zcat upstream5000.fa.gz | grep "^>" | cut -f2 -d ' ' | sed 's/[:-]/\t/g' > upstream5000.bed
Now how many CpG islands are located at 1,000, 2,000 and 5,000 bp upstream of RefSeq genes?
#any amount of overlap between the CpG island and upstream region intersectBed -a hg19_cpg_island.bed.gz -b upstream1000.bed -u | wc -l 11792 intersectBed -a hg19_cpg_island.bed.gz -b upstream2000.bed -u | wc -l 12117 intersectBed -a hg19_cpg_island.bed.gz -b upstream5000.bed -u | wc -l 13054
Are CpG islands located within RefSeq genes? To answer this question we will need to download specific parts of the RefSeq gene model. Again using the UCSC Table Browser, select the group "Genes and Gene Prediction Tracks", select the track "RefSeq Genes" and output format as BED. When you click get output you will be presented another page, which allows you to save 5' UTR Exons, 3' UTR Exons, Exons and Introns. Save these four separate tracks.
#5' UTR intersectBed -a hg19_cpg_island.bed.gz -b hg19_refseq_5_utr.bed.gz -u | wc -l 13174 #Do the 5' UTR overlap with the 1kp upstream BED files? intersectBed -a hg19_refseq_5_utr.bed.gz -b upstream1000.bed -u | wc -l 8606 #Exon intersectBed -a hg19_cpg_island.bed.gz -b hg19_refseq_exon.bed.gz -u | wc -l 18757 #Does the exon track include 5' UTR regions zcat hg19_refseq_5_utr.bed.gz | wc -l 77524 intersectBed -a hg19_refseq_5_utr.bed.gz -b hg19_refseq_exon.bed.gz -u | wc -l 77524 #Introns intersectBed -a hg19_refseq_intron.bed.gz -b hg19_cpg_island.bed.gz -u | wc -l 38729 #3' UTR intersectBed -a hg19_refseq_3_utr.bed.gz -b hg19_cpg_island.bed.gz -u | wc -l 4393
Conclusions
CpG islands were defined in the paper by M. Gardiner-Garden and M. Frommer. Using the UCSC Table Browser, these 28,691 CpG islands can be classified by where they are located in the genome with respect to known gene models, such as by RefSeq. 13,174 of the total CpG islands overlap 5' UTR regions (~46%) and 4,393 overlap 3' UTR regions (~15%).
These numbers give only a rough estimate, however, since 3' UTRs of one gene may overlap an intron of another gene and so on. Regardless, we can see that a subset of CpG islands are not nearby or overlap RefSeq gene models, which is interesting.

This work is licensed under a Creative Commons
Attribution 4.0 International License.