CpG islands are genomic regions that have a high frequency of cytosine nucleotides that occurs next to a guanine nucleotide.
Download a CpG island BED file for hg19 from the UCSC Table Browser. For the Group, select "Regulation", the track "CpG Islands" and output format = BED. According to the table schema, there are 28,691 defined CpG islands.
#how many CpG islands are in hg19_cpg_island.bed.gz zcat hg19_cpg_island.bed.gz | wc -l 28691 #how many CpG islands on assembled chromosomes zcat hg19_cpg_island.bed.gz | perl -nle '@a = split; print unless $a =~ /chrUn|random|hap/;' | wc -l 27718 #Distribution of CpG islands on hg19 (excluding random, chrUn and hap chromsomes) zcat hg19_cpg_island.bed.gz | perl -nle '@a = split; print unless $a =~ /chrUn|random|hap/;' | cut -f1 | sort | uniq -c | sort -k1rn 2541 chr19 2462 chr1 1688 chr2 1634 chr17 1578 chr7 1491 chr16 1367 chr11 1253 chr6 1229 chr5 1226 chr9 1221 chr12 1163 chr3 1143 chr10 1037 chr8 1031 chr4 896 chrX 801 chr20 792 chr15 788 chr14 719 chr22 605 chr13 507 chr18 365 chr21 181 chrY
Now that we have the CpG islands, let's download some reference gene models. Provided here are some sequences 1,000, 2,000 and 5,000 bases upstream of annotated transcription starts of RefSeq genes with annotated 5' UTRs. As these are fasta files, we need to transform them into BED files.
zcat upstream1000.fa.gz | grep "^>" | cut -f2 -d ' ' | sed 's/[:-]/\t/g' > upstream1000.bed cat upstream1000.bed | wc -l 35056 zcat upstream2000.fa.gz | grep "^>" | cut -f2 -d ' ' | sed 's/[:-]/\t/g' > upstream2000.bed zcat upstream5000.fa.gz | grep "^>" | cut -f2 -d ' ' | sed 's/[:-]/\t/g' > upstream5000.bed
Now how many CpG islands are located at 1,000, 2,000 and 5,000 bp upstream of RefSeq genes?
#any amount of overlap between the CpG island and upstream region intersectBed -a hg19_cpg_island.bed.gz -b upstream1000.bed -u | wc -l 11792 intersectBed -a hg19_cpg_island.bed.gz -b upstream2000.bed -u | wc -l 12117 intersectBed -a hg19_cpg_island.bed.gz -b upstream5000.bed -u | wc -l 13054
Are CpG islands located within RefSeq genes? To answer this question we will need to download specific parts of the RefSeq gene model. Again using the UCSC Table Browser, select the group "Genes and Gene Prediction Tracks", select the track "RefSeq Genes" and output format as BED. When you click get output you will be presented another page, which allows you to save 5' UTR Exons, 3' UTR Exons, Exons and Introns. Save these four separate tracks.
#5' UTR intersectBed -a hg19_cpg_island.bed.gz -b hg19_refseq_5_utr.bed.gz -u | wc -l 13174 #Do the 5' UTR overlap with the 1kp upstream BED files? intersectBed -a hg19_refseq_5_utr.bed.gz -b upstream1000.bed -u | wc -l 8606 #Exon intersectBed -a hg19_cpg_island.bed.gz -b hg19_refseq_exon.bed.gz -u | wc -l 18757 #Does the exon track include 5' UTR regions zcat hg19_refseq_5_utr.bed.gz | wc -l 77524 intersectBed -a hg19_refseq_5_utr.bed.gz -b hg19_refseq_exon.bed.gz -u | wc -l 77524 #Introns intersectBed -a hg19_refseq_intron.bed.gz -b hg19_cpg_island.bed.gz -u | wc -l 38729 #3' UTR intersectBed -a hg19_refseq_3_utr.bed.gz -b hg19_cpg_island.bed.gz -u | wc -l 4393
CpG islands were defined in the paper by M. Gardiner-Garden and M. Frommer. Using the UCSC Table Browser, these 28,691 CpG islands can be classified by where they are located in the genome with respect to known gene models, such as by RefSeq. 13,174 of the total CpG islands overlap 5' UTR regions (~46%) and 4,393 overlap 3' UTR regions (~15%).
These numbers give only a rough estimate, however, since 3' UTRs of one gene may overlap an intron of another gene and so on. Regardless, we can see that a subset of CpG islands are not nearby or overlap RefSeq gene models, which is interesting.
This work is licensed under a Creative Commons
Attribution 4.0 International License.