Journal club

From Dave's wiki
Jump to navigation Jump to search

Variants

Systematic comparison of variant calling pipelines using gold standard personal exome variants

http://www.ncbi.nlm.nih.gov/pubmed/26639839

Guidelines for investigating causality of sequence variants in human disease

http://www.ncbi.nlm.nih.gov/pubmed/24759409

  • As the volume of patient sequencing data increases, it is critical that candidate variants are subjected to rigorous evaluation to prevent further mis-annotation of the pathogenicity of variants in public databases
  • Our recommendations centre on five key areas: study design; gene- level implication; variant-level implication; publication and databases; and implications for clinical diagnosis

Box 1. Terms used to describe sequence variants:

Lack of clarity in the terms used to describe sequence variants is a major source of confusion in human genetics. We have adopted the following definitions for terms used throughout this manuscript.

  • Pathogenic: contributes mechanistically to disease, but is not necessarily fully penetrant, i.e. may not be sufficient in isolation to cause disease
  • Implicated: possesses evidence consistent with a pathogenic role, with a defined level of confidence
  • Associated: significantly enriched in disease cases compared to matched controls
  • Damaging: alters the normal levels or biochemical function of a gene or gene product
  • Deleterious: reduces the reproductive fitness of carriers, and would thus be targeted by purifying natural selection

To implicate a variant as pathogenic requires that the DNA sequence affected by that variant has a role in the disease process.

An appropriate framework for detecting pathogenic variants will evaluate all of the variation in a gene compared to a well-calibrated null model specific for the hypothesis being considered (for example, de novo, dominant, recessive).

Gene burden: the affected gene shows statistical excess of rare (or de novo) probably damaging variants segregating in cases compared to control cohorts or null models.

Researchers should at the very least evaluate and report the level of background variation in an implicated gene in population cohorts, taking advantage of public resources such as the Exome Variant Server (http://evs.gs.washington.edu/EVS/) when implicating a new gene in pathogenesis.

We urge that, whenever possible, investigators assess the results of genetic, informatic and functional analyses within a quantitative statistical framework, such as determining the probability of the observed distribution of genetic variants in cases and controls under the null hypothesis, and the a priori power to detect variants of a specified frequency and effect size.

Rare A2ML1 variants confer susceptibility to otitis media

http://www.ncbi.nlm.nih.gov/pubmed/26121085

A duplication variant within the middle ear-specific gene A2ML1 co-segregates with otitis media in an indigenous Filipino pedigree (LOD score = 7.5 at reduced penetrance) and lies within a founder haplotype that is also shared by 3 otitis-prone European-American and Hispanic-American children but is absent in non-otitis-prone children and >62,000 next-generation sequences. We identified seven additional A2ML1 variants in six otitis-prone children. Collectively, our studies support a role for A2ML1 in the pathophysiology of otitis media.

Otitis media (OM) causes considerable morbidity worldwide and hearing loss at any age. Despite efforts to reduce its incidence, OM remains an important public health problem within the US, with OM being the most frequent cause of paediatric consults and antibiotic prescription, incurring an annual cost of >$5 billion. In developing countries, including the Philippines, the prevalence of chronic suppurative OM is 2-6%. Strong evidence exists for genetic susceptibility to OM, but only a few associated loci have been mapped, including genome-wide association at rs10497394 on chromosome 2q31.1. Here we report rare variants in A2ML1 predisposing to non-syndromic OM within different study populations. A2ML1 encodes a middle ear-specific protease inhibitor that is 41% identical and 59% similar to alpha2-macroglobulin (encoded by A2M), a known marker for vascular permeability of the middle ear mucosa during infection.

An intermarried, indigenous Filipino community had close to 50% prevalence of OM. A pedigree was constructed that connected 134 indigenous individuals. We obtained DNA samples from 51 indigenous individuals, of whom 38 had current or previous OM. We performed exome sequencing using samples from two second cousins with chronic suppurative OM. Six exome variants were heterozygous in both indigenous Filipino individuals, passed GATK filters, occurred at conserved residues, were predicted to be damaging and were not in dbSNP or ExAC. We performed Sanger sequencing on all 6 variants using DNA from 51 indigenous pedigree members. Only the A2ML1 duplication c.2478_2485dupGGCTAAAT (p.Ser829Trpfs*9) possibly cosegregated with OM. Assuming 95% penetrance and a 5% phenocopy rate, we obtained a statistically significant maximum two-point logarithm of odds (LOD) score of 7.5 at recombination fraction = 0 for the A2ML1 variant.

"g." for genomic sequence "c." for coding DNA sequence "p." for protein (http://www.hgvs.org/mutnomen/recs-prot.html) "m." for mitochondria

The A2ML1 duplication is predicted to truncate the protein to <60% of its original size, initiate nonsense-mediated decay and result in loss of thiol-ester and receptor-binding domains, which are expected to be essential for protease trapping and lysosomal clearance, respectively. We did not find the duplication in 61,109 exomes from multiple ancestry groups in the ExAC database; 1,385 exome sequences from the University of Washington Center for Mendelian Genomics (UWCMG) and 100 genomes from the Singapore Sequencing Malay Project (SSMP), which includes Southeast Asians of Chinese, Indian, and Malayan descent.

We obtained DNA samples from 123 otitis-prone and 118 non-otitis-prone children who were followed from birth at the University of Texas Medical Branch (UTMB). Among the UTMB children, 84 (68.3%) otitis-prone and 79 (66.9%) non-otitis-prone children self-identified as European American or Hispanic American. Sanger sequencing of all !aML1 coding exons showed that the same A2ML1 duplication was present in 3 of 123 otitis-prone children. Two otitis-prone children, one European American and the other Hispanic American, were homozygous for the duplication, whereas a third otitis-prone, European-American child was heterozygous. We verified the reported ancestry for these three otitis-prone carriers by PCA. All three children with the duplication had early-onset severe OM requiring tympanostomy tube insertion by 6 months of age. Additionally, the duplication was absent in the 118 non-otitis-prone children, 2,756 UWCMG chromosomes of European-American or Hispanic-American descent, and 67,630 European, non-Finnish and 11,606 Latino alleles from the ExAC database. Comparing the frequency of this duplication only in individuals of European-American or Hispanic-American descent, we found that the duplication had genome-wide significant association with OM (two-sided Fisher's exact test, P = 3.34 x 10e-14). Moreover, the two exome-sequenced indigenous individuals and three otitis-prone children shared a haplotype that included the duplication and three common variants. The A-dup-A-T haplotype included 5.2 kb and is estimated to be ~1,800 years old (95% confidence interval = 145-3,462 years). This short founder haplotype was most likely introduced to the Americas and the Philippines by colonial Spaniards, according to population history.

Seven additional variants (three stop-gain and four missense) were each identified as heterozygous in an otitis-prone child but not in non-otitis-prone children. With the exception of the A2ML1 duplication, all additional variants identified in the UTMB cohort each occurred in a single proband. All seven single-nucleotide variants identified in otitis-prone children from UTMB occurred at highly conserved nucleotides, were predicted to be damaging, had CADD score > 15, and were absent in UWCMG exomes and SSMP.

First Genome-Wide Association Study in an Australian Aboriginal Population Provides Insights into Genetic Risk Factors for Body Mass Index and Type 2 Diabetes

http://www.ncbi.nlm.nih.gov/pubmed/25760438

Performed a genome-wide association study using 1,075,436 quality-controlled single nucleotide polymorphisms (SNPs) genotyped (Illumina 2.5M Duo Beadchip) in 402 individuals in extended pedigrees from a Western Australian Aboriginal community.

Subjects for the discovery GWAS were recruited from an Australian Aboriginal community of Martu ancestry at the edge of the Western Desert in Western Australia.

See https://en.wikipedia.org/wiki/Martu_people

An individual was classified as having T2D if the subject was: (1) diagnosed with T2D by a qualified physician; (2) on a prescribed drug treatment regimen for T2D; and (3) returned biochemical test results of a fasting plasma glucose level of at least 7 mmol/l in SI units based on criteria laid down by the World Health Organization (WHO) consultation group report.

DNA was prepared from saliva samples collected into Oragene tubes (DNA Genotek, Ontario, Canada) from 405 consenting family members who were available at the time of visits by the study team during the two-year collection period of the study.

A subset of 70,420 genotyped SNPs with pairwise linkage disequilibrium (LD; r^2) 0.3 and MAF >0.01 was used in principal component analysis to look at population substructure across the 402 family members.

Cpipe: a shared variant detection pipeline designed for diagnostic settings

Preprint available at http://biorxiv.org/content/early/2015/06/03/020388

Cpipe is a pipeline designed specifically for clinical genetic disease diagnostics and is available at http://cpipeline.org.

This is an important point made in the background section:

Thus far most clinical sequencing analysis pipelines have been driven by individual laboratories, who have either developed their own bioinformatics capability for processing data, relied on commercial products, or have partnered with research institutions to acquire the expertise needed.

The aim of Cpipe is to:

Create a common framework for applying the tools, that can be readily adapted for a diverse range of diagnostic settings and clinical indications.

They stress the importance of pipelines developed for research purposes and those that are developed for clinical use. They describe various pipelines that were developed for research and other pipelines developed for clinical use. However they lack many features, such as being freely available or have an extensive logging system.

Currently Cpipe:

Is being actively used by three separate institutions for clinical sequencing, and is undergoing accreditation for diagnostic use.

Cpipe uses Bpipe, which provides various features such as:

Automatic tracking of command history, logging of input and output files, clean-up of partially created files from failed commands, dependency tracking, automatic removal of intermediate results, generation of graphical reports, tracking of performance statistics, and notifications by email and instant messaging in response to failures.

Cpipe uses the concept of an "analysis profile", which is predefined to optimise settings for a particular subgroup of patients, such as those with a common clinical diagnosis. The parameters defined in the analysis profile can include:

the list of genes to be included or excluded in the analysis; minimum quality and coverage thresholds for variants that are reported; the width of the window beyond exonic boundaries that should be used to identify potential splice site variants; and any other customisable settings that could be applicable to different patients.

The core bioinformatic analysis implemented by Cpipe is based on the approach developed and recommended by the Broad Institute; the workflow includes:

Alignment using BWA mem, duplicate removal using Picard MarkDuplicates, Indel realignment using the GATK IndelRealigner, base quality score recalibration using the GATK BaseRecalibrator, and variant calling using the GATK HaplotypeCaller.

However, there are three modifications in Cpipe:

  1. They used Annovar for the annotation of variants
  2. They called variants individually instead of using joint calling
  3. No variant quality score recalibration was performed

GATK and Annovar may require a license for clinical use; to circumvent this, Cpipe uses an older version of GATK and provide two alternative variant annotation tools (VEP and SnpEFF).

A common diagnostic strategy for rare diseases is to filter out variants that are observed at a frequency in the population that is inconsistent with the prevalence of the disease. Cpipe therefore maintains an internal database of all variants observed in all samples that are processed by that specific instance of Cpipe.

To reduce the number of variants that may be clinically important the Variant Priority Index was developed, which combines a range of factors to place variants into four distinct tiers. The tiers are ordered according to measures of rarity, conservation and truncating effect on the transcript protein.

  1. Tier one corresponds to “rare” in-frame indels or missense variants with frequency less than 0.01 in EVS, 1000G, and ExAC.
  2. Tier two corresponds to “very rare or novel” if their frequency in these population databases is less than 0.0005.
  3. Tier three corresponds to "very rare or novel" and are also “highly conserved” (Condel>0.07)
  4. Tier four is reserved for the highest priority variants including frameshift, truncating, and splice site variants.

Furthermore, variants that do not meet the criteria for at least tier one, are hidden in the result set.

The second strategy for reducing the number of variants is the prioritisation of genes into categories based on a-priori likelihoods for being causal to the specific patient. The Gene Prioritisation Index (GPI) starts with all genes in the analysis profile target region (GPI 1), then narrows to genes that are commonly known to be casual for the disease or patient group (GPI 2), and finally narrows again to a set of custom genes that may be prioritised by the patient's clinician based on individual considerations (GPI 3 and GPI 4).

Two important statistic that was described in the paper include:

The diagnostic rate for a broad range of Mendelian adult and childhood conditions compares favourably to well established clinical genomics projects that claim diagnostic rates of 25%-35%.

and

In the operational setting where the Melbourne Genomics Health Alliance has processed 168 samples, we observe that 89% of all non-synonymous coding variants are removed by filtering on allele frequency in the 1000 genomes project and the Exome Sequencing Project.

A framework for variation discovery and genotyping using next-generation DNA sequencing data

http://www.ncbi.nlm.nih.gov/pubmed/21478889

We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs.

... elucidate the full spectrum of human genetic diversity and the complete genetic architecture of human disease.

Separating true variation from machine artifacts as a result of the high rate and context-specific nature of sequencing errors is the outstanding challenge in NGS analysis.

Here we present a single framework and the associated tools capable of discovering high-quality variation and genotyping individual samples using diverse sequencing machines and experimental designs.

... the introduction of improved calibration of base quality scores, local realignment to accommodate indels, the simultaneous evaluation of multiple samples from a population, and finally, an assessment of the likelihood that an identified variable site is a true biological DNA variant greatly improves the sensitivity and specificity of variant discovery from NGS data.

Inferring disease

Systematic Localization of Common Disease-Associated Variation in Regulatory DNA

Phenolyzer: phenotype- based prioritization of candidate genes for human diseases

The tool is located at: http://phenolyzer.usc.edu/ and the manual is located at: http://phenolyzer.usc.edu/download/Phenolyzer_manual.pdf

Prioritising candidate variants or genes from sequencing data poses substantial challenges. Several computational tools, including ANNOVAR, snpEff, VEP, Jannovar and VAT, address this problem mainly by employing a number of variant filtering steps, such as keeping non-synonymous and splice variants and keeping variants with high conservation scores and low alternative allele frequencies.

Phenomizer assesses an input query consisting of standard Human Phenotype Ontology (HPO) terms and generates a diagnosis with P values for each of the ~7,000 rare diseases in the HPO database.

Here we introduce a computational tool called Phenolyzer to prioritise human disease genes based on disease or phenotype information provided by users as free text. Phenolyzer includes multiple components: (i) a tool to map user-supplied pheno- types to related diseases, (ii) a resource that integrates existing knowledge on known disease genes, (iii) an algorithm to predict previously unknown disease genes, (iv) a machine learning model that integrates multiple features to score and prioritise all candidate genes and (v) a network visualisation tool to examine gene-gene and gene-disease relationships.

Transcriptomics

Post-transcriptional processing generates a diversity of 5'-modified long and short RNAs

http://www.ncbi.nlm.nih.gov/pubmed/19169241

Deep sequencing of small RNAs (<200 nucleotides) from human HeLa and HepG2 cells revealed a remarkable breadth of species. These arose both from within annotated genes and from unannotated intergenic regions. Overall, small RNAs tended to align with CAGE (cap-analysis of gene expression) tags, which mark the 5' ends of capped, long RNA transcripts. Many small RNAs, including the previously described promoter-associated small RNAs, appeared to possess cap structures.

Here we show that processing of mature mRNAs through an as yet unknown mechanism may generate complex populations of both long and short RNAs whose apparently capped 59 ends coincide.

Nearly 80 million short sequence reads (30–35 bases) were generated, representing RNAs <200 nt in both cell lines. Nearly 30 million of these could be matched perfectly to the hg18 release of the human genome, with 9.5 million reads mapping to unique sites. Sequences derived from mitochondria, chromosome Y, repeats, annotated small RNAs, predicted RNA genes, and known and predicted small nucleolar RNAs were excluded from further analysis.

Notably, nearly half of all reads could be assigned to the sense strand of annotated exons, with a strong representation of first exons.

Thus, the CAGE tags that we observe probably represent cleaved products of mature mRNAs that somehow acquire a 59 modification analogous to a cap structure that renders them sensitive to the CAGE tagging method.

We synthesised a collection of 30–35-nt, single-stranded RNAs that share their 5' ends with three PASRs from the sense genomic strand and two from antisense strand upstream of the annotated TSS. These were transfected individually into HeLa cells, and their effect on the abundance of MYC mRNA was measured. In each case, transfection of the PASR mimetic reduced the expression of c-MYC mRNA.

These studies have raised two possibilities for the origin of PASRs. First, they may be produced as capped, independent transcription products from promoters that also generate long RNAs. Second, they may be generated as post-transcriptional processing products of longer RNAs that initiate at annotated TSSs.

The existence of a large class of CAGE tags that are both adjacent to and cross splice junctions provides a prima facie case for the conclusion that long RNAs are metabolized into short RNAs that bear cap-like structures at their 5' ends.

A key question remains as to whether the group of small RNAs that arise from internal exons represents transition products from mature mRNAs into recyclable ribonucleotides. Several lines of evidence argue against these representing simple degradation intermediates. First, there is a strong correlation between the precise 5' ends of CAGE tags, derived from long RNAs, and small RNAs identified in our study.

Landscape of transcription in human cells

http://www.ncbi.nlm.nih.gov/pubmed/22955620

The Encyclopedia of DNA Elements (ENCODE) project has sought to catalogue the repertoire of RNAs produced by human cells as part of the intended goal of identifying and characterizing the functional elements present in the human genome sequence.

Here we report identification and characterization of annotated and novel RNAs that are enriched in either of the two major cellular subcompartments (nucleus and cytosol) for all 15 cell lines studied, and in three additional subnuclear compartments in one cell line.

Approximately 6% of all annotated coding and non-coding transcripts overlap with small RNAs and are probably precursors to these small RNAs. The subcellular localisation of both annotated and unannotated short RNAs is highly specific.

Beyond the GENCODE annotated elements, we observed a substantial number of novel elements represented by reproducible RNA-Seq contigs. These novel elements covered 78% of the intronic nucleotides and 34% of the intergenic sequences.

The distribution of gene expression is very similar across cell lines, with protein-coding genes, as a class, having on average higher expression levels than long non-coding RNAs (lncRNAs). Assuming that 1–4 r.p.k.m. approximates to 1 copy per cell, we find that almost one-quarter of expressed protein-coding genes and 80% of the detected lncRNAs are present in our samples in 1 or fewer copies per cell.

The analysis of the expression of alternative isoforms resulted in several observations. First, isoform expression does not seem to follow a minimalistic strategy. Genes tend to express many isoforms simultaneously, and as the number of annotated isoforms per gene grows, so does the number of expressed isoforms.

Currently, a total of 7,053 small RNAs are annotated by GENCODE, 85% of which correspond to four major classes: small nuclear RNAs, small nucleolar RNAs, micro RNAs and transfer RNAs. Overall we find 28% of all annotated small RNAs to be expressed in at least one cell line.

Non-coding RNAs

Site-specific DICER and DROSHA RNA products control the DNA-damage response

http://www.ncbi.nlm.nih.gov/pubmed/22722852

List of acronyms:

  • Oncogene-induced senescence (OIS)
  • Senescence-associated heterochromatic foci (SAHF)

Notes:

  • α-Amanitin is an inhibitor of RNA polymerase II and III
  • Pleiotropy occurs when one gene influences two or more seemingly unrelated phenotypic traits, an example being phenylketonuria, which is a human disease that affects multiple systems but is caused by one gene defect.
  • ATM/ATR-like kinases preferentially phosphorylate their substrates on serine or threonine residues that precedeglutamine residues, so-called SQ/TQ (or S/TQ) motifs. Interestingly, a large number of ATM/ATR substrates contain regions with a remarkably high local density of SQ/TQ motifs that have been termed SQ/TQ cluster domains or, in short,‘‘SCDs’’. As a consequence, SCDs are now widely considered to represent a third signature domain characteristic of DNA-damage-response proteins, in addition to the BRCT andFHA domains that function as protein–protein interactions modules. See http://www.ncbi.nlm.nih.gov/pubmed/15770685
  • Mediator of DNA damage checkpoint protein 1 is a 2080 amino acid long protein that in humans is encoded by the MDC1 gene. MDC1 protein is a regulator of the Intra-S phase and the G2/M cell cycle checkpoints and recruits repair proteins to the site of DNA damage.

The main finding of the paper is the proposition that Dicer and Drosha have a role in the DNA damage response (DDR), specifically DDR foci formation and checkpoint activation.

A role for small RNAs in DNA double-strand break repair

http://www.ncbi.nlm.nih.gov/pubmed/22445173

A direct role for small non-coding RNAs in DNA damage response

http://www.ncbi.nlm.nih.gov/pubmed/24156824

Long noncoding RNAs: functional surprises from the RNA world

http://www.ncbi.nlm.nih.gov/pubmed/19571179

Although the vast majority of long noncoding RNAs have yet to be characterised thoroughly, many of these transcripts are unlikely to represent transcriptional "noise" as a significant number have been shown to exhibit cell type-specific expression, localisation to sub-cellular compartments, and association with human diseases.

In some cases, it appears that simply the act of noncoding RNA transcription is sufficient to positively or negatively affect the expression of nearby genes.

Supporting the biological relevance of these transcripts, multiple studies have shown that significant numbers of long ncRNAs are regulated during development, exhibit cell type-specific expression, localise to specific sub-cellular compartments, and are associated with human diseases. In addition, evidence for evolutionary selection within some long ncRNAs has been found.

It now appears that many protein-coding mRNAs and long ncRNAs may be post-transcriptionally processed to yield many small RNAs that, curiously, have a 5' cap structure. Numerous small RNAs identified using next-generation sequencing technology were found to significantly overlap CAGE tags, which are thought to mark the 5' ends of capped, long RNA transcripts. Although many CAGE tags do mark transcription start sites, significant numbers were found in exonic regions and, in some cases, to even cross splice junctions, meaning they must have arisen from at least partially processed mRNAs. Therefore, it has been proposed that mature long transcripts (both protein-coding mRNAs and long ncRNAs) can be processed post-transcriptionally to yield small RNAs, which are then modified by the addition of a cap structure.

Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs

http://www.ncbi.nlm.nih.gov/pubmed/23818866

Summary

  • The majority of the human genome is made up of intergenic sequence, which are the regions between genes.
  • Tiling arrays studies initially showed that the majority of the genome is transcribed, which became known as pervasive transcription
  • However, the scope and nature of pervasive transcription has not been fully characterised; importantly it is not known whether these transcribed regions are functional or not
  • The authors studied 127 RNA-Seq libraries, which represented various human tissue types and found the same level of transcription as reported by ENCODE
  • By performing de novo transcriptome assembly of the 127 RNA-Seq libraries, they found a large number of long intergenic noncoding RNA (lincRNAs) that have not been previously reported
  • By characterising the expression pattern, sequence conservation, and whether or not these lincRNAs were associated with trait-assoicated SNPs, they argue that these lincRNAs are not a product of transcriptional noise

MicroRNAs silence the noisy genome

http://www.ncbi.nlm.nih.gov/pubmed/25838367

  • It has been observed that although cells within an organ are genetically identical, the concentration of many of their proteins is variable and fluctuates between cells.
  • This variability comes from two sources:
    • Intrinsic noise, which results from the stochastic nature of the biochemistry operating within cells
    • Extrinsic noise, which manifests global differences between cells, such as the number of protein production facilities

From the Wikipedia article http://en.wikipedia.org/wiki/Cellular_noise

  • Intrinsic noise refers to variation in identically-regulated quantities within a single cell: for example, the intra-cell variation in expression levels of two identically-controlled genes.
  • Extrinsic noise refers to variation in identically-regulated quantities between different cells: for example, the cell-to-cell variation in expression of a given gene.

Now back to the article

  • As this type of variability can become detrimental, the question is whether organisms have evolved a means to control noise
  • This perspective paper describes a study that uses a synthetic gene approach to establish a complex role for miRNAs in controlling cellular protein content
  • While miRNAs are well-known to post-transcriptionally control mRNA levels, the quantitative effect of miRNAs on their targets has not been studied in detail
  • It has been suggested that miRNAs provide noise filtration functions, limiting the variability of protein expression across a population of cells
  • To study this possible function of miRNAs, Schmiedel et al. used a reporter gene that contains synthetically connected miRNA target sequences
    • The fluorescence reporter system allows the measure of gene expression
    • They synthesised various reporter genes with different miRNA target sequences (of varying binding strengths) and measured their expression in cultured mammalian cells
  • On examination of single-cell fluorescence data, it was revealed that reporters with and without miRNA binding sites were differentially expressed
    • A lowly expressed reporter gene with a miRNA binding site was expressed with less noise
    • However, a highly expressed reporter gene with a miRNA binding site had elevated noise

Single cell transcriptomics

Single Mammalian Cells Compensate for Differences in Cellular Volume and DNA Copy Number through Independent Global Transcriptional Mechanisms

http://www.ncbi.nlm.nih.gov/pubmed/25866248

Stochastic mRNA Synthesis in Mammalian Cells

http://www.ncbi.nlm.nih.gov/pubmed/17048983

  • Explored cell-to-cell variation in gene expression in mammalian cells by accurately counting single molecules of mRNA through the use of fluorescence in situ hybridization (FISH).

Localization and abundance analysis of human lncRNAs at single-cell and single-molecule resolution

For an introduction to RNA FISH https://sites.google.com/site/singlemoleculernafish/introduction-to-rna-fish

http://www.ncbi.nlm.nih.gov/pubmed/25630241

  • Single-molecule RNA FISH to systematically quantify and categorise the subcellular localisation patterns of a representative set of 61 lncRNAs in three different cell types.
  • Sequencing has revealed thousands of lncRNAs, however, the vast majority remain uncharacterised.
  • The subcellular localisation of RNA can provide fundamental insights into their function
    • This is particularly true for lncRNAs, which must localise to their particular site of action
    • For example, finding a lncRNA primarily in the nucleus near its site of transcription suggests that it regulates transcription of a proximal gene
  • On average the expression of most lncRNAs tend to be lower than mRNAs
    • One hypothesis states that a small number of cells in the population may express high numbers of lncRNA, thereby allowing for an increased number of sites of action in those cells
  • RNA fluorescence in situ hybridisation (RNA FISH) is an approach that can address the quantity of lncRNA
    • For example, RNA FISH demonstrated that XIST accumulates on the inactive X-chromosome
    • MALAT1, NEAT1, and Gomafu are localised to nuclear bodies
    • However, these RNA are highly abundant in the cell and most lncRNAs are considerably less abundant
  • To detect low level lncRNAs, multiple short, fluorescently labelled oligonucleotide probes can be used together to amplify the fluorescent
  • This study is on the systematic profiling of lncRNAs using single molecule RNA FISH
  • This is a technical challenge because lncRNAs are in low abundance and contain repeats
  • Used single-molecule RNA-FISH to systematically quantify and categorise the subcellular localisation patterns of a representative set of 61 long non-coding RNAs (lncRNAs) in three different cell types
  • lncRNAs are typically made up of repetitive regions and they developed a validation pipeline to select probes that had little to no off target effects
  • Oligonucleotides sets were designed using software available through Stellaris Probe Designer
  • They only included in the actual screen lncRNAs for which there were at least 10 designed oligonucleotides
    • Specifically, 10–48 complementary DNA oligonucleotides, each 20 bases long and labeled with a single fluorophore at its 3’ end, were used
  • The lncRNAs in their set are significantly expressed in at least one of human foreskin fibroblasts (hFFs), human lung fibroblasts (hLFs) or HeLa cells
  • Five classes of subcellular localisation patterns
    • I) 1–2 large foci in the nucleus (9 pairs)
    • II) large nuclear foci and single molecules scattered through the nucleus (11 pairs)
    • III) predominantly nuclear, without foci (18 pairs)
    • IV) cytoplasmic and nuclear (28 pairs)
    • V) predominantly cytoplasmic (4 pairs).
  • Characterised the cell-to-cell variability of lncRNAs in single cells
  • Observed that in most cases the eight lncRNAs with divergent mRNA partners were not colocalised

Enhancers

Linc-ing Long noncoding RNAs and enhancer function

http://www.ncbi.nlm.nih.gov/pubmed/20951339

DNA hypomethylation within specific transposable element families associates with tissue-specific enhancer landscape

http://www.ncbi.nlm.nih.gov/pubmed/23708189

  • The methylation profile of 29 human samples, representing 11 cell types, was profiled using two complementary genome-wide DNA methylation assays: Methylated DNA ImmunoPrecipitation sequencing (MeDIP-seq) and Methylation-sensitive Restriction Enzyme sequencing (MRE-seq)
  • The samples included embryonic stem cells (ESC H1), fetal brain tissue, primary neural progenitor cells, primary adult breast epithelial cells, unfractionated peripheral blood mononuclear cells and adult immune cells
  • They examined 1,395 specific families of human repeats and 928 transposable elements (TEs); these numbers are based on the RepeatMasker on hg19 results available from the UCSC Genome Browser
  • They found a high correlation between the CpG content in each TE and the total MeDIP-seq signal, which can be seen in supplementary figures 6-9
  • Using the TE methylation signal from TEs they were able to cluster the samples, suggesting that TE methylation patterns are tissue specific (check out figure 1)
  • Using ANOVA, they identified 95 TE families that were hypomethylated (out of 928)
    • 14 in brain samples, 55 in breast samples, 13 in blood samples, and 13 in embryonic stem cells
    • 69/95 belonged to endogenous retroviruses (ERVs) or long terminal repeats (LTRs)
    • 12/95 were DNA transposons
    • Not mentioned in the main paper but the remaining 14/95 belonged to non-LTR retrotransposons: 3 LINEs and 11 SINEs
  • Examining the genomic location of hypomethylated TEs, they found that they are located nearby genes functionally related to the tissue the TEs were hypomethylated in
    • They used the Genomic Regions Enrichment of Annotations Tool (GREAT) for this analysis
  • Next they generated ChIP-seq data using the same tissues for these histone modifications:
    • H3K4me1 (enhancer mark), H3K4me3 (promoter mark), H3K27me3 (repressive mark), H3K36me3 (elongation mark) and H3K9me3 (repressive mark)
  • They found that sequences within hypomethylated TEs had strong tissue specific signal for H3K4me1
  • They selected two genes, ERAP1 and GFRA1, and examined whether the TE nearby these genes could/were acting as enhancers
  • (Figure 3a) An LTR77 element detected 2kb upstream of ERAP1, was confirmed to be (figure 3b) hypomethylated in blood by locus-specific bisulfite sequencing, and expression levels of ERAP1 were much higher in blood samples. RNA polII and NF-κB peaks were also observed, suggesting perhaps eRNAs
  • They examined the TFBS and histone modifications (from ENCODE) in two cell lines, GM12878 (lymphoblastoid) and SK-N-SH (neuroblastoma), with respect to the LTR77 and LFSINE repeats
  • Examining figure 4, there are 11 columns. The first two columns show H3K4me1 and p300 signal within a 10 kb window (centred on the repeat) for 148 LTR77 and 429 LFSINE repeats. There is enriched signal near the repeat for LTR77 but not in LFSINE.
  • In the third column, p300 signal is enriched for LFSINE in the SK-N-SH cell line. There is no H3K4me1 data from ENCODE on the SK-N-SH cell line, hence it is missing in the figure.
  • The next three columns show RAD21 TFBSs for the two cell lines and the predicted motifs for this TF. They don't explain why they used RAD21 and looking up this transcription showed that it is involved in DNA damage repair.
  • The last six columns show the same thing but for the YY1 and NF-κB TFs

In conclusion, the authors have identified tissue-specific hypomethylation of subset of TEs using two complementary methylation assays. This challenges the general view that TEs are largely methylated. By examining the histone modifications on the same samples, they found that hypomethylated TEs were enriched with H3K4me1 signal, which indicates that these TEs could be acting as enhancers. They performed reporter gene assays to test 36 TE-derived candidates and 26 showed enhancer activity. Examining two specific repeats, LTR77 and LFSINE, they used ENCODE data for the two cell lines GM12878 and SK-N-SH to show enhancer marks (p300 and H3K4me1) and specific TF binding.

Supporting website: http://epigenome.wustl.edu/TE_Methylation/index.php

Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks

http://www.ncbi.nlm.nih.gov/pubmed/23818646

  • How do we distinguish between functional and nonfunctional sites that are both bound by TFs?
  • Using a massively parallel reporter gene assay, the authors compared the cis-regulatory activity of Cone-rod homeobox (Crx)-bound DNA, which were previously identified using ChIP-Seq in murine photoreceptors, to the activity of unbound genomic regions with equivalent numbers of Crx motif occurrences

RNA transcribed from a distal enhancer is required for activating the chromatin at the promoter of the gonadotropin α-subunit gene

http://www.ncbi.nlm.nih.gov/pubmed/25810254

  • Study the effects of knocking down an eRNA produced at an distal enhancer of the gonadotrophin hormone alpha-subunit gene, chorionic gonadotropin alpha (Cga).
  • Knockdown led to a drop in Cga mRNA levels, a lost in the interaction between the enhancer with the promoter, and an increase in total histone H3 occupancy and a loss of histone H3K4me3 at the promoter
  • Proposes that the Cga eRNA mediates the physical interaction between these genomic regions and determines the chromatin structure of the proximal promoter to allow gene expression

A unique chromatin signature uncovers early developmental enhancers in humans

http://www.ncbi.nlm.nih.gov/pubmed/21160473

Here we show that in human embryonic stem cells (hESCs), unique chromatin signatures identify two distinct classes of genomic elements, both of which are marked by the presence of chromatin regulators p300 and BRG1, monomethylation of histone H3 at lysine 4 (H3K4me1), and low nucleosomal density. In addition, elements of the first class are distinguished by the acetylation of histone H3 at lysine 27 (H3K27ac), overlap with previously characterized hESC enhancers, and are located proximally to genes expressed in hESCs and the epiblast. In contrast, elements of the second class, which we term 'poised enhancers', are distinguished by the absence of H3K27ac, enrichment of histone H3 lysine 27 trimethylation (H3K27me3), and are linked to genes inactive in hESCs and instead are involved in orchestrating early steps in embryogenesis, such as gastrulation, mesoderm formation and neurulation.

Transcription of ERVs

Dynamic Transcription of Distinct Classes of Endogenous Retroviral Elements Marks Specific Populations of Early Human Embryonic Cells

http://www.ncbi.nlm.nih.gov/pubmed/25658370

Report that repetitive elements originating from endogenous retroviruses (ERVs) are systematically transcribed during human early embryogenesis in a stage-specific manner. Conversion of human embryonic stem cells (hESCs) to an epiblast-like state activates blastocyst-specific ERV elements, indicating that their activity dynamically reacts to changes in regulatory networks. For their analyses, they used only uniquely mappable reads from 100 bp RNA-Seq data.

  • Full-length ERVs encode an array of proteins (gag, pol, and env) flanked by two long terminal repeats (LTRs) that contain a regulatory sequence for controlling ERV expression.
  • Comprehensively analysed expression of ERV elements in the human genome using publicly available single-cell RNA sequencing (RNA-seq) data from oocytes (n = 6) and pronuclei (n = 3), zygote (n = 5), two-cell (n = 9), four-cell (n = 16), eight-cell (n = 31), morula (n = 19), and blastocyst stage embryos (n = 30) as well as hESCs (n = 32; in total, 153 100 bp-long, single- and paired-end RNA-seq samples, almost 4 billion mapped reads)
  • Found that oocytes and zygote to four-cell embryos show the highest percentage of ERV transcripts
    • From the onset of the eight-cell stage, the fraction of reads that map to ERV elements is gradually reduced.
  • The expression pattern of ERV elements is not an artifact of overlapping protein-coding genes.
  • They noticed the same distinct expression pattern of ERV elements even when they averaged the expression of all elements that belonged to the same family
  • Strikingly, only the epiblast cells express LTR7-HERVH, the HERVH class that contributes to pluripotency of hESCs in cultures
    • Enrichment of H3K4me3 supports the role of LTR7 as active promoter
    • As reported previously, they found that NANOG binds to the LTR7 region of HERVH.
  • To test whether the transcription of ERV elements is stage specific, they performed a principal component analysis (PCA) on the ERV expression estimates.
  • They cloned a LTR7 region that was shown to be bound by NANOG into the promoter region of a luciferase reporter; the LTR7 sequence resulted in a significant increase in reporter activity and depletion of NANOG reduced the reporter activity by more than 50%
  • They suggest that the DNA sequence of the LTR elements might indeed be the primary determinant of stage-specific expression of ERV elements.
  • The LTR elements of HERVH, HERVK, and HERVK14 (LTR7, LTR7Y, LTR7B, LTR5_Hs, and LTR14B) show a low amount of splicing, indicating that they act as promoter and transcription start or end site for the HERVs.
  • In conclusion, the dynamic expression pattern of ERV families may result from a combination of activating and silencing mechanisms that integrate ERV elements into the regulatory networks of early human embryos.

Detecting endogenous retrovirus-driven tissue-specific gene transcription

http://www.ncbi.nlm.nih.gov/pubmed/25767249

Main point from the abstract

  • Using correlation of expression patterns across 18 tissue types, we reveal the tissue-specific uncoupling of gene expression due to 62 different LTR classes

Points from the introduction

  • They "present a straightforward approach to screen for tissue-specific signatures of transposable elements (TEs) using transcriptomic data."
  • TEs have various effects on the regulation of adjacent genes; these effects include single TE recruitment into cis-regulation into a single lineage, as well as striking examples of multiple independent co-options of different transposable elements across species
  • TEs are restricted phylogenetically, due to the fact that they invaded a specific lineage, and therefore have potential for driving discrete lineage differences
  • They focused on LTRs of ERVs, which are enriched in TFBSs
  • They detected multiple associations between LTR elements and tissues, driven by the expression of genes co-localised with LTRs
  • Lastly they focused on the expression of placenta-specific LTRs and find that the increase in LTR transcription in placenta relative to other tissues is largely due to a small number of repeats rather than the genome-wide effects

Points from the Methods

  • Focused on gene subsets with a particular LTR within 10kb upstream of the TSS, in the same orientation of the linked gene transcripts, and considered whether the expression of these LTR-associated genes is potentially affected by the presence of the particular LTR element.
  • They used the Illumina Human Body Map 2.0, which consists of 73-83 million 50 bp paired-end reads from 16 normal non-placental human tissues, which were mapped to hg19
  • Two additional RNA-Seq transcriptomes of human reproductive tissues, and two human placental samples were also used
  • They chose a cutoff of 1 FPKM as a threshold for determining the presence of a gene transcript
  • 221 ubiquitously highly-expression housekeeping genes (defined as those expressed in all 18 tissues at FPKM values > 50) were removed
  • 18 tissues would result in 153 comparisons and not 136 between tissues (or I missed something?)
  • They tested 62 common LTR elements present in the human genome
  • They mentioned that they used hg19 as the reference
  • They used a transformation known as "arcsine square root transformation" to normalise the RNA-Seq data
  • Measured the co-expression of genes across tissues using the Pearson product-moment correlation of gene expression levels
  • They compared the correlation between tissues based on the LTR-assocated genes (LTR+) and the correlation between tissues based on LTR- absent genes (LTR-)
    • The result can be expressed as a ratio between the two similarity measures, LTR+/LTR-, for every pairwise tissue comparison.

At this point, I couldn't fully comprehend the entire methodology, which is probably my fault entirely, so I stopped reading.

Epigenetics

Epigenomic annotation of genetic variants using the Roadmap Epigenome Browser

The browser takes advantage of the over 10,000 epigenomic data sets it currently hosts, including 346 ‘complete epigenomes’, defined as tissues and cell types for which we have collected a complete set of DNA methylation, histone modification, open chromatin and other genomic data sets.

Investigators can specify any number of single nucleotide polymorphism (SNP)-associated regions and any type of epigenomic data, for which the browser automatically creates virtual data hubs through a shared hierarchical metadata annotation, retrieves the data and performs real-time clustering analysis.

We illustrate the epigenomic annotation of two noncoding SNPs, identified from genome-wide association studies of people with multiple sclerosis, by clustering the histone H3K4me1 profile of SNP-harbouring regions and RNA-Seq signal of their closest genes across multiple primary tissues and cells.

What do you mean, "epigenetic"?

http://www.ncbi.nlm.nih.gov/pubmed/25855649

  • Conrad Waddington, who first defined the field of epigenetics in 1942, worked as an embryologist and developmental biologist.
  • At the time, there were two prevailing views on development:
    • Preformation, which asserted that all adult characters were present in the embryo and needed simply to grow or unfold
    • Epigenesis, which posited that new tissues were created from successive interactions between the constituents of the embryo
  • Waddington believed that both preformation and epigenesis could be complementary, with preformation representing the static nature of the gene and epigenesis representing the dynamic nature of gene expression
  • It is through the combination of these concepts that he coined the term epigenetics, which he referred to as, "the branch of biology that studies the causal interactions between genes and their products which bring the phenotype into being".
  • Today, Waddington's views on epigenetics are most closely associated with phenotype plasticity, which is the ability of a gene to produce multiple phenotypes, but he also coined the term canalisation to refer to the inherent stability of certain phenotypes (particularly developmental traits) across different genotypes and environments.
  • In 1956, 16 years after Waddington first coined the term, David Nanney published a paper in which he used the term epigenetics to distinguish between different types of cellular control systems
    • He proposed that genetic components were responsible for maintaining and perpetuating a library of genes, expressed and unexpressed, through a template replicating mechanism.
  • Throughout the 1980s and 1990s, the definition of epigenetics moved farther away from developmental processes and became more generalised
    • For example, one definition from 1982 describes epigenetics as "pertaining to the interaction of genetic factors and the developmental processes through which the genotype is expressed in the phenotype".
    • It made the term more available and applicable to other fields by emphasising the importance of genetic and nongenetic factors in controlling gene expression, while downplaying (although not ignoring) the connection to development.
  • Concurrently, research being performed in the 1970s and 1980s on the relationship between DNA methylation, cellular differentiation, and gene expression became more closely associated with epigenetics
  • This prompted the redefinition of epigenetics in a way that was more specific and squarely focused on the inheritance of expression states (while Nanney discussed epigenetic inheritance, his definition of epigenetics did not include a specific component on heritability).
  • Holliday offered two definitions of epigenetics, both of which were admittedly insufficient when taken separately but comprehensive in covering all currently acknowledged epigenetic processes when taken together.
    • The first definition posed that epigenetics was "the study of the changes in gene expression, which occur in organisms with differentiated cells, and the mitotic inheritance of given patterns of gene expression."
    • The second stated that epigenetics was "nuclear inheritance, which is not based on differences in DNA sequence."

Redistribution of H3K27me3 upon DNA hypomethylation results in de-repression of Polycomb target genes

http://www.ncbi.nlm.nih.gov/pubmed/23531360

  • DNA methylation and the Polycomb Repression System are epigenetic mechanisms that play important roles in maintaining transcriptional repression
  • The Polycomb Repressor Complex 2 (PRC2) modifies chromatin structure by depositing tri-methylation of lysine 27 on histone H3 (H3K27me3) via its catalytic Ezh2/Ezh1 subunit
  • Recent evidence suggests that DNA methylation can attenuate/weaken the binding of Polycomb protein components to chromatin and thus plays a role in determining their genomic targeting
  • They try to prove that methylation has a negative effect on the formation of the PRC2 complex

The model used for the study is a Dnmt1-/- mutant

  • To induce DNA hypomethylation, mouse embryonic fibroblasts (MEFs) had Dnmt1 mutation (the gene encoding the major maintenance DNA methyltransferase)
    • Despite the Dnmt1-/- genotype these cells are still viable in culture

Promoter or enhancer

Promoter or enhancer, what's the difference? Deconstruction of established distinctions and presentation of a unifying model

http://www.ncbi.nlm.nih.gov/pubmed/25450156

Nuclear stability and transcriptional directionality separate functionally distinct RNA species

http://www.ncbi.nlm.nih.gov/pubmed/25387874

Main summary:

  • Performed CAGE and GRO-seq on HeLa cells
  • Compared two conditions:
    • Control versus siRNA knockdown of hRRP40 (a core component of the exosome)
  • Intersected CAGE with ENCODE DNAse I hypoersensitive sites (DHSs) from ENCODE
    • These are the "transcribed DHSs", which total 19,224 sites
  • This study is on the characterisation of these transcribed DHSs

Overlap of CAGE to DHSs

  • Most CAGE tags are within 300 bp of DHSs (~93%)
  • Few DHSs overlap CAGE tags (since there are many DHSs)
  • Examined the bi-directionality and exosome sensitivity of transcribed DHSs
  • Directionality ranges from 0 to 1, where 0 is 100% minus strand expression and 1 is 100% plus strand expression; 0.5 indicates perfectly balanced bidirectional output
  • Sensitivity calculated based on strand-specific expression and the difference between control and exosome depletion
  • If the expression is higher after exosome depletion, then the sensitivity is greater than 0.
  • 0.75 or higher was used to define highly stable RNAs

Bidirectional transcription

  • 66% (12,763 of 19,224) of transcribed DHSs in hRRP40 depleted cells showed evidence of bidirectional transcription compared to ~35% (6,724 of 19,224) in control cells
    • Thus it seems that bidirectional transcription is a general feature of transcribed DHSs but are degraded post-transcriptionally
  • Unidirectional defined as having the majority of CAGE tags derived from one strand
  • There is bidirectional transcription, even in unidirectional cases, but they are exosome sensitive (i.e. the antisense is degraded by the exosome)
  • Bidirectional transcriptional products are mainly exosome sensitive
  • Major strand defined as the strand with the majority of CAGE tags and the minor strand is the other strand
  • Transcription from the major strand of protein-coding loci are not exosome sensitive, i.e. stable and are much more highly expressed than lncRNAs or unannotated loci
  • Minor strand transcription is mostly exosome sensitive

k-medoids clustering reveals five classes of transcribed DHSs

  • The five classes defined by clustering based upon exosome sensitivity, expression levels, and directionality
  1. Bidirectional stable
  2. Unidirectional stable (-ve strand)
  3. Unidirectional stable (+ve strand)
  4. Intermediate unstable
  5. Weak unstable
  • Transcribed DHSs that are not annotated are unstable, i.e. degraded by the exosome, are located in repressed regions, and are not associated with TSSs
  • Enhancer regions are weakly transcribed and unstable
  • Transcribed DHSs associated with protein-coding regions are stable

Closing remarks

  • The absence of a functional exosome has previously revealed a new class of ncRNAs called Promoter Upstream Transcripts (PROMPTs)
  • In this study, exosome sensitive sites are closely characterised revealing widespread bidirectional transcription, the stability of protein-coding loci, and instability of enhancer and unannotated regions
  • Furthermore, only a few annotated lncRNAs are resistant to exosome-mediated decay
  • Perhaps by virtue of cytoplasmic transport and translation mechanisms, protein-coding transcripts require a higher stability

Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers

http://www.ncbi.nlm.nih.gov/pubmed/25383968

  • Here we examine the architecture of transcription initiation through comprehensive mapping of transcription start sites (TSSs) in human lymphoblastoid B cell (GM12878) and chronic myelogenous leukemic (K562) ENCODE Tier 1 cell lines.
  • Using a nuclear run-on protocol called GRO-cap, which captures TSSs for both stable and unstable transcripts, we conduct detailed comparisons of thousands of promoters and enhancers in human cells.
  • These analyses identify a common architecture of initiation, including tightly spaced (110 bp apart) divergent initiation, similar frequencies of core promoter sequence elements, highly positioned flanking nucleosomes and two modes of transcription factor binding.
  • Post-initiation transcript stability provides a more fundamental distinction between promoters and enhancers than patterns of histone modification and association of transcription factors or co-activators. These results support a unified model of transcription initiation at promoters and enhancers.

Evolution

The frailty of adaptive hypotheses for the origins of organismal complexity

http://www.ncbi.nlm.nih.gov/pubmed/17494740

  • Although biologists have always been concerned with complex phenotypes, the matter has recently become the subject of heightened speculation, as a broad array of academics, from nearly every branch of science other than evolutionary biology itself, claim to be in possession of novel insights into the evolution of complexity.
  • First, evolution is a population-genetic process governed by four fundamental forces:
    • Natural selection, for which an elaborate theory in terms of genotype frequencies now exists
    • Mutation is the ultimate source of variation on which natural selection acts
    • Recombination assorts variation within and among chromosomes
    • Genetic drift ensures that gene frequencies will deviate a bit from generation to generation independent of other forces

Others

Native Elongating Transcript Sequencing Reveals Human Transcriptional Activity at Nucleotide Resolution

http://www.ncbi.nlm.nih.gov/pubmed/25910208

Gene expression is circular: factors for mRNA degradation also foster mRNA synthesis

http://www.ncbi.nlm.nih.gov/pubmed/23706738

Gene regulation for higher cells: a theory

http://www.ncbi.nlm.nih.gov/pubmed/5789433

http://embryo.asu.edu/pages/gene-regulation-higher-cells-theory-1969-roy-j-britten-and-eric-h-davidson

Different "classes" of genes defined in this paper:

  • Gene: a region of the genome with a narrowly definable or elementary function. It need not contain information for specifying the primary structure of a protein
  • Producer gene: a region of the genome transcribed to yield a template RNA molecule or other species of RNA molecules, except those engaged directly in genomic regulation
  • Receptor gene: a DNA sequence linked to a producer gene, which causes transcription of the producer gene to occur when a sequence-specific complex is formed between the receptor sequence and an RNA molecule called an activator RNA
  • Activator RNA: the RNA molecules which form a sequence-specific complex with receptor genes linked to producer genes
  • Integrator gene: a gene whose function is the synthesis of an activator RNA
  • Sensor gene: a sequence serving as a binding site for agents which induce the occurrence of specific patterns of activity in the genome
  • Battery of genes: the set of producer genes which is activated when a particular sensor gene activates its set of integrator genes

These (as I understand it) are now known respectively as:

  • Gene = gene
  • Producer gene = transcript
  • Receptor gene = promoter
  • Activator RNA = transcriptional machinery
  • Integrator gene = genes for the transcriptional machinery
  • Sensor gene = enhancers

Functional transcriptomics in the post-ENCODE era

http://www.ncbi.nlm.nih.gov/pubmed/24172201

See:

Summary

  • Perspective from GENCODE's point of view on annotating the human transcriptome
  • A philosophical discussion on "What is a gene?", "What is the criteria for functionality?", and on "Pervasive transcription"
  • What GENCODE has annotated at the time they wrote the paper
  • Approaches for better annotation of "genes" and their functionality

What is a gene?

  • The site of a heritable trait
  • The genomic region from where the mRNA that encodes a protein is transcribed
  • Gerstein et al., proposed in 2007 that "A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products"
  • A gene is no longer a unit of functionality, rather it is a collective term for a group of products that not only encodes proteins
  • The problem is that we are attempting to retrofit biological complexity into an existing vocabulary
  • This is in line with FANTOM's definition of a "transcriptional unit"

The evolving dogma of gene transcription

  • The biological definition of the gene evolved from "the site of a hereditable trait" to "the genomic region from where the mRNA that encodes a protein is transcribed," which forms the central dogma of molecular biology
  • An updated model includes alternative splicing (AS), AS with a retained intron, read-through transcription, and antisense transcripts

Gene annotations

  • Gene annotation has been based primarily on cDNAs, mRNAs, and ESTs; however, nowadays this is mainly dominated by RNA-Seq methodologies
  • GENCODE (the gene annotation group of ENCODE) represents a merge between manually annotated HAVANA and computationally derived Ensembl models
  • RefSeq also combines manual and automated processes but focuses more on full-length cDNAs
  • UCSC genes combine RefSeq models with additional models from other data sources, such as computational models based on GenBank ESTs
  • This paper uses GENCODE version 16 to discuss transcriptional complexity
  • GENCODE version 16 contains:
    • 20,387 protein coding transcripts, 5,835 lincRNAs, 4,545 antisense lncRNAs, 657 sense intronic lncRNAs, 9,173 small noncoding RNAs, 2,837 unprocessed pseudogenes, 9,911 processed pseudogenes, 158 unitary pseudogenes
    • Unprocessed pseudogenes result from the genomic duplication of protein-coding genes; pseudogenisation may come from the fact that the duplication is partial, or by subsequent mutation
    • Processed pseudogenes are formed by the retroinsertion of mRNAs into the genome sequence, and these loci are thus typically intronless
    • Unitary pseudogenes are protein-coding genes that are pseudogenised in the human lineage, as judged by a comparison with an intact coding ortholog in another species

Defining function

  • Consider the example whereby a transcript with premature termination codons (PTCs) are degraded by the nonsense-mediated decay (NMD) pathway; is this transcript functional?
  • Certain genes utilise NMD for gene regulation and can switch transcription from a coding region (CDS) transcription to an NMD-targeted transcript in order to reduce protein output
  • In terms of gene regulation, this transcript has a function
  • In the context of annotation, the GENCODE team believes that it's appropriate to define a functional transcript as one that makes a contribution to phenotypic complexity, regardless of the mechanism by which this occurs

Defining function via gene expression

  • It is regarded that if a transcript is abundant in the cell, it may be functional
  • This is based on the presumption that specific "errors" in transcription are rare
  • However, transcripts with very low expression levels may be functional and some transcripts may have lost their function but not their transcription potential
  • Another potential indication of functionality is via restricted expression, i.e., where a transcript displays tissue or developmental specificity

Defining biologically non-functional transcripts

  • These are non-functional transcripts that are created by biological mechanisms and not by technical artifacts
  • While the spliceosome is blieved to be highly accurate, it does not operate with complete fidelity
  • GENCODE contains 25,466 models classes as retained introns, which may have resulted from the failure of the spliceosome to initiate or complete the splicing of an intron
  • 62% of lncRNA transcripts in GENCODE overlap transposable elements based on unpublished data
  • Certain families such as Alus contain DNA motifs that resemble splicing signals and may form new exons

Defining biologically non-functional transcripts

  • However, one view is that transposition is a form of evolutionary innovation
  • The 3' untranslated regions (UTRs) or protein coding transcripts in GENCODE are filled with TEs (unpublished data)
  • Thus, if we assume that transcript creation (via TE insertion or de novo mutations) is the first step towards the generation of new functionality in the transcriptome, then we should anticipate the existence of transcripts that are in the process of being selected for functionality
  • Thus there is no binary classification of functionality, i.e., functional versus non-functional

Just how many transcripts are there?

  • Firstly, transcriptomes can differ significantly between the cells of distinct tissues and developmental stages
  • Splicing abnormalities are commonly observed in cancer cells and immortalised cell lines, which are the main source of ENCODE
  • Many protocols select for poly-adenylated RNAs, mainly to avoid rRNAs, however there are large amounts of non-polyadenylated RNAs
  • It has been reported that numerous GENCODE lncRNAs may actually represent 3' UTRs of protein coding genes
  • ENCODE found that 62.1% of the genome (combined across 15 cell lines) is covered by processed transcript extrapolated from sequencing reads, with 34% of the bases lying in intergenic regions
  • Before we quantify the number of transcripts, we need accurate annotations, such as combining signal from different technologies. For example, combining CAGE and polyA-seq.

Using comparative genomics

  • Comparative genomics is based on the idea that conservation indicates functionality
  • Of the 20,000 protein-coding genes known in human and mouse, at least 80% can be defined as orthologs
  • Obvious limitations is that conversation cannot judge loci that are not conserved and thus cannot be used to judge species-specific loci
  • The usefulness of comparative annotation depends entirely upon the availability of high-quality genome sequences and large pools of transcriptomic data
  • Few vertebrate species are comparable with human and mouse

Functional annotation of lncRNAs

  • Our knowledge of lncRNAs evolved from observations of pervasive transcription across eukaryotic genomes, which was originally suggested using genomic tiling arrays
  • Functional lncRNAs such as HOTAIR may contain a bulk of sequence that does not contribute to their actual function and so does not experience constraint
  • At the present time, a role in the regulation of gene expression looks set to become a central paradigm of lncRNA functionality
  • One approach to identifying functional lncRNAs is by capturing transcripts that interact with chromatin-modification complexes
  • It may not be appropriate to regard lncRNAs as a single homogenous class of transcripts
  • Conclusion: it is difficult to speculate on the proportion of the 22,444 lncRNA transcripts annotated in GENCODE that have genuine functionality

Final words

  • GENCODE aims to generate a comprehensive set of complete transcript structures, onto which biological information can be layered as it becomes available
  • No constraint based on what is deemed functional or not
  • Incorporate different sources of evidence for accurate transcript annotation
  • The true confirmation of transcript functionality, and a detailed understanding of the nature of this functionality, can only be gained in the laboratory
  • Waiting for short-read technologies to be replaced by the technologies able to sequence entire transcripts (PacBio, Nanopore sequencing,etc.)
  • Finally no one knows what proportion of the transcriptome is functional, therefore the appropriate scientific position to take is to be open-minded!

Base preferences in non-templated nucleotide incorporation by MMLV-derived reverse transcriptases

http://www.ncbi.nlm.nih.gov/pubmed/24392002

  • Reverse transcriptases derived from the Moloney Murine Leukemia Virus (MMLV) have an intrinsic terminal transferase activity, primarily cytosines. As this mechanism is relatively efficient and occurs in a single reaction, it has recently found use in several protocols for single-cell RNA sequencing.
  • This paper investigates the base preference from the terminal transferase activity by using fully degenerate oligos to determine the exact base preference at the template switching site
  • Used the degenerate riboN library and studied the base composition of the template-independent incorporation, by analysing the bases immediately upstream of the template-switching site in the sequencing reads. Given the assumption that the bases at the corresponding positions of the TSO participate in Watson-Crick pairing with the bases added to the cDNA in a nontemplate manner by the RT, reading the sequencing reads at these positions reveals the nature of the RT-added nucleotides.
  • Found strong preference for cytosine addition when using SuperScriptII
  • However with an increase in the distance from the end of the transcript, the preference for cytosine decreases
  • Suggested the use of a ribo NGG motif at the end of the template-switching oligonucleotide to capture more cDNA molecules. However, they noted that the increased TSO complexity caused by the degenerate position in the ribo base region might increase the number of artifacts due to mispriming events. They noted that in STRT, barcodes ending in guanosine were more efficient.
  • They also:
  1. tested different lengths of the template-switching oligonucleotide (TSO)
  2. tested TSO concentrations
  3. tested SuperScript II against SuperScript III (and also against using trehalose)
  4. tested different amounts of SSII
  5. tested the number of residues added by SSII
  6. examined the library complexity using their Unique Molecular Identifiers (UMIs)

Repetitive DNA and next-generation sequencing: computational challenges and solutions

http://www.ncbi.nlm.nih.gov/pubmed/22124482

  • This review discusses the computational problems associated with repeats with respect to mapping, de novo assembly and expression profiling and the strategies used to solve them. From a computational perspective, repeats create ambiguities in alignment and in genome assembly because the majority of repeats are longer than the read length of high throughput sequencers.
  • The percentage of short reads (25 bp or longer) that map to a unique location on the human genome is typically reported to be 70–80%, although this number varies depending on the read length, the availability of paired-end reads and the sensitivity of the software used for alignment
  • Assigning reads to the location of their best alignment is the simplest way to resolve repeats, although it is not always correct
  • Essentially, an algorithm has three choices for dealing with reads that multimap:
  1. The first is to discard all multi-reads
  2. The second option is the best match approach, in which the alignment with the fewest mismatches is reported. If there are multiple, equally good best match alignments, then an aligner will either choose one at random or report all of them.
  3. The third choice is to report all alignments up to a maximum number, d, regardless of the total number of alignments found. A variant on this strategy is to ignore multi-reads that align to >d locations.
  • For gene families and genes containing repeat elements, multi-reads can introduce errors in estimates of gene expression