Gene Regulation Proposal

- = Executive Summary/Introduction =

Regulation of gene expression is one of the primary means of a cell’s control over its structure and function, and representation of gene expression regulation is necessary to capture many known biological pathways and gene regulation networks.

The mechanistic details of gene regulation, however, are usually not important for pathway representations, e.g. it is often sufficient to say that a particular transcription factor activates or inhibits a gene or set of genes to convey the role of the transcription factor in the pathway.

We propose to add a 'geneRegulation' subclass of control and a new physicalEntity class called 'gene'. geneRegulation controls a gene, which has a GENE-PRODUCT property. Transcription factor binding sites can also be stored.

- = Introduction =

Goals/Motivation
The primary goal of this proposal is to cover pathway related data present in gene regulation databases such as TRANSFAC, EcoCyc, RegulonDB and others listed under transcription factors/gene regulatory networks at http://www.pathguide.org/

= Biological Questions / Use Cases =


 * 1) Gene regulation abstraction: The Davidson/Bolouri sea urchin gene regulation network represents transcription factors, which promoter regions they bind and whether the transcription factor activates or represses a given gene at a specific developmental stage. A single gene regulation event is often supported by a large number of experiments designed to prove the regulation, potentially including transcription factor binding site determination. (http://www.sciencemag.org/cgi/content/full/295/5560/1669) (pdf)
 * 2) Physical binding of a transcription factor to a segment of DNA. The interaction does not imply gene regulation, because a transcription factor binding site in a promoter region is not sufficient to prove that the factor is involved in gene regulation. However, binding is a prerequisite of transcriptional control.  Often, the physical interaction is measured by chromatin immunoprecipitation or DNA footprinting experiments.
 * 3) Physical binding of microRNA to the 3-prime untranslated region (3'UTR) of an mRNA. The interaction is required for translational repression of the mRNA, though binding does not prove repression. Similar to transcription factor binding of DNA, the binding site must be stored and experimental support may be available.
 * 4) Gene regulatory network inference from transcriptional profiles. Direct and indirect gene regulator-target relationships are computationally inferred from genomic information, such as transcript profiles and genome sequence.
 * 5) Gene regulation (transcription and translational) by small molecules, such as in the iron response element.

= Requirements =

User Requirements
Describe:
 * 1) transcription factor binding to DNA (including the binding site) and also microRNA binding to mRNA (including the binding site).
 * 2) gene expression activation/inhibition relationships (involving regulators such as transcription factors and microRNAs)
 * 3) transcription factors acting as complexes, potentially binding to composite regulatory elements
 * 4) multiple transcription factors regulating the same gene, not necessarily at the same time
 * 5) experimental evidence for binding or gene regulation
 * 6) gene regulation by small molecules (not just transcription factors and microRNAs)

At least BioCyc, Cancer Cell Map, INOH, Reactome, Transpath, aMAZE have the above requirements, but this is not a complete list of their requirements and is not a complete list of DX track stakeholders who have these requirements.

Practical Software Requirements

 * 1) None specific. Similar to the basic requirements of the rest of BioPAX

= Proposed Implementation =


 * 1) Add a new subclass of PhysicalEntity called "gene" (sibling to protein, etc.) defined as follows:
 * 2) * Class definition: "A physical entity that encodes information that can be inherited through replication i.e. DNA or RNA. This is a generalization of the prokaryotic and eukaryotic notion of a gene."
 * 3) * New property: GENE-PRODUCT, range = {protein or RNA}, property definition: "The end-product of a gene, e.g. a protein or microRNA. A gene can have multiple gene products."
 * 4) * Existing property: ORGANISM, range = bioSource, property definition: "An organism, e.g. 'Homo sapiens'. This is the organism that the entity is found in. Pathways may not have an organism associated with them, for instance, reference pathways from KEGG. Sequence-based entities (DNA, protein, RNA) may contain an xref to a sequence database that contains organism information, in which case the information should be consistent with the value for ORGANISM."
 * 5) ** Restriction: Maximum cardinality of ORGANISM = 1
 * 6) New property: BOUND-CONTROL-SEQUENCE, range = {DNA or RNA}, property definition: "The DNA or RNA sequence that is bound to a controlling physical entity, such as a transcription factor, in a geneRegulation interaction. This is meant to capture the binding of other physical entities to a region on or near this gene. For instance, a transcription factor could bind to the promoter region. This property would store the promoter region DNA sequence (not necessarily including the gene itself), which would be in the transcription factor bound state (using the IN-COMPLEX property in the DNA class."

attachment:geneClass.png

attachment:geneClassDef.png


 * 1) Add a new subclass of Control called "geneRegulation" (sibling to catalysis and modulation) defined as follows:
 * 2) * Class definition: "A control interaction in which a physical entity increases (activates) or decreases (inhibits) the expression of a gene. This can be direct, e.g. transcription factor activates DNA coded gene, or indirect, e.g. kinase activates gene. If the interaction is direct, a binding site must be defined, even if it is unknown. If the interaction is indirect, a binding site must not be defined."
 * 3) * Restriction: All CONTROLLED instances of class gene
 * 4) * Restriction: Minimum cardinality of CONTROLLED = 1

attachment:geneReg.png

attachment:geneRegClass.png

attachment:geneRegCons.png


 * 1) Use the proposal to define complex topology defined in the StatesProposal:molecular state proposal to store DNA or RNA binding site information for transcription factors, microRNAs and other gene expression regulators. Basically, the proposal is to store the binding site information with the physical entity in a state i.e. the bound site information becomes part of the definition of the state of the physical entity. For example, the binding of a transcription factor to a promoter region would be defined using a transcription factor in the promoter bound state and the promoter in the transcription factor bound state (implemented with a symmetric property in OWL).

Putting 1 and 2 together:

attachment:geneRegAndGeneClassDiag.png

= Worked Examples =


 * 1) The proposed OWL implementation (built on BioPAX Level 2 v1.0)
 * 2) * attachment:geneRegulation-biopax.owl (Note: now that binding information is now described in the state proposal, this proposal is dependent on the state proposal. This OWL file will be updated shortly with one built using the state proposal OWL file)
 * 3) TODO: The Lac operon
 * 4) * attachment:geneRegulation-lac-biopax.owl
 * 5) TODO: An example from EcoCyc of a sigma factor regulating transcription
 * 6) * attachment:geneRegulation-ecocyc-eg1-biopax.owl
 * 7) TODO: An example from EcoCyc of a small molecule regulating a transcription factor, forming a complex which in turn regulates transcription
 * 8) * attachment:geneRegulation-ecocyc-eg2-biopax.owl
 * 9) TODO: An example from TRANSFAC
 * 10) * attachment:geneRegulation-transfac-eg-biopax.owl
 * 11) TODO: An example from INOH
 * 12) * attachment:geneRegulation-inoh-eg-biopax.owl
 * 13) TODO: An example of an indirect gene regulation event resulting from the ARACNE gene regulatory network inference algorithm run on 300 human B cell microarray experiments
 * 14) * attachment:geneRegulation-aracne-eg-biopax.owl

= Open Issues =


 * CV for DNA class. e.g. promoter.
 * Defining a DNA sequence ontology requires a considerable amount of work. But having a CV property for DNA sequence (e.g. promoter) will help in annotating DNA sequence related pathway data.
 * Do we need a grouping mechanism of DNA sequences (regulation region + gene, or allele)?
 * This will introduce a more flexible way of specifying a transcription unit or a meaningful set of DNA sequences that is related to gene expression (transcription/translation). The current proposal has to specify this by saying a gene has a BOUND-CONTROL-SEQUENCE (a DNA/RNA instance) which is IN-COMPLEX (see state proposal for details).
 * How can we differentiate gene regulation of direct cases from indirect cases?
 * a) Should direct interactions require the definition of a binding site, even if it is unknown? This would help differentiate direct from indirect cases, though would incur a cost of forcing users to define many unknown binding sites. However, if the interaction is known to be direct, usually the binding site is known (because that is often the proof that the interaction is direct). The latter statement needs to be verified by checking actual databases. There can be cases where binding site information is omitted for simplicity even if it is known.
 * b) Another possible way to differentiate indirect/direct cases is to link the gene instance with a DNA sequence group (see open issues 2) in direct gene regulation cases. This restriction may be weaker than the binding site restriction.
 * Have a good definition for gene.
 * Gene definition can be very controversial. For example, the first ISMB BioOntology SIG in 1998 started by introducing two different definitions of "Gene" from two different backbone databases.

= Expected growth and plan for growth =


 * 1) The geneRegulation class could be subclassed into specific types of regulation, such as transcriptionalRegulation and translationalRegulation.  Also, conversion could be subclassed into corresponding transcription and translation conversion processes.  This was not done here, because it will not add to information coverage for the specific use cases defined above. These more detailed events can be inferred based on biological "central dogma" rules.
 * 2) Gene expression is regulated in many ways.  Although this proposal is oriented mostly toward regulation of transcription initiation and regulation by small RNAs, many other mechanisms are known, including attenuation, mRNA degradation, etc.  These others are left for future work.
 * 3) Gene regulatory elements, like DNA or RNA sequences e.g. promoter, operon, enhancer, transcription termination site, etc., could be specifically modeled and attached to the gene class. This may require subclassing gene into eukaryotic and prokaryotic forms, since these are quite different from each other e.g. eukaryotic genes are more complex because of splicing.
 * 4) The processes involved in gene expression, though complicated, are well conserved across species and are thought to occur with only subtle variations for the complement of genes within an organism, thus could be thought of as a generic process (an orthogroup).  It would be desirable to develop a mechanism in BioPAX to represent common complicated (multi-step) generic processes (i.e. that operate on a variety of substrates with a variety of products, e.g. transcription, translation, and protein degradation via the proteasome).  If such a "template" mechanism becomes available, it may be preferable to use it to represent gene regulation over the ‘short-hand’ representation presented in this proposal. The current proposal is simple and is similar to common biological gene regulation representational styles, so may still be prefered even if generic processes were available.

= Relation to external ontologies and plan for use =


 * 1) The Sequence Ontology lists different types of genes. These types could be used to annotate genes in BioPAX.  SO is intended to be used in concert with a heavily curated genome annotation project or automated genome annotation effort.  It is likely that SO annotation is out of BioPAX scope, since the SO information is easy to bind in existing sequence databases that are linked to using database identifiers in BioPAX (much like most information in UniProt is not included in a BioPAX protein instance).

= OWL considerations =


 * 1) Similar considerations as the rest of BioPAX Level 1 and 2.

= Backward Compatibility =


 * 1) This proposal only adds classes and properties, thus is backwards compatible with Level 2.

= Notes =


 * The stoichiometric coefficient for a gene is always 1.
 * The binding of a transcription factor to a piece of DNA should be represented as a DNA-bound transcription factor (and by definition a transcription factor-bound DNA).
 * Gene regulation involving polycistronic mRNA (mRNA that encodes multiple proteins e.g. from operons) should contain all of the co-transcribed genes in the CONTROLLED property. Polycistronic mRNA is rare in eukaryotes but common in prokaryotes.
 * In cases where multiple regulators (e.g. transcription factors) operate on the same gene but exert their effects independently of each other, a separate geneRegulation instance should be created for each.
 * In cases where multiple regulators (e.g. transcription factors) work in conjunction (i.e. all are required in order for the control effect to occur) but they are not known to act as a complex, the regulators should be listed together in the CONTROLLER property of the geneRegulation instance (note: this does not include the normal, non-specific components of the gene expression machinery). In these cases, multiple binding locations of the various CONTROLLERs can be listed as bound states of the controllers. E.g. multiple kinases that are required together to activate a transcription factor via phosphorylation, which in turn activates gene expression.
 * For transcription factors that act as a complex, a complex instance should be created and used as the CONTROLLER. This rule covers transcription factors binding to composite elements. E.g. two transcription factors form a complex, which then goes on to regulate transcription.
 * The GENE-PRODUCT property of the gene class provides an explicit link between a regulated gene and its products (RNA, protein, or both). It is therefore recommended that the GENE-PRODUCT property always be filled if a gene is to be used in a geneRegulation instance, so that a pathway graph may be traversed through the gene to its product protein or RNA, rather than terminating at the gene. Only the end product of the expression should be listed for this purpose. For instance, the product may go on to be a CONTROLLER in other gene regulation interactions.
 * The geneRegulation class should be used to represent all types of gene regulation events, not just those involving protein transcription factors. For example, if a microRNA promotes degradation of messengerRNA prior to translation, an inhibitory geneRegulation instance should be created between the microRNA and the gene (the physical interaction between the microRNA and the messengerRNA can be listed as the bound state of the microRNA to capture binding information). An example  from bacteria where small molecules regulate a transcription factor by binding to it and the resulting complex activates transcription should be represented using gene regulation as well.
 * CONTROL-TYPE on geneRegulation is optional in case we also want to include pure binding studies (see No. 2 under Biological Questions / Use Cases). Also, there are cases, where the effect can depend on the presence of other proteins (e.g. an activating transcription factor can be 'transformed into a repressor' by another protein, but this could also be interpreted differently, i.e. the transcription factor is inhibited by the other protein (although there is a difference in emphasis) or that the complex (and therefore a different transcription factor) is the repressor.
 * An example of an indirect interaction is "kinase activates gene expression". We don't know what the exact intermediaries are, but we assume that at least one transcription factor is involved somewhere downstream from the kinase. These types of indirect interactions are often published, especially as a result of computational methods that infer gene regulatory networks.
 * A gene class is proposed, instead of just using the existing DNA and RNA classes, since a DNA or RNA sequence is not always a gene (thus doesn't have gene products), and having a specific gene class will make it easier to define gene-specific regulatory DNA elements, like promoter regions, if required in the future.
 * If the gene regulation process itself needs to be described, e.g. interactions involving all of the components of the transcriptional machinery, the large complexes should not be defined as CONTROLLERs in a gene regulation instance. A gene regulation instance implicitly states that non-speficic gene regulation machinery is present, so defining it again would be redundant and would lead to multiple ways of defining the same information, making data integration more difficult.
 * We are defining gene regulation as a shorthand so it assumes that 'normal' non gene or operon-specific transcription machinery is present. In eukaryotes, at least, there are 3 RNA polymerase complexes for different gene classes e.g. mRNA coding vs. ribosomal RNA coding. But these would be considered normal given the class of gene, though this information would be implicit.

= References = Recommended reading:
 * A Genomic Regulatory Network for Development - Science 2002, Davidson et al.
 * Genes and Signals - 2001, Ptashne and Gann
 * MicroRNA targets in Drosophila - Genome Biology 2003, Enright, John, Gaul, Tuschl, Sander, Marks
 * Reverse engineering of regulatory networks in human B cells - Nature Genetics 2005, Basso, Margolin, Stolovitzky, Klein, Dalla-Favera, Califano

-

= Discussion of proposal = /Discussion:Discuss this proposal.