StatesAndGenericsProposal

= (!) DRAFT - DRAFT - DRAFT (!) =

- = Executive Summary =

Current BioPAX physical entities are grouping of multiple molecular species. For example p53 can have many different phosphorylation states. Similarly changes in the cellular location of the molecule can also be considered as a change in the information context. Finally formation of a molecular complex induces molecular changes on the members of the complex, so they also should be treated as a different state. Although some of these changes currently can be represented in BioPAX they are duplicated for each reaction. Also one often needs more complex state definitions such as “not phoshphorylated at serine 18”, which again maps to multiple molecular species, often referred as “generics”. This document proposes a refactoring of BioPAX to support states of physical entities and generic entities, in a more formal manner.

We identify three types of state variables, chemical modification (e.g. phosphorylation), localization (e.g. cytosolic) and complex member (e.g. member of mdm2-p53 complex). Each combination of the variables, called a variable set, defines a separate state. Interactions involving identical physical entities in a specific state (all with the same variable set) reference the same state instance.

So far we only considered concrete states, in the sense that each state represents, to our best knowledge, a single molecular species. States that cover more than one molecular species, are called generic physical entities. A detailed domain analysis on generics can be found at [GenericParticipants]. We identified three major cases of generics. First care is created by homology groupings, such as WNT family of proteins. Second case is created by generalizations over chemical modifications, such as all p53s phosphorylated at a certain residue, regardless of other modifications. Third case is created by generalization over a chemical structure pattern, such as alcohols or glycogen.

This proposal identifies the primary requirements as being able to


 * 1) Specify and retrieve equivalent participants.
 * 2) Specify and retrieve equivalent variables of an entity.
 * 3) Cover don't care statements for chemical modifications.
 * 4) Cover incomplete statements for chemical modifications.
 * 5) Cover negative statements for chemical modifications. (!)
 * 6) Specify the binding sites  and topology of the complex.
 * 7) Address complex members individually as states.
 * 8) Accommodate homology groupings.
 * 9) Accommodate homologous chemical modifications.
 * 10) Remove redundant representation of entity-common and state-common data.

As an implementation we propose to:
 * 1) Move LOCATION property from physicalEntityParticipant to PhysicalEntity.
 * 2) Remove sequence feature property from physical entity participant.
 * 3) Introduce a subclass of utility class called reference data, and move all entity-common information from physical entity to this class. Reference data is further subclassed to DNA Data, RNA Data etc.
 * 4) Define all known sequence features in the reference entity. They are later referred from the physical entities. Although this list can always be computed on the fly, it is here to serve several purposes:
 * 5) * a) It defines the state space.
 * 6) * b) It allows defining restrictions on variables themselves, which otherwise would not be possible without subclassing states.
 * 7) * c) It effectively differentiates unknown sequence features from unknown - condition sequence features.
 * 8) Subclass sequence feature to dna, rna and protein sequence features. This is useful for restricting features to proper sub types.
 * 9) Define three new properties, MODIFIED-AT, NOT-MODIFIED-AT, REGARDLESS-OF for specifying sequence features in the applicable physical entities ( dna, rna and protein) .These are a subset of the corresponding reference entity's possible variable set.
 * 10) Define a new property, IN-COMPLEX in the physical entity to define the state is in a complex. Consider a physicalEntity  instance that participates in a complex as a member of the complex only (it can't be reused). Note that this also allows defining subcomplexes.
 * 11) Define a new subclass of utility class called bindingPartner, which keeps track of a sequence feature describing the bound site, and another boundSite object to describe the target. Last property is symmetric functional.
 * 12) Define a new property with domain physical entity, range bindingPartner to define actual topology.
 * 13) Remove sequenceEntityParticipant, as it is no longer used.
 * 14) Define a subclass of utilityClass, called referenceDataGroup for representing groupings over entities.ReferenceDataGroup can be used to both enumerated and undefined groupings.
 * 15) Define a subclass of utilityClass, called sequenceFeatureGroup for representing groupings over groups of modifications.
 * 16) Remove sequenceEntityParticipant, as it is no longer used.
 * 17) The following Non-OWL restrictions are also defined:
 * 18) * a) A physicalEntity can be owned by at most one complex. All interactions of this physicalEntity are considered in the context of this complex.
 * 19) * b) All variables in a state must be selected from the variable set of the owner reference entity.
 * 20) * c) A variable should be addressed by only one of MODIFIED-AT, NOT-MODIFIED-AT, REGARDLESS-OF properties per physical entity.
 * 21) * d) A sequence feature group must group members of the same type, and must be owned by a proper physical entity. This can be turned into an OWL constraint by subclassing sequenceFeatureGroup. Should we do it ?

= Introduction =

Goals/Motivation
The primary goal of this proposal is to cover state representations present in existing databases such as BioCyc, Reactome, aMAZE, INOH and PATIKA. This proposal also tries to identify and cover some generalizations over states.

= Domain Analysis = Most common class of state changes involve chemical modifications on macromolecules such as post translational group additions or DNA methylation. TRASPATH embeds such modifications at the end of modified molecule names which may show the type and location of modification. For instance SRF(h){pS103} is the name of human SRF protein phosphorylated at serine 103. Kohn interactions maps define each chemical modification with a node attached to the entity. Other systems explicitly store the type of modification and sequence site for each modification. The type of modification is represented with a controlled vocabulary term. For post translational group additions to the proteins, PSI-MI offers a relatively complete controlled vocabulary. Modifications to RNA and DNA molecules are however not currently captured, mostly due to lack of data. There is already existing work on possible RNA modifications [\cite PMID: 7518580 ] and the RNA modifications database [cite http://medlib.med.utah.edu/RNAmods/] is one possible source that can be incorporated to these data models in the future. More complex events such as disulfide bonds, cleavage or splicing require defining sequence fragments. BioPAX allows each modification to be associated with either a sequence site or fragment. As a side note, Chemical Modifications (loosely speaking) are metabolic mechanisms. However chemical modifications differ from the synthesis and degradation in the sense that the substrate and product are very similar in overall structure, much larger compared to the modification group and preserve most of their properties. Covalent changes to small molecules are treated as a new molecule but not as chemical modifications in most pathway representation systems. shows some post-translational modifications (PTM) on the p53 protein. Note that any combination of these 'potential' PTMs could occur, and each combination might differ in its information context and behaviour, thus should be modeled as separate instances (Yang2005).

Anchor(fig:p53PTMS) http://www.gene-regulation.com/pub/databases/transpath/6.0/doc/doc/maps/p53_sites.gif

Figure 1: Known phosphorylation sites on p53 and their upstream effectors, and some known biological consequences of these PTMs. Three orthogonal biological functions (e.g. improved acetylation) of p53 are controlled by different PTMs, and some PTMs are responsible for more than one aspect. Image courtesy of BioBase.

Changing the location of a molecule is another way to control its downstream interactions, and is modeled by almost all pathway representation systems. GO provide an extensive list of cellular compartments. Some transmembrane proteins and complexes can span multiple compartments. Recently, Demir et al. and Karp et al. proposed ontologies to define relations between compartments such as inclusion or sub-part-of. Such relations are helpful in validating models, as one can define additional constrains on the location of the reaction participants. Additionally such relations are useful for visualization and automated layout of the pathway elements. [cite dogrusoz et al. ] PATIKA allows defining different sets of compartments for different types of cells. FieldML[cite] allows spatial definitions of compartments and gradients, which can be used to represent such containment relations quantitatively. Open issues on representing subcellular location include representing dynamic compartments, such as golgi apparatus or other spatial features, such as axon hillock or chromatin which are not compartments.

Some functional changes occur due to long-lived, non-covalent bonds between molecules. These range from binary protein binding to large assemblies such as proteosome complex. A complex is minimally represented by enumerating its members. BioPAX and PATIKA also keeps track of complex stoichiometry. In the scientific literature a member of a complex is often addressed individually. In some cases their subcellular localization can also be different. Several signaling databases allow addressing these points[cite PATIKA, Reactome, INOH and Transpath]. TRANSPATH has a unique way of representing the roles of complex members in reactions. Reactions are partitioned into “decompositions”, formed of the acting portion of the actors for that particular reaction. PATIKA and INOH allow representing the binding relations between complex members. INOH additionally annotates the binding site. BIND takes a step further and provides facilities for defining the specific binding surface of the interaction.

Subcomplexes, i.e. complexes within complexes, appear in several ontologies[cite Reactome, Kohn, BioPAX]. As opposed to complexes, they are not experimental observables and their exact semantics are rarely specified. Depending on the context, they can represent functional groupings, such as catalytic or regulatory subunit or abstractions based on experimental and/or historical reasons. Representing assembly order with subcomplexes is not a recommended practice as there can be multiple ways to create a complex. Instead, assembly order should be captured using reactions.

Kohn maps allow defining chained binary binding relations, which can potentially used to capture subcomplexes and binding relations. However these chained binary relations can not represent overlapping subcomplexes or cyclic binding topologies.

Complex formation adds an additional level to combinatorial complexity. Proteins that already have multiple states can combine to produce combinatorially many similar complexes. Hlavecek et al. [cite 14708119] provide examples of such complexes. It is possible to generate such complexes automatically [cite 15217809], for purposes of analysis. A missing capability in all pathway representation systems is the ability to define an arbitrary set of these complexes and associate it with a function. A major mechanism in the evolution of pathways was gene duplication and exon shuffling [ cite]. As a result pathways have many elements that can be generalized to a canonical representation. Most common generalization method is based on so called “EC numbers”, a classification of enzymatic function developed by Enzyme Nomenclature Committee[cite]. EC numbers allow associating a group of reactions with a group of enzymes that are capable of catalyzing these reactions, which proves to be a very useful abstraction. Signaling pathway databases further expand this concept to group any functionally similar pathway elements. For example WNT protein family consists of 17 different WNTs [cite 11806834], that interact with frizzled proteins using a similar mechanism. Enumeration based abstractions exist in Reactome, PATIKA to capture such groupings. INOH takes a step further and explicitly group sub-pathways that share a similar interaction mechanism. Panther and Phosphosite also keep track of homologous phosphorylation sites across homologous proteins.

There are many examples where a term such as p53 ambiguously refers to some/all of these states. Also, sequence, domains, structure (partially), etc. is shared by all p53 states. This shared information is stored in UniProt, where all these molecules (states) are represented with a single record. Finally, some experiments only observe the shared information and be blind to the state. For example the yeast two-hybrid (Y2H) experimental method only identifies protein sequence, so you can only infer that *some* states of the p53 interacts with some states of the MDM2. p53 and MDM2 in this case are reference entities (and thus map to several states). In the case of WNT, things are even more complicated as we are no longer talking about a single molecule anymore.

Current biological representation system are inconsistent in the way they represent states. In one extreme states are totally omitted, and only reference physical entities are used, often supported with free text to explain the actual mechanism involving states. In the other extreme, e.g. simulation models, physical entity representation is totally dismissed and each molecule/state is treated as a separate physical entity.

In BioPAX level 2, physicalEntity classes are the reference physical entities discussed above. Some state related data, such as known PTMs or cellular location, are represented in the physicalEntityParticipant.

The terminology used by different groups is surprisingly diverse and conflicting. This table attempts to provide a cross-reference. For the rest of the document we will be using BioPAX terms:

= Biological Questions / Use Cases =

Covered Cases
[todo : add cases for complexes]

Biological Question 1: What is the reason for an observed increase in cytosolic ser20 phosphorylated p53? Such questions occur frequently as biologists try to come up with new hypotheses related to the observation. A model can facilitate the hypothesis formulation process by representing the existing the knowledge unambigiously. With current BioPAX answering this query first requires identifying PEPs that satisfy the given criteria.

Biological Question 2: How are signals from NfKB transmitted to p53? Graph queries, such as shortest path or their modified forms can be helpful for this case. However traversing over reference physical entities for such a question is not meaningful, as for example not every signal received by B-catenin from NfKB is transmitted to p53. We need to traverse the graph where nodes are states (in BioPAX Level 2, this would require us to identify identical PEPs, which is inconvenient).

Biological Question 3: What happens if p53 is mutated so that the Ser20 phosphorylation site is rendered non-functional? If a state includes this phosphorylation site, it must be considered non-functional. To know this, state variables must be defined and we must be able to track identical variables across states of the same physical entity (Identity between variables of different physical entities does not exist, except for location). This question is an example of from sequence to pathway type questions where one needs to associate a sequence feature with a mechanism.

Use Case 1-Flux Analysis: Flux analysis is useful for inferring temporal aspects of cellular processes by considering an optimality criteria. Similar to Q2, flux analysis requires the knowledge of identical participants/states, as it considers them the members of the same pool of molecular species.

Modeling Case-1 - Addressing Complex Members : ''CRP binds to DNA -- but only if it is complexed with at least one molecule of cAMP. When it is bound to DNA, it functions as an activator and helps RNA polymerase bind to an adjacent promoter.'' Here the important point is that the literature refers to the individual member of the complex, e.g. a bound state of it. Although one might claim that the extra level of detail is illusionary, or group thinking, it is very common in the literature, and might be worth covering.

Modeling Case-2 - State of a Complex, without knowing the subunit states: An example of a complex phosphorylated was noted by Keith Robison via Alan Ruttenberg. In the paper, addition of a phosphatase is used to inactivate the cyclosome with the authors concluding "These results demonstrate that activity of cyclin-ubiquitin ligase is controlled by reversible phosphorylation of the cyclosome complex". In the discussion they note "Further purification of the cyclosome complex is required to define its subunit composition, to identify the subunits phosphorylated, and to obtain quantitative information on the extent of phosphorylations relevant to the regulation of its activity."

Modeling Case-3 - Generic physical entities (by Homology): ''Control of Rb/E2F gene expression. E2F family members in complex with DP (dimerization partner) proteins, bind consensus E2F sites in promoters of genes and promotes gene expression in co-operation with the basal transcription initiation machinery (basal factors).Rb binding blocks activation by inhibiting the interaction between the E2F activation domain and basal transcription initiation factors.''(Halaban, 2005) In this case both E2F and DP are not single proteins but refer to a group of proteins, that are evolutionarily/sequence similar and their interaction/reaction graphs have large portions that are the same (isomorphic). However, at other parts they diverge, in fact different E2Fs in fact might have opposing effects (Frolov et al, 2001). Thus both the generic and concrete examples need to be modeled. GenericParticipants document lists several other aspects where generalizaton of states occur.

Modeling Case-4 - Incomplete variable specification: Sometimes we only know the existence of a specific variable but not whether other variables are present or not. Sometimes, regardless of other variables the interaction will occur, and we would like to express this. (TODO: put an example here).

(Currently) Uncovered Examples
Biological Question 4: What happens when p53 is activated? This is a very common question, yet also ambigious, since there are several (possibly combinatorially many) different molecules that can be considered active and in different interaction contexts. This is an example of logical variable, other examples being open/closed and wild type/mutated. All of these logical variables map to one or more physical variables.

Rationale for not covering:

Logical variables are manifestations of physical variables in a certain context. We may not know the exact physical variable, but can assess the logical state. Considering that example, leaving out logical variables would decrease our coverage. However the state subgroup decided that modeling contexts would be very difficult, and such data was often not stored by databases. The subgroup unanimously agreed to leave logical variables out for this proposal.

Modeling Case -5: Polymerization

= Requirements (For the covered cases) =

1. Remove redundant representation of entity-common and state-common data.

User Requirements
We need to be able to:

1. Specify and retrieve equivalent participants. This is a primary requirement for data exchange and integration, as identifying equivalent states without a formal definition of a state is not straightforward and might not always be feasible. Note that the identity can not be absolutely defined as it depends on the level of granularity of the model.

1. Specify and retrieve equivalent variables. We consider the following variables: This can be minimally done by listing all possible modification types and associate a controlled vocabulary term with each one. Obviously a sub-ontology is also a possible.
 * a) Cellular Location: Basically cellular compartments but also can be extended to special points in the cells like axon hillock, or even chromatin. Location can also have an attachment aspect, a cytopasmic protein can be attached to a membrane.
 * b) Chemical Modification: This is by far the most diverse and common variable. PTMs, RNA splicing, DNA methylation all fall into this class.
 * c) Complex Member: Any molecule in a complex needs to be treated as a state, as well as the complex itself. We want to address complexes and their members individually.

1. Cover don't care, don’t know and negative statements for chemical modifications. Most chemical modification information in the literature is in the incomplete form. Most data is collected from experiments where the scientists checked for only one site but not combinations of it, so the most common statement for chemical modifications is don’t know. There are cases where a certain modification did not affect or disrupted a function, forming don’t care and negative statements. These are relatively rare. FigureX lists 15 known different phosphorylation sites on p53 and its controllers. A form of generalization is often required for representing different modified states of entities. Multiple modification sites on a single entity may lead to a combinatorial explosion of states. Sometimes these modification sites can be additive [\cite 9931323]. However in most cases they are responsible for different functions and can be combinatorial in nature[\cite 11734846]. Only by considering the combinations of these phosphorylation sites, one can define approximately 30,000 different species of p53.

1. Address complex members individually as states. In the scientific literature a member of a complex is often addressed individually. In some cases their subcellular localization can also be different. Representing these require the ability to address complex members as an individual.

1. Specify the binding sites and topology of the complex. Related to the previous requirement, this allows encoding further detail on the structure and members of the complex.

1. Accommodate homology groupings and homologous chemical modifications. This is a requirement to capture abstractions in the literature as well as a certain type of incompleteness.

Data Model Requirements
1. Track different molecules with the same path of synthesis. This is already provided by BioPAX, however any improvement should not break this requirement.

1. Remove redundant representation of entity-common and state-common data. Duplicated data is considered bad for a knowledge representation system, as the modifications are harder to make, and it is easier to introduce inconsistencies.

1. Specify which variables are possible, given a type of molecule. It has two aspects:
 * a) Only a small subset of actual state variable space is possible for physical entities, for example only specific sites on proteins can receive specific chemical modifications. Although technically it is possible to specify all possible state variables at the physical entity level, and then let states be defined only in this variable space, practically this data does not exist for all proteins. We can still adopt an "add on the go" approach though.
 * b) Some variables can only be applied to certain types of entities. For example phosphorylation is only valid for proteins.

Tool Requirements
= Proposed Implementation =

Here is the latest proposed implementation in OWL


 * attachment:StatesAndGenericsProposal.owl
 * Move LOCATION property from physicalEntityParticipant to PhysicalEntity.
 * Remove sequence feature property from physical entity participant.
 * Introduce a subclass of utility class called reference data, and move all entity-common information from physical entity to this class. Reference data is further subclassed to DNA Data, RNA Data etc.
 * Define all known sequence features in the reference entity. They are later referred from the physical entities. Although this list can always be computed on the fly, it is here to serve several purposes:
 * a) It defines the state space.
 * b) It allows defining restrictions on variables themselves, which otherwise would not be possible without subclassing states.
 * c) It effectively differentiates unknown sequence features from unknown - condition sequence features.
 * Subclass sequence feature to dna, rna and protein sequence features. This is useful for restricting features to proper sub types.
 * Define three new properties, MODIFIED-AT, NOT-MODIFIED-AT, REGARDLESS-OF for specifying sequence features in the applicable physical entities ( dna, rna and protein) .These are a subset of the corresponding reference entity's possible variable set.
 * Define a new property, IN-COMPLEX in the physical entity to define the state is in a complex. Consider a physicalEntity instance that participates in a complex as a member of the complex only (it can't be reused). Note that this also allows defining subcomplexes.
 * Define a new subclass of utility class called bindingPartner, which keeps track of a sequence feature describing the bound site, and another boundSite object to describe the target. Last property is symmetric functional.
 * Define a new property with domain physical entity, range bindingPartner to define actual topology.
 * Remove sequenceEntityParticipant, as it is no longer used.
 * Define a subclass of utilityClass, called referenceDataGroup for representing groupings over entities.ReferenceDataGroup can be used to both enumerated and undefined groupings.
 * Define a subclass of utilityClass, called sequenceFeatureGroup for representing groupings over groups of modifications.
 * Remove sequenceEntityParticipant, as it is no longer used.
 * The following Non-OWL restrictions are also defined:
 * a) A physicalEntity can be owned by at most one complex. All interactions of this physicalEntity are considered in the context of this complex.
 * b) All variables in a state must be selected from the variable set of the owner reference entity.
 * c) A variable should be addressed by only one of MODIFIED-AT, NOT-MODIFIED-AT, REGARDLESS-OF properties per physical entity.
 * d) A sequence feature group must group members of the same type, and must be owned by a proper physical entity. This can be turned into an OWL constraint by subclassing sequenceFeatureGroup. Should we do it ?

= Worked Examples =


 * 1) Description of worked example 1.
 * 2) * attachment:workedExample1
 * 3) INOH feadback and Examples
 * 4) * attachment:INOHFeadbackAndExamples.doc

= Test Cases =

= Open Issues =

1. How to address controlled vocabularies. Which controlled vocabularies to use for various state proposals? E.g. to specify protein post-translational modifications.

= Notes =

An important aspect when considering generic physical entities (see GenericParticipants) is that they represent pools of molecules that can possibly be overlapping. Think about the example where a protein can be phosphorylated at two sites, x and y. Just considering these modifications, there are 4 pools of molecules :
 * 1) Every molecule in the cell is unique, it occupies a unique space and has a unique information context. We tend to think of molecular species as pools of actual molecules, that are sufficiently similar for purposes of pathway modeling. By grouping similar instances into pools, we are basically reducing them to a manageable size. One of the rule of thumbs for deciding whether something constitutes a pool or not, is to consider whether it has an observable concentration, but the actual groupings are almost always defined by our desired and observable level of detail.  An important follow-up is that in reality every physical entity is generic (i.e. is a set of molecules when considering a more detailed level of granularity), but we may not want to consider them for practical purposes.
 * 1) only @x,
 * 2) only@y ,
 * 3) none,
 * 4) x and y.

Now consider the generic definition of @x, which maps to union of pools 1 and 4. Similarly a generic definition of @y maps to pools 2 and 4. These states, which seem to have nothing in common, actually overlap (they are not independent). Thus an increase in the concentration of one, sometimes can mean an increase in the concentration of the other. Although these definitions are good for capturing the incompleteness of the domain, and has sufficiently clear semantics for several use cases including querying, they are totally incompatible with the biochemical/chemical paradigm and the methods assuming conformance to it, such as flux analysis.

It is thus important to try to model concrete physical entities as concrete whenever possible (not generic). The actual difference between generic and concrete in the ontology should be based on current experimental knowledge, anything we know to be formed of multiple instances, but would like to group nevertheless for complexity management or incomplete information should be modeled as a generic. The rest, should be concrete and be modeled as a concrete physicalEntity.


 * 1) The initial subgroup, formed by Aviv Regev, Vincent Schachter, Mike Cary and Gary Bader already produced a document which is available at *, and was references for this document. The initial group also proposed an ontology for modeling cellular states, but it is now obsolete.
 * 2) This proposal makes use of controlled vocabularies for modeling state variables. It is an open technical/implementation issue how we are going to import different controlled vocabularies. As for the currently available sources:
 * 3) * a) Protein Modifications -> Most extensive ontology is from PSI.
 * 4) * b) RNA Modifications -> Here is a nice database http://medstat.med.utah.edu/RNAmods/, and there is GO sub tree for RNA Modifications
 * 5) * c) DNA Modifications -> Best thing we have is GO subtree for DNA Modifications. However note that these are oriented towards process but not the modification itself, thus needs manual cleaning before use.

= Backwards Compatibility =
 * 1) We have removed sequenceEntityParticipants, that was previously used for modeling interaction specific sequence modifications. This data must be modeled with states in this proposal.

/Discussion:Discuss this proposal.