GenericsProposal-Generics and everything else are classes

- = Executive Summary =

''This is an proposal in the process of being constructed. Although contributions have been made by the people noted above, those people should not be considered to have endorsed this proposal in any way. Alan Ruttenberg takes responsibility for the contents of it.''

Virtually all facts in Biology are only known to a certain level of precision, although sometimes there appears to be enough precision that a term might sound "concrete". Where we can't specify enough specifics about a term, we talk about the term being generic. A list of use cases is listed in GenericParticipants, and includes homology, polymerization, state, and molecular classes. This proposal, when complete, will attempt to address as many of the use cases as possible.

This proposal goes further than just proposing the representation of generics. In fact we assert that in talking about proteins in living cells we never have complete information about the state of proteins and other macromolecules. Biological experiments virtually never measure effects of single molecules. In order to accurately represent this we propose that even what are typically called "concrete" proteins are in fact generalizations. Translated into our representation language, this corresponds to denoting such generalizations as classes, where we say that the class represents a set of physical molecules, which we would call instances.

This proposal focuses solely on getting the meaning as close as possible to what is desired. It does not address issues of user adoption, efficiency, conciseness. It may be extended to deal with these issues at a later time.

- = Introduction =

Goals/Motivation
There are several motivations for this proposal
 * Desire to represent partially specified information about classes of objects
 * Desire to capture what we actually can know from typical biological experiments involving stateful proteins.
 * Uncertainty about what an instance vs class means
 * Desire to have a unified States/Generics proposal
 * Desire to use OWL in a manner which which is consistent with its specification, and which best models the kinds of information known as pathway knowledge.

= Biological Questions / Use Cases =

This proposal, when completed, hopes to cover the uses cases defined in GenericParticipants, as well as those covered in the states proposal. These include homology, polymerization, partially, totally and sufficiently specified state, and molecular classes as well as interactions that describe biological relationshops between these entities.

In its current state, we focus on representing 3 hyptothetical facts involving P53, possibly phosphorylated, activating or inhibiting apoptosis.


 * 1) Represent that P53 (state unspecified) inhibits apoptosis
 * 2) Represent that P53 (state unspecified) activates apoptosis (contradicting fact)
 * 3) Represent that P53 (state partially specified) activates apoptosis

= Requirements =


 * Only classes should be used to represent the facts, in order to accurately model reflect the state of what we know from biological experiments, as described in the executive summary above. Put another way, there are no reference to physical instances of molecules, since biological experiments rarely, if ever manipulate single completely characterized molecules.

To make my assumptions clear, I list what I assume about how we come to know a typical (but hypthetical) fact that one encounters in pathway descriptions.

When we say P53 inhibits Apoptosis, for the purpose of the example, we assume the following.
 * There is some assay for apotosis.
 * Presence of P53 is assayed by antibody assay.
 * Antibody assays sample only partial state
 * Two experiments were done. In the first the system is manipulated so that P53 (or particular state thereof) is present and presence is measured. In the second experiment the researcher endeavored to keep all conditions but the p53 status unchanged. In both cases the endpoint (apoptosis) was measured. It was verified that in the first experiment apoptosis occured, and in the second it didn't.
 * We assume that what we have learned by the experiment is that some some form(s) of P53, with at least the partially specified state, is(are) responsible for the effect, given the condition. Since we won't in general record the condition in BioPAX, we can say the the statement is qualified by exists(condition)( statement ). Since this is true of all our facts I'll omit that from formal presentation.

User Requirements

 * 1) User represents information to level of detail they have. For protein state we have, so far, amino acid modifications, specified by position if available, and status (is, isn't) if available.

Practical Software Requirements

 * 1) Requires owl reasoner to do some queries

= Proposed Implementation =


 * 1) All terms referred to in pathway data are represented as classes that are subclassed from appropriate classes, and restrictions added so as to narrow the set of things referred to as much as possible.
 * 2) There is therefore a new class for each mention of a protein, each interaction, each pathway etc.
 * 3) ModifiedSequences are those which have at least one SequenceModification.
 * 4) SequenceModification has two properties: A position (integer) and a status (boolean) see /ModifiedIrrelevant for an extension that handles the cases where the status is known not to matter
 * 5) AminoAcidModifications are a subclass of SequenceModifications.  In addition to a position and status, inherited, they also have a modificationType which is a AminoAcidModificationType, of which there is currently Phosphorylation and SerinePhosphorylation subclasses respectively.
 * 6) ModifiedProtein is a subclass of Protein and ModifiedSequence where the modifications are AminoAcidModifications.
 * 7) As a placeholder for interactions we have Process, subclasses: Control and Pathway.
 * 8) Control is subclassed into inhibition and activation. Note that the process hierarchy will reflect the kind of distinctions currently in BioPAX. I've not specified more here so as not to complicate the presentation.
 * 9) To represent a specific protein referred to in a finding we will create a subclass of Protein or ModifiedProtein as appropriate.



= Worked Examples =

1. P53 controlling Apoptosis

SampleOntology.owl


 * P53 Inhibits Apoptosis:




 * Things serine phosphorylated at position 200.




 * P53 Serine Phosphorylated at 200




 * P53 Serine Phosphorylated at 200 activates apoptosis



See ModifiedIrrelevant
 * P53 Serine Phosphorylated at 200 with phosphorylation state at 100 irrelevant activates apoptosis

= Open Issues =



= Expected growth and plan for growth =


 * 1) There will be more subclasses of AminoAcidModificationType.
 * 2) There will be other protein modification types (e.g conformational state)
 * 3) The decided upon pathway ontology will be placed below Pathway.
 * 4) BioPAX level 1 constructs will be ported to new representation.
 * 5) Elaborate entities for genetic loci, genes, gene products, rna, polypeptides and rest of objects referred to in molecular biology.
 * 6) Referring to a protein by a locusid by being subclass of protein and subclass of geneproducts of gene at locus.

= Relation to external ontologies and plan for use =


 * 1) PSI-MI, resid to define subclasses of AminoAcidModificationType
 * 2) Pathway ontology terms.

= OWL considerations =


 * 1) Queries can be constructed as definition of class of objects you are looking for then ask reasoner for subclasses.
 * 2) Example 1: Controlled by P53. The class has a necessary and sufficient condition: hasController some P53. Classification finds all three control reactions.
 * 3) Examples 2: Controlled by a phosphorylated protein: hasController some (hasModification some (hasModificationType some Phosphorylation)). Classification finds the single control reaction with Serine Phosphorylated P53 as the controller.



= Backward Compatibility =

Not backwards compatible. May be possible to automatically convert a large portion of BioPAX level 1.

= Notes =


 * 1) Reference several month discussion on the BioPAX discuss list.
 * 2) Ken Fukuda is concerned with the case called homology on the generic participant page. Specifically he is concerned that in curation from the scientific literature generic names are used and we need to connect those generic names to "concrete" proteins where concrete proteins in his sense are those that have a sequence identifier such as that in swissprot. He gives the example of SMAD being used in the literature and needing to know that that refers to SMAD1 SMAD2 or SMAD3. Cites Reactome use as similar use case. We would model this case by defining SMAD to be the union of SMAD1 SMAD2 and SMAD3.

-

/Discussion:Discuss this proposal.