StatesAndGenericsProposal/Discussion

Nicolas:

"We identify three types of state variables, chemical modification (e.g. phosphorylation), localization (e.g. cytosolic) and complex member (e.g. member of mdm2-p53 complex). Each combination of the variables, called a variable set, defines a separate state. Interactions involving identical physical entities in a specific state (all with the same variable set) reference the same state instance."

One very important type of state variable is missing in my opinion, the conformational "state", define by the relative position of atoms. For instance, a channel can be open or closed, an enzyme active or inhibited (I am not talking about the inhibition by another molecule, but just because of a different conformation: see the conformations of the CaMKII). Another example is the protonation of proteins that affect their function. For instance some enzymes are active in the lysosomes, but not in the cytosol (that is just an example, cellular-location is not an answer). A CRUCIAL difference is that we need a mechanism to represent NON-BINARY variables: E.g active, inactive and desensitized. Or a channel with several conductance states.

"Current biological representation system are inconsistent in the way they represent states. In one extreme states are totally omitted, and only reference physical entities are used, often supported with free text to explain the actual mechanism involving states. In the other extreme, e.g. simulation models, physical entity representation is totally dismissed and each molecule/state is treated as a separate physical entity."

"e.g. MOST simulation models ..."

Cf. StochSim (http://www.pdn.cam.ac.uk/groups/comp-cell/StochSim.html), that uses variable sets and generics.

In the table of correspondence, you can add that in SBML Level 2 Version 2, the "Reference Data" is the class "SpeciesType".

"Define three new properties, MODIFIED-AT, NOT-MODIFIED-AT, REGARDLESS-OF for specifying sequence features in the applicable physical entities ( dna, rna and protein) .These are a subset of the corresponding reference entity's possible variable set."

That implies the only state variables are those which modify the sequence. As I said previously, one can imagine other cases, such as conformation and protonation. As a consequence, although I thing the idea is nice, I would not use this terminology, and think it should be attached to the pool rather than the sequence.

More generally, the existence of a non-binary state variable allows the usage of wildcards:

state-variable name=Thr34-phosphorylation state={on|off|?} state-variable name=Thr34-glycosylation state={on|off|?}

where '?' means "any" or "does not matter"

Alan:

There is a basic problem with the notion of does not matter and that is  that it is often ambiguous, context dependent, or wrong.

''What does not matter in one interaction can matter in another. In this sense, "does not matter", if anything, is a property of the molecule-in-the-context-of a pathway, a reaction, or a model (all of  these, in different cases)''

Nicolas:

Absolutely! And this is why it has to be attached to the so-called pool. But to state that something does not matter already contains more information than not saying anything.

Therefore, if I have a physicalEntity X with two state variables:

catalytic-state: {on|off} Thr306P: {on|off|?}

The control of an interaction by X[catalytic-state{on}]

and

the control of an interaction by X[catalytic-state{on};Thr306P{?}]

Are semantically different. In the former case, I do not know if the phosphorylation of Thr306 affects the control, whereas in the latter case I know that the phosphorylation of Thr306 does NOT affect the control.

Alan: In the case that you present, where one is making a strong statement (presumably experiments w/wo Thr306P have been done) I agree. And if this is what the state proposal suggests is the meaning of ? then it should be documented (the defining experimental evidence, in      particular).

But the proposal makes stronger statements, like having to enumerate the possible state variables in the reference data, information which is often incomplete. How are we to take this list of possible states? As some of them? If so this is easily derivable by a query on the database, represents redundant information, and is therefore prone to      error accumulation without careful maintenance, As all of them? Then this is problematic from a data integration point of view. What if two databases referenceData are talking about the same thing but one knows more phosphorylation sites than the other? Of course neither may know the complete set. And so on.

I would also point out that your distinction makes use of the so-called open-world assumption (that if you don't state something, you don't      know it), something that the DX proposals reject. That OWL operates under the open world assumption is one of the great benefits of its use for representing biological knowledge. So by choosing to disregard it, the DX proposals lose important expressivity, and generate semantically incorrect OWL.

Again, insofar as the DX proposal addresses short term needs of a      subset of the BioPAX community, it is up to them to make these choices. But once this is out of the way we need to be working on solutions to      these issues.

''And to know that something really does not matter you need to do the experiment with and without the molecule in the state you are making the claim doesn't matter. Many times this is not done. When it is, it  is often worth recording the two experiments. More often only a  particular set of variables is recorded and the model is constructed of   these. And in many cases these models don't work because some variable that was not known was not recorded and mattered.''

''Then there is the blurring of what exactly we are talking about. Are we  talking about partial descriptions of things in the world, or are we   talking about computational models of pathways, which, depending on the methodology can make different assumptions about what does and doesn't  matter, and about level of representation, for example continuous versus discrete.''

''If we are talking about things in the world, then we need to realize that state is really a shortcut for representing an approximation of  something more complicated, and we should provide provisions for linking the more elaborate knowledge to the shortcut as we acquire it, or if some particular database chooses to represent an area in more detail (compare, e.g. "phosphorylation", with the descriptions you find  in the RESID database.''

''If we are talking about some specific model, then we should probably say so, so that when we attempt to integrate knowledge from different sources we don't have the wrong sorts of information mixing together because of differing assumptions of the models.''

''None of this may be a problem for the short term needs of DX (hard to  see why it wouldn't be, but hey, I don't write their requirements), but resolving these issues is important as we move forward, and particularly for those of use who need to be able to meaningfully combine information from multiple sources.''

OR

state-variable name=Thr34 state={naked|phosphorylated|glycosylated|?}

state-variable name=open state={on|off|?} state-variable name=desnsitized state={on|off|?} where '?' means "any" or "does not matter"

OR

state-variable name=channel-state state={open|closed|desensitized|?}

SUMMARY:

1) We should extend the stated to modification that do not imply chemical modification (Yes, I know a protonation can be seen a chemical modification. You see what I mean).

2) There is maybe room for non-binary state variables

Alan:

The second major issue with the proposal, from my point of view, is the notion of reference data. The issue with is that it effectively says: There is something the same about this physicalEntity and the reference data but it is up to you to figure out what. In order to support generics, physical entities can point to reference data that resolves to groups of physical entities, each of which may have conflicting information, such as sequence. As far as I can tell, there is no theory or mechanism for deciding which of these properties is valid or specific to the physicalEntity in your reaction and what the status of the rest are.

This makes more sense in SBML which is explicitly about making computational models of a certain kind. In BioPAX, which is intended to describe information at a different level, and for a wider variety of uses, in particular integration (as opposed to simple accumulation), it is a problem.

Once again, perhaps not an issue for DX, but of serious concern to others.