KeggConversionSummary

This is a summary of the discussions from BioPAX-discuss related to the Kegg BioPAX conversion.

Current exporter was implemented by the North Bears team led by Chris Lemer. Currently development is discontinued.

Current conversion is available at: ftp://ftp.genome.jp/pub/db/community/biopax

Last Status Report from Chris Lemer :
Currently, all KEGG compounds except the RNAs are converted into BioPAX smallMolecule.

Glycans with a unique KEGG compound id are converted into BioPAX smallMolecule.

All the reactions involved in a metabolic pathway are converted into BioPAX biochemicalReaction except the reactions involved in a "multiple enzymes/multiple   reactions" relation e.g. reaction 1.2.1.8 of the eco00260 pathway. The concerned reactions are excluded from the conversion because there is no way to suppress the ambiguity of which specific enzyme catalyzes which specific reaction. The controller of BioPAX catalysis uses a gene product as protein, which is an   approximation since an enzyme is often represented as a complex of polypeptides in a given state.

The KEGG pathways where excluded entities are involved are ruled out of the conversion. Thus all converted pathways are complete.

Elements that are not currently converted:

- generic compounds (compound database) - tRNA (compound database) - glycans matching several compounds (glycan database) - oligosaccharides and polysaccharides (glycan) - reaction steps involving both multiple enzymes and multiple reactions

This results in an approximately 80% pathway conversion rate, though this rate is  an average over all organisms and some organisms may have higher or lower rates.

Generic Compounds (by AR[Alan Ruttenberg])
Kegg generic compounds are partially ambiguous and currently can not be expressed in BioPAX.

For example :

Acetoacetyl-[acp] C4H5O2SR http://www.genome.jp/dbget-bin/www_bget?compound+C05744

Here the R can be anything and this is a problem. ( actually the R reads some organic radical but is exactly some, not any, thus the ambiguity)

Susumu Goto commented on their plans to handle the issue:

In terms of the generic compounds, we will define them and built a hierarchical classification according to that, which will be an extension of the current compound classification available from our BRITE page.

In addition to those include R group and classes listed by Gary,we have compounds whose chiralities are not specified. For example, COMPOUND database has glucose, D-glucose, alpha-D-glucose, beta-D-glucose.Although how they are used is context dependent, we can classify them in a hierarchy, such as glucose is a parent of D-glucose, D-glucose is a parent of alpha-D-glucose and beta-D-glucose, etc.

We will start to add those parent-child relationships to the COMPOUND database soon, but it will be a time consuming process.

Later AR posted two lists about Kegg generic compounds:

a. is a list of compounds which have either the string "R" or the string ")n" in the chemical formula for the entry.

b. is a list of compounds which have no formula listed. Most of these are generics or markers saying the compound was transferred to the drug database.

Gary Bader checked lists a and b and found:

a.I did find examples where an "R" is present in an element name e.g. Ru or Rb and these should not be included in the list.

C13932: H42N14O2Ru3.6Cl

C13515: HORb

b. I scanned the list and found many different types of things. There are many generics, but seemingly an equal number of larger "small molecules" that are not generic, but don't have a formula listed, probably because it is hard to write down. So, general classes:
 * 1) generic "small molecules" e.g. Reduced flavodoxin
 * 2) generic proteins (many of these) e.g. Histone-L-lysine
 * 3) generic other stuff e.g. Bacterial cell wall, e- (electron)
 * 4) things where the name is the formula e.g. C05590: HI
 * 5) Small molecules with no formula (not many of these) e.g. Pinene (though there could be alpha or beta forms, this is not out of bounds of other small molecules with a formula)
 * 6) Actual proteins and RNA e.g. C02754: Yeast cytochrome c (this may be a generic, depending on how many cytochromes there are in yeast)

Generic Reactions
Dhivya Arasappan asked re: Conversion Status report :

When you mentioned that you have excluded pathways in which there are multiple enzymes for single reactions, I began to wonder why this is? Is this because of some technical issues with conversion or is it because the biopax data model itself doesn't allow us to store such multiple enzyme reactions?

Emek Demir replied:

I believe the problem is representing homology reactions, i.e. reactions that are similar in nature and can be catalyzed/controlled by a class of enzymes. However the exact pairing is either full( n*m), not known or not specified. Full and not specified has their cures, but not known has problems.

Chris Lemer commented:

Deciding to generate the N*M combinations is a matter of choice but I think that this decision is up to KEGG

Susumu Goto replied:

In the KEGG pathway maps, we usually omit cofactors such as NAD and NADP. For example, there is a case where we put two oxidoreductase reactions into one enzyme box, one uses NAD and the other uses NADP as a cofactor. In many cases, we cannot distinguish them from the protein sequences, I think the best way is to generate the N*M combinations.

Another case is protein complex. We are planning to add group tag for the complex, which will solve the problem. In fact they are defined for some of the complexes in the signaling pathways in KEGG. But the current best practice is also to generate all the combination for now, I think.