Background and motivation
Algorithms for analyzing, designing and mining large scale probabilistic networks have huge
applications in a number of fields like Bioinformatics, Sociology, Commerce and others. In the context
of Bioinformatics an important trend for research recently is analysis of protein interaction networks.
These are networks in which every node represents a protein and every edge represents interaction between
two proteins. The methods for extracting such information empirically usually generate big amount of errors.
That is why these networks are usually modelled as probabilistic networks. In other words the edges have weights
specifying the probability of interaction between the two connected proteins.
One trend in the research in Protein Interaction Networks aims at maximizing the connectivity to a complete
set of interaction for one specie.
Another research area in Protein Interaction Networks aims at finding patterns within the networks showing
conserved interaction configurations for groups of proteins in the network. These configurations imply
participation in a complex biological process and could be also used for prediction of unknown interactions.
The presented project aims at dicovering statistically significant conserved motifs into already assembled networks. The nodes (proteins) will be annotated with molecular function, biological process and cellular compartment according to an existing protein ontology. The found labeled motifs will be assigned a p-value designating their statistical significance.
Refferences
[1] S. Asthana, O. D. King, F. D. Gibbons, and F. P. Roth. Predicting protein complex membership using probabilistic network reliability. Genome Research, 14:1170-1175, May 2004. --read
[2] G. D. Bader and C. W. V. Hogue. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4(2), 2003.
[3] T. Can, O. Camoglu, and A. K. Singh. Analysis of protein interaction networks using random Proceedings of the 5th ACM SIGKDD Workshop on Data Mining in Bioinformatics, Aug. 2005. read
[4] H. Hu, X. Yan, Y. Huang, J. Han, and X. J. Zhou. Mining coherent dense subgraphs networks for functional discovery. Bioinformatics, 21(Suppl. 1):i213-i221, 2005. read
[5] J. Scott, T. Ideker, R. M. Karp, and R. Sharan. Efficient algorithms for detecting signaling pathways in protein interaction networks. In Proceedings of RECOMB, 2005. read
[6] R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N. J. Krogan, S. Chung, A. Emili, M. Snyder, J. F. Greenblatt, and M. Gerstein. A bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302:449–453, October 2003. read
[7] Rui Jiang, Zhidong Tu, Ting Chen, and Fengzhu Sun "Network motif identification in stochastic networks" June 12, 2006, 10.1073/pnas.0507841103 read
[8]Marin, J.M., Mengersen, K. and Robert, C.P. "Bayesian modelling and inference on mixtures of distributions". Handbook of Statistics 25, D. Dey and C.R. Rao revewed
Methods
- Annotate the nodes using GO labels.
In this part of the project the two datasets (see datasets) will be combined and ontology information will be assigned to each node in the network. - Mine the conserved labeled trees.
The labeled network will be mined for signifficant patterns (trees or other topological entities.) The labels will be used as a refference for the proximity of the proteins, i.e. as the "strength" of the found topological configuration.
We will use an EM technique looking at labeled binary trees, initialized by frequent paths derived by the algorithm outlined in [5]. With this initialization we will expect a quicker convergence of the EM technique. - Compute the p-value.
There are two approaches for calculating the statistical significance of the found conserved motifs: statistics and simulation. We will use simulation to estimate the p-value. After we have trained our EM parameters M*, and having the maximum likelihood L* we will generate N random networks according to the background label distributions B and will run the algorithm on them obtaining Ln* likelihoods. The p-value then will be defined by Np/N where Np is the number of Ln* that are less than L*.
Dataset
(1) Database of Interacting Proteins(DIP)
This dataset is available at http://dip.doe-mbi.ucla.edu/The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions.
(2) the Gene Ontology (GO)
This dataset is available at http://www.geneontology.org/The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.
Experiments and validation
We are going to validate the correctness of our algorithm by first mining for conserved motifs that are already known in the literature. While performing tests we will target maximum quality of the found motifs achieved in manageable amount of time. We will also validate our system using different networks from differnt spieces.
A sample protein network (without labels)