Research Background essay

ResearchBackground

Withthe completion of many genome-sequencing projects, the focus in thepost genomic era has turned to proteomics. Proteomics is the largescale study of proteins, focusing on their structures and functions.This is because proteins are vital parts of living organisms as theyare the main components of the physiological metabolic pathways ofcells. One important task in proteomics is detection of proteincomplexes based on the PPI data generated by various experimentaltechnologies, for example, yeast-two-hybrid and affinitypurification.

Understandingmechanisms of proteins in cells is made possible by an essential taskof identifying protein complexes. Computational approaches have beendeveloped for identification of protein complexes in protein-proteininteraction, PPI, networks. Protein complexes are a group of two ormore associated polypeptide chains that may have different functions.PPIs are specific lasting physical events or electrostatic forces.

Proteincomplexes which consist of molecular aggregation of proteinsassembled by multiple protein interactions are of fundamental unitsof macro-molecular organizations and are essential in integration ofindividual gene products to perform useful cellular functions.Confirmation arises from the fact that the complex ‘RNA polymeraseII’ transcribes genetic information into messages for ribosomes toproduce proteins. Another example is complex protease core particleinvolved in the degradation of proteins this is an essential processwithin the cell.

Manygenetic diseases with similar causes usually tend to lie close to oneanother in a network of protein-protein interaction. This kind ofinteraction relatedness can be exploited by measuring theevolutionary relationships between genes in order to find the novelprotein complexes which ultimately lead to finding disease genes.

Unfortunately,the mechanism for most of the biological activities are stillunknown, and hence, accurately predicting the protein complexes fromthe available PPI data has considerable motivation because it allowsus to infer principles of biological processes.

Inthe post-genomic era, identification of protein complexes from PPInetworks is a challenging task. In cells, proteins perform variousfunctions such as dynamic signaling, cellular machines, rigidstructures and post-translational modification systems. PPI proteinscan be represented as graphs. Protein complexes correspond to densesubgraphs in a PPI network. This is because proteins in the samecomplex highly interact with one another.

Methodsfor protein complexes prediction are usually based on experimentaland computational notions. The most popular method experimentally isthe TAP, the Tandem Affinity Purification, which has massspectroscopy. However, TAP is not perfect due tp its limitations. Onelimitation is the probability for the low affinity protein complexesto be excluded because of the washing and purification operations inTAP-MS. Tag proteins are needed by this experimental approach toinfer the protein complex. Gavin et al indicated that only limitedknown yeast protein complex subunits can be extracted by the TAP-MS.Schonbach, showed that in order to validate experimental resultsusing subcellular localization information, preparation ofsubcellular fractioned lysates is a must. Computational approachesare becoming promising alternatives to complement experimental onesas they save on time. Preparation procedure usually consumes a lot oftime.

Generally,protein interaction data can be effectively modeled as a graph, alsocalled a network, by regarding each protein as a vertex and eachknown interaction between two proteins as an edge. Although there areplenty of related results in graph theory and many graph algorithmshave been developed, it is still non-trivial to design an efficientalgorithm to mine protein complexes from PPI networks. One reason isthat there has not been an exact definition for a protein complex.To overcome this difficulty, Tong et al. Assumed that a proteincomplex corresponds to a dense subgraph because proteins in the samecomplex interact frequently among themselves, and a similardiscussion was made in. Most of the current methods are based on theobservational aim of finding dense subgraphs inside the PPInetworks. All of the algorithms for mining protein complexes differmainly in the way that a dense subgraph is defined and in theprocedure that is developed.

Studyingprotein complexes is important in biological processes because itenables revealing of the structure-functionality relationships in aprotein complex. Accurate prediction of protein complexes from theincreasing amount of protein-protein interaction, PPI, data requiresmuch attention.A comprehensive comparison between MCL, DPClus, Coach, DECAFF and thecore-attachment algorithm was made by comparing the predicted proteincomplexes with the benchmarked complexes. Experimental resultsindicated that core-attached algorithm outperforms the MCL, DPClusand DECAFF and has comparative performance with the Coach algorithmin terms of the accuracy of prediction. The detected complexes withcore-attachment structures match well with the benchmark data,demonstrating that the algorithm can provide more insightfulperspectives. Robustness analysis further shows that the algorithm isvery robust.

Availabilityof PPI networks is narrow, for example, the average interactions perprotein. In the PPI networks, it is difficult to extract many proteincomplexes due to noises in the sparse networks.Designing an efficient algorithm that get rid of the noise isimportant and challenge task when predicting protein complexes.Unfortunately, previous algorithms did not pay sufficient attentionto the problem since they only filtered the noise by deleting thenodes with degree1based on the fact that the interactions between the proteins havelower reliability than the topological reliability measures

ResearchStatus and Advances

Abig percentage of the current algorithms that concern the detectionof protein complexes focus on discovering dense subgraphs based onthe observation that dense subgraphs in a biological network maycorrespond to protein complexes. In the core-attachment concept, anovel-core attachment algorithm is developed by detecting the coresand attachments. First, the core of a protein-complex ischaracterized as a maximal clique in a virtual network that has thesame size as the original PPI network which is constructed by theeigenvectors and eigenvalues of the adjacency matrix.

Althoughit is difficult to develop an efficient algorithm, there has beenmuch research devoted to circumventing this difficulty. For example,in the Markov Cluster Algorithm, MCL, the concept of random walks isintroduced, where a walker can start at an arbitrary protein andvisits a neighborhood vertex with a predefined probability.Intuitively, if the walker goes into a dense region, it would be hardto get out of the region. The main problem here with MCL is that thedetected clusters may not necessarily have an absolute high densitybecause the walker can be trapped in a large subgraph with a lowdensity. In contrast to MCL, Molecular Complex Detection, MCODE,relied on the topological structure of the network, where it isassumed that a protein belongs to some complex if it has a subset ofneighbors with a high degree and there are many interactions amongthese proteins. The CFinder defined a dense subgraph with thek-cliques. Each dense subgraph is identified by merging a set ofadjacent k-cliques, where two k-cliques are adjacent if and only ifthey share k – 1 node. The biggest drawback in CFinder is that it isdifficult to assign a reasonable value to the parameter k becausesetting k = 3 results in too much noise while setting k = 4 omitsmany complexes. A similar problem also exists with MCODE. Othernon-topological properties, such as functional information and thedata of the protein binding interface are also incorporated intoalgorithms with the immediate purpose of improving the accuracy ofprediction. In addition, there are some other algorithms for thedetection of protein complexes that rely solely on the TAP date,which can be summarized with two points. First, a reliable PPInetwork is constructed by applying specific scoring strategies basedon the purification records and selected protein interactions withhigh scores. Then, some existing algorithms are employed to detectthe dense clusters in the newly constructed networks.

Theexisting algorithms for extracting dense subgraphs fail to take intoaccount the inherent organization in protein complexes. Recently,Gavin et al. revealed that a complex consists of a core component andSattachments, enlightening a new direction to predict proteincomplexes by discovering the core and attachments. A core componentis the ’heart’ of a protein complex and has relatively moreinteractions among proteins, while each attachment protein binds to acore to form a biologically meaningful complex.

Althoughthere is no clear-cut definition for the core-attachment structure,this method has led to substantial interests among researchers. Forexample, this method has been used in to extract cores, named by amediator, which are locally significant proteins that mediate thefunction of modules. In making the most use of the core-attachmentconcept, Leung et al. presented the first work to predict complexeswith the core-attachment-based algorithm based on a probabilitymodel. The second method draws a substantially different line, inwhich the protein complex cores are extracted from each vertex’sneighborhood graph. Both approaches outperform the existingnon-core-attachment-based computational methods dramatically,demonstrating the significance of the core-attachment structure.These results are significant motivation for the present work.

Mainwork or Main Contributions

ThesisStructure