We introduce a relation extraction method to identify the sentences in biomedical text that indicate an interaction among the protein names mentioned.
Our approach is based on the analysis of the paths between two protein names in the dependency parse trees of the sentences.
Given two dependency trees, we define two separate similarity functions (kernels) based on cosine similarity and edit distance among the paths between the protein names.
Using these similarity functions, we investigate the performances of two classes of learning algorithms, Support Vector Machines and k-nearest-neighbor, and the semi-supervised counterparts of these algorithms, transductive SVMs and harmonic functions, respectively.
Significant improvement over the previous results in the literature is reported as well as a new benchmark dataset is introduced.
Semi-supervised algorithms perform better than their supervised version by a wide margin especially when the amount of labeled data is limited.
1 Introduction
Protein-protein interactions play an important role in vital biological processes such as metabolic and signaling pathways, cell cycle control, and DNA replication and transcription (Phizicky and Fields, 1995).
A number of (mostly manually curated) databases such as MINT (Zanzoni et al., 2002), BIND (Bader et al., 2003), and SwissProt (Bairoch
and Apweiler, 2000) have been created to store protein interaction information in structured and standard formats.
However, the amount of biomedical literature regarding protein interactions is increasing rapidly and it is difficult for interaction database curators to detect and curate protein interaction information manually.
Thus, most of the protein interaction information remains hidden in the text of the papers in the biomedical literature.
Therefore, the development of information extraction and text mining techniques for automatic extraction of protein interaction information from free texts has become an important research area.
In this paper, we introduce an information extraction approach to identify sentences in text that indicate an interaction relation between two proteins.
Our method is different than most of the previous studies (see Section 2) on this problem in two aspects: First, we generate the dependency parses of the sentences that we analyze, making use of the dependency relationships among the words.
This enables us to make more syntax-aware inferences about the roles of the proteins in a sentence compared to the classical pattern-matching information extraction methods.
Second, we investigate semi-supervised machine learning methods on top of the dependency features we generate.
Although there have been a number of learning-based studies in this domain, our methods are the first semi-supervised efforts to our knowledge.
The high cost of labeling free text for this problem makes semi-supervised methods particularly valuable.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 228-237, Prague, June 2007.
©2007 Association for Computational Linguistics
and harmonic functions (Zhu et al., 2003).
We also compare these two methods with their supervised counterparts, namely SVMs and k-nearest neighbor algorithm.
Because of the nature of these algorithms, we propose two similarity functions (kernels in SVM terminology) among the instances of the learning problem.
The instances in this problem are natural language sentences with protein names in them, and the similarity functions are defined on the positions of the protein names in the corresponding parse trees.
Our motivating assumption is that the path between two protein names in a dependency tree is a good description of the semantic relation between them in the corresponding sentence.
We consider two similarity functions; one based on the cosine similarity and the other based on the edit distance among such paths.
2 Related Work
There have been many approaches to extract protein interactions from free text.
One of them is based on matching pre-specified patterns and rules (Blaschke et al., 1999; Ono et al., 2001).
However, complex cases that are not covered by the pre-defined patterns and rules cannot be extracted by these methods.
Huang et al. (2004) proposed a method where patterns are discovered automatically from a set of sentences by dynamic programming.
Bunescu et al. (2005) have studied the performance of rule learning algorithms.
They propose two methods for protein interaction extraction.
One is based on the rule learning method Rapier and the other on longest common subsequences.
They show that these methods outperform hand-written rules.
Another class ofapproaches is using more syntax-aware natural language processing (NLP) techniques.
Both full and partial (shallow) parsing strategies have been applied in the literature.
In partial parsing the sentence structure is decomposed partially and local dependencies between certain phrasal components are extracted.
An example of the application of this method is relational parsing for the inhibition relation (Pustejovsky et al., 2002).
In full parsing, however, the full sentence structure is taken into account.
Temkin and Gilder (2003) used a full parser with a lexical analyzer and a context free grammar (CFG) to extract protein-protein
interaction from text.
Another study that uses full-sentence parsing to extract human protein interactions is (Daraselia et al., 2004).
Alternatively, Yakushiji et al. (2005) propose a system based on head-driven phrase structure grammar (HPSG).
In their system protein interaction expressions are presented as predicate argument structure patterns from the HPSG parser.
These parsing approaches consider only syntactic properties of the sentences and do not take into account semantic properties.
Thus, although they are complicated and require many resources, their performance is not satisfactory.
Machine learning techniques for extracting protein interaction information have gained interest in the recent years.
The PreBIND system uses SVM to identify the existence of protein interactions in abstracts and uses this type of information to enhance manual expert reviewing for the BIND database (Donaldson et al., 2003).
Words and word bigrams are used as binary features.
This system is also tested with the Naive Bayes classifier, but SVM is reported to perform better.
Mitsumori et al. (2006) also use SVM to extract protein-protein interactions.
They use bag-of-words features, specifically the words around the protein names.
These systems do not use any syntactic or semantic information.
Sugiyama et al. (2003) extract features from the sentences based on the verbs and nouns in the sentences such as the verbal forms, and the part of speech tags of the 20 words surrounding the verb (10 before and 10 after it).
Further features are used to indicate whether a noun is found, as well as the part of speech tags for the 20 words surrounding the noun, and whether the noun contains numerical characters, non-alpha characters, or uppercase letters.
They construct k-nearest neighbor, decision tree, neural network, and SVM classifiers by using these features.
They report that the SVM classifier performs the best.
They use part-of-speech information, but do not consider any dependency or semantic information.
The paper is organized as follows.
In Section 3 we describe our method of extracting features from the dependency parse trees of the sentences and defining the similarity between two sentences.
In Section 4 we discuss our supervised and semi-supervised methods.
In Section 5 we describe the data sets and evaluation metrics that we used, and present our re-
sults.
We conclude in Section 6.
3 Sentence Similarity Based on Dependency Parsing
In order to apply the semi-supervised harmonic functions and its supervised counterpart kNN, and the kernel based TSVM and SVM methods, we need to define a similarity measure between two sentences.
For this purpose, we use the dependency parse trees of the sentences.
Unlike a syntactic parse (which describes the syntactic constituent structure of a sentence), the dependency parse of a sentence captures the semantic predicate-argument relationships among its words.
The idea of using dependency parse trees for relation extraction in general was studied by Bunescu and Mooney (2005a).
To extract the relationship between two entities, they design a kernel function that uses the shortest path in the dependency tree between them.
The motivation is based on the observation that the shortest path between the entities usually captures the necessary information to identify their relationship.
They show that their approach outperforms the dependency tree kernel of Culotta and Sorensen (2004), which is based on the subtree that contains the two entities.
We adapt the idea of Bunescu and Mooney (2005a) to the task of identifying protein-protein interaction sentences.
We define the similarity between two sentences based on the paths between two proteins in the dependency parse trees of the sentences.
In this study we assume that the protein names have already been annotated and focus instead on the task ofextracting protein-protein interaction sentences for a given protein pair.
We parse the sentences with the Stanford Parser1 (de Marneffe et al., 2006).
From the dependency parse trees of each sentence we extract the shortest path between a protein pair.
For example, Figure 1 shows the dependency tree we got for the sentence 'The results demonstrated that KaiC interacts rhythmically with KaiA, KaiB, and SasA.
" This example sentence illustrates that the dependency path between a protein pair captures the relevant information regarding the relationship between the proteins better compared to using the words in the unparsed sentence.
Consider the pro-
'http://nlp.stanford.edu/software/lex-parser.shtml
tein pair KaiC and SasA.
The words in the sentence between these proteins are interacts, rhythmically, with, KaiA, KaiB, and and.
Among these words rhythmically, KaiA, and and KaiB are not directly related to the interaction relationship between KaiC and SasA.
On the other hand, the words in the dependency path between this protein pair give sufficient information to identify their relationship.
In this sentence we have four proteins (KaiC, KaiA, KaiB, and SasA).
So there are six pairs of proteins for which a sentence may or may not be describing an interaction.
The following are the paths between the six protein pairs.
In this example there is a single path between each protein pair.
However, there may be more than one paths between a protein pair, if one or both appear multiple times in the sentence.
In such cases, we select the shortest paths between the protein pairs.
demonstrated
results interacts
The that KaiC rhytmically SasA
conjand/ ^^conjand KaiA KaiB
Figure 1: The dependency tree of the sentence ' The
results demonstrated that KaiC interacts rhythmically with KaiA, KaiB, and SasA."
If a sentence contains n different proteins, there are different pairs of proteins.
We use machine learning approaches to classify each sentence as an interaction sentence or not for a protein pair.
A sentence may be an interaction sentence for one protein
pair, while not for another protein pair.
For instance, our example sentence is a positive interaction sentence for the KaiC and SasA protein pair.
However, it is a negative interaction sentence for the KaiA and SasA protein pair, i.e., it does not describe an interaction between this pair of proteins.
Thus, before parsing a sentence, we make multiple copies of it, one for each protein pair.
To reduce data sparseness, we rename the proteins in the pair as PROTX1 and PROTX2, and all the other proteins in the sentence as PROTX0.
So, for our example sentence we have the following instances in the training set:
The first three instances are positive as they describe
an interaction between PROTX1 and PROTX2.
The
last three are negative, as they do not describe an
interaction between PROTX1 and PROTX2.
We define the similarity between two instances based on cosine similarity and edit distance based similarity between the paths in the instances.
3.1 Cosine Similarity
Suppose pi and pj are the paths between PROTX1 and PROTX2 in instance xi and instance Xj, respectively.
We represent pi and pj as vectors of term frequencies in the vector-space model.
The cosine similarity measure is the cosine of the angle between these two vectors and is calculated as follows:
llPillllPjll
that is, it is the dot product of pi and pj divided by the lengths of pi and pj. The cosine similarity measure takes values in the range [0,1].
If all the terms in pi and pj are common, then it takes the maximum value of 1.
If none of the terms are common, then it takes the minimum value of 0.
3.2 Similarity Based on Edit Distance
A shortcoming of cosine similarity is that it only takes into account the common terms, but does not consider their order in the path.
For this reason, we also use a similarity measure based on edit distance (also called Levenshtein distance).
Edit distance between two strings is the minimum number of operations that have to be performed to transform the first string to the second.
In the original character-based edit distance there are three types of operations.
These are insertion, deletion, or substitution of a single character.
We modify the character-based edit distance into a word-based one, where the operations are defined as insertion, deletion, or substitution of a single word.
The edit distance between path 1 and path 2 of our example sentence is 2.
We insert PROTX0 and conj-Cind to path 1 to convert it to path 2.
We normalize edit distance by dividing it by the length (number of words) of the longer path, so that it takes values in the range [0,1].
We convert the distance measure into a similarity measure as follows.
Bunescu and Mooney (2005a) propose a similar method for relation extraction in general.
However, their similarity measure is based on the number of the overlapping words between two paths.
When two paths have different lengths, they assume the similarity between them is zero.
On the other hand, our edit distance based measure can also account for deletions and insertions of words.
4 Semi-Supervised Machine Learning Approaches
4.1 kNN and Harmonic Functions
When a similarity measure is defined among the instances of a learning problem, a simple and natural choice is to use a nearest neighbor based approach that classifies each instance by looking at the labels of the instances that are most similar to it.
Perhaps the simplest and most popular similarity-based
learning algorithm is the k-nearest neighbor classification method (kNN).
Let U be the set of unlabeled instances, and L be the set of labeled instances in a learning problem.
Given an instance x £ U, let x) be the set of top k instances in L that are most similar to x with respect to some similarity measure.
The kNN equation for a binary classification problem can be written as:
where y(z) £ {0,1} is the label of the instance z.2 Note that y(x) can take any real value in the [0,1] interval.
The final classification decision is made by setting a threshold in this interval (e.g. 0.5) and classifying the instances above the threshold as positive and others as negative.
For our problem, each instance is a dependency path between the proteins in the pair and the similarity function can be one of the functions we have defined in Section 3.
Equation 3 can be seen as averaging the labels (0 or 1) of the nearest neighbors of each unlabeled instance.
This suggests a generalized semi-supervised version of the same algorithm by incorporating un-labeled instances as neighbors as well:
Unlike Equation 3, the unlabeled instances are also considered in Equation 4 when finding the nearest neighbors.
We can visualize this as an undirected graph, where each data instance (labeled or unla-beled) is a node that is connected to its k nearest neighbor nodes.
The value of y( ) is set to 0 or 1 for labeled nodes depending on their class.
For each unlabeled node x, y(x) is equal to the average of the y( ) values of its neighbors.
Such a function that satisfies the average property on all unlabeled nodes is called a harmonic function and is known to exist and have a unique solution (Doyle and Snell, 1984).
Harmonic functions were first introduced as a semi-supervised learning method by Zhu et al. (2003).
There are interesting alternative interpretations of
2Equation 3 is the weighted (or soft) version of the kNN algorithm.
In the classical voting scheme, x is classified in the category that the majority of its neighbors belong to.
a harmonic function on a graph.
One of them can be explained in terms of random walks on a graph.
Consider a random walk on a graph where at each time point we move from the current node to one of its neighbors.
The next node is chosen among the neighbors of the current node with probability proportional to the weight (similarity) of the edge that connects the two nodes.
Assuming we start the random walk from the node x, y(x) in Equation 4 is then equal to the probability that this random walk will hit a node labeled 1 before it hits a node labeled
Support vector machines (SVM) is a supervised machine learning approach designed for solving two-class pattern recognition problems.
The aim is to find the decision surface that separates the positive and negative labeled training examples of a class with maximum margin (Burges, 1998).
Transductive support vector machines (TSVM) are an extension of SVM, where unlabeled data is used in addition to labeled data.
The aim now is to assign labels to the unlabeled data and find a decision surface that separates the positive and negative instances of the original labeled data and the (now labeled) unlabeled data with maximum margin.
Intuitively, the unlabeled data pushes the decision boundary away from the dense regions.
However, unlike SVM, the optimization problem now is NP-hard (Zhu, 2005).
Pointers to studies for approximation algorithms can be found in (Zhu,
2005).
In Section 3 we defined the similarity between two instances based on the cosine similarity and the edit distance based similarity between the paths in the instances.
Here, we use these path similarity measures as kernels for SVM and TSVM and modify the SVMlight package (Joachims, 1999) by plugging in our two kernel functions.
A well-defined kernel function should be symmetric positive definite.
While cosine kernel is well-defined, Cortes et al. (2004) proved that edit kernel is not always positive definite.
However, it is possible to make the kernel matrix positive definite by adjusting the 7 parameter, which is a positive real number.
Li and Jiang (2005) applied the edit kernel to predict initiation sites in eucaryotic mRNAs and
obtained improved results compared to polynomial kernel.
One of the problems in the field of protein-protein interaction extraction is that different studies generally use different data sets and evaluation metrics.
Thus, it is difficult to compare their results.
Bunescu et al. (2005) manually developed the AIMED corpus3 for protein-protein interaction and protein name recognition.
They tagged 199 Medline abstracts, obtained from the Database of Interacting Proteins (DIP) (Xenarios et al., 2001) and known to contain protein interactions.
This corpus is becoming a standard, as it has been used in the recent studies by (Bunescu et al., 2005; Bunescu and Mooney, 2005b; Bunescu and Mooney, 2006; Mitsumori et al., 2006; Yakushiji et al., 2005).
In our study we used the AIMED corpus and the CB (Christine Brun) corpus that is provided as a resource by BioCreAtIvE II (Critical Assessment for Information Extraction in Biology) challenge eval-uation4.
We pre-processed the CB corpus by first annotating the protein names in the corpus automatically and then, refining the annotation manually.
As discussed in Section 3, we pre-processed both of the data sets as follows.
We replicated each sentence for each different protein pair.
For n different proteins in a sentence, new sentences are created, as there are that many different pairs of proteins.
In each newly created sentence we marked the protein pair considered for interaction as PROTX1 and PROTX2, and all the remaining proteins in the sentence as PROTX0.
If a sentence describes an interaction between PROTX1 and PROTX2, it is labeled
as positive, otherwise it is labeled as negative.
The summary of the data sets after pre-processing is displayed in Table 15.
Since previous studies that use AIMED corpus perform 10-fold cross-validation.
We also performed 10-fold cross-validation in both data sets and report the average results over the runs.
3ftp://ftp.cs.utexas.edu/pub/mooney/bio-data/ 4http://bi ocreative.sourceforge.net/bi ocreative_2.html 5The pre-processed data sets are available at http://belobog.si.umich.edu/clair/biocreative
Data Set
Sentences
+ Sentences
- Sentences
AIMED CB
5.2 Evaluation Metrics
We use precision, recall, and F-score as our metrics to evaluate the performances of the methods.
Precision (n) and recall (p) are defined as follows:
Here, TP (True Positives) is the number of sentences classified correctly as positive; FP (False Positives) is the number of negative sentences that are classified as positive incorrectly by the classifier; and FN (False Negatives) is the number of positive sentences that are classified as negative incorrectly by the classifier.
F-score is the harmonic mean of recall and precision.
5.3 Results and Discussion
We evaluate and compare the performances of the semi-supervised machine learning approaches (TSVM and harmonic functions) with their supervised counterparts (SVM and kNN) for the task of protein-protein interaction extraction.
As discussed in Section 3, we use cosine similarity and edit distance based similarity as similarity functions in harmonic functions and kNN, and as kernel functions in TSVM and SVM.
Our instances consist of the shortest paths between the protein pairs in the dependency parse trees of the sentences.
In our experiments, we tuned the 7 parameter of the edit distance based path similarity function to 4.5 with cross-validation.
The results in Table 2 and Table 3 are obtained with 10-fold cross-validation.
We report the average results over the runs.
Table 2 shows the results obtained for the AIMED data set.
Edit distance based path similarity function performs considerably better than the cosine similarity function with harmonic functions and kNN
and usually slightly better with SVM and TSVM.
with TSVM with edit kernel.
While SVM with edit
kernel achieves the highest precision of 77.52%, it performs slightly worse than SVM with cosine kernel in terms of F-score measure.
TSVM performs slightly better than SVM, both of which perform better than harmonic functions. kNN is the worst performing algorithm for this data set.
In Table 2, we also show the results obtained previously in the literature by using the same data set.
Yakushiji et al. (2005) use an HPSG parser to produce predicate argument structures.
They utilize these structures to automatically construct protein interaction extraction rules.
Mitsumori et al. (2006) use SVM with the unparsed text around the protein names as features to extract protein interaction sentences.
Here, we show their best result obtained by using the three words to the left and to the right of the proteins.
The most closely related study to ours is that by Bunescu and Mooney (2005a).
They define a kernel function based on the shortest path between two entities of a relationship in the dependency parse tree of a sentence (the SPK method).
They apply this method to the domain of protein-protein interaction extraction in (Bunescu and Mooney, 2006).
Here, they also test the methods ELCS (Extraction Using Longest Common Subsequences) (Bunescu et al., 2005) and SSK (Subsequence Kernel) (Bunescu and Mooney, 2005b).
We cannot compare our results to theirs directly, because they report their results as a precision-recall graph.
However, the best F-score in their graph seems to be around 0.50 and definitely lower than the best F-scores we have achieved 0.59).
Bunescu and Mooney (2006) also use SVM as their learning method in their SPK approach.
They define their similarity based on the number of overlapping words between two paths and assign a similarity of zero if the two paths have different lengths.
Our improved performance with SVM and the shortest path dependency features may be due to the edit-distance based kernel, which takes into account not only the overlapping words, but also word order and accounts for deletions and insertions of words.
Our results show that, SVM, TSVM, and harmonic functions achieve better F-score and recall performances than the previous studies by Yakushiji et al. (2005), Mitsumori et al. (2006), and the SSK and ELCS approaches of Bunescu and Mooney (2006).
SVM and TSVM also achieve higher precision scores.
Since,
Mitsumori et al. (2006) also use SVM in their study, our improved results with SVM confirms our motivation of using dependency paths as features.
Table 3 shows the results we got with the CB data set.
The F-score performance with the edit distance based similarity function is always better than that of cosine similarity function for this data set.
The difference in performances is considerable for harmonic functions and kNN.
Our best F-score
cosine similarity function is used, kNN performs better than harmonic functions.
However, when edit distance based similarity is used, harmonic functions achieve better performance.
SVM and TSVM perform better than harmonic functions.
But, the gap in performance is low when edit distance based similarity is used with harmonic functions.
Table 2: Experimental Results - AIMED Data Set
Precision
SVM-edit
TSVM-edit
TSVM-cos
Harmonic-edit
Harmonic-cos
kNN-edit
Table S: Experimental Results - CB Data Set
Semi-supervised approaches are usually more effective when there is less labeled data than unlabeled data, which is usually the case in real applications.
To see the effect of semi-supervised approaches we perform experiments by varying the amount of la-
kNN Harmonic
0.2 .
Figure 2: The F-score on the AIMED dataset with varying sizes of training data
Harmonic
SVM TSVM
Number of Labeled Sentences
Figure 3: The F-score on the CB dataset with varying sizes of training data
beled training sentences in the range [10, 3000].
For each labeled training set size, sentences are selected randomly among all the sentences, and the remaining sentences are used as the unlabeled test set.
The results that we report are the averages over 10 such random runs for each labeled training set size.
We report the results for the algorithms when edit distance based similarity is used, as it mostly performs better than cosine similarity.
Figure 2 shows the results obtained over the AIMED data set.
Semi-supervised approaches TSVM and harmonic functions perform considerably better than their supervised counterparts SVM and kNN when we have small number of labeled training data.
It is interesting to note that, although SVM is one of the best performing algorithms with more training data, it is the worst performing algorithm with small amount of labeled training sentences.
Its performance starts to increase when number of training data is larger than 200.
Eventually, its performance gets close to that of the other algorithms.
Harmonic functions is the best performing algorithm when we have less than 200 labeled training data.
TSVM achieves better performance when there are more than 500 labeled training sentences.
Figure 3 shows the results obtained over the CB data set.
When we have less than 500 labeled sen-
tences, harmonic functions and TSVM perform significantly better than kNN, while SVM is the worst performing algorithm.
When we have more than 500 labeled training sentences, kNN is the worst performing algorithm, while the performance of SVM increases and gets similar to that of TSVM and slightly better than that of harmonic functions.
6 Conclusion
We introduced a relation extraction approach based on dependency parsing and machine learning to identify protein interaction sentences in biomedical text.
Unlike syntactic parsing, dependency parsing captures the semantic predicate argument relationships between the entities in addition to the syntactic relationships.
We extracted the shortest paths between protein pairs in the dependency parse trees of the sentences and defined similarity functions (kernels in SVM terminology) for these paths based on cosine similarity and edit distance.
Supervised machine learning approaches have been applied to this domain.
However, they rely only on labeled training data, which is difficult to gather.
To our knowledge, this is the first effort in this domain to apply semi-supervised algorithms, which make use of both labeled and unlabeled data.
We evaluated and compared the performances of two semi-supervised ma-
chine learning approaches (harmonic functions and TSVM), with their supervised counterparts (kNN and SVM).
We showed that, edit distance based similarity function performs better than cosine similarity function since it takes into account not only common words, but also word order.
Our 10-fold cross validation results showed that, TSVM performs slightly better than SVM, both of which perform better than harmonic functions.
The worst performing algorithm is kNN.
We compared our results with previous results published with the AIMED data set.
We achieved the best F-score performance with TSVM with the edit distance kernel (59.96%) which is significantly higher than the previously reported results for the same data set.
In most real-world applications there are much more unlabeled data than labeled data.
Semi-supervised approaches are usually more effective in these cases, because they make use of both the labeled and unlabeled instances when making decisions.
To test this hypothesis for the application ofextracting protein interaction sentences from text, we performed experiments by varying the number of labeled training sentences.
Our results show that, semi-supervised algorithms perform considerably better than their supervised counterparts, when there are small number oflabeled training sentences.
An interesting result is that, in such cases SVM performs significantly worse than the other algorithms.
Harmonic functions achieve the best performance when there are only a few labeled training sentences.
As number of labeled training sentences increases the performance gap between supervised and semi-supervised algorithms decreases.
Acknowledgments
This work was supported in part by grants R01-LM008106 and U54-DA021519 from the US National Institutes of Health.
