This paper presents a method for categorizing named entities in Wikipedia.
In Wikipedia, an anchor text is glossed in a linked HTML text.
We formalize named entity categorization as a task of categorizing anchor texts with linked HTML texts which glosses a named entity.
Using this representation, we introduce a graph structure in which anchor texts are regarded as nodes.
In order to incorporate HTML structure on the graph, three types of cliques are defined based on the HTML tree structure.
We propose a method with Conditional Random Fields (CRFs) to categorize the nodes on the graph.
Since the defined graph may include cycles, the exact inference of CRFs is computationally expensive.
We introduce an approximate inference method using Tree-based Reparameterization (TRP) to reduce computational cost.
In experiments, our proposed model obtained significant improvements compare to baseline models that use Support Vector Machines.
1 Introduction
Named and Numeric Entities (NEs) refer to proper nouns (e.g. PERSON, LOCATION and ORGANIZATION), time expressions, date expressions and so on.
Since a large number of NEs exist in the world, unknown expressions appear frequently in texts, and they become hindrance to real-world text analysis.
To cope with the problem, one effective ways to add a large number of NEs to gazetteers.
In recent years, NE extraction has been performed with machine learning based methods.
However, such methods cannot cover all of NEs in texts.
Therefore, it is necessary to extract NEs from existing resources and use them to identify more NEs.
There are many useful resources on the Web.
We focus on Wikipedia1 as the resource for acquiring NEs.
Wikipedia is a free multilingual online encyclopedia and a rapidly growing resource.
In Wikipedia, a large number of NEs are described in titles of articles with useful information such as HTML tree structures and categories.
Each article links to other related articles.
According to these characteristics, they could be an appropriate resource for extracting
NEs.
Since a specific entity or concept is glossed in a Wikipedia article, we can regard the NE extraction problem as a document classification problem of the Wikipedia article.
In traditional approaches for document classification, in many cases, documents are classified independently.
However, the Wikipedia articles are hypertexts and they have a rich structure that is useful for categorization.
For example, hyper-linked mentions (we call them anchor texts)which are enumerated in a list tend to refer to the articles that describe other NEs belonging to the same class.
It is expected that improved NE categorization is accomplished by capturing such dependencies.
We structure anchor texts and dependencies between them into a graph, and train graph-based CRFs to obtain probabilistic models to estimate categories for NEs in Wikipedia.
So far, several statistical models that can cap-
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 649-657, Prague, June 2007.
©2007 Association for Computational Linguistics
ture dependencies between examples have been proposed.
There are two types of classification methods that can capture dependencies: iterative classification methods (Neville and Jensen, 2000; Lu and Getoor, 2003b) and collective classification methods
paper, we use Conditional Random Fields (CRFs) (Lafferty et al., 2001) for NE categorization in Wikipedia.
The rest of the paper is structured as follows.
Section 2 describes the general framework of CRFs.
Section 3 describes a graph-based CRFs for NE categorization in Wikipedia.
In section 4, we show the experimental results.
Section 5 describes related work.
We conclude in section 6.
2 Conditional Random Fields
Conditional Random Fields (CRFs) (Lafferty et al., 2001) are undirected graphical models that give a conditional probability distribution p(y\x) in a form of exponential model.
CRFs are formalized as follows.
Let G = {V,E} be an undirected graph over random variables y and x,where V is a set of vertices, and E is a set of edges in the graph G. When a set of cliques C = {{yc, xc}} are given, CRFs define the conditional probability of a state assignment given an observation set.
where $(xc, yc) is a potential function defined over cliques, and Z(x) = Y,yllceC ®{xc, yc) is the partition function.
The potentials are factorized according to the set of features {fk }.
where F = {f\,fK} are feature functions on the cliques, A = {Ai,XK <^R} are the model parameters.
The parameters A are estimated iterative scaling or quasi-Newton method from labeled data.
part-of-speech tagging problem.
McCallum et al. (2003), Sutton et al (2004) proposed Dynamic Conditional Random Fields (DCRFs), the generalization of linear-chain CRFs, that have complex graph structure (include cycles).
Since DCRFs model structure contains cycles, it is necessary to use approximate inference methods to calculate marginal probability.
Tree-based Reparameterization (TRP) (Wainwright et al., 2003), a schedule for loopy belief propagation, is used for approximate inference in these papers.
3 Graph-based CRFs for NE Categorization in Wikipedia
In this section we describe how to apply CRFs for NE categorization in Wikipedia.
Each Wikipedia article describes a specific entity or concept by a heading word, a definition, and one or more categories.
One possible approach is to classify each NE described in an article into an appropriate category by exploiting the definition of the article.
This process can be done one by one without considering the relationship with other articles.
On the other hand, articles in Wikipedia are semi-structured texts.
Especially lists (<UL> or <OL>) and tables (<TABLE>) have an important characteristics, that is, occurrence of elements in them have some sort of dependencies.
Structural characteristics, such as lists (<UL> or <OL>) or tables (<TABLE>) , are useful becase their elements have some sort of dependencies.
Figure 2 shows an example of an HTML segment and the corresponding tree structure.
The first anchor texts in each list tag (<LI>)tendtobeinthe same NE category.
Such characteristics are useful feature for the categorization task.
In this paper we focus on lists which appear frequently in Wikipedia.
Furthermore, there are anchor texts in articles.
Anchor texts are glossed entity or concept described with links to other pages.
With this in mind, our NE categorization problem can be regarded as NE category labeling problem for anchor texts in articles.
Exploiting dependencies of anchor texts that are induced by the HTML structure is expected to improve categorization performance.
We use CRFs for categorization in which anchor texts correspond to random variables V in G and de-
Figure 1 : The definitions of sibling, cousin and relative cliques, where Es, Ec, Er correspond to sets which consist of anchor text pairs that have sibling, cousin and relative relations respectively.
pendencies between anchor texts are treated as edges E in G. In the next section, we describe the concrete way to construct graphs.
3.1 Constructing a graph from an HTML tree
An HTML document is an ordered tree.
We define a graph G = (VG ,EG) on an HTML tree THTML = (VT,ET): the vertices VG are anchor texts in the HTML text; the edges E are limited to cliques of Sibling, Cousin, and Relative, which we will describe later in the section.
These cliques are intended to encode a NE label dependency between anchor texts where the two NEs tend to be in the same or related class, or one NE affects the other
NE label.
Let us consider dependent anchor text pairs in Figure 2.
First, "Dillard & Clark" and "country rock" have a sibling relation over the tree structure, and appearing the same element of the list.
The latter element in this relation tends to be an attribute or a concept of the other element in the relation.
Second, "Dillard & Clark" and "Carpenters" have a cousin relation over the tree structure, and they tend to have a common attribute such as "Artist".
The elements in this relation tend to belong to the same class.
Third, "Carpenters" and "Karen Carpenter" have a relation in which "Karen Carpenter" is a sibling's grandchild in relation to "Carpenters" over the tree structure.
The latter elements in this relation tends to be a constituent part of the other element in the relation.
We can say that the model can capture dependencies by dealing with anchor texts that depend on each other as cliques.
Based on the observations as above, we treat a pair of anchor texts as cliques which satisfy the condtions in Figure 1.
> Dillard & Clark....
Carpenters
• Karen Carpenter
Dillard & Clark
■ Sibling
country Carpenters rock
Karen Carpenter
Figure 2: Correspondence between tree structure and defined cliques.
Now, we define the three sorts of edges given an HTML tree.
Consider an HTML tree THTML = (VT,ET),where VT and ET are nodes and edges over the tree.
Let d(vT,vT) be the number of edges between vf and vT where vf ,vT £ VT, pa(vf ,k) be k-th generation ancestor of vf, ch(vf ,k) be vf's k-th child, ca(vf ,vf) be a common ancestor of vf ,vf £ Vf. Precise definitions of cliques, namely Sibling, Cousin, and Relative, are given in Figure 1.
A set of cliques used in our graph-based CRFs are edges defined in Figure 1 and vertices, i.e. C = ES u EC u ER u V. Note that they are restricted to pairs of the nearest vertices to keep the graph simple.
We introduce potential functions for cliques to deine conditional probability distribution over CRFs.
Conditional distribution over label set y given ob-
servation set x is defined as:
fk(yi,yj) captures co-occurrences between labels, where k G{(yi,yj)\Y x Y} corresponds to the particular element of the Cartesian product of the label set Y. fk'(yi, x) captures co-occurrences between label yi gY and observation features, where k' corresponds to the particular element of the label set and observed features.
3.3 Tree-based Reparameterization
Since the proposed model may include loops, it is necessary to introduce an approximation to calculate mariginal probabilities.
For this, we use Tree-based Reparameterization (TRP) (Wainwright et al., 2003) for approximate inference.
TRP enumerates a set of spanning trees from the graph.
Then, inference is performed by applying an exact inference algorithm such as Belief Propagation to each of the spanning trees, and updates of marginal probabilities are continued until they converge.
Our dataset is a random selection of 2300 articles from the Japanese version of Wikipedia as of October 2005.
All anchor texts appearing under HTML <LI> tags are hand-annotated with NE class label.
We use the Extended Named Entity Hierarchy (Sekine et al., 2002) as the NE class labeling guideline, but reduce the number of classes to 13 from the original 200+ by ignoring fine-grained categories and nearby categories in order to avoid data sparseness.
We eliminate examples that consist of less than two nodes in the SCR model.
There are 16136 anchor texts with 14285 NEs.
The number of Sibling, Cousin and Relative edges in the dataset are \Es\ = 4925, \Ec\ = 13134 and \Er\ = 746 respectively.
The log-likelihood function can be deined as follows:
where the last two terms are due to the Gaussian prior (Chen and Rosenfeld, 1999) used to reduce overrating.
Quasi-Newton methods, such as L-BFGS (Liu and Nocedal, 1989) can be used for maximizing the function.
The aims of experiments are the two-fold.
Firstly, we investigate the effect of each cliques.
The several graphs are composed with the three sorts of edges.
We also compare the graph-based models with a node-wise method - just MaxEnt method not using any edge dependency.
Secondly, we compare the proposed method by CRFs with a baseline method by Support Vector Machines (SVMs) (Vap-
nik, 1998).
The experimental settings of CRFs and SVMs are as follows.
CRFs In order to investigate which type of clique boosts classiication performance, we perform experiments on several CRFs models that are constructed from combinations of deined cliques.
Re-
# of loopy examples
# of linear chain or tree examples
# of one node examples
# of total examples
average # of nodes per example
Table 1: The dataset details constructed from each model.
suiting models of CRFs evaluated on this experiments are SCR, SC, SR, CR, S, C, R and I (independent).
Figure 3 shows representative graphs of the eight models.
When the graph are disconnected by reducing the edges, the classification is performed on each connected subgraph.
We call it an example.
We name the examples according the graph structure: "loopy examples" are subgraphs including at least one cycle; "linear chain or tree examples" are subgraphs including not a cycle but at least an edge; "one node examples" are subgraphs without edges.
Table 1 shows the distribution of the examples of each model.
Since SCR, SC, SR and CR model have loopy examples, TRP approximate inference is necessary.
To perform training and testing with CRFs, we use GRMM (Sutton, 2006) with TRP.
We set the Gaussian Prior variances for weights as a2 = 10 in all models.
Figure 3.
An example of graphs constructed by combination of defined cliques.
S, C, R in the model names mean that corresponding model has Sibling, Cousin, Relative cliques respectively.
In each model, classification is performed on each connected subgraph.
SVMs We introduce two models by SVMs (model I and model P).
In model I, each anchor text is classified independently.
In model P, we ordered the anchor texts in a linear-chain sequence.
Then, we perform a history-based classification along the sequence, in which j — 1-th classification result is used in j-th classification.
We use TinySVM with a linear-kernel.
One-versus-rest method is used for multi-class classification.
To perform training and testing with SVMs, we use TinySVM 2 with a linearkernel, and one-versus-rest is used for multi-class classification.
We used the cost of constraint violation C = 1.
Features for CRFs and SVMs The features used in the classification with CRFs and SVMs are shown in Table 2.
Japanese morphological analyzer MeCab 3 is used to obtain morphemes.
We evaluate the models by 5 fold cross-validation.
Since the number of examples are different in each model, the datasets are divided taking the examples - namely, connected subgraphs - in SCR model.
The size of divided five sub-data are roughly equal.
We evaluate per-class and total extraction performance by F1-value.
4.4 Results and discussion
Table 3 shows the classification accuracy of each model.
The second column "N" stands for the number of nodes in the gold data.
The second last row "ALL" stands for the Fl-value of all NE classes.
3http://mecab.sourceforge.net/
observation
definition (bag-of-words)
features
heading of articles
heading of articles (morphemes)
categories articles
categories articles (morphemes)
anchor texts
anchor texts (morphemes)
parent tags of anchor texts
text included in the last header of anchor texts
text included in the last header of anchor texts(morphemes)
label features
between-label feature
previous label
Table 2: Features used in experiments.
"yj" means that the corresponding features are used in classification.
The V, S, C and R in CRFs column corresponds to the node, sibling edges, cousin edges and relative edges respectively.
NE CLASS
TIMEX/NUMEX
FACILITY
LOCATION
NATURAL_OBJECTS
ORGANIZATION
VOCATION
NAME_OTHER
ALL (no articles)
Table 3.
Comparison of Fl-values of CRFs and SVMs.
The last row "ALL (no article)" stands for the F1-value of all NE classes which have no gloss texts in Wikipedia.
Relational vs. Independent Among the models constructed by combination of defined cliques, the best F1-value is achieved by CR model, followed by
mar paired test on labeling disagreements between
CR model of CRFs and I model of CRFs.
The
difference was significant (p< 0.01).
These results show that considering dependencies work positively in obtaining better accuracy than classifying independently.
The Cousin cliques provide the highest accuracy improvement among the three defined cliques.
The reason may be that the Cousin cliques appear frequently in comparison with the other cliques, and also possess strong dependencies among anchor texts.
As for PERSON, better accuracy is achieved in SC and SCR models.
In fact, the PERSON-PERSON pairs frequently appear in Sibling cliques (435 out of 4925) and in Cousin cliques (2557 out of 13125) in the dataset.
Also, as for PRODUCT and LOCATION, better accuracy is achieved in the models that contain Cousin cliques (C, CR, SC and SCR model).
1072 PRODUCT-PRODUCT pairs and 738 LOCATION-LOCATION pairs appear in Cousin cliques.
"All (no article)" row in Table 3 shows the F1-value of nodes which have no gloss texts.
The F1-value difference between CR and I model of CRF in "ALL (no article)" row is larger than the difference in "All" row.
The fact means that the dependency information helps to extract NEs without gloss texts in Wikipedia.
We attempted a different parameter tying in which the SCR potential functions are tied with a particular observation feature.
This parameter tying is introduced by Ghamrawi and McCallum (2005).
However, we did not get any improved accuracy.
model) outperforms the best model of SVMs (P model).
We performed McNemar paired test on labeling disagreements between CR model of CRFs and P model of SVMs.
The difference was significant (p< 0.01).
In the classes having larger number of examples, models of CRFs achieve better F1-values than models of SVMs.
However, in several classes having smaller number of examples such as
Figure 4: Precision-Recall curve obtained by varying the threshold t of marginal probability from 1.0
to 0.0.
EVENT and UNIT, models of SVMs achieve significantly better F1-values than models of CRFs.
Filtering NE Candidates using Marginal Probability The precision-recall curve obtained by thresholding the marginal probability of the MAP estimation in the CR models is shown in Figure 4.
The curve reaches a peak at 0.57 in recall, and the precision value at that point is 0.97.
This precision and recall values mean that 57% of all NEs can be classified with approximately 97% accuracy on a particular thresholding of marginal probability.
This results suggest that the extracted NE candidates can be filtered with fewer cost by exploiting the marginal probability.
Training Time The total training times of all CRFs and SVMs models are shown in Table 4.
The training time tends to increase in case models have complicated graph structure.
For instance, model SCR has complex graph structure compare to model I, therefore the SCR's training time is three times longer than model I. Training the models by SVMs are faster than training the models by CRFs.
The difference comes from the implementation issues: C++
Training Time (minutes)
vs. Java, differences of feature extraction modules, and so on.
So, the comparing these two is not the important issue in this experiment.
5 Related Work
Wikipedia has become a popular resource for NLP.
Bunescu and Pasca used Wikipedia for detecting and disambiguating NEs in open domain texts (2006).
Strube and Ponzetto explored the use of Wikipedia for measuring Semantic Relatedness between two concepts (2006), and for Coreference Resolution
(2006).
Several CRFs have been explored for information extraction from the web.
Tang et al. proposed Tree-structured Conditional Random Fields (TCRFs) (2006) that capture hierarchical structure of web documents.
Zhu et al. proposed Hierarchical Conditional Random Fields (HCRFs) (2006)
for product information extraction from Web documents.
TCRFs and HCRFs are similar to our approach described in section 4 in that the model structure is induced by page structure.
However, the model structures of these models are different from our model.
There are statistical models that capture dependencies between examples.
There are two types of classification approaches: iterative (Lu and Getoor, 2003b; Lu and Getoor, 2003 a) or collective (Getoor et al., 2001; Taskar et al., 2002).
Lu et al. (2003a; 2003b) proposed link-based classification method based on logistic regression.
This model iterates local classification until label assignments converge.
The results vary from the ordering strategy of local classification.
In contrast to iterative classification methods, collective classification methods directly estimate most likely assignments.
Getoor et al. proposed Probabilistic Relational Models (PRMs) (2001) which are built upon Bayesian Networks.
Since Bayesian Networks are directed graphical models, PRMs cannot model directly the cases where instantiated graph contains cycles.
Taskar et al. proposed Relational Markov Networks (RMNs)
(2002).
RMNs are the special case of Conditional Markov Networks (or Conditional Random Fields) in which graph structure and parameter tying are determined by SQL-like form.
As for the marginal probability to use as a confidence measure shown in Figure 4, Peng et al. (2004) has applied linear-chain CRFs to Chinese word segmentation.
It is calculated by constrained forward-backward algorithm (Culotta and McCallum, 2004), and confident segments are added to the dictionary in order to improve segmentation accuracy.
6Conclusion
In this paper, we proposed a method for categorizing NEs in Wikipedia.
We defined three types of cliques that are constitute dependent anchor texts in construct CRFs graph structure, and introduced potential functions for them to reflect classification.
The experimental results show that the effectiveness of capturing dependencies, and proposed CRFs model can achieve significant improvements compare to baseline methods with SVMs.
The results also show that the dependency information from the HTML tree helps to categorize entities without gloss texts in Wikipedia.
The marginal probability of MAP assignments can be used as confidence measure of the entity categorization.
We can control the precision by filtering the confidence measure as PR curve in Figure 4.
The measure can be also used as a confidence estimator in active learning in CRFs (Kim et al., 2006), where examples with the most uncertainty are selected for presentation to human anno-tators.
In future research, we plan to explore NE categorization with more fine-grained label set.
For NLP applications such as QA, NE dictionary with finegrained label sets will be a useful resource.
However, generally, classification with statistical methods becomes difficult in case that the label set is large, because of the insufficient positive examples.
It is an issue to be resolved in the future.
