In this paper, we address the problem of extracting data records and their attributes from unstructured biomedical full text.
There has been little effort reported on this in the research community.
We argue that semantics is important for record extraction or finer-grained language processing tasks.
We derive a data record template including semantic language models from unstructured text and represent them with a discourse level Conditional Random Fields (CRF) model.
We evaluate the approach from the perspective of Information Extraction and achieve significant improvements on system performance compared with other baseline systems.
1 Introduction
The discovery and extraction of specific types of information, and its (re)structuring and storage into databases, are critical tasks for data mining, knowledge acquisition, and information integration from large corpora or heterogeneous resources (e.g., Muslea et al., 2001; Arasu and GarciaMolina, 2003).
For example, webpages of products on Amazon may contain a list of data records such as books, watches, and electronics.
Automatic extraction of individual records will facilitate the access and management of data resources.
Most current approaches address this problem for structured or semi-structured text, for instance, from XML format files or lists and/or tabular data records on webpages (e.g., Liu et al., 2003; Zhu et al., 2006).
The techniques applied rely strongly on the analysis of document structure derived from
the webpage's html tags (e.g., the DOM tree model).
Regarding unstructured text, most Information Extraction (IE) work has focused on named entities (people, organizations, places, etc.).
Such IE treats each extracted element as a separate record.
Much less work has focused on the case where several related pieces of information have to be extracted to jointly comprise a single data record.
In this work, it is usually assumed that there is only one record for each document (e.g., Kristjannson et al., 2004).
Almost no work tries to extract multiple data records from a single document.
Multiple data records can be scattered across the narrative in free text.
The problem becomes much harder as there are no explicit boundaries between data records and no heavily indicative format features (like html tags) to utilize.
With the exponential increase of unstructured text resources (e.g., digitalized publications, papers and/or technical reports), knowledge needs have made it a necessity to explore this problem.
For example, biomedical papers contain numerous experiments and findings.
But the large volume and rate of publication have made it infeasible to read through the articles and manually identify data records and attributes.
We present a study to extract data records and attributes from the biomedical research literature.
This is part of an effort to develop a Knowledge Base Management System to benefit neuroscience research.
Specifically we are interested in knowledge of various aspects (attributes) of Tract-tracing Experiments (TTE) (data records) in neuroscience.
The goal of TTE experiments is to chart the interconnectivity of the brain by injecting tracer chemicals into a region of the brain and identifying corresponding labeled regions where the tracer is
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 837-846, Prague, June 200?.
©2007 Association for Computational Linguistics
Figure 1.
An example of data records and attributes in a research article.
taken up and transported to (Burns et al., 2007).
To extract data records from the research literature, we need to solve two sub-problems: discovering individual attributes of records and grouping them into one or more individual records, each record representing one TTE experiment.
Each attribute may contain a list of words or phrases and each record may contain a list of attributes.
Listing each sentence from top to bottom, we call the first problem the Horizontal Problem (HP) and the second the Vertical Problem (VP).
Figure 1 provides an example of a TTE research article with colored fragments representing attributes and dashed frames representing data records.
For instance, the third dashed frame represents one experiment record having three attributes with corresponding biological interpretations: "no labeled cells", "the DCN", and "the contralateral AVCN".
We view the HP and VP problems as two sequential labeling problems and describe our approach using two-level Conditional Random Fields (CRF) (Lafferty et al., 2001) models to extract data records and their attributes.
The HP problem (finding individual attribute values) is solved using a sentence-level CRF labeling model that integrates a rich set of linguistic features.
For the VP problem, we apply a discourse-level CRF model to identify individual experiments (data records).
This model utilizes deep
semantic knowledge from the HP results (attribute labels within sentences) together with semantic language models and achieves significant improvements over baseline systems.
This paper mainly focuses on the VP problem, since linguistic features for the HP problem is the general IE topic of much past research (e.g., Peng and McCallum, 2004).
We apply various feature combinations to learn the most suitable and indicative linguistic features.
The remainder of this paper is organized as follows: in the next section we discuss related work.
Following that, we present the approach to extract data records in Section 3.
We give extensive experimental evaluations in Section 4 and conclude in Section 5.
2 Related Work
As mentioned, data record extraction has been extensively studied for structured and semi-structured resources (e.g., Muslea et al., 2001; Arasu and Garcia-Molina, 2003; Liu et al., 2003; Zhu et al., 2006).
Most of those approaches rely on the analysis of document structure (reflected in, for example, html tags), from which record templates are derived.
However, this approach does not apply to unstructured text.
The reason lies in the difficulty of representing a data record template in free text without formatting tags and integrating it
into a learning system.
We show how to address this problem by deriving data record templates through language analysis and representing them with a discourse level CRF model.
Given the problem of identifying one or more records in free text, it is natural to turn toward text segmentation.
The Natural Language Processing (NLP) community has come up with various solutions towards topic-based text segmentation (e.g., Hearst, 1994; Choi, 2000; Malioutov and Barzilay, 2006).
Most unsupervised text segmentation approaches work under optimization criteria to maximize the intra-segment similarity and minimize the inter-segment similarity based on word distribution statistics.
However, this approach cannot be applied directly to data record extraction.
A careful study of our corpus shows that data records share many words and phrases and are not distinguishable based on word similairties.
In other words, different experiments (records) always belong to the same topic and there is no way to segment them using standard topic segmentation techniques (even if one views the problem as a finer-level segmentation than traditional text segmentation).
In addition, most text segmentation approaches require a prespecified number of segments, which in our domain cannot be provided.
(Wick et al., 2006) report extracting database records by learning record field compatibility.
However, in our case, the field compatibility is hard to distinguish even by a human expert.
Cluster-based or pairwise field similarity measures do not apply to our corpora without complex knowledge reasoning.
Most of Wick et al.'s data (faculty and student's homepages) contains one record.
In addition, as explained below, we have found that surface word statistics alone are not sufficient to derive data record templates for extraction.
Some (limited) form of semantic understanding of text is necessary.
We therefore first perform some sentence level extraction (following the HP problem) and then integrate semantic labels and semantic language model features into a discourse level CRF model to represent the template for extracting data records in the future.
Recently an increasing number of research efforts on text mining and IE have used CRF models
provides a compact way to integrate different types of features when sequential labeling is important.
Recent work includes improved model variants (e.g., Jiao et al., 2006; Okanohara et al., 2006) and applications such as web data extraction (Pinto et al., 2003), scientific citation extraction (Peng and McCallum, 2004), and word alignment (Blunsom and Cohn, 2006).
But none of them have used CRFs for discourse level data record extraction.
We use a CRF model to represent a data record template and integrate various knowledge as CRF features.
Instead of traditional work on the sentence level, our focus here is on the discourse level.
As this has not been carefully explored, we experiment with various selected features.
For the biomedical domain, our work will facilitate biomedical research by supporting the construction of Knowledge Base Management Systems (e.g., Stephan et al., 2001; Hahn et al., 2002; Burns and Cheng, 2006).
Unlike the well-studied problem of relation extraction from biomedical text, our work focuses on grouping extracted attributes across sentences into meaningful data records.
TTE experiment is only one of many experimental types in biology.
Our work can be generalized to many different types of data records to facilitate biology research.
In the next section, we present our approach to extracting data records.
3 Extracting Data Records
Inspired by the idea of Noun Phrase (NP) chunking in a single sentence, we view the data records extraction problem as discourse chunking from a sequence of sentences using a sequential labeling CRF model.
3.1 Sequential Labeling Model: CRF
The CRF model addresses the problem of labeling sequential tokens while relaxing the strong independence assumptions of Hidden Markov Models (HMMs) and avoiding the presence of label bias from having few successor states.
For each current state, we obtain the conditional probability of its output states given previously assigned values of input states.
For most language processing tasks, this model is simply a linear-chain Markov Random Fields model.
In typical labeling processes using CRFs each token is viewed as a labeling unit.
For our problem, we process each input document D = s 2,..., sn) as a sequence of individual sen-
tences, with a corresponding labeling sequence of labels, L = l2,...,ln), so that each sentence corresponds to only one label.
In our problem, each data record corresponds to a distinct TTE experiment.
Similar to NP chunking, we define three labels for sentences, "B_REC" (beginning of record), "I_REC" (inside record), and "O" (other).
The default label "O" indicates that this sentence is beyond our concern.
The CRF model is trained to maximize the probability of P(L | D) , that is, given an input document D, we find the most probable labeling sequence L. The decision rule for this procedure is:
A CRF model of the two sequences is characterized by a set of feature functions fk and their corresponding weights Xt .
As in Markov fields, the conditional probability P(L | D) can be computed using Equation 2.
where fk (lt-1, lt, D, t) is a feature function, representing either the state transition feature fk (lt-1, lt, D) or the feature of output state fk (lt, D) given the input sequence.
All these feature functions are user-defined boolean functions.
CRF works under the framework of supervised learning, which requires a pre-labeled training set to learn and optimize system parameters to maximize the probability or its log format.
Equipped with this model, we investigate how to apply it and prepare features accordingly.
3.2 Feature Preparation
The CRF model provides a compact, unified framework to integrate features.
However, unlike sentence-level processing, where features are very intuitive and circumscribed, it is not obvious what features are most indicative for our problem.
We therefore explore three categories of features for discourse level chunking.
Most text segmentation approaches compute surface word similarity scores in given corpora without semantic analysis.
However, in our case, data records have very similar characteristics and
share most of the words.
They are not distinguishable just from an analysis of surface word statistics.
We have to understand the semantics before we can make decisions about data record extraction.
In our case, we care about the four types of attributes of each data record (one TTE experiment).
Table 1 gives the definitions of the four attributes for each data record.
injectionLocation
the named brain region where the injection was made.
tracerChemical
the tracer chemical used.
labelingLocation
the region/location where the labeling was found.
labelingDescription
a description of labeling, including label density or label type.
Table 1.
Attributes of data records (a TTE experiment).
To obtain this semantic attributes information of individual sentences (the HP problem), we first apply another sentence-level CRF model to label each sentence.
We consider five categories of features based on language analysis.
Table 2 shows the features for each category.
Lexicon Knowledge
TOPOGRAPHY
Is word topographic?
BRAINREGION
Is word a region name?
Is word a tracer chemical?
Is word a density term?
LABELINGTYPE
Does word denote a labeling type?
Surface Word
Current word
Context Window
CONT-INJ
If current word is within a window of injection context
Window Words
Prev-word
Previous word
Next-word
Next word
Dependency Features
Root-form
Root form of the word if different
Gov-verb
The governing verb
The sentence subject
The sentence object
Table 2.
The features for labeling words.
a. Lexicon knowledge.
We used names of brain structures taken from brain atlases (Swanson, 2004), standard terms to denote neuro-anatomical topographical relationships (e.g., "rostral"), the name or abbreviation of the tracer chemical used (e.g., "PHAL"), and commonsense descriptions for descriptions of
the labeling (e.g., "dense", "light").
b. Surface and window word.
The current word and the words around are important indicators of the most probable label.
c. Context window.
The TTE is a description of the inject-label-findings process.
Whenever a word having a root form of "injection" or "deposit" appears, we generate a context window and all the words falling into this window are assigned a feature of "CONT-
INJ".
d. Dependency features.
We apply a dependency parser MiniPar (Lin, 1998) to parse each sentence, and then derive four types of features from the parsing result.
These features are (a) root form of every word, (b) the subject within the sentence, (c) the object within the sentence, and (d) the governing verbs.
The labeling system assigns a label for every token in each sentence.
We achieved the best performance with an F-score of 0.79 (based on a precision of 0.80 and a recall of 0.78).
This is not the focus of this paper.
Please refer to our previous work (Burns et al., 2007) for details.
Figure 2.
An example of semantic attribute labels.
With the sentence-level understanding of each sentence, we obtain the semantic attribute labels for the data records.
Figure 2 gives an example sentence with semantic attribute labels.
Here <tracerChemical>, <labelingLocation>, and <la-belingDescription> are recognized by the system, and the attribute names will be used as features for this sentence.
Since text narratives might adhere to logical ways of expressing facts, language models for each sentence will also provide good features to extract data records.
However, in biomedical research articles many of the technical words/phrases used in the narrative are repeated across experiments, making the surface word language model of little use in deriving generalized data record templates.
Considering this, we replace in each sentence the labeled fragments with their attribute labels and then derive semantic language models from that format.
By 'semantic language model' we therefore mean a combination of semantic labels and surface words.
For example, in the sentence shown in Figure 2, we have the semantic language model trigrams location-of-<tracerChemical>, sites-in-<injectionLocation>, and <labelingDescription>-followed-the.
In addition, we also query WordNet for the root form of each word to generalize the semantic language models.
This for example produces the semantic language model trigrams site-in-<injectionLocation> and <labelingDescription>-follow-the.
We believe the collected semantic language models represent an inherent structure of unstructured data records.
By integrating them as features with a CRF model, we expect to represent data record templates and use the learned model to extract new data records.
However, it is not clear what semantic language models are most indicative and useful.
A bag-of-words (language models) approach may bring much noise in.
We show below a comparison of regular language models and semantic language models in evaluations.
The previous two categories of features come from the discovery of semantic components of sentences and their narrative form word analysis.
When interviewing the neuroscience expert annotator, we learned that some layout and word level heuristics may also help to delineate individual data records.
Table 3 gives the two types of heuristic features.
When a sentence contains heuristic words, it will be assigned to a word heuristic feature.
If the sentence is at the boundary of a paragraph, it will be assigned a layout heuristic feature, namely the first or the last sentence in the paragraph.
Description
EXPBWORD
INJECT CASE
EXPERIMENT
APPLICATION
PLACEMENT
INTRODUCTION
Heuristic words for beginning of an experiment description
POS_IN_PARA
FIRST_IN_PARA LAST_IN_PARA
Position of the sentence in the paragraph
Table 3.
The heuristic features.
4 Empirical Evaluation
To evaluate the effectiveness and performance of our technique, we conducted extensive experiments to measure the data record extraction approach.
4.1 Experimental Setup
We used the machine learning package MALLET (McCallum, 2002) to conduct the CRF model training and labeling.
We have obtained the digital publications of 9474 Journal of Comparative Neurology (JCN)1 articles from 1982 to 2005.
We have converted the PDF format into plain text, maintaining paragraph breaks (some errors still occur though).
A simple heuristic based approach identifies semantic sections of the paper (e.g, Introduction, Results, Discussion).
As most experimental descriptions appear in the Results section, we only process the Results section.
A neuroscience expert manually annotated the data records in the Results section of 58 research articles.
The total number of sentences in the Results section of the 58 files is 6630 (averaging 114.3 sentences per article).
__Training Set Testing Set
Table 4.
Experiment configuration.
We randomly divided this material into training and testing sets under a 2:1 ratio, giving 39 documents in the training set and 19 in the testing set.
Table 4 gives the numbers of documents and data records in the training and the testing set.
4.2 Evaluation Metrics
To evaluate data record extraction, we notice it is not fair to strictly evaluate the boundaries of data records because this does not penalize the near-miss and false positive of data records in a reasonable way; sentences near a boundary that contain no relevant record information can be included or omitted without affecting the results.
Hence the standard Pk (Beeferman et al., 1997) and WinDiff (Pevzner and Hearst, 2002) measures for text segmentation are not so suitable for our task.
As we are concerned with the usefulness of knowledge in extracted data records, we instead evaluate from the perspective of IE.
We measure system performance on the quality of the extracted data records.
For each extracted data record, it will be aligned to one of the data records in the gold standard using the "dominance rule" (if the data record can be aligned to multiple records in the gold standard, it will be aligned to the one with highest overlap).
Then we evaluate the precision, recall, and F1 scores of extracted units of the data record.
The units are the attributes in data records.
# of correct units
precision -
# of the extracted units by the system # of correct units
precision + recall
These measures provide an indication of the completeness and correctness of each extracted record (experiment).
We also measure the number of distinct records extracted, compared with the gold standard as appearing in the document.
4.3 Experiment Results
To fully compare the effectiveness of our semantic analysis functionality, we evaluated system performance for all the following systems:
TextTiling (TT): To compare with text segmentation techniques, we use TextTiling (Hearst, 1994) with default parameters as the first baseline system.
Random Guess (RG): In order to demonstrate the data balance of all the possible labels in the testing set, we also use another baseline system with random decisions for each sentence.
Domain Heuristics (DH): In a regular TTE experiment, only one tracer chemical will typically be used.
Given this heuristic, we assume each data record contains one tracer chemical.
In this system, we first locate sentences with identified trace chemicals, and then we greedily expand backward and forward until another new tracer chemical appears or no other attribute is included.
Surface Text (ST): To measure the effectiveness of the semantic analysis (attribute labels and semantic language models), the ST system utilizes only standard surface word language models and heuristic features.
Semantic Analysis (SEM): The SEM system uses all the semantic features available (including identified attributes and semantic language models) and two heuristic features.
Table 5 shows the final performance of these different systems.
The second column provides the numbers of extracted data records.
In this task, a larger number does not necessarily mean a better system, as a system might produce too many false positives.
The remaining three columns represent the precision, recall, and F1 scores, averaged over all data records.
With our approach, the system performance is significantly improved compared with other systems.
System TT fails in this task as it only outputs the full document as one single record.
U of Records
Prec.
Rec.
Table 5.
System performance.
To investigate how plain text language models and semantic language models affect system performance, we also experimented with all the language models.
Table 6 shows comparisons of three types of language models.
Systems with semantic analysis always work better than those with only surface text analysis.
Without semantic analysis, unigram features work better than bigram and tri-gram features.
This matches our intuition: without generalizing to semantic language models, higher order language models will be relatively sparse and contain much noise.
However, when taking into account the semantic features, we found that bigram and trigram semantic language model fea-
tures outperformed unigrams.
They are especially important in boosting the recall scores as they capture more generalized information when derived.
Table 6.
Language model comparisons.
As an example, Table 7 gives a list of high quality bigram semantic language models ranked by their information gains based on the training data.
through_<labelingLocation>
<labelingDescription>_be
of_<tracerChemical>
<labelingLocation>_(
<tracerChemical>_be
<tracerChemical>_injection
beinject
into_<injectionLocation>
becenter
<labelingDescription>_from
injectwith
<tracerChemical>_in
injectionof
in_<labelingLocation>
inexperiment
Table 7.
An example list of top-ranked bigrams.
The main difficulty for data record extraction from unstructured text lies in deriving and representing a template for future extraction.
We actually take advantage of CRF and represent the template with a CRF model.
Each data record is measured with precision, recall, and F1 scores.
Figure 3 depicts the distribution of extracted data records according to these measures in the best system.
Distribution
Performance
Figure 3.
Data records performance distribution.
The results are encouraging, especially given the complexity and flexibility of data record descriptions in the unstructured text.
In Figure 3, Axis X
represents the value interval for precision, recall, and F1, and Axis Y represents the number of extracted records with their corresponding values.
For example, 57 records have recall scores falling into [0.9, 1.0].
Figure 4 gives an example alignment between system result and the gold standard.
Each record is represented by a range of sentences.
The numbers following each record in the system result are individual data record's precision and recall scores.
System Gold
...
Figure 4.
An example of record extraction in one doc.
This is a real example from the testing set.
For records R1, R3, and R6, the system can extract the exact sentences contained.
For record R2 and R5, although they do not exactly match at the sentence level, the extracted record contains the entire required set of attributes as in the gold standard.
4.4 Error Analysis and Discussion
When we investigated the errors, we found that sometimes the extracted data records combined two or more smaller gold standard records, or vice versa.
As shown in Figure 4, extracted records R4 and R7 are both combinations of records in the gold standard.
This is partially due to the granularity definition problem.
Authors may mention several approaches/symptoms to one type of experiment for a single purpose.
In this case, it is almost infeasible to have annotators strictly agree on granularity and thus to teach the system to acquire this knowledge.
For example, in the gold standard, the annotator annotated three successive sentences as three separate records but the system output
those as only one data record.
In this extreme case, it is too hard to expect the system to perform well.
In our approach, the semantic attribute labels and semantic language models require the result of the initial sentence-level labeling, which has an F-score of 0.79.
The error may propagate into the data record extraction procedure and lower overall system performance.
In our current experiments, we also assume all the attributes within one segment belong to one record.
However, the situation of embedded data records will make this problem harder.
For example, authors sometimes compare the current experiment with other approaches in referenced papers.
In this case, those attributes should be excluded from the records.
We need to invent rules or constraints to filter them out.
When such reference occurs at experiment boundaries, it brings higher risk for correct results.
It is a very hard problem to extract from unstructured text neat structured records.
The annotators sometimes employ background knowledge or reasoning when performing manual extraction; such knowledge cannot today be easily modeled and integrated into learning systems.
In our study, we also compared some feature selection approaches.
Similar to (Yang and Pedersen, 1997), we tried Feature Instance Frequency, Mutual Information, Information Gain, and CHI-square test.
But we eventually found that the system including all the features worked best, and with all the other configurations unchanged, feature instance frequency worked at almost the same level as other complex measures such as mutual information and information gain.
5 Conclusion and Future Work
In this paper, we explored the problem of extracting data records from unstructured text.
The lack of structure makes it difficult to derive meaningful objects and their values without resorting to deeper language analysis techniques.
We derived indicative linguistic features to represent data record templates in free text, using a two-pass approach in which the second pass used the IE labels derived from the first to compose attributes into coherent data records.
We evaluated the results from an IE perspective and reported potential problems of error generation.
For the future, we plan to explore additional feature types and feature selection strategies to determine what is "good" for unstructured record templates to improve our results.
More effort will also be put into the sentence-level analysis to reduce error propagations.
In addition, ontology based knowledge inference strategies might be useful to validate attributes in single record and in turn help data record extraction.
The last thing under our direction is to explore new models if applicable.
We hope this thought-provoking problem will attract more attention from the community.
In the future, we plan to make our corpus available to the community.
The solution to this problem will highly affect the access of knowledge in large scale unstructured text corpora.
Acknowledgements
The work was supported in part by an ISI seed funding, and in part by a grant from the National Library of Medicine (RO1 LM07061).
The authors want to thank Feng Pan for his helpful suggestions with the manuscript.
We would also like to thank the anonymous reviewers for their valuable comments.
