This paper presents a syntax-driven approach to question answering, specifically the answer-sentence selection problem for short-answer questions.
Rather than using syntactic features to augment existing statistical classifiers (as in previous work), we build on the idea that questions and their (correct) answers relate to each other via loose but predictable syntactic transformations.
We propose a probabilistic quasi-synchronous grammar, inspired by one proposed for machine translation (D. Smith and Eisner, 2006), and parameterized by mixtures of a robust non-lexical syntax/alignment model with a(n optional) lexical-semantics-driven log-linear model.
Our model learns soft alignments as a hidden variable in discriminative training.
Experimental results using the TREC dataset are shown to significantly outperform strong state-of-the-art baselines.
1 Introduction and Motivation
Open-domain question answering (QA) is a widely-studied and fast-growing research problem.
State-of-the-art QA systems are extremely complex.
They usually take the form of a pipeline architecture, chaining together modules that perform tasks such as answer type analysis (identifying whether the correct answer will be a person, location, date, etc.), document retrieval, answer candidate extraction, and answer reranking.
This architecture is so predominant that each task listed above has evolved
into its own sub-field and is often studied and evaluated independently (Shima et al., 2006).
At a high level, the QA task boils down to only two essential steps (Echihabi and Marcu, 2003).
The first step, retrieval, narrows down the search space from a corpus of millions of documents to a focused set of maybe a few hundred using an IR engine, where efficiency and recall are the main focus.
The second step, selection, assesses each candidate answer string proposed by the first step, and finds the one that is most likely to be an answer to the given question.
The granularity of the target answer string varies depending on the type of the question.
For example, answers to factoid questions (e.g., Who, When, Where) are usually single words or short phrases, while definitional questions and other more complex question types (e.g., How, Why) look for sentences or short passages.
In this work, we fix the granularity of an answer to a single sentence.
Earlier work on answer selection relies only on the surface-level text information.
Two approaches are most common: surface pattern matching, and similarity measures on the question and answer, represented as bags of words.
In the former, patterns for a certain answer type are either crafted manually (Soubbotin and Soubbotin, 2001) or acquired from training examples automatically (Itty-cheriah et al., 2001; Ravichandran et al., 2003; Licuanan and Weischedel, 2003).
In the latter, measures like cosine-similarity are applied to (usually) bag-of-words representations of the question and answer.
Although many of these systems have achieved very good results in TREC-style evaluations, shallow methods using the bag-of-word representation clearly have their limitations.
Examples of
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 22-32, Prague, June 2007.
©2007 Association for Computational Linguistics
cases where the bag-of-words approach fails abound in QA literature; here we borrow an example used by Echihabi and Marcu (2003).
The question is "Who is the leader of France?", and the sentence "Henri Hadjenberg, who is the leader of France's Jewish community, endorsed ..."
(note tokenization), which is not the correct answer, matches all keywords in the question in exactly the same order.
(The correct answer is found in "Bush later met with French President Jacques Chirac.")
This example illustrates two types of variation that need to be recognized in order to connect this question-answer pair.
The first variation is the change of the word "leader" to its semantically related term "president".
The second variation is the syntactic shift from "leader of France" to "French president."
It is also important to recognize that "France" in the first sentence is modifying "community", and therefore "Henri Hadjenberg" is the "leader of ... community" rather than the "leader of France."
These syntactic and semantic variations occur in almost every question-answer pair, and typically they cannot be easily captured using shallow representations.
It is also worth noting that such syntactic and semantic variations are not unique to QA; they can be found in many other closely related NLP tasks, motivating extensive community efforts in syntactic and semantic processing.
Indeed, in this work, we imagine a generative story for QA in which the question is generated from the answer sentence through a series of syntactic and semantic transformations.
The same story has been told for machine translation (Yamada and Knight, 2001, inter alia), in which a target language sentence (the desired output) has undergone semantic transformation (word to word translation) and syntactic transformation (syntax divergence across languages) to generate the source language sentence (noisy-channel model).
Similar stories can also be found in paraphrasing (Quirk et al., 2004; Wu, 2005) and textual entailment (Harabagiu and
Hickl, 2006; Wu, 2005).
Our story makes use of a weighted formalism known as quasi-synchronous grammar (hereafter, QG), originally developed by D. Smith and Eisner (2006) for machine translation.
Unlike most synchronous formalisms, QG does not posit a strict isomorphism between the two trees, and it provides
an elegant description for the set of local configurations.
In Section 2 we situate our contribution in the context of earlier work, and we give a brief discussion of quasi-synchronous grammars in Section 3.
Our version of QG, called the Jeopardy model, and our parameter estimation method are described in Section 4.
Experimental results comparing our approach to two state-of-the-art baselines are presented in Section 5.
We discuss portability to cross-lingual QA and other applied semantic processing tasks in Section 6.
2 Related Work
To model the syntactic transformation process, researchers in these fields—especially in machine translation—have developed powerful grammatical formalisms and statistical models for representing and learning these tree-to-tree relations (Wu and Wong, 1998; Eisner, 2003; Gildea, 2003; Melamed,
Galley et al., 2006; Smith and Eisner, 2006, inter alia).
We can also observe a trend in recent work in textual entailment that more emphasis is put on explicit learning of the syntactic graph mapping between the entailed and entailed-by sentences (Mac-
Cartney et al., 2006).
However, relatively fewer attempts have been made in the QA community.
As pointed out by Katz and Lin (2003), most early experiments in QA that tried to bring in syntactic or semantic features showed little or no improvement, and it was often the case that performance actually degraded (Litkowski, 1999; Attardi et al., 2001).
More recent attempts have tried to augment the bag-of-words representation—which, after all, is simply a real-valued feature vector—with syntactic features.
The usual similarity measures can then be used on the new feature representation.
For example, Pun-yakanok et al. (2004) used approximate tree matching and tree-edit-distance to compute a similarity score between the question and answer parse trees.
Similarly, Shen et al. (2005) experimented with dependency tree kernels to compute similarity between parse trees.
Cui et al. (2005) measured sentence similarity based on similarity measures between dependency paths among aligned words.
They used heuristic functions similar to mutual information to
assign scores to matched pairs of dependency links.
Shen and Klakow (2006) extend the idea further through the use of log-linear models to learn a scoring function for relation pairs.
Echihabi and Marcu (2003) presented a noisy-channel approach in which they adapted the IBM model 4 from statistical machine translation (Brown et al., 1990; Brown et al., 1993) and applied it to QA.
Similarly, Murdock and Croft (2005) adopted a simple translation model from IBM model 1 (Brown et al., 1990; Brown et al., 1993) and applied it to QA.
Porting the translation model to QA is not straightforward; it involves parse-tree pruning heuristics (the first two deterministic steps in Echihabi and Marcu, 2003) and also replacing the lexical translation table with a monolingual "dictionary" which simply encodes the identity relation.
This brings us to the question that drives this work: is there a statistical translation-like model that is natural and accurate for question answering?
We propose Smith and Eisner s (2006) quasi-synchronous grammar (Section 3) as a general solution and the Jeopardy model (Section 4) as a specific instance.
3 Quasi-Synchronous Grammar
For a formal description of QG, we recommend Smith and Eisner (2006).
We briefly review the central idea here.
QG arose out of the empirical observation that translated sentences often have some iso-morphic syntactic structure, but not usually in entirety, and the strictness of the isomorphism may vary across words or syntactic rules.
The idea is that, rather than a synchronous structure over the source and target sentences, a tree over the target sentence is modeled by a source-sentence-specific grammar that is inspired by the source sentence s tree.1 This is implemented by a "sense"—really just a subset of nodes in the source tree—attached to each grammar node in the target tree.
The senses define an alignment between the trees.
Because it only loosely links the two sentences syntactic structure, QG is particularly well-suited for QA insofar as QA is like "free" translation.
A concrete example that is easy to understand is a binary quasi-synchronous context-free grammar
1 Smith and Eisner also show how QG formalisms generalize synchronous grammar formalisms.
(denoted QCFG).
Let VS be the set of constituent tokens in the source tree.
QCFG rules would take the augmented form
where X, Y, and Z are ordinary CFG nonterminals, each Si G 2Vs (subsets of nodes in the source tree to which the nonterminals align), and w is a target-language word.
QG can be made more or less "liberal" by constraining the cardinality of the Si (we force all |Si| = 1), and by constraining the relationships among the Si mentioned in a single rule.
These are called permissible "configurations."
An example of a strict configuration is that a target parent-child pair must align (respectively) to a source parent-child pair.
Configurations are shown in Table 1.
Here, following Smith and Eisner (2006), we use a weighted, quasi-synchronous dependency grammar.
Apart from the obvious difference in application task, there are a few important differences with their model.
First, we are not interested in the alignments per se; we will sum them out as a hidden variable when scoring a question-answer pair.
Second, our probability model includes an optional mixture component that permits arbitrary features— we experiment with a small set of WordNet lexical-semantics features (see Section 4.4).
Third, we apply a more discriminative training method (conditional maximum likelihood estimation, Section 4.5).
4 The Jeopardy Model
Our model, informally speaking, aims to follow the process a player of the television game show Jeopardy! might follow.
The player knows the answer (or at least thinks he knows the answer) and must quickly turn it into a question.2 The question-answer pairs used on Jeopardy! are not precisely what we have in mind for the real task (the questions are not specific enough), but the syntactic transformation inspires our model.
In this section we formally define
2A round of Jeopardy! involves a somewhat involved and specific "answer" presented to the competitors, and the first competitor to hit a buzzer proposes the "question" that leads to the answer.
For example, an answer might be, This Eastern European capital is famous for defenestrations.
In Jeopardy! the players must respond with a queston: What is Prague?
this probability model and present the necessary algorithms for parameter estimation.
4.1 Probabilistic Model
The Jeopardy model is a QG designed for QA.
Let q = (qi,qn) be a question sentence (each qi is a word), and let a = (ai,...,am) be a candidate answer sentence.
(We will use w to denote an abstract sequence that could be a question or an answer.)
In practice, these sequences may include other information, such as POS, but for clarity we assume just words in the exposition.
Let A be the set of candidate answers under consideration.
Our aim is to choose:
At a high level, we make three adjustments.
The first is to apply Bayes' rule, p(a | q) oc p(q | a) • p(a).
Because A is known and is assumed to be generated by an external extraction system, we could use that extraction system to assign scores (and hence, probabilities p(a)) to the candidate answers.
Other scores could also be used, such as reputability of the document the answer came from, grammaticality, etc. Here, aiming for simplicity, we do not aim to use such information.
Hence we treat p(a) as uniform over A.3
The second adjustment adds a labeled, directed dependency tree to the question and the answer.
The tree is produced by a state-of-the-art dependency parser (McDonald et al., 2005) trained on the Wall Street Journal Penn Treebank (Marcus et al., 1993).
A dependency tree on a sequence w = (wi,wk) is a mapping of indices of words to indices of their syntactic parents and a label for the syntactic relation, t : — {0, ...,k} x L.
Each word wi has a single parent, denoted wT(i).par.
Cycles are not permitted. wo is taken to be the invisible "wall" symbol at the left edge of the sentence; it has a single child : t(i) = 0}| = 1).
The label for wi is denoted t(i).lab.
The third adjustment involves a hidden variable X, the alignment between question and answer
3The main motivation for modeling p(q | a) is that it is easier to model deletion of information (such as the part of the sentence that answers the question) than insertion.
Our QG does not model the real-world knowledge required to fill in an answer; its job is to know what answers are likely to look like, syntactically.
words.
In our model, each question-word maps to exactly one answer-word.
Let x : {1,...,n} — {1,m} be a mapping from indices of words in q to indices of words in a. (It is for computational reasons that we assume | x(i)| = 1; in general x could range over subsets of {1, ... , m}.)
Because we define the correspondence in this direction, note that it is possible for multple question words to map to the same answer word.
Why do we treat the alignment X as a hidden variable?
In prior work, the alignment is assumed to be known given the sentences, but we aim to discover it from data.
Our guide in this learning is the structure inherent in the QG: the configurations between parent-child pairs in the question and their corresponding, aligned words in the answer.
The hidden variable treatment lets us avoid commitment to any one x mapping, making the method more robust to noisy parses (after all, the parser is not 100% accurate) and any wrong assumptions imposed by the model (that |x(i) | = 1, for example, or that syntactic transformations can explain the connection between q and a at all).
Our model, then, defines
where Tq and Ta are the question tree and answer tree, respectively.
The stochastic process defined by our model factors cleanly into recursive steps that derive the question from the top down.
The QG defines a grammar for this derivation; the grammar depends on the specific answer.
Let tW refer to the subtree of tw rooted at wi.
The model is defined by:
4If parsing performance is a concern, we might also treat the question and/or answer parse trees as hidden variables, though that makes training and testing more computationally expensive.
Note the recursion in the last line.
While the above may be daunting, in practice it boils down only to defining the conditional distribution pkid, since the number of left and right children of each node need not be modeled (the trees are assumed known)— P#Hds is included above for completeness, but in the model applied here we do not condition it on qi and therefore do not need to estimate it (since the trees are fixed).
pkid defines a distribution over syntactic children of qi and their labels, given (1) the word qi, (2) the parent of qi, (3) the dependency relation between qi and its parent, (4) the answer-word qi is aligned to, (5) the answer-word the child being predicted is aligned to, and (6) the remainder of the answer tree.
4.2 Dynamic Programming
Note that k ranges over indices of answer-words to be aligned to qj. The recursive case is
Solving these equations bottom-up can be done in O(nm2) time and O(nm) space; in practice this is very efficient.
In our experiments, computing the value of a question-answer pair took two seconds on
average.5 We turn next to the details of pkid, the core of the model.
Our base model factors pkid into three conditional multinomial distributions.
where qi.pos is question-word i s POS label and qi. ne is its named-entity label. config maps question-word i, its parent, and their alignees to a QG configuration as described in Table 1; note that some configurations are extended with additional tree information.
The base model does not directly predict the specific words in the question— only their parts-of-speech, named-entity labels, and dependency relation labels.
This model is very similar to Smith and Eisner (2006).
Because we are interested in augmenting the QG with additional lexical-semantic knowledge, we also estimate pkid by mixing the base model with a model that exploits WordNet (Miller et al., 1990) lexical-semantic relations.
The mixture is given by:
4.4 Lexical-Semantics Log-Linear Model
The lexical-semantics model pksid is defined by predicting a (nonempty) subset of the thirteen classes for the question-side word given the identity of its aligned answer-side word.
These classes include WordNet relations: identical-word, synonym, antonym (also extended and indirect antonym), hy-pernym, hyponym, derived form, morphological variation (e.g., plural form), verb group, entailment, entailed-by, see-also, and causal relation.
In addition, to capture the special importance of Wh-words in questions, we add a special semantic relation called "q-word" between any word and any Wh-word.
This is done through a log-linear model with one feature per relation.
Multiple relations may fire, motivating the log-linear model, which permits "overlapping" features, and, therefore prediction of
5Experiments were run on a 64-bit machine with 2 x 2.2GHz dual-core CPUs and 4GB of memory.
any of the possible 2ia — 1 nonempty subsets.
It is important to note that this model assigns zero probability to alignment of an answer-word with any question-word that is not directly related to it through any relation.
Such words may be linked in the mixture model, however, via p^df .
(It is worth pointing out that log-linear models provide great flexibility in defining new features.
It is straightforward to extend the feature set to include more domain-specific knowledge or other kinds of morphological, syntactic, or semantic information.
Indeed, we explored some additional syntactic features, fleshing out the configurations in Table 1 in more detail, but did not see any interesting improvements.)
parent-child
Question parent-child pair align respectively to answer parent-child pair.
child-parent
Question parent-child pair align respectively to answer child-parent pair.
grandparent-child
Question parent-child pair align respectively to answer grandparent-child pair.
Augmented with the q.-side dependency label.
same node
Question parent-child pair align to the same answer-word.
siblings
Question parent-child pair align to siblings in the answer.
Augmented with the tree-distance between the a.-side siblings.
c-command
The parent of one answer-side word is an ancestor of the other answer-side word.
A catch-all for all other types of configurations, which are permitted.
Table 1: Syntactic alignment configurations are partitioned into these sets for prediction under the Jeopardy model.
4.5 Parameter Estimation
The parameters to be estimated for the Jeopardy model boil down to the conditional multinomial distributions in pj^6, the log-linear weights inside of plf!id, and the mixture coefficient a.7 Stan-
6It is to preserve that robustness property that the models are mixed, and not combined some other way.
7In our experiments, all log-linear weights are initialized to be 1; all multinomial distributions are initialized as uniform dis-
dard applications of log-linear models apply conditional maximum likelihood estimation, which for our case involves using an empirical distribution p over question-answer pairs (and their trees) to optimize as follows:
Note the hidden variable x being summed out; that makes the optimization problem non-convex.
This sort of problem can be solved in principle by conditional variants of the Expectation-Maximization algorithm (Baum et al., 1970; Dempster et al., 1977; Meng and Rubin, 1993; Jebara and Pentland, 1999).
We use a quasi-Newton method known as L-BFGS (Liu and Nocedal, 1989) that makes use of the gradient of the above function (straightforward to compute, but omitted for space).
5 Experiments
To evaluate our model, we conducted experiments using Text REtrieval Conference (TREC) 8-13 QA dataset.8
5.1 Experimental Setup
The TREC dataset contains questions and answer patterns, as well as a pool of documents returned by participating teams.
Our task is the same as Pun-yakanok et al. (2004) and Cui et al. (2005), where we search for single-sentence answers to factoid questions.
We follow a similar setup to Shen and Klakow (2006) by automatically selecting answer candidate sentences and then comparing against a human-judged gold standard.
We used the questions in TREC 8-12 for training and set aside TREC 13 questions for development (84 questions) and testing (100 questions).
To generate the candidate answer set for development and testing, we automatically selected sentences from each question s document pool that contains one or more non-stopwords from the question.
For generating the training candidate set, in addtion to the sentences that contain non-stopwords from the question, we also added sentences that contain correct
tributions; a is initialized to be 0.1.
8We thank the organizers and NIST for making the dataset publicly available.
answer pattern.
Manual judgement was produced for the entire TREC 13 set, and also for the first 100 questions from the training set TREC 8-12.9 On average, each question in the development set has 3.1 positive and 17.1 negative answers.
There are 3.6 positive and 20.0 negative answers per question in the test set.
We tokenized sentences using the standard tree-bank tokenization script, and then we performed part-of-speech tagging using MXPOST tagger (Rat-naparkhi, 1996).
The resulting POS-tagged sentences were then parsed using MSTParser (McDonald et al., 2005), trained on the entire Penn Treebank to produce labeled dependency parse trees (we used a coarse dependency label set that includes twelve label types).
We used BBN Identifinder (Bikel et al., 1999) for named-entity tagging.
As answers in our task are considered to be single sentences, our evaluation differs slightly from TREC, where an answer string (a word or phrase like 1977 or George Bush) has to be accompanied by a supporting document ID.
As discussed by Pun-yakanok et al. (2004), the single-sentence assumption does not simplify the task, since the hardest part of answer finding is to locate the correct sentence.
From an end-user s point of view, presenting the sentence that contains the answer is often more informative and evidential.
Furthermore, although the judgement data in our case are more labor-intensive to obtain, we believe our evaluation method is a better indicator than the TREC evaluation for the quality of an answer selection algorithm.
To illustrate the point, consider the example question, "When did James Dean die?"
The correct an-
9More human-judged data are desirable, though we will address training from noisy, automatically judged data in Section 5.4.
It is important to note that human judgement of answer sentence correctness was carried out prior to any experiments, and therefore is unbiased.
The total number of questions in TREC 13 is 230.
We exclude from the TREC 13 set questions that either have no correct answer candidates (27 questions), or no incorrect answer candidates (19 questions).
Any algorithm will get the same performance on these questions, and therefore obscures the evaluation results.
6 such questions were also excluded from the 100 manually-judged training questions, resulting in 94 questions for training.
For computational reasons (the cost of parsing), we also eliminated answer candidate sentences that are longer than 40 words from the training and evaluation set.
After these data preparation steps, we have 348 positive Q-A pairs for training, 1,415 Q-A pairs in the development set, and 1,703 Q-A pairs in the test set.
swer as appeared in the sentence "In 1955, actor James Dean was killed in a two-car collision near Cholame, Calif." is 1955.
But from the same document, there is another sentence which also contains 1955: "In 1955, the studio asked him to become a technical adviser on Elia Kazan's 'East of Eden,' starring James Dean."
If a system missed the first sentence but happened to have extracted 1955 from the second one, the TREC evaluation grants it a "correct and well-supported" point, since the document ID matches the correct document ID—even though the latter answer does not entail the true answer.
Our evaluation does not suffer from this problem.
We report two standard evaluation measures commonly used in IR and QA research: mean average precision (MAP) and mean reciprocal rank (MRR).
All results are produced using the standard trec.eval program.
We implemented two state-of-the-art answer-finding algorithms (Cui et al., 2005; Punyakanok et al., 2004) as strong baselines for comparison.
Cui et al. (2005) is the answer-finding algorithm behind one of the best performing systems in TREC evaluations.
It uses a mutual information-inspired score computed over dependency trees and a single alignment between them.
We found the method to be brittle, often not finding a score for a testing instance because alignment was not possible.
We extended the original algorithm, allowing fuzzy word alignments through WordNet expansion; both results are reported.
The second baseline is the approximate tree-matching work by Punyakanok et al. (2004).
Their algorithm measures the similarity between Tq and Ta by computing tree edit distance.
Our replication is close to the algorithm they describe, with one subtle difference.
Punyakanok et al. used answer-typing in computing edit distance; this is not available in our dataset (and our method does not explicitly carry out answer-typing).
Their heuristics for reformulating questions into statements were not replicated.
We did, however, apply WordNet type-checking and approximate, penalized lexical matching.
Both results are reported.
development set
training dataset
TreeMatch
Cui et al.
Jeopardy (base only)
Jeopardy
Table 2: Results on development and test sets.
TreeMatch is our implementation of Punyakanok et al. (2004); +WN modifies their edit distance function using WordNet.
We also report our implementation of Cui et al. (2005), along with our WordNet expansion (+WN).
The Jeopardy base model and mixture with the lexical-semantics log-linear model perform best; both are trained using conditional maximum likelihood estimation.
The top part of the table shows performance using 100 manually-annotated question examples (questions 1-100 in TREC 8-12), and the bottom part adds noisily, automatically annotated questions 101— 2,393.
Evaluation results on the development and test sets of our model in comparison with the baseline algorithms are shown in Table 2.
Both our model and the model in Cui et al. (2005) are trained on the manually-judged training set (questions 1-100 from TREC 8-12).
The approximate tree matching algorithm in Punyakanok et al. (2004) uses fixed edit distance functions and therefore does not require training.
From the table we can see that our model significantly outperforms the two baseline algorithms— even when they are given the benefit of WordNet— on both development and test set, and on both MRR
and MAP.
5.4 Experiments with Noisy Training Data
Although manual annotation of the remaining 2,293 training sentences' answers in TREC 8-12 was too labor-intensive, we did experiment with a simple, noisy automatic labeling technique.
Any answer that had at least three non-stop word types seen in the question and contains the answer pattern defined in the dataset was labeled as "correct" and used in training.
The bottom part of Table 2 shows the results.
Adding the noisy data hurts all methods, but
Unlike most previous work, our model does not try to find a single correspondence between words in the question and words in the answer, during training or during testing.
An alternative method might choose the best (most probable) alignment, rather than the sum of all alignment scores.
This involves a slight change to Equation 3, replacing the summation with a maximization.
The change could be made during training, during testing, or both.
Table 3 shows that summing is preferable, especially during training.
6 Discussion
The key experimental result of this work is that loose syntactic transformations are an effective way to carry out statistical question answering.
One unique advantage of our model is the mixture of a factored, multinomial-based base model and a potentially very rich log-linear model.
The base model gives our model robustness, and the log-
test set
training
decoding
Table 3: Experimental results on comparing summing over alignments (£) with maximizing (max) over alignments on the test set.
Boldface marks the best score in a column and any scores in that column not significantly worse under a a two-tailed paired t-test (p < 0.03).
linear model allows us to throw in task- or domain-specific features.
Using a mixture gives the advantage of smoothing (in the base model) without having to normalize the log-linear model by summing over large sets.
This powerful combination leads us to believe that our model can be easily ported to other semantic processing tasks where modeling syntactic and semantic transformations is the key, such as textual entailment, paraphrasing, and cross-lingual QA.
The traditional approach to cross-lingual QA is that translation is either a pre-processing or postprocessing step done independently from the main QA task.
Notice that the QG formalism that we have employed in this work was originally proposed for machine translation.
We might envision transformations that are performed together to form questions from answers (or vice versa) and to translate— a Jeopardy! game in which bilingual players must ask a question in a different language than that in which the answer is posed.
7 Conclusion
We described a statistical syntax-based model that softly aligns a question sentence with a candidate answer sentence and returns a score.
Discriminative training and a relatively straightforward, barely-engineered feature set were used in the implementation.
Our scoring model was found to greatly outperform two state-of-the-art baselines on an answer selection task using the TREC dataset.
Acknowledgments
The authors acknowledge helpful input from three anonymous reviewers, Kevin Gimpel, and David Smith.
This work is supported in part by ARDA/DTO Advanced Question Answering for Intelligence (AQUAINT) program award number NBCHC040164.
