\documentclass[12pt]{article}
\usepackage{amsthm}
\usepackage{amsfonts,amsmath,amssymb,amsthm,amsmath}
\usepackage{xspace}
\usepackage{aliascnt}
\usepackage{geometry}
\usepackage{tikz}
\usepackage{lipsum}
\usepackage[colorlinks=true, urlcolor=blue, linkcolor=black]{hyperref}
\usepackage{cleveref}
\geometry{a4paper}

\newcommand\ourbenchmark{\textsc{QAMParI}}


\title{\large \textbf{Response Letter to: "\emph{\ourbenchmark{}: A Benchmark for Open-domain Questions with Many Answers}"}}
\date{}

\begin{document}
    \maketitle
    December 14th, 2022 \\
    Dear reviewers and AC, thank you all for your reviews concerning our manuscript entitled \textbf{"\emph{\ourbenchmark{}: A Benchmark for Open-domain Questions with Many Answers}" (Paper ID: Paper286)} dating from the 15th of October. We have carefully addressed your concerns in the revised version of the manuscript we attached. The revisions are chronicled in detail below. For convenience, we have interspersed each comment with a description of the associated changes to the manuscript. \\ \\
    Please note that references to the manuscript lines, sections and tables concern the revised paper. \\ \\
    Our main additions to the manuscript are as follows:
    \begin{itemize}
        \item We have finetuned DPR on \ourbenchmark{} and added those results
        \item We added a multi-hop question decomposition to our experimental setup
        \item We expanded our experimental setup to include an Oracle setup with a perfect retriever and the \emph{Passage Independent Generator} trained on NQ only in a zero-shot setup on \ourbenchmark{}
        \item We added a discussion about the naturalness of \ourbenchmark{} questions
    \end{itemize}
    \newpage
    \large \textbf{Area Chair:  k69m } \\ \\
    \normalsize
    \textbf{Comment AC.1} \emph{The question distribution could be artificial, i.e., irrelevant to the general distribution of queries in ODQA. For example, as the authors commented, only 1.63\% of questions in MSMARCO have more than 1 answer. The dataset though can be useful to measure this specific distribution.} \\
    \textbf{Answer to AC.1} We add a discussion regarding the question distribution of ODQA question vs. \ourbenchmark{} in the introduction (4th paragraph [JB: ADD LINES]). The fact that only 1.63\% of questions in MSMARCO have more than one answer is possibly because users (at the time when the questions were asked) know that they are unlikely to get answers for multi-answer questions. It is hard to say what fraction of questions will be multi-answer if users believe any question is answerable and we argue that there are in fact many important and interesting mutli-answer questions that arise naturally. \\ \\
    \textbf{Comment AC.2} \emph{The reviewers suggest improving the paper with more benchmarks, including DPR trained on QAMPARI, multi-hop QA, etc} \\
    \textbf{Answer to AC.2} We followed the AC and the reviewers' suggestions and we added the following baselines:
    \begin{itemize}
        \item DPR pretrained on NQ and finetuned on \ourbenchmark{} (discussed in Section 4.1 paragraph titled \textbf{Reader} \jb{why is a retriever discussed in `Reader'?} \sa{Couldn't solve the compilation problem for the author response fast enough, opened a new project for it and changed it there. I sent you the new project link on slack: }, results in Section 4.3, Table 3, Table 4)
        \item Question decomposition for multi-hop QA (discussed in Section 4.1 last paragraph, results in Section 4.3, results in Table 3)
        \item \emph{Passage Independent Generator} (PIG) trained on NQ only and tested on \ourbenchmark{} (discussed in Section 4.3, results in Table 3)
        \item An \emph{Oracle} setup in which we assumed a perfect retriever and gave all the gold contexts to our readers (discussed in Section 4.3 paragraph title Oracle results, results in Appendix E Table 9)
    \end{itemize}
    \newpage
    \large \textbf{Reviewer 1: vusV} \\ \\
    \normalsize
    \textbf{Comment R1.1} \emph{Since majority of the dataset is automatically generated using templates, it has some concerning similarity with some existing semantic parsing datasets (e.g., WikiTableQuestions, ComplexWebQuestions, SPIDER). The paper lacks references to these datasets, and discussion of the differences with them. Is QAMPARI yet another semantic parsing dataset (filtered according to number of answers) that is framed as an ODQA benchmark?} \\

    \textbf{Answer to R1.1} As we said in our previous author response, we argue that the domain over which questions are asked is what makes an ODQA dataset, not the way the questions were generated. \\

    In WikiTableQuestions and SPIDER, answers are from a single table or a small database. ComplexWebQuestions does not contain lists of answers at all and does not map its answers to relevant Wikipedia paragraphs. \\

    QAMPARI is the only benchmark where questions require many answers and the domain is all of Wikipedia’s text (without tables), and we developed a process for verifying answers do in fact exist on Wikipedia. Thus, we believe this dataset introduces an new ODQA task. \\ \\

    \textbf{Comment R1.2} \emph{In the experiments, you use off-the-shelf DPR retriever trained on NQ. It is not surprising that it performs poorly on QAMPARI, given the differences between the two datasets. Have you try finetuning DPR with QAMPARI? Without finetuning on QAMPARI, it is not a best-effort implementation of ODQA system on QAMPARI, and hence cannot support your claim well.} \\

    \textbf{Answer to R1.2} As suggested, we finetuned DPR with \ourbenchmark{} and added these results to our revised paper (discussed in Section 4.1 paragraph titled \textbf{Reader}, results in Section 4.3, Table 3, Table 4). While it is significantly better than the off-the-shelf DPR-NQ, finetuned DPR results are still lower than BM25 results. \\ \\

    \textbf{Comment R1.3} \emph{Despite questions with many ($>$5) answers are interesting research-wise, the problem setting seems a bit artificial. How likely will such question occur in real user queries? When encounter such questions, is it a good idea for the system to produce a long list of answer, or should it ask for clarification?} \\

    \textbf{Answer to R1.3} We added a discussion on the likelihood and the interest of \ourbenchmark{}-style questions in the introduction 4th paragraph. Regarding the last question, whether it is a good idea to produce a long list of answer of if it should ask for clarification, some questions (as the examples provided in the introduction) do not call for a clarification but rather for a list of answers. \\ \\

    \textbf{Comment R1.4} \emph{Regarding the T5 generator training: how is the list of answers verbalised as output sequence? Does the order of the answers matter?} \\

    \textbf{Answer to R1.4} We added these details in Appendix C. \\ \\

    \textbf{Comment R1.5} \emph{Missing templates for complex questions: could you provide the templates for intersection and composition questions?} \\

    \textbf{Answer to R1.5} In the previous author response, we mentioned that the template for complex questions requires some adaptations but it generally goes like this:
    Composition: What is the <comp\_property> of <subtype> who/which <base\_property>?. All the templates are in the code base provided. \\ \\

    \large \textbf{Reviewer 2: LPHn} \\ \\
    \normalsize
    \textbf{Comment R2.1} \emph{No benchmarking of multi-hop QA systems, a setting that is closer to QAMPARI, since it involves reasoning over multiple (relevant) paragraphs. Furthermore, decompositions were used to come up with QAMPARI questions, which would make it interesting to see how multi-hop QA models would perform. I believe this is a critical missing piece in the evaluation of this benchmark} \\

    \textbf{Answer to R2.1} As suggested, we tested a multi-hop QA system on \ourbenchmark{}, with question decomposition (discussed in Section 4.1, results in Section 4.3, Table 3 and Table 9). Overall, these systems' performance is lower than the basic ODQA system, which is not surprising since these systems perform better on \ourbenchmark{}'s complex questions than the simple ones (discussed in section 4.3 paragraph title \emph{Question type analysis}). \\ \\

    \textbf{Comment R2.2} \emph{Missing citations: MuSiQue, TeaBReaC, select then answer model family} \\

    \textbf{Answer to R2.2} As suggested, we added the citations in the relevant locations (Intro, Section 4.1, Section 5) \\ \\

    \large \textbf{Reviewer 3: c7Ui} \\ \\
    \normalsize
    \textbf{Comment R3.1} \emph{Although the paper discusses about various types of relation in the questions, the type of questions in the dataset is limited to what/who type questions. More question types similar to Figure 2 in HOTPOTQA would add value to the dataset.} \\

    \textbf{Answer to R3.1} As suggested, we added statistic regarding the phrasing of our questions in Appendix D. \\ \\

    \textbf{Comment R3.2} \emph{Although the paper demonstrates that the dataset is challenging for current SOTA models, the use-cases in the real-world application is missing in the paper and would further strengthen the paper.} \\
    \textbf{Answer to R3.2} We added a discussion regarding the use cases in the real-world application in Section 1, paragraph 4. \\ \\

    \textbf{Comment R3.3} \emph{(Section 2.3) Since the list of answers from wikipedia are extracted from the tables with title ‘List of X’, Comparison with QA methods (such as Question answering using Web Lists, Katti et al 2021) that can answer the question using web lists would be helpful.} \\

    \textbf{Answer to R3.3} As we mentioned in our previous response, the core focus of \ourbenchmark{} is answering multi-answer questions where the answers are in text paragraphs. We believe this approach mimics many real-world scenarios such as open IE over text. While TableQA systems over structured data (tables, HTML) might help solve \ourbenchmark{}, its goal is to evaluate models that reason over unstructured natural language texts. \\ \\

    \textbf{Comment R3.4} \emph{The reason to paraphrase only 3000 out of 61911 questions in the training set is not clear} \\

    \textbf{Answer to R3.4} As we mentioned in our previous author response, the reason is that it would be too costly. \\ \\

    \textbf{Comment R3.5} \emph{Table 3: zero-shot evaluation with the models trained on NQ and evaluated on proposed dataset would be helpful to further demonstrate the impact of the dataset.} \\

    \textbf{Answer R3.5} As suggested, we added results of the \emph{Passage Independent Generator} (PIG) trained on NQ and evaluated on \ourbenchmark{} (results in Table 3, discussed in section 4.3). We did not add the same results for FiD trained on NQ since this model would only output one answer given the contexts (which is not insightful for \ourbenchmark{}). \\ \\

    \large \textbf{Reviewer 4: Rk2Y} \\ \\
    \normalsize

    \textbf{Comment R4.1} \emph{The novelty is limited, the characteristics of QAMPARI has been included in many existing datasets, such as MSMARCO and WDRASS especially WDRASS, which not only contain answers from multiple paragraphs within same document but also contain the answers from different documents, In addition, it also consider about the non-factoid questions.} \\

    \textbf{Answer to R4.1} As we mentioned in our author response, MSMARCO contains less than 1.63\% of questions with multiple answers. In Section 1, we added a discussion regarding the reason for the rarity of \ourbenchmark{}-style questions in real user queries. \\
    Regarding WDRASS, its data is still not publicly available and we therefore do not have much to say about it. \\ \\

    \textbf{Comment R4.2} \emph{The domain of this dataset is limited, the questions and answers of QAMPARI are only from wikipedia, which can result in bias and be hard to generalize} \\

    \textbf{Answer to R4.2} As we said in our previous response, we agree that extending ODQA benchmarks to domains other than Wikipedia is important. However, this is orthogonal to our work which focuses on developing and evaluating ODQA models on questions with multiple answers. Furthermore, Wikipedia is a benchmark corpus in terms on ODQA research, as evident by past works (“Natural Questions: A Benchmark for Question Answering Research” (Kwiatkowski et al., 2019); “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension” (Joshi et al. 2017); “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering” (Yang et al. 2017)). \\ \\

    \textbf{Comment R4.3} \emph{The experiments have done by this paper are unconvincing. For example, when the authors evaluate the passage retrieval performance, they didn't train the model on their train data. So it can't demonstrate that their dataset QAMPARI is n valuable resource for training a better passage retriever, which is an important part in ODQA research} \\

    \textbf{Answer to R4.3} As suggested, we finetuned DPR with \ourbenchmark{} and added these results to our revised paper (discussed in Section 4.1 paragraph titled \textbf{Reader}, results in Section 4.3, Table 3, Table 4). While it is significantly better than the off-the-shelf DPR-NQ, finetuned DPR results are still lower than BM25 results. Therefore, \ourbenchmark{} can be a valuable resource for training better passage retrievers.\\ \\

    \large \textbf{Reviewer 5:} \\ \\
    \normalsize

    \textbf{Comment R5.1} \emph{My main concern is the omission of TREC QA tracks in the early 2000s that had factoid questions with list answers. Because of their relevance to the present work, TREC QA tracks are worth mentioning in the paper with an explanation of their differences with QAMParI.} \\

    \textbf{Answer R5.1} As suggested, we added these references in the related works, Section 5. \\ \\

    \textbf{Comment R5.2} \emph{QA datasets typically account for various forms of correct answers. While QAMParI provides a list of answers, each answer may appear in various forms —e.g., Robert vs. Bob. By skimming over the data, it seems QAMParI does not provide alternatives for each answer. This can be a prevalent problem in evaluation.} \\

    \textbf{Answer to R5.2} As we mentioned in our previous response, all aliases of a given gold entity provided by Wikidata are used as additional correct answers. \\ \\

    \textbf{Comment R5.3} \emph{Also, for a single-answer QA setup, F1 is often reported along with exact-match accuracy as a remedy to the previous issue I pointed out. For a list of answers as in QAMParI, it should not be difficult to combine F1 for alternatives of each answer. } \\

    \textbf{Answer to R5.3} As we mentioned in our previous response, we believe in testing our models on strict metrics and exact match is stricter compared to F1 over strings. To avoid cluttering the paper with many metrics, we did not include it in our revised version but we would be glad to give these to the reviewer if he deems it necessary.

\end{document}