Strategyqa Dataset

2 views

Skip to first unread message

Ceumar Pee

unread,

Jul 24, 2024, 11:44:09 AM7/24/24

to subtlathera

The StrategyQA dataset was created through a crowdsourcing pipeline for eliciting creative and diverse yes/no questions that require implicit reasoning steps. To solve questions in StrategyQA, the reasoning steps should be inferred using a strategy. To guide and evaluate the question answering process, each example in StrategyQA was annotated with a decomposition into reasoning steps for answering it, and Wikipedia paragraphs that provide evidence for the answer to each step.

The StrategyQA dataset was created through a crowdsourcing pipeline for eliciting creative and diverse yes/no questions that require implicit reasoning steps. To solve questions in StrategyQA, the reasoning steps should be inferred using a strategy. To guide and evaluatethe question answering process, each example in StrategyQA was annotated with a decomposition into reasoning steps for answering it, and Wikipedia paragraphs that provide evidence for the answer to each step.

strategyqa dataset

DOWNLOAD ===> https://shurll.com/2zL0dT

The file strategyqa_train_filtered.json does not include annotations of facts, decomposition, and evidence, and the public test examples in strategyqa_test.json include only the fields qid and question.

Each line in the corpus file corpus-enwiki-20200511-cirrussearch-parasv2.jsonl.gz contains a paragraph with similar metadata fields to strategyqa_train_paragraphs.json. There are several additional metadata fields for indexing the paragraphs in ElasticSearch. The script for creating an ElasticSearch index will be provided soon.

We experiment and provide results on five difficult question-answering datasets: StrategyQA, QuaRel, OpenBookQA, NumerSense and QASC. The table below gives examples from each dataset. We train all our models in an I-RO format, wherein the input to the LM is the question, and the output is the joint generation of the rationale and the predicted answer.

In order to determine whether these generated rationales are of good quality, we focus on three properties that are necessary for any rationale to have, agnostic of the task it is meant for.

First, we note that a rationale should be plausible. We define plausible as the rationale making sense on its own -- whether it be common, logical or factual sense depending on the dataset at hand. For example, if a rationale states 'Cows can fly', it is not plausible.

Next, we identify that a rationale should be diverse, where the rationale is clean and not repetitive.

Lastly, we note that a rationale should be consistent with the gold label for the input. Consistency is important to ensure that a rationale does not spew irrelevant information, and that it supports the gold answer. Furthermore, we focus on consistency with respect to the gold label, as misleading rationales are unhelpful as both LM justifications, and for human utility.

All of these properties are agnostic of the actual prediction made by the LM. Since our self-rationalization setup generates a rationale first, followed by its prediction, we aim to generate rationales with good quality, which should ideally improve the answer generated by the LM. Therefore, we focus on improving self-rationalization along these three properties, as well as on task accuracy. Along with the above rationale properties, we also consider task correctness as a necessary property of rationales, that they should try to improve over as a byproduct.

The table below shows representative examples of rationales generated by training with MaRio in comparison with the supervised fine-tuned baseline SFT. We also release the full set of rationale comparisons for all datasets in this drive folder.

We first present human preference studies comparing rationales generated by MaRio and the supervised fine-tuned baseline SFT for all five datasets. For each instance, we ask three distinct annotators from a pool of qualified annotators to compare the two rationales across three settings, for a given question and correct answer pair: plausibility and consistency, which are defined in the same manner as the rewards, and an overall Preference rating. Preference is meant to indicate that the annotators pick the rationale that they would find acceptable for the given question. In the figure below, we plot the % of instances where majority of annotators prefer only MaRio's rationales, only SFT's rationales, both or none. We note human annotators prefer MaRio's only rationales for 83.15%, 75.3%, 71.49%, 67.44% and 66.6% of instances respectively for Strategyqa, QuaRel, OpenBookQA, NumerSense and QASC. Human annotators also find MaRio's rationales to be considerably more plausible and consistent than SFT (We do not perform human studies for diversity and accuracy since they are automatic/straightforward metrics). We use Amazon MTurk for all our human studies.

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200006. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

This means you are free to borrow the source code of this website, we just ask that you link back to this page in the footer. Please remember to remove the analytics code included in the header of the website which you do not want on your website.

Today, we are joined by Mor Geva, a postdoctoral researcher, now at Google and previously at the Allen Institute for AI (AI2). Her research focuses on debugging the inner workings of black-box NLP models, to increase their transparency, control their operation, and improve their reasoning abilities. Mor is a previous guest on the show. The last time, she spoke about annotator bias in language models and how it affects the robustness of NLP models. Today, she follows up on that study, investigating where bias starts.

She started by discussing a pattern she observed with datasets from crowdfunded workers. This was largely due to the instructions given to annotators by the researcher. She then detailed ways researchers can frame questions/instructions to avoid propagating bias when hiring crowdfunded workers.

Mor spoke about the StrategyQA dataset, a question-answering benchmark for testing the ability of models to perform implicit reasoning. She discussed how the data was gathered and the steps taken to ensure the data was diversified in terms of topics and reasoning types.

StrategyQA is one of the challenging tasks in the Big Bench benchmark, a collaborative benchmark for measuring the capabilities of large language models. The construction of Big Bench was led by Google and involved contributions from over 400 researchers in the NLP community. She highlighted possible reasons the top-ranking models in the Leaderboard performed well.

Mor then discussed the place of benchmarks in advancing language models. She particularly spoke about BigBench, a Google benchmark that measures the capabilities of language models. In closing, she gave her take on whether the trajectory in language models will lead to AGI. She highlighted some limitations with large language models. You can follow Mor on Twitter @megamor2 or on her webpage.

The rise of large language models has ushered in a new era of conversational AI. Yet, when deployed in complex enterprise settings, even the most capable models can stumble. These models struggle with specialized terminology, procedural nuances, and industry-specific data.

Moveworks' machine learning thrives on our dedication to curating specialized enterprise data. This high-quality training data enables MoveLM to achieve a nuanced understanding of how employees ask for help at work.

Over six years, Moveworks has aggregated over 500 million support tickets, facilitated approximately 14 million bot conversations, ingested over 400,000 knowledge base articles, and collected over 35,000 forms. This comprehensive enterprise dataset powers MoveLM's abilities to handle real-world business scenarios.

Our first critical advantage is our ability to tap into an extensive repository of proprietary enterprise data that has been fully anonymized prior to model training. This repository provides a robust foundation for our model. However, it required extensive data masking, privacy engineering, and careful adaptation into instructional input/output formats tailored for MoveLM's training.

Through an expert annotation team, we extracted high-quality annotated datasets for tasks most commonly used to understand user queries in enterprise settings. These included traditional tasks like entity typing, intent classification, question answering, and slot filling. And, it also included new tasks that became a possibility due to the incredible reasoning capabilities of LLMs, such as function calling. Converting these datasets created over 60,000 real-world examples to teach nuanced business concepts.

By converting these diverse datasets into a consistent format showing instructions and corresponding actions, we created hundreds of thousands of real-world examples to teach MoveLM nuanced enterprise concepts that generic models struggle with.

This conversion process helped to tap into our vast amount of data and mold it into a usable form for MoveLM. This was a critical step towards creating a model that understands complex instructions and, crucially, how to execute them.

Seeking to expand our dataset, we leveraged MoveLM itself through responsible self-instruction techniques. By exposing MoveLM only to anonymized excerpts from tens of millions of employee service, HR, IT, and customer chat logs within Moveworks' secure environment, we could generate new and diverse training examples.