Gsm8k Dataset

0 views

Skip to first unread message

Mei Li Wu

unread,

Aug 4, 2024, 2:23:13 PM8/4/24

to soreappeibhad

GSM8KGrade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

For the main configuration, each instance contains a string for the grade-school level math question and a string for the corresponding answer with multiple steps of reasoning and calculator annotations (explained here).

For the socratic configuration, each instance contains a string for a grade-school level math question, a string for the corresponding answer with multiple steps of reasoning, calculator annotations (explained here), and Socratic sub-questions.

We initially collected a starting set of a thousand problems and natural language solutions by hiring freelance contractors on Upwork (upwork.com). We then worked with Surge AI (surgehq.ai), an NLP data labeling platform, to scale up our data collection. After collecting the full dataset, we asked workers to re-solve all problems, with no workers re-solving problems they originally wrote. We checked whether their final answers agreed with the original solutions, and any problems that produced disagreements were either repaired or discarded. We then performed another round of agreement checks on a smaller subset of problems, finding that 1.7% of problems still produce disagreements among contractors. We estimate this to be the fraction of problems that contain breaking errors or ambiguities. It is possible that a larger percentage of problems contain subtle errors.

GSM8K, or Grade School Math 8K, is a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

Researchers have been using GSM8K to develop methods for improving the performance of large language models on multi-step mathematical reasoning tasks. One such method involves training verifiers to judge the correctness of model completions, which has been shown to significantly improve performance on the GSM8K dataset.

The GSM8K dataset, a collaborative effort between OpenAI and Surge AI, comprises 8,500 high-quality math word problems, crafted by experts to reflect linguistic diversity and grade school math concepts. Designed for step-by-step problem-solving, the dataset serves as both a benchmark for large language models like GPT-3 and a tool for advancing AI problem-solving techniques.

While the detailed methodology of problem creation and curation is not public, it likely involved expert knowledge in elementary math, attention to linguistic variety, and stringent curation to ensure clarity and solvability through basic arithmetic.

GSM8K is a dataset of 8,500 high-quality linguistically diverse grade school math word problems created by human problem writers. The dataset is segmented into 7,500 training problems and 1,000 test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations. The dataset is designed to train language models like GPT-3 to solve natural language math problems and measure their performance.

The current state-of-the-art on GSM8K is GPT-4 Code Interpreter (CSV, K=5). Researchers have found that even the largest transformer models struggle to achieve high test performance on GSM8K, despite the conceptual simplicity of the problem distribution. To increase performance, some researchers propose training verifiers to judge the correctness of model completions. At test time, they generate many candidate solutions and select the one ranked highest by the verifier, demonstrating that verification significantly improves performance on GSM8K.

Some common methods for implementing GSM8K, a dataset of 8.5K high-quality linguistically diverse grade school math word problems, involve using large language models (LLMs) and various prompting techniques to solve multi-step mathematical reasoning problems. Some of these methods include:

GSM8K presents challenges for LLMs in terms of accuracy, verification, scaling, and comparison with other datasets. Researchers continue to explore ways to improve LLM performance on GSM8K and similar datasets.

Some future directions for GSM8K research include improving the reliability of large language models (LLMs) and enhancing their ability to solve complex mathematical problems. OpenAI's Q* project is an example of such research, which aims to bring groundbreaking progress in artificial general intelligence (AGI) by enhancing mathematical reasoning ability in conventional LLMs.

As for Q*, the project is still in development, and its future success is uncertain. However, researchers at OpenAI are optimistic about Q*'s potential to advance AI capabilities, particularly in mathematical reasoning. The Q* project aims to solve certain mathematical problems and has the potential to bring significant progress in AGI research.

This guide details the steps for going from a pre-trained, unoptimized Llama2 7B model to a 50% sparse Llama2 7B model that has been fine-tuned on the GSM8K dataset and recovers fully and goes beyond the dense baseline accuracy.

Note: Some of these hyper-parameters may need further tuning to enhance the overall accuracy of the fine-tuned model. The values mentioned above were obtained through a quick hyper-parameter search. Parameters that could have a significant impact and are worth considering for tuning include: learning_rate, max_grad_norm, warmup_steps, max_seq_length.

The example_fsdp_config.yaml used above contains the following setup for FSDP. Set the num_processes to the number of GPUs available. For our setup, we used 4 NVIDIA A100 GPUs so we set num_processes to 4.

Evaluating the dense fine-tuned model on the gsm8k 0-shot task, results in a baseline accuracy of 37.52%. We'll consider this accuracy as our baseline for calculating recovery for the oneshot sparse and sparse fine-tuned models we'll get later. Detailed results are provided below:

Evaluating the oneshot 50% sparse model on the gsm8k 0-shot task, results in an accuracy of 33.81% and translates to a 90.11% recovery over our [dense baseline](#Dense fine-tuned model accuracy). In the next step we'll see how to improve the recovery of this model using sparse fine-tuning. Detailed results for the oneshot 50% sparse model are provided below:

The one-shot sparse model generated previously can undergo further sparse fine-tuning to enhance its overall accuracy. This process involves distilling information from the previously obtained dense fine-tuned model, which serves as the teacher model, to the one-shot sparse model, acting as the student. This can be achieved using the following command and recipe.

Evaluating the fine-tuned 50% sparse model on the gsm8k 0-shot task, results in an accuracy of 38.59% and shows clear improvement over the [oneshot accuracy](#Oneshot 50% sparse model accuracy). The sparse fine-tuning step not only helped improve over the oneshot accuracy but even surpassed the dense baseline model. Detailed results for the oneshot 50% sparse model are provided below:

Back in 2019, a group of computer scientists performed a now-famous experiment with far-reaching consequences for artificial intelligence research. At the time, machine vision algorithms were becoming capable of recognizing a wide range of objects with some recording spectacular results in the standard tests used to assess their abilities.

But there was a problem with the method behind all these tests. Almost all the algorithms were trained on a database of labelled images, known as ImageNet. The database contained millions of images which had been carefully described in human-written text to help the machines learn. This effort was crucial for the development of machine vision and ImageNet became a kind of industry standard.

In this way, the computer scientists used a subset of the images to train algorithms to identify a strawberry, a table, a human face and so on, using labelled images in the dataset. They then used a different subset of images to test the algorithms. Over time, computer scientists claimed that their algorithms were becoming increasingly good at recognizing objects in the real world.

But privately, researchers began to wonder whether this was really true. Because the ImageNet database was becoming so famous, an alternative explanation was that its images, or ones very like them, were leaking into the real world. So AI systems trained on them were just recognizing images they had already seen.

Their experiment became a famous example of the pitfalls of relying on single databases for testing machines. Without careful management of this database, AI systems can seem to be good at a task in general but are really only repeating what they have already learnt.

Over the years, AI systems have become increasingly better at answering these questions. That has led to various claims that AI systems are becoming better at the kind of reasoning needed to solve these problems.

But there is another possibility. This is that GSM8K has become so well known that the test questions have begun to leak into the wild. As a result, AI systems may come across them during their broader benchmark training. So rather than answering them by reasoning, they could just be repeating the answer they saw during their training.

Following the lead by the Berkeley researchers, the Scale AI team decided to test this idea by developing their own mathematics test of 1250 questions. They call this GSM1k and have carefully ensured that it closely resembles the GSM8K test but has never been published.