Flan-t5 Large

0 views

Skip to first unread message

Sherlene Holloman

unread,

Aug 5, 2024, 2:38:17 AM8/5/24

to credunnesting

Ifyou already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages. As mentioned in the first few lines of the abstract :

Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models

Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.

Flan-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.

This trick of loading the model outside of _map_fn is awesome! It should save some memory. In pytorch-xla the model and the datset is loaded in all processes (8 in case 8 TPU cores) so it ends up taking lot of memory. Lazy loading dataset should also reduce RAM usage.

Specifically, we use a mean span length of 3 and corrupt 15% of the original sequence. We found that this objective produced marginally better performance (Table 7) while being slightly more computationally efficient due to shorter target sequence lengths.

Q:Do we have access to T5 1.1 Checkpoints:

A: No, because they are not obvious wins: Should I use t5v1.1, t5narrow and TalkingHeads? Issue #266 google-research/text-to-text-transfer-transformer GitHub

More on T5 pre-training objective

Each corrupted span is replaced by a unique sentinel token. . The

output sequence then consists of the dropped-out spans, delimited by the sentinel

tokens used to replace them in the input plus a final sentinel token.

I would like to bring into this discussion that when i run the mesh tensorflow version of T5 from the research repo ( -research/text-to-text-transfer-transformer) on TPU on my data set its rock solid 16 bit training (I assume because of the wider range capability of bf16 support). On the same data set I essentially can never get fp16 working on anything larger than t5-small with HuggingFace (with adafactor, with and without lr warming, native/apex(1/2/3) ect)

I have tested the exact following code on t5-small and t5-base and they work fine. However, when using t5-large and/or flan-t5-xl, the model produces nan outputs. This is solely a result of using half precision (ignore the multiple GPUs, strategy etc, I have tested with every other variation):

Though if you need to run a large model on your MBP, one of the LLaMa-based models would probably be easier to work with. For example, Alpaca-Lora and GPT4All are supposed to be able to run performantly on a laptop. These are also instruction-tuned models, so they should do tasks like what you mentioned in this thread.

I'm trying to use google flan t5-large to create embeddings for a simple semantic search engine. However, the generated embeddings cosine similarity with my query is very off. Is there something I'm doing wrong?

But that isn't an issue, because FLAN is intended for other use cases. It was trained on different datasets with a suitable instruction prompt for that task to allow zero-shot prompting (i.e. performing tasks the model hasn't seen been trained on). That means you could perform your similarity task by formulating a proper prompt without any training. For example:

Depending on your use case you might face issues when the number of options increases or when you want to work with the sentence embeddings. If this is the case, you should have a look at sentence-transformers. These are transformers that were trained to produce meaningful sentence embeddings and can therefore be used to calculate the cosine similarity of two sentences.

FLAN-T5 is a family of large language models trained at Google, finetuned on a collection of datasets phrased as instructions. It has strong zero-shot, few-shot, and chain of thought abilities. Because of these abilities, FLAN-T5 is useful for a wide array of natural language tasks. This model is FLAN-T5-Large, the 780M parameter version of FLAN-T5. To learn more about FLAN-T5, read the FLAN paper here.

FLAN-T5 is capable of various natural language tasks. Some of these include question answering, classification, summarization and translation, among others. Here are some examples of this, summarized here and linked for information about the parameters used.

In this blog, we are going to show you how to apply Low-Rank Adaptation of Large Language Models (LoRA) to fine-tune FLAN-T5 XXL (11 billion parameters) on a single GPU. We are going to leverage Hugging Face Transformers, Accelerate, and PEFT.

PEFT, or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

In our example, we use the PyTorch Deep Learning AMI with already set up CUDA drivers and PyTorch installed. We still have to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.

To train our model, we need to convert our inputs (text) to token IDs. This is done by a ? Transformers Tokenizer. If you are not sure what this means, check out chapter 6 of the Hugging Face Course.

Before we can start training, we need to preprocess our data. Abstractive Summarization is a text-generation task. Our model will take a text as input and generate a summary as output. We want to understand how long our input and output will take to batch our data efficiently.

The first step of our training is to load the model. We are going to use philschmid/flan-t5-xxl-sharded-fp16, which is a sharded version of google/flan-t5-xxl. The sharding will help us to not run off of memory when loading the model.

Nice! our model works! Now, lets take a closer look and evaluate it against the test set of processed dataset from samsum. Therefore we need to use and create some utilities to generate the summaries and group them together. The most commonly used metrics to evaluate summarization task is rogue_score short for Recall-Oriented Understudy for Gisting Evaluation). This metric does not behave like the standard accuracy: it will compare a generated summary against a set of reference summaries.

Hello, could someone provide more information regarding the maximum input and output size of the Flan-T5 models? While reading the paper, I noticed it was trained on 1024 input length and 256 output length, but I also saw conflicting information. Can someone please clarify? Thank you.

As token limitations in the input/output are inherent limitations of the LLM models, I would assume this depends on the model you choose. As you can see in Huggingface, there are several versions of Flan-T5, which may lead to different input/output token limitations.

Does it make sense?

Hello @carloshvp , thank u for your answer. But i was doing some research about Flan-T5-large. And i am not sure about the specific input and output lengths. So if you khow how we can obtain the length or if u already familiar with the input/output size of Flan-T5-large .Feel free to share the numbers with me

as you know FLAN-T5 is an instruction fine-tuned variant of T5. Looking at the paper behind T5 (here), it looks like the used a maximum sequence length of 512 tokens, which means, anything beyond that will probably give bad results. There is however some literature about extending the context size, but that is another topic.

I am not sure however that the d_model is the parameter to look at. In previous versions of hugging face documentation, there was a very useful and clear parameter called n_positions, which is exactly what you are searching for (and it is also 512)

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Sometimes some artificial intelligence models go unnoticed despite their worth. This is the case with FLAN-T5, a model developed by Google and with a name as appetizing as its NLP power. The California company created a new example of the democratization of artificial intelligence and we explain why. FLAN-T5, a yummy model superior to GPT-3.

Firstly, we have Google T5 (Text-to-Text Transfer Transformer). T5 consists of transformer-based architecture that uses a text-to-text approach and is the epitome of encoder-decoder excellence in the world of natural language processing (NLP).

GPT-3 is a model with a high degree of popularity, but to test it and use it correctly, we need a huge computing budget that can seldom be found in a regular home. We need power in our computers that is not easy to get. However, FLAN-T5 does not need large devices because its smaller models/checkpoints are created for the common citizen. It detects sarcasm and is very intuitive. It is able to reinterpret the questions. Tested with an input of 5 examples into FLAN-T5 XL (5-shot), the 3 billion model outperforms GPT-3. In fact, there are not many examples to give it and he is very good with the zero-shot.