A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising variouselements depending on the configuration (GemmaConfig) and inputs.
A transformers.modeling_flax_outputs.FlaxBaseModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising variouselements depending on the configuration (GemmaConfig) and inputs.
A transformers.modeling_flax_outputs.FlaxMaskedLMOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising variouselements depending on the configuration (GemmaConfig) and inputs.
Gemma is a set of lightweight, generative artificial intelligence (AI)open models. Gemma models are available to run in yourapplications and on your hardware, mobile devices, or hosted services. You canalso customize these models using tuning techniques so that they excel atperforming tasks that matter to you and your users. Gemma models arebased on Gemini models and are intendedfor the AI development community to extend and take further.
Fine-tuning can help improve a model's performance in specific tasks. Becausemodels in the Gemma model family are open weight, you can tune any ofthem using the AI framework of your choice and the Vertex AI SDK.You can open a notebook example to fine-tune the Gemma model usinga link available on the Gemma model card in Model Garden.
Vertex AI offers a managed platform for rapidly building and scalingmachine learning projects without needing in-house MLOps expertise. You can useVertex AI as the downstream application that serves theGemma models. For example, you might port weights from the Kerasimplementation of Gemma. Next, you can use Vertex AI toserve that version of Gemma to get predictions. We recommend usingVertex AI if you want end-to-end MLOps capabilities, value-added MLfeatures, and a serverless experience for streamlined development.
Google Kubernetes Engine (GKE) is the Google Cloud solutionfor managed Kubernetes that provides scalability, security, resilience, and costeffectiveness. We recommend this option if you have existing Kubernetesinvestments, your organization has in-house MLOps expertise, or if you needgranular control over complex AI/ML workloads with unique security, datapipeline, and resource management requirements. To learn more, see the followingtutorials in the GKE documentation:
Gemma models are available in several sizes so you can buildgenerative AI solutions based on your available computing resources, thecapabilities you need, and where you want to run them. Each model is availablein a tuned and an untuned version:
Pretrained - This version of the model wasn't trained on any specific tasksor instructions beyond the Gemma core data training set. We don'trecommend using this model without performing some tuning.
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
A causal language model (LM) predicts the next token based on previoustokens. This task setup can be used to train the model unsupervised onplain text input, or to autoregressively generate plain text similar tothe data used for training. This task can be used for pre-training orfine-tuning a Gemma model, simply by calling fit().
This model has a generate() method, which generates text based on aprompt. The generation strategy used is controlled by an additionalsampler argument on compile(). You can recompile the model withdifferent keras_nlp.samplers objects to control the generation. Bydefault, "greedy" sampling will be used.
This model can optionally be configured with a preprocessor layer, inwhich case it will automatically apply preprocessing to string inputs duringfit(), predict(), evaluate() and generate(). This is done by defaultwhen creating the model with from_preset().
This constructor can be called in one of two ways. Either from a taskspecific base class like keras_nlp.models.CausalLM.from_preset(), orfrom a model class like keras_nlp.models.BertClassifier.from_preset().If calling from the a base class, the subclass of the returning objectwill be inferred from the config in the preset directory.
If a preprocessor is attached to the model, inputs will bepreprocessed inside the generate() function and should match thestructure expected by the preprocessor layer (usually raw strings).If a preprocessor is not attached, inputs should match the structureexpected by the backbone. See the example usage above for ademonstration of each.
Examples of lightweight models include MobileNet, a computer vision model designed for mobile and embedded vision applications, EfficientDet, an object detection model, and EfficientNet, a CNN that uses compound scaling to enable better performance. All these are lightweight models from Google.
Gemma is a family of lightweight, open-source machine learning models developed by Google AI. These models are designed to be accessible and efficient, making AI development more available for a broad range of users. Released on February 21st, 2024, Gemma is built from the same research and technology that was used to create the Gemini models. Amongst the key features, which are being lightweight and open-source, Gemma is also text-based. It excels in tasks like text summarization, question answering, and reasoning.
Based on the number of trainable parameters, Gemma models come in two main variations: 2B and 7B. It also offers instruction-tuned models like Gemma 2B-FT and 7B-FT, which are specifically designed for further customization using personal datasets. Gemma applications can be functional in various industries that perform actions on text.
After setting the variables for the environment, the next step is to install dependencies. To use Gemma, KerasNLP is the dependency used. KerasNLP is a collection of natural language processing (NLP) models implemented in Keras and runnable on JAX, PyTorch, and TensorFlow.
Fine-tuning is the process of taking a pre-trained model and adjusting it further with additional training on a more specific dataset. This technique leverages the general capabilities of the model and allows the model to excel at specific tasks rather than remaining a general-purpose tool. One technique for achieving this fine-tuning is LoRA (Low-Rank Adaptation).
LoRA is a technique designed to enhance the capabilities of pre-trained transformer models. It was developed to optimize transformer networks efficiently by focusing on a significantly smaller set of trainable parameters. These parameters act like a lightweight "adapter" that sits on top of the pre-trained LLM.
By fine-tuning this adapter, LoRA modifies the model's behavior for the new task without needing to make extensive changes to the underlying structure. This translates to faster training times, reduced memory usage, and the ability to run LLMs on less powerful hardware.
We explored Gemma's innovative and efficient capabilities. It is text-focused and can perform a range of tasks on text. Furthermore, Gemma's support for fine-tuning using LoRA opens up possibilities for customization and adaptation to specific tasks and datasets. This feature enables users to enhance the model's performance further and tailor it to their unique requirements.
Generative AI products are relatively new and the behaviors of an application can vary more than earlier forms of software. This makes it important to probe the machine learning models being used, examine examples of the model's behavior and investigate surprises.
In this codelab, you'll learn how to use LIT to get more out of Google's Gemma model. This codelab demonstrates how to use sequence salience, an interpretability technique, to analyze different prompt engineering approaches.
Text-to-text generative models, such as Gemma, take an input sequence in the form of tokenized text and generate new tokens that are typical follow-ons or completions to that input. This generation happens one token at a time, appending (in a loop) each newly generated token to the input plus any previous generations until the model reaches a stopping condition. Examples include when the model generates an end-of-sequence (EOS) token or reaches the predefined maximum length.
Salience methods are a class of explainable AI (XAI) techniques that can tell you which parts of an input are important to the model for different parts of its output. LIT supports salience methods for a variety of classification tasks, which explain the impact of a sequence of input tokens on the predicted label. Sequence salience generalizes these methods to text-to-text generative models and explains the impact of the preceding tokens on the generated tokens.
You'll use the Grad L2 Norm method here for sequence salience, which analyzes the gradients of the model and provides a magnitude of the influence that each preceding token has on the output. This method is simple and efficient, and has been shown to perform well in classification and other settings. The larger the salience score, the higher the influence. This method is used within LIT because it's well-understood and utilized widely across the interpretability research community.
It is best to follow along with this codelab in new Colab. We recommend using an accelerator runtime, since you will be loading a model into memory, though be aware that the accelerator options vary over time and are subject to limitations. Colab offers paid subscriptions if you would like access to more powerful accelerators. Alternately, you could use a local runtime if your machine has an appropriate GPU.
The following code initializes the LIT wrappers to support salience on the Gemma model. The LIT framework refers to these as models, but in this case they are just different endpoints for the same underlying gemma_model you loaded above. This enables LIT to compute generations, tokenization, and salience on-demand.
c80f0f1006