vipllava

`mindnlp.transformers.models.vipllava.configuration_vipllava` ¶

VipLlava model configuration

`mindnlp.transformers.models.vipllava.configuration_vipllava.VipLlavaConfig` ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [VipLlavaForConditionalGeneration]. It is used to instantiate an VipLlava model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the VipLlava-9B.

e.g. ybelkada/vip-llava-7b-hf

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`vision_config`	Custom vision config or dict TYPE: `VipLlavaVisionConfig`, optional DEFAULT: `None`
`text_config`	The config object of the text backbone. Can be any of `LlamaConfig` or `MistralConfig`. TYPE: `Union[AutoConfig, dict]`, optional DEFAULT: `None`
`ignore_index`	The ignore index for the loss function. TYPE: `int`, optional, defaults to -100 DEFAULT: `-100`
`image_token_index`	The image token index to encode the image prompt. TYPE: `int`, optional, defaults to 32000 DEFAULT: `32000`
`projector_hidden_act`	The activation function used by the multimodal projector. TYPE: `str`, optional, defaults to `"gelu"` DEFAULT: `'gelu'`
`projector_layernorm_eps`	The layer norm epsilon of the projector layernorm TYPE: `float`, optional, defaults to 1e-05 DEFAULT: `1e-05`
`vision_feature_layers`	The list of layers to select the vision features from. TYPE: `List[int]`, optional, defaults to `[-2, -5, -8, -11, 6]` DEFAULT: `[-2, -5, -8, -11, 6]`
`image_seq_length`	Sequence length of one image embedding. TYPE: `int`, optional, defaults to 576 DEFAULT: `576`

>>> from transformers import VipLlavaForConditionalGeneration, VipLlavaConfig, CLIPVisionConfig, LlamaConfig

>>> # Initializing a CLIP-vision config
>>> vision_config = CLIPVisionConfig()

>>> # Initializing a Llama config
>>> text_config = LlamaConfig()

>>> # Initializing a VipLlava vipllava-7b style configuration
>>> configuration = VipLlavaConfig(vision_config, text_config)

>>> # Initializing a model from the vipllava-7b style configuration
>>> model = VipLlavaForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Source code in mindnlp\transformers\models\vipllava\configuration_vipllava.py

class VipLlavaConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`VipLlavaForConditionalGeneration`]. It is used to instantiate an
    VipLlava model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the VipLlava-9B.

    e.g. [ybelkada/vip-llava-7b-hf](https://huggingface.co/ybelkada/vip-llava-7b-hf)

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vision_config (`VipLlavaVisionConfig`,  *optional*):
            Custom vision config or dict
        text_config (`Union[AutoConfig, dict]`, *optional*):
            The config object of the text backbone. Can be any of `LlamaConfig` or `MistralConfig`.
        ignore_index (`int`, *optional*, defaults to -100):
            The ignore index for the loss function.
        image_token_index (`int`, *optional*, defaults to 32000):
            The image token index to encode the image prompt.
        projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
            The activation function used by the multimodal projector.
        projector_layernorm_eps (`float`, *optional*, defaults to 1e-05):
            The layer norm epsilon of the projector layernorm
        vision_feature_layers (`List[int]`, *optional*, defaults to `[-2, -5, -8, -11, 6]`):
            The list of layers to select the vision features from.
        image_seq_length (`int`, *optional*, defaults to 576):
            Sequence length of one image embedding.

    Example:

    ```python
    >>> from transformers import VipLlavaForConditionalGeneration, VipLlavaConfig, CLIPVisionConfig, LlamaConfig

    >>> # Initializing a CLIP-vision config
    >>> vision_config = CLIPVisionConfig()

    >>> # Initializing a Llama config
    >>> text_config = LlamaConfig()

    >>> # Initializing a VipLlava vipllava-7b style configuration
    >>> configuration = VipLlavaConfig(vision_config, text_config)

    >>> # Initializing a model from the vipllava-7b style configuration
    >>> model = VipLlavaForConditionalGeneration(configuration)

    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""

    model_type = "vipllava"
    is_composition = False

    def __init__(
        self,
        vision_config=None,
        text_config=None,
        ignore_index=-100,
        image_token_index=32000,
        projector_hidden_act="gelu",
        projector_layernorm_eps=1e-5,
        vision_feature_layers=[-2, -5, -8, -11, 6],
        image_seq_length=576,
        **kwargs,
    ):
        self.ignore_index = ignore_index
        self.image_token_index = image_token_index
        self.projector_hidden_act = projector_hidden_act
        self.projector_layernorm_eps = projector_layernorm_eps
        self.vision_feature_layers = vision_feature_layers
        self.image_seq_length = image_seq_length
        self.vision_config = vision_config

        if isinstance(self.vision_config, dict):
            vision_config["model_type"] = (
                vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
            )
            self.vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
        elif vision_config is None:
            self.vision_config = CONFIG_MAPPING["clip_vision_model"](
                intermediate_size=4096,
                hidden_size=1024,
                patch_size=14,
                image_size=336,
                num_hidden_layers=24,
                num_attention_heads=16,
                vocab_size=32000,
                projection_dim=768,
            )

        if isinstance(text_config, dict):
            text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
            text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
        elif text_config is None:
            text_config = CONFIG_MAPPING["llama"]()

        self.text_config = text_config

        super().__init__(**kwargs)

`mindnlp.transformers.models.vipllava.modeling_vipllava` ¶

MindSpore VipLlava model.

`mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaCausalLMOutputWithPast` `dataclass` ¶

Bases: ModelOutput

Base class for VipLlava causal language model (or autoregressive) outputs.

PARAMETER	DESCRIPTION
`loss`	Language modeling loss (for next-token prediction). TYPE: `mindspore.Tensor` of shape `(1,)`, optional, returned when `labels` is provided DEFAULT: `None`
`logits`	Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)` DEFAULT: `None`
`past_key_values`	Tuple of `tuple(mindspore.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding. TYPE: `tuple(tuple(mindspore.Tensor))`, optional, returned when `use_cache=True` is passed or when `config.use_cache=True` DEFAULT: `None`
`hidden_states`	Tuple of `mindspore.Tensor` (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. TYPE: `tuple(mindspore.Tensor)`, optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True` DEFAULT: `None`
`attentions`	Tuple of `mindspore.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. TYPE: `tuple(mindspore.Tensor)`, optional, returned when `output_attentions=True` is passed or when `config.output_attentions=True` DEFAULT: `None`
`image_hidden_states`	Tuple of `mindspore.Tensor` (one for the output of the image embeddings, `(batch_size, num_images, sequence_length, hidden_size)`. image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver TYPE: `tuple(mindspore.Tensor)`, optional DEFAULT: `None`

Source code in mindnlp\transformers\models\vipllava\modeling_vipllava.py

@dataclass
# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->VipLlava
class VipLlavaCausalLMOutputWithPast(ModelOutput):
    """
    Base class for VipLlava causal language model (or autoregressive) outputs.

    Args:
        loss (`mindspore.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
            Language modeling loss (for next-token prediction).
        logits (`mindspore.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
        past_key_values (`tuple(tuple(mindspore.Tensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
            Tuple of `tuple(mindspore.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
            `(batch_size, num_heads, sequence_length, embed_size_per_head)`)

            Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
            `past_key_values` input) to speed up sequential decoding.
        hidden_states (`tuple(mindspore.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
            Tuple of `mindspore.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
        attentions (`tuple(mindspore.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
            Tuple of `mindspore.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
        image_hidden_states (`tuple(mindspore.Tensor)`, *optional*):
            Tuple of `mindspore.Tensor` (one for the output of the image embeddings, `(batch_size, num_images,
            sequence_length, hidden_size)`.

            image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
    """

    loss: Optional[mindspore.Tensor] = None
    logits: mindspore.Tensor = None
    past_key_values: Optional[List[mindspore.Tensor]] = None
    hidden_states: Optional[Tuple[mindspore.Tensor]] = None
    attentions: Optional[Tuple[mindspore.Tensor]] = None
    image_hidden_states: Optional[Tuple[mindspore.Tensor]] = None

`mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration` ¶

Bases: VipLlavaPreTrainedModel

Source code in mindnlp\transformers\models\vipllava\modeling_vipllava.py

class VipLlavaForConditionalGeneration(VipLlavaPreTrainedModel):
    def __init__(self, config: VipLlavaConfig):
        super().__init__(config)
        self.vision_tower = AutoModel.from_config(config.vision_config)

        self.multi_modal_projector = VipLlavaMultiModalProjector(config)
        self.vocab_size = config.text_config.vocab_size
        self.language_model = AutoModelForCausalLM.from_config(
            config.text_config, attn_implementation=config._attn_implementation
        )
        self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
        self.post_init()

    def get_input_embeddings(self):
        return self.language_model.get_input_embeddings()

    def set_input_embeddings(self, value):
        self.language_model.set_input_embeddings(value)

    def get_output_embeddings(self):
        return self.language_model.get_output_embeddings()

    def set_output_embeddings(self, new_embeddings):
        self.language_model.set_output_embeddings(new_embeddings)

    def set_decoder(self, decoder):
        self.language_model.set_decoder(decoder)

    def get_decoder(self):
        return self.language_model.get_decoder()

    def tie_weights(self):
        return self.language_model.tie_weights()

    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
        model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
        # update vocab size
        self.config.text_config.vocab_size = model_embeds.num_embeddings
        self.vocab_size = model_embeds.num_embeddings
        return model_embeds

    def _merge_input_ids_with_image_features(self, image_features, inputs_embeds, input_ids, attention_mask, labels):
        num_images, num_image_patches, embed_dim = image_features.shape
        batch_size, sequence_length = input_ids.shape
        left_padding = not ops.sum(input_ids[:, -1] == mindspore.tensor(self.pad_token_id))
        # 1. Create a mask to know where special image tokens are
        special_image_token_mask = input_ids == self.config.image_token_index
        num_special_image_tokens = ops.sum(special_image_token_mask, dim=-1)
        # Compute the maximum embed dimension
        with no_grad():
            max_embed_dim = (num_special_image_tokens.max().item() * (num_image_patches - 1)) + sequence_length
        batch_indices, non_image_indices = ops.nonzero(input_ids != self.config.image_token_index, as_tuple=True)

        # 2. Compute the positions where text should be written
        # Calculate new positions for text tokens in merged image-text sequence.
        # `special_image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens.
        # `ops.cumsum` computes how each image token shifts subsequent text token positions.
        # - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
        new_token_positions = ops.cumsum((special_image_token_mask * (num_image_patches - 1) + 1), -1) - 1
        nb_image_pad = max_embed_dim - 1 - new_token_positions[:, -1]
        if left_padding:
            new_token_positions += nb_image_pad[:, None]  # offset for left padding
        text_to_overwrite = new_token_positions[batch_indices, non_image_indices]

        # 3. Create the full embedding, already padded to the maximum position
        final_embedding = ops.zeros(
            batch_size, max_embed_dim, embed_dim, dtype=inputs_embeds.dtype
        )
        final_attention_mask = ops.zeros(
            batch_size, max_embed_dim, dtype=attention_mask.dtype
        )
        if labels is not None:
            final_labels = ops.full(
                (batch_size, max_embed_dim), self.config.ignore_index, dtype=input_ids.dtype
            )

        # 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"]
        # we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
        final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
        final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
        if labels is not None:
            final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices]

        # 5. Fill the embeddings corresponding to the images. Anything that is not `text_positions` needs filling (#29835)
        image_to_overwrite = ops.full(
            (batch_size, max_embed_dim), True, dtype=mindspore.bool_
        )
        image_to_overwrite[batch_indices, text_to_overwrite] = False
        image_to_overwrite = image_to_overwrite.int()
        image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None]

        # if image_to_overwrite.sum() != image_features.shape[:-1].numel():
        #     raise ValueError(
        #         f"The input provided to the model are wrong. The number of image tokens is {ops.sum(special_image_token_mask)} while"
        #         f" the number of image given to the model is {num_images}. This prevents correct indexing and breaks batch generation."
        #     )

        final_embedding[image_to_overwrite.bool()] = image_features.reshape(-1, embed_dim)
        final_attention_mask |= image_to_overwrite
        position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill((final_attention_mask == 0), 1.)

        # 6. Mask out the embedding at padding positions, as we later use the past_key_value value to determine the non-attended tokens.
        batch_indices, pad_indices = ops.nonzero(input_ids == self.pad_token_id, as_tuple=True)
        indices_to_mask = new_token_positions[batch_indices, pad_indices]

        if 0 not in batch_indices.shape:
            final_embedding[batch_indices, indices_to_mask] = 0

        if labels is None:
            final_labels = None

        return final_embedding, final_attention_mask, final_labels, position_ids

    # Ignore copy
    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        pixel_values: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        vision_feature_layers: Optional[List[int]] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        cache_position: Optional[mindspore.Tensor] = None,
    ) -> Union[Tuple, VipLlavaCausalLMOutputWithPast]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

        Returns:

        Example:

        ```python
        >>> import torch
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, VipLlavaForConditionalGeneration

        >>> model = VipLlavaForConditionalGeneration.from_pretrained("llava-hf/vip-llava-7b-hf", device_map="auto", torch_dtype=mindspore.float16)
        >>> processor = AutoProcessor.from_pretrained("llava-hf/vip-llava-7b-hf")

        >>> prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n{}###Assistant:"
        >>> question = "Can you please describe this image?"
        >>> prompt = prompt.format(question)
        >>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-neg.png"
        >>> image = Image.open(requests.get(url, stream=True).raw)

        >>> inputs = processor(text=text, images=image, return_tensors="ms").to(0, mindspore.float16)

        >>> # Generate
        >>> generate_ids = model.generate(**inputs, max_new_tokens=20)
        >>> processor.decode(generate_ids[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
        The image features a brown and white cat sitting on a green surface, with a red ball in its
        ```"""

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        vision_feature_layers = (
            vision_feature_layers if vision_feature_layers is not None else self.config.vision_feature_layers
        )

        if (input_ids is None) ^ (inputs_embeds is not None):
            raise ValueError(
                "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
            )

        if pixel_values is not None and inputs_embeds is not None:
            raise ValueError(
                "You cannot specify both pixel_values and inputs_embeds at the same time, and must specify either one"
            )

        legacy_processing = False
        if inputs_embeds is None:
            inputs_embeds = self.get_input_embeddings()(input_ids)

            # if the number of image tokens is more than image embeddings seq length, then prob we expanded it in processing
            # not very reliable, but we don't expect one to actually pass 500+ images for one prompt
            # In case we're in decoding stage, legacy behavior is checked by presence of pixel values even if use_cache=True
            with no_grad():
                legacy_processing = (
                    ops.max((input_ids == self.config.image_token_index).sum(1)).item() < self.config.image_seq_length
                ) or (input_ids.shape[-1] == 1 and pixel_values is not None)

        if pixel_values is not None:
            image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)

            # For VIP-llava, the image features are computed this way
            # We select the features from index 1: for the layers -2, -5, -8, -11 and 6
            image_features = [image_outputs.hidden_states[index][:, 1:] for index in vision_feature_layers]
            image_features = ops.cat(image_features, dim=-1)
            image_features = self.multi_modal_projector(image_features)

            if legacy_processing:
                logger.warning_once(
                    "Expanding inputs for image tokens in VipLLaVa should be done in processing. "
                    "Please add `patch_size` and `vision_feature_select_strategy` to the model's image processing config. "
                    "Using processors without these attributes in the config is deprecated and will throw an error in v4.47."
                )
                # prefill stage vs decoding stage (legacy behavior copied)
                if input_ids.shape[1] != 1:
                    inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                        image_features, inputs_embeds, input_ids, attention_mask, labels
                    )
                else:
                    # Retrieve the first layer to inspect the logits and mask out the hidden states
                    # that are set to 0
                    first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]

                    # Sum all dimensions of head_dim (-1) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
                    batch_index, non_attended_tokens = ops.nonzero(first_layer_past_key_value.float().sum(-2) == 0, as_tuple=True)

                    target_length = input_ids.shape[1]
                    past_length = first_layer_past_key_value.shape[-1]

                    extended_attention_mask = ops.ones(
                        (attention_mask.shape[0], past_length),
                        dtype=attention_mask.dtype,
                    )

                    # Filter out only the tokens that can be un-attended, this can happen
                    # in the case one uses Llava + Fused modules where the cache on the
                    # first iteration is already big enough, or if one passes custom cache
                    valid_indices = non_attended_tokens < extended_attention_mask.shape[-1]
                    new_batch_index = batch_index[valid_indices]
                    new_non_attended_tokens = non_attended_tokens[valid_indices]

                    # Zero-out the places where we don't need to attend
                    extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0

                    attention_mask = ops.cat((extended_attention_mask, attention_mask[:, -target_length:]), dim=1)
                    position_ids = ops.sum(attention_mask, dim=1).unsqueeze(-1) - 1

            # TODO: @raushan retain only the new behavior after v4.47
            else:
                special_image_mask = (
                    (input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
                )
                image_features = image_features.to(inputs_embeds.dtype)
                inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)

        outputs = self.language_model(
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            cache_position=cache_position,
        )

        logits = outputs[0]

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            if attention_mask is not None:
                shift_attention_mask = attention_mask[..., 1:]
                shift_logits = logits[..., :-1, :][shift_attention_mask != 0]
                shift_labels = labels[..., 1:][shift_attention_mask != 0]
            else:
                shift_logits = logits[..., :-1, :]
                shift_labels = labels[..., 1:]
            # Flatten the tokens
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(
                shift_logits.view(-1, shift_logits.shape[-1]), shift_labels.view(-1)
            )

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return VipLlavaCausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def prepare_inputs_for_generation(
        self,
        input_ids,
        past_key_values=None,
        inputs_embeds=None,
        pixel_values=None,
        attention_mask=None,
        cache_position=None,
        **kwargs,
    ):
        # Trigger the new behavior if we have more than image embeddings seq length tokens for images
        legacy_processing = (
            input_ids is not None
            and (input_ids == self.config.image_token_index).sum(1).max() < self.config.image_seq_length
        )

        model_inputs = self.language_model.prepare_inputs_for_generation(
            input_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            attention_mask=attention_mask,
            cache_position=cache_position,
            **kwargs,
        )

        if legacy_processing:
            model_inputs["pixel_values"] = pixel_values
        elif cache_position[0] == 0:
            # If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
            # Otherwise we need pixel values to be passed to model
            model_inputs["pixel_values"] = pixel_values

        return model_inputs

`mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.forward(input_ids=None, pixel_values=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, vision_feature_layers=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, cache_position=None)` ¶

PARAMETER	DESCRIPTION
`labels`	Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, optional DEFAULT: `None`

Example:

>>> import torch
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, VipLlavaForConditionalGeneration

>>> model = VipLlavaForConditionalGeneration.from_pretrained("llava-hf/vip-llava-7b-hf", device_map="auto", torch_dtype=mindspore.float16)
>>> processor = AutoProcessor.from_pretrained("llava-hf/vip-llava-7b-hf")

>>> prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n{}###Assistant:"
>>> question = "Can you please describe this image?"
>>> prompt = prompt.format(question)
>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-neg.png"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=text, images=image, return_tensors="ms").to(0, mindspore.float16)

>>> # Generate
>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
>>> processor.decode(generate_ids[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
The image features a brown and white cat sitting on a green surface, with a red ball in its

Source code in mindnlp\transformers\models\vipllava\modeling_vipllava.py

def forward(
    self,
    input_ids: mindspore.Tensor = None,
    pixel_values: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    vision_feature_layers: Optional[List[int]] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    cache_position: Optional[mindspore.Tensor] = None,
) -> Union[Tuple, VipLlavaCausalLMOutputWithPast]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

    Returns:

    Example:

    ```python
    >>> import torch
    >>> from PIL import Image
    >>> import requests
    >>> from transformers import AutoProcessor, VipLlavaForConditionalGeneration

    >>> model = VipLlavaForConditionalGeneration.from_pretrained("llava-hf/vip-llava-7b-hf", device_map="auto", torch_dtype=mindspore.float16)
    >>> processor = AutoProcessor.from_pretrained("llava-hf/vip-llava-7b-hf")

    >>> prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n{}###Assistant:"
    >>> question = "Can you please describe this image?"
    >>> prompt = prompt.format(question)
    >>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-neg.png"
    >>> image = Image.open(requests.get(url, stream=True).raw)

    >>> inputs = processor(text=text, images=image, return_tensors="ms").to(0, mindspore.float16)

    >>> # Generate
    >>> generate_ids = model.generate(**inputs, max_new_tokens=20)
    >>> processor.decode(generate_ids[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
    The image features a brown and white cat sitting on a green surface, with a red ball in its
    ```"""

    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    vision_feature_layers = (
        vision_feature_layers if vision_feature_layers is not None else self.config.vision_feature_layers
    )

    if (input_ids is None) ^ (inputs_embeds is not None):
        raise ValueError(
            "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
        )

    if pixel_values is not None and inputs_embeds is not None:
        raise ValueError(
            "You cannot specify both pixel_values and inputs_embeds at the same time, and must specify either one"
        )

    legacy_processing = False
    if inputs_embeds is None:
        inputs_embeds = self.get_input_embeddings()(input_ids)

        # if the number of image tokens is more than image embeddings seq length, then prob we expanded it in processing
        # not very reliable, but we don't expect one to actually pass 500+ images for one prompt
        # In case we're in decoding stage, legacy behavior is checked by presence of pixel values even if use_cache=True
        with no_grad():
            legacy_processing = (
                ops.max((input_ids == self.config.image_token_index).sum(1)).item() < self.config.image_seq_length
            ) or (input_ids.shape[-1] == 1 and pixel_values is not None)

    if pixel_values is not None:
        image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)

        # For VIP-llava, the image features are computed this way
        # We select the features from index 1: for the layers -2, -5, -8, -11 and 6
        image_features = [image_outputs.hidden_states[index][:, 1:] for index in vision_feature_layers]
        image_features = ops.cat(image_features, dim=-1)
        image_features = self.multi_modal_projector(image_features)

        if legacy_processing:
            logger.warning_once(
                "Expanding inputs for image tokens in VipLLaVa should be done in processing. "
                "Please add `patch_size` and `vision_feature_select_strategy` to the model's image processing config. "
                "Using processors without these attributes in the config is deprecated and will throw an error in v4.47."
            )
            # prefill stage vs decoding stage (legacy behavior copied)
            if input_ids.shape[1] != 1:
                inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                    image_features, inputs_embeds, input_ids, attention_mask, labels
                )
            else:
                # Retrieve the first layer to inspect the logits and mask out the hidden states
                # that are set to 0
                first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]

                # Sum all dimensions of head_dim (-1) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
                batch_index, non_attended_tokens = ops.nonzero(first_layer_past_key_value.float().sum(-2) == 0, as_tuple=True)

                target_length = input_ids.shape[1]
                past_length = first_layer_past_key_value.shape[-1]

                extended_attention_mask = ops.ones(
                    (attention_mask.shape[0], past_length),
                    dtype=attention_mask.dtype,
                )

                # Filter out only the tokens that can be un-attended, this can happen
                # in the case one uses Llava + Fused modules where the cache on the
                # first iteration is already big enough, or if one passes custom cache
                valid_indices = non_attended_tokens < extended_attention_mask.shape[-1]
                new_batch_index = batch_index[valid_indices]
                new_non_attended_tokens = non_attended_tokens[valid_indices]

                # Zero-out the places where we don't need to attend
                extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0

                attention_mask = ops.cat((extended_attention_mask, attention_mask[:, -target_length:]), dim=1)
                position_ids = ops.sum(attention_mask, dim=1).unsqueeze(-1) - 1

        # TODO: @raushan retain only the new behavior after v4.47
        else:
            special_image_mask = (
                (input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
            )
            image_features = image_features.to(inputs_embeds.dtype)
            inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)

    outputs = self.language_model(
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
        cache_position=cache_position,
    )

    logits = outputs[0]

    loss = None
    if labels is not None:
        # Shift so that tokens < n predict n
        if attention_mask is not None:
            shift_attention_mask = attention_mask[..., 1:]
            shift_logits = logits[..., :-1, :][shift_attention_mask != 0]
            shift_labels = labels[..., 1:][shift_attention_mask != 0]
        else:
            shift_logits = logits[..., :-1, :]
            shift_labels = labels[..., 1:]
        # Flatten the tokens
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(
            shift_logits.view(-1, shift_logits.shape[-1]), shift_labels.view(-1)
        )

    if not return_dict:
        output = (logits,) + outputs[1:]
        return (loss,) + output if loss is not None else output

    return VipLlavaCausalLMOutputWithPast(
        loss=loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

vipllava

mindnlp.transformers.models.vipllava.configuration_vipllava ¶

mindnlp.transformers.models.vipllava.configuration_vipllava.VipLlavaConfig ¶

mindnlp.transformers.models.vipllava.modeling_vipllava ¶

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaCausalLMOutputWithPast dataclass ¶

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration ¶

`mindnlp.transformers.models.vipllava.configuration_vipllava` ¶

`mindnlp.transformers.models.vipllava.configuration_vipllava.VipLlavaConfig` ¶

`mindnlp.transformers.models.vipllava.modeling_vipllava` ¶

`mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaCausalLMOutputWithPast` `dataclass` ¶

`mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration` ¶