跳转至

vision_text_dual_encoder

mindnlp.transformers.models.vision_text_dual_encoder.configuration_vision_text_dual_encoder

VisionTextDualEncoder model configuration

mindnlp.transformers.models.vision_text_dual_encoder.configuration_vision_text_dual_encoder.VisionTextDualEncoderConfig

Bases: PretrainedConfig

[VisionTextDualEncoderConfig] is the configuration class to store the configuration of a [VisionTextDualEncoderModel]. It is used to instantiate [VisionTextDualEncoderModel] model according to the specified arguments, defining the text model and vision model configs.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
projection_dim

Dimentionality of text and vision projection layers.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

logit_scale_init_value

The inital value of the logit_scale paramter. Default is used as per the original CLIP implementation.

TYPE: `float`, *optional*, defaults to 2.6592 DEFAULT: 2.6592

kwargs

Dictionary of keyword arguments.

TYPE: *optional* DEFAULT: {}

Example
>>> from transformers import ViTConfig, BertConfig, VisionTextDualEncoderConfig, VisionTextDualEncoderModel
...
>>> # Initializing a BERT and ViT configuration
>>> config_vision = ViTConfig()
>>> config_text = BertConfig()
...
>>> config = VisionTextDualEncoderConfig.from_vision_text_configs(config_vision, config_text, projection_dim=512)
...
>>> # Initializing a BERT and ViT model (with random weights)
>>> model = VisionTextDualEncoderModel(config=config)
...
>>> # Accessing the model configuration
>>> config_vision = model.config.vision_config
>>> config_text = model.config.text_config
...
>>> # Saving the model, including its configuration
>>> model.save_pretrained("vit-bert")
...
>>> # loading model and config from pretrained folder
>>> vision_text_config = VisionTextDualEncoderConfig.from_pretrained("vit-bert")
>>> model = VisionTextDualEncoderModel.from_pretrained("vit-bert", config=vision_text_config)
Source code in mindnlp\transformers\models\vision_text_dual_encoder\configuration_vision_text_dual_encoder.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
class VisionTextDualEncoderConfig(PretrainedConfig):
    r"""
    [`VisionTextDualEncoderConfig`] is the configuration class to store the configuration of a
    [`VisionTextDualEncoderModel`]. It is used to instantiate [`VisionTextDualEncoderModel`] model according to the
    specified arguments, defining the text model and vision model configs.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        projection_dim (`int`, *optional*, defaults to 512):
            Dimentionality of text and vision projection layers.
        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
            The inital value of the *logit_scale* paramter. Default is used as per the original CLIP implementation.
        kwargs (*optional*):
            Dictionary of keyword arguments.

    Example:
        ```python
        >>> from transformers import ViTConfig, BertConfig, VisionTextDualEncoderConfig, VisionTextDualEncoderModel
        ...
        >>> # Initializing a BERT and ViT configuration
        >>> config_vision = ViTConfig()
        >>> config_text = BertConfig()
        ...
        >>> config = VisionTextDualEncoderConfig.from_vision_text_configs(config_vision, config_text, projection_dim=512)
        ...
        >>> # Initializing a BERT and ViT model (with random weights)
        >>> model = VisionTextDualEncoderModel(config=config)
        ...
        >>> # Accessing the model configuration
        >>> config_vision = model.config.vision_config
        >>> config_text = model.config.text_config
        ...
        >>> # Saving the model, including its configuration
        >>> model.save_pretrained("vit-bert")
        ...
        >>> # loading model and config from pretrained folder
        >>> vision_text_config = VisionTextDualEncoderConfig.from_pretrained("vit-bert")
        >>> model = VisionTextDualEncoderModel.from_pretrained("vit-bert", config=vision_text_config)
        ```
    """

    model_type = "vision-text-dual-encoder"
    is_composition = True

    def __init__(self, projection_dim=512, logit_scale_init_value=2.6592, **kwargs):
        super().__init__(**kwargs)

        if "vision_config" not in kwargs:
            raise ValueError("`vision_config` can not be `None`.")

        if "text_config" not in kwargs:
            raise ValueError("`text_config` can not be `None`.")

        vision_config = kwargs.pop("vision_config")
        text_config = kwargs.pop("text_config")

        vision_model_type = vision_config.pop("model_type")
        text_model_type = text_config.pop("model_type")

        vision_config_class = VISION_MODEL_CONFIGS.get(vision_model_type)
        if vision_config_class is not None:
            self.vision_config = vision_config_class(**vision_config)
        else:
            self.vision_config = AutoConfig.for_model(vision_model_type, **vision_config)
            if hasattr(self.vision_config, "vision_config"):
                self.vision_config = self.vision_config.vision_config

        self.text_config = AutoConfig.for_model(text_model_type, **text_config)

        self.projection_dim = projection_dim
        self.logit_scale_init_value = logit_scale_init_value

    @classmethod
    def from_vision_text_configs(cls, vision_config: PretrainedConfig, text_config: PretrainedConfig, **kwargs):
        r"""
        Instantiate a [`VisionTextDualEncoderConfig`] (or a derived class) from text model configuration and vision
        model configuration.

        Returns:
            [`VisionTextDualEncoderConfig`]: An instance of a configuration object
        """

        return cls(vision_config=vision_config.to_dict(), text_config=text_config.to_dict(), **kwargs)

mindnlp.transformers.models.vision_text_dual_encoder.configuration_vision_text_dual_encoder.VisionTextDualEncoderConfig.from_vision_text_configs(vision_config, text_config, **kwargs) classmethod

Instantiate a [VisionTextDualEncoderConfig] (or a derived class) from text model configuration and vision model configuration.

RETURNS DESCRIPTION

[VisionTextDualEncoderConfig]: An instance of a configuration object

Source code in mindnlp\transformers\models\vision_text_dual_encoder\configuration_vision_text_dual_encoder.py
104
105
106
107
108
109
110
111
112
113
114
@classmethod
def from_vision_text_configs(cls, vision_config: PretrainedConfig, text_config: PretrainedConfig, **kwargs):
    r"""
    Instantiate a [`VisionTextDualEncoderConfig`] (or a derived class) from text model configuration and vision
    model configuration.

    Returns:
        [`VisionTextDualEncoderConfig`]: An instance of a configuration object
    """

    return cls(vision_config=vision_config.to_dict(), text_config=text_config.to_dict(), **kwargs)

mindnlp.transformers.models.vision_text_dual_encoder.modeling_vision_text_dual_encoder

MindSpore VisionTextDualEncoder model.

mindnlp.transformers.models.vision_text_dual_encoder.modeling_vision_text_dual_encoder.VisionTextDualEncoderModel

Bases: PreTrainedModel

Source code in mindnlp\transformers\models\vision_text_dual_encoder\modeling_vision_text_dual_encoder.py
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
class VisionTextDualEncoderModel(PreTrainedModel):
    config_class = VisionTextDualEncoderConfig
    base_model_prefix = "vision_text_dual_encoder"

    def __init__(
        self,
        config: Optional[VisionTextDualEncoderConfig] = None,
        vision_model: Optional[PreTrainedModel] = None,
        text_model: Optional[PreTrainedModel] = None,
    ):
        if config is None and (vision_model is None or text_model is None):
            raise ValueError("Either a configuration or an vision and a text model has to be provided")

        if config is None:
            config = VisionTextDualEncoderConfig.from_vision_text_configs(vision_model.config, text_model.config)
        else:
            if not isinstance(config, self.config_class):
                raise ValueError(f"config: {config} has to be of type {self.config_class}")

        # initialize with config
        super().__init__(config)

        if vision_model is None:
            if isinstance(config.vision_config, CLIPVisionConfig):
                vision_model = CLIPVisionModel(config.vision_config)
            else:
                vision_model = AutoModel.from_config(
                    config.vision_config
                )

        if text_model is None:
            text_model = AutoModel.from_config(config.text_config)

        self.vision_model = vision_model
        self.text_model = text_model

        # make sure that the individual model's config refers to the shared config
        # so that the updates to the config will be synced
        self.vision_model.config = self.config.vision_config
        self.text_model.config = self.config.text_config

        self.vision_embed_dim = config.vision_config.hidden_size
        self.text_embed_dim = config.text_config.hidden_size
        self.projection_dim = config.projection_dim

        self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False)
        self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False)
        self.logit_scale = Parameter(ms.tensor(self.config.logit_scale_init_value))

    def get_text_features(
        self,
        input_ids=None,
        attention_mask=None,
        position_ids=None,
        token_type_ids=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        r"""
        Returns:
            text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
                applying the projection layer to the pooled output of [`CLIPTextModel`].

        Example:
            ```python
            >>> from transformers import VisionTextDualEncoderModel, AutoTokenizer
            ...
            >>> model = VisionTextDualEncoderModel.from_pretrained("clip-italian/clip-italian")
            >>> tokenizer = AutoTokenizer.from_pretrained("clip-italian/clip-italian")
            ...
            >>> inputs = tokenizer(["una foto di un gatto", "una foto di un cane"], padding=True, return_tensors="ms")
            >>> text_features = model.get_text_features(**inputs)
            ```
        """
        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            token_type_ids=token_type_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = text_outputs[1]
        text_features = self.text_projection(pooled_output)

        return text_features

    def get_image_features(
        self,
        pixel_values=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        r"""

        Returns:
            image_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
                applying the projection layer to the pooled output of [`CLIPVisionModel`].

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import VisionTextDualEncoderModel, AutoImageProcessor
            ...
            >>> model = VisionTextDualEncoderModel.from_pretrained("clip-italian/clip-italian")
            >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = image_processor(images=image, return_tensors="ms")
            ...
            >>> image_features = model.get_image_features(**inputs)
            ```
        """
        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = vision_outputs[1]  # pooled_output
        image_features = self.visual_projection(pooled_output)

        return image_features

    def forward(
        self,
        input_ids: Optional[ms.Tensor] = None,
        pixel_values: Optional[ms.Tensor] = None,
        attention_mask: Optional[ms.Tensor] = None,
        position_ids: Optional[ms.Tensor] = None,
        return_loss: Optional[bool] = None,
        token_type_ids: Optional[ms.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[ms.Tensor], CLIPOutput]:
        r"""
        Returns:
            Union[Tuple[ms.Tensor], CLIPOutput]

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import (
            ...     VisionTextDualEncoderModel,
            ...     VisionTextDualEncoderProcessor,
            ...     AutoImageProcessor,
            ...     AutoTokenizer,
            ... )
            ...
            >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
            >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
            >>> processor = VisionTextDualEncoderProcessor(image_processor, tokenizer)
            >>> model = VisionTextDualEncoderModel.from_vision_text_pretrained(
            ...     "google/vit-base-patch16-224", "google-bert/bert-base-uncased"
            ... )
            ...
            >>> # contrastive training
            >>> urls = [
            ...     "http://images.cocodataset.org/val2017/000000039769.jpg",
            ...     "https://farm3.staticflickr.com/2674/5850229113_4fe05d5265_z.jpg",
            ... ]
            >>> images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
            >>> inputs = processor(
            ...     text=["a photo of a cat", "a photo of a dog"], images=images, return_tensors="ms", padding=True
            ... )
            >>> outputs = model(
            ...     input_ids=inputs.input_ids,
            ...     attention_mask=inputs.attention_mask,
            ...     pixel_values=inputs.pixel_values,
            ...     return_loss=True,
            ... )
            >>> loss, logits_per_image = outputs.loss, outputs.logits_per_image  # this is the image-text similarity score
            ...
            >>> # save and load from pretrained
            >>> model.save_pretrained("vit-bert")
            >>> model = VisionTextDualEncoderModel.from_pretrained("vit-bert")
            ...
            >>> # inference
            >>> outputs = model(**inputs)
            >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
            >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.return_dict

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        image_embeds = vision_outputs[1]  # pooler_output
        image_embeds = self.visual_projection(image_embeds)

        text_embeds = text_outputs[1]  # pooler_output
        text_embeds = self.text_projection(text_embeds)

        # normalized features
        image_embeds = image_embeds / ops.norm(image_embeds, p=2, dim=-1, keepdim=True)
        text_embeds = text_embeds / ops.norm(text_embeds, p=2, dim=-1, keepdim=True)

        # cosine similarity as logits
        logit_scale = self.logit_scale.exp()
        logits_per_text = ops.matmul(text_embeds, image_embeds.t()) * logit_scale
        logits_per_image = logits_per_text.T

        loss = None
        if return_loss:
            loss = clip_loss(logits_per_text)

        if not return_dict:
            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
            return ((loss,) + output) if loss is not None else output

        return CLIPOutput(
            loss=loss,
            logits_per_image=logits_per_image,
            logits_per_text=logits_per_text,
            text_embeds=text_embeds,
            image_embeds=image_embeds,
            text_model_output=text_outputs,
            vision_model_output=vision_outputs,
        )

    @classmethod
    def from_pretrained(cls, *args, **kwargs):
        # At the moment fast initialization is not supported
        # for composite models
        kwargs["_fast_init"] = False
        return super().from_pretrained(*args, **kwargs)

    @classmethod
    def from_vision_text_pretrained(
        cls,
        *model_args,
        vision_model_name_or_path: str = None,
        text_model_name_or_path: str = None,
        **kwargs,
    ) -> PreTrainedModel:
        """
        Params:
            vision_model_name_or_path (`str`, *optional*, defaults to `None`):
                Information necessary to initiate the vision model. Can be either:

                - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
                - A path to a *directory* containing model weights saved using
                  [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
                - A path or url to a *PyTorch checkpoint folder* (e.g, `./pt_model`). In this case, `from_pt`
                  should be set to `True` and a configuration object should be provided as `config` argument. This
                  loading path is slower than converting the PyTorch checkpoint in a Flax model using the provided
                  conversion scripts and loading the Flax model afterwards.

            text_model_name_or_path (`str`, *optional*):
                Information necessary to initiate the text model. Can be either:

                - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
                - A path to a *directory* containing model weights saved using
                  [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
                - A path or url to a *PyTorch checkpoint folder* (e.g, `./pt_model`). In this case, `from_pt`
                  should be set to `True` and a configuration object should be provided as `config` argument. This
                  loading path is slower than converting the PyTorch checkpoint in a Flax model using the provided
                  conversion scripts and loading the Flax model afterwards.

            model_args (remaining positional arguments, *optional*):
                All remaning positional arguments will be passed to the underlying model's `__init__` method.

            kwargs (remaining dictionary of keyword arguments, *optional*):
                Can be used to update the configuration object (after it being loaded) and initiate the model (e.g.,
                `output_attentions=True`).

                - To update the text configuration, use the prefix *text_* for each configuration parameter.
                - To update the vision configuration, use the prefix *vision_* for each configuration parameter.
                - To update the parent model configuration, do not use a prefix for each configuration parameter.

                Behaves differently depending on whether a `config` is provided or automatically loaded.

        Example:
            ```python
            >>> from transformers import VisionTextDualEncoderModel
            ...
            >>> # initialize a model from pretrained ViT and BERT models. Note that the projection layers will be randomly initialized.
            >>> model = VisionTextDualEncoderModel.from_vision_text_pretrained(
            ...     "google/vit-base-patch16-224", "google-bert/bert-base-uncased"
            ... )
            >>> # saving model after fine-tuning
            >>> model.save_pretrained("./vit-bert")
            >>> # load fine-tuned model
            >>> model = VisionTextDualEncoderModel.from_pretrained("./vit-bert")
            ```
        """
        kwargs_vision = {
            argument[len("vision_") :]: value for argument, value in kwargs.items() if argument.startswith("vision_")
        }

        kwargs_text = {
            argument[len("text_") :]: value for argument, value in kwargs.items() if argument.startswith("text_")
        }

        # remove vision, text kwargs from kwargs
        for key in kwargs_vision.keys():
            del kwargs["vision_" + key]
        for key in kwargs_text.keys():
            del kwargs["text_" + key]

        # Load and initialize the vision and text model
        vision_model = kwargs_vision.pop("model", None)
        if vision_model is None:
            if vision_model_name_or_path is None:
                raise ValueError(
                    "If `vision_model` is not defined as an argument, a `vision_model_name_or_path` has to be defined"
                )

            if "config" not in kwargs_vision:
                vision_config = AutoConfig.from_pretrained(vision_model_name_or_path)

            if vision_config.model_type == "clip":
                kwargs_vision["config"] = vision_config.vision_config
                vision_model = CLIPVisionModel.from_pretrained(vision_model_name_or_path, *model_args, **kwargs_vision)
                # TODO: Should we use the pre-trained projection as well ?
            else:
                kwargs_vision["config"] = vision_config
                vision_model = AutoModel.from_pretrained(vision_model_name_or_path, *model_args, **kwargs_vision)

        text_model = kwargs_text.pop("model", None)
        if text_model is None:
            if text_model_name_or_path is None:
                raise ValueError(
                    "If `text_model` is not defined as an argument, a `text_model_name_or_path` has to be defined"
                )

            if "config" not in kwargs_text:
                text_config = AutoConfig.from_pretrained(text_model_name_or_path)
                kwargs_text["config"] = text_config

            text_model = AutoModel.from_pretrained(text_model_name_or_path, *model_args, **kwargs_text)

        # instantiate config with corresponding kwargs
        config = VisionTextDualEncoderConfig.from_vision_text_configs(vision_model.config, text_model.config, **kwargs)

        # init model
        model = cls(config=config, vision_model=vision_model, text_model=text_model)

        # the projection layers are always newly initialized when loading the model
        # using pre-trained vision and text model.
        logger.warning(
            "The projection layer and logit scale weights `['visual_projection.weight', 'text_projection.weight',"
            " 'logit_scale']` are newly initialized. You should probably TRAIN this model on a down-stream task to be"
            " able to use it for predictions and inference."
        )

        return model

mindnlp.transformers.models.vision_text_dual_encoder.modeling_vision_text_dual_encoder.VisionTextDualEncoderModel.forward(input_ids=None, pixel_values=None, attention_mask=None, position_ids=None, return_loss=None, token_type_ids=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple[Tensor], CLIPOutput]

Union[Tuple[ms.Tensor], CLIPOutput]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import (
...     VisionTextDualEncoderModel,
...     VisionTextDualEncoderProcessor,
...     AutoImageProcessor,
...     AutoTokenizer,
... )
...
>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> processor = VisionTextDualEncoderProcessor(image_processor, tokenizer)
>>> model = VisionTextDualEncoderModel.from_vision_text_pretrained(
...     "google/vit-base-patch16-224", "google-bert/bert-base-uncased"
... )
...
>>> # contrastive training
>>> urls = [
...     "http://images.cocodataset.org/val2017/000000039769.jpg",
...     "https://farm3.staticflickr.com/2674/5850229113_4fe05d5265_z.jpg",
... ]
>>> images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
>>> inputs = processor(
...     text=["a photo of a cat", "a photo of a dog"], images=images, return_tensors="ms", padding=True
... )
>>> outputs = model(
...     input_ids=inputs.input_ids,
...     attention_mask=inputs.attention_mask,
...     pixel_values=inputs.pixel_values,
...     return_loss=True,
... )
>>> loss, logits_per_image = outputs.loss, outputs.logits_per_image  # this is the image-text similarity score
...
>>> # save and load from pretrained
>>> model.save_pretrained("vit-bert")
>>> model = VisionTextDualEncoderModel.from_pretrained("vit-bert")
...
>>> # inference
>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
Source code in mindnlp\transformers\models\vision_text_dual_encoder\modeling_vision_text_dual_encoder.py
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
def forward(
    self,
    input_ids: Optional[ms.Tensor] = None,
    pixel_values: Optional[ms.Tensor] = None,
    attention_mask: Optional[ms.Tensor] = None,
    position_ids: Optional[ms.Tensor] = None,
    return_loss: Optional[bool] = None,
    token_type_ids: Optional[ms.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple[ms.Tensor], CLIPOutput]:
    r"""
    Returns:
        Union[Tuple[ms.Tensor], CLIPOutput]

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import (
        ...     VisionTextDualEncoderModel,
        ...     VisionTextDualEncoderProcessor,
        ...     AutoImageProcessor,
        ...     AutoTokenizer,
        ... )
        ...
        >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
        >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
        >>> processor = VisionTextDualEncoderProcessor(image_processor, tokenizer)
        >>> model = VisionTextDualEncoderModel.from_vision_text_pretrained(
        ...     "google/vit-base-patch16-224", "google-bert/bert-base-uncased"
        ... )
        ...
        >>> # contrastive training
        >>> urls = [
        ...     "http://images.cocodataset.org/val2017/000000039769.jpg",
        ...     "https://farm3.staticflickr.com/2674/5850229113_4fe05d5265_z.jpg",
        ... ]
        >>> images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
        >>> inputs = processor(
        ...     text=["a photo of a cat", "a photo of a dog"], images=images, return_tensors="ms", padding=True
        ... )
        >>> outputs = model(
        ...     input_ids=inputs.input_ids,
        ...     attention_mask=inputs.attention_mask,
        ...     pixel_values=inputs.pixel_values,
        ...     return_loss=True,
        ... )
        >>> loss, logits_per_image = outputs.loss, outputs.logits_per_image  # this is the image-text similarity score
        ...
        >>> # save and load from pretrained
        >>> model.save_pretrained("vit-bert")
        >>> model = VisionTextDualEncoderModel.from_pretrained("vit-bert")
        ...
        >>> # inference
        >>> outputs = model(**inputs)
        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
        >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.return_dict

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    image_embeds = vision_outputs[1]  # pooler_output
    image_embeds = self.visual_projection(image_embeds)

    text_embeds = text_outputs[1]  # pooler_output
    text_embeds = self.text_projection(text_embeds)

    # normalized features
    image_embeds = image_embeds / ops.norm(image_embeds, p=2, dim=-1, keepdim=True)
    text_embeds = text_embeds / ops.norm(text_embeds, p=2, dim=-1, keepdim=True)

    # cosine similarity as logits
    logit_scale = self.logit_scale.exp()
    logits_per_text = ops.matmul(text_embeds, image_embeds.t()) * logit_scale
    logits_per_image = logits_per_text.T

    loss = None
    if return_loss:
        loss = clip_loss(logits_per_text)

    if not return_dict:
        output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
        return ((loss,) + output) if loss is not None else output

    return CLIPOutput(
        loss=loss,
        logits_per_image=logits_per_image,
        logits_per_text=logits_per_text,
        text_embeds=text_embeds,
        image_embeds=image_embeds,
        text_model_output=text_outputs,
        vision_model_output=vision_outputs,
    )

mindnlp.transformers.models.vision_text_dual_encoder.modeling_vision_text_dual_encoder.VisionTextDualEncoderModel.from_vision_text_pretrained(*model_args, vision_model_name_or_path=None, text_model_name_or_path=None, **kwargs) classmethod

PARAMETER DESCRIPTION
vision_model_name_or_path

Information necessary to initiate the vision model. Can be either:

  • A string, the model id of a pretrained model hosted inside a model repo on huggingface.co.
  • A path to a directory containing model weights saved using [~PreTrainedModel.save_pretrained], e.g., ./my_model_directory/.
  • A path or url to a PyTorch checkpoint folder (e.g, ./pt_model). In this case, from_pt should be set to True and a configuration object should be provided as config argument. This loading path is slower than converting the PyTorch checkpoint in a Flax model using the provided conversion scripts and loading the Flax model afterwards.

TYPE: `str`, *optional*, defaults to `None` DEFAULT: None

text_model_name_or_path

Information necessary to initiate the text model. Can be either:

  • A string, the model id of a pretrained model hosted inside a model repo on huggingface.co.
  • A path to a directory containing model weights saved using [~PreTrainedModel.save_pretrained], e.g., ./my_model_directory/.
  • A path or url to a PyTorch checkpoint folder (e.g, ./pt_model). In this case, from_pt should be set to True and a configuration object should be provided as config argument. This loading path is slower than converting the PyTorch checkpoint in a Flax model using the provided conversion scripts and loading the Flax model afterwards.

TYPE: `str`, *optional* DEFAULT: None

model_args

All remaning positional arguments will be passed to the underlying model's __init__ method.

TYPE: remaining positional arguments, *optional* DEFAULT: ()

kwargs

Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., output_attentions=True).

  • To update the text configuration, use the prefix text_ for each configuration parameter.
  • To update the vision configuration, use the prefix vision_ for each configuration parameter.
  • To update the parent model configuration, do not use a prefix for each configuration parameter.

Behaves differently depending on whether a config is provided or automatically loaded.

TYPE: remaining dictionary of keyword arguments, *optional* DEFAULT: {}

Example
>>> from transformers import VisionTextDualEncoderModel
...
>>> # initialize a model from pretrained ViT and BERT models. Note that the projection layers will be randomly initialized.
>>> model = VisionTextDualEncoderModel.from_vision_text_pretrained(
...     "google/vit-base-patch16-224", "google-bert/bert-base-uncased"
... )
>>> # saving model after fine-tuning
>>> model.save_pretrained("./vit-bert")
>>> # load fine-tuned model
>>> model = VisionTextDualEncoderModel.from_pretrained("./vit-bert")
Source code in mindnlp\transformers\models\vision_text_dual_encoder\modeling_vision_text_dual_encoder.py
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
@classmethod
def from_vision_text_pretrained(
    cls,
    *model_args,
    vision_model_name_or_path: str = None,
    text_model_name_or_path: str = None,
    **kwargs,
) -> PreTrainedModel:
    """
    Params:
        vision_model_name_or_path (`str`, *optional*, defaults to `None`):
            Information necessary to initiate the vision model. Can be either:

            - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
            - A path to a *directory* containing model weights saved using
              [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
            - A path or url to a *PyTorch checkpoint folder* (e.g, `./pt_model`). In this case, `from_pt`
              should be set to `True` and a configuration object should be provided as `config` argument. This
              loading path is slower than converting the PyTorch checkpoint in a Flax model using the provided
              conversion scripts and loading the Flax model afterwards.

        text_model_name_or_path (`str`, *optional*):
            Information necessary to initiate the text model. Can be either:

            - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
            - A path to a *directory* containing model weights saved using
              [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
            - A path or url to a *PyTorch checkpoint folder* (e.g, `./pt_model`). In this case, `from_pt`
              should be set to `True` and a configuration object should be provided as `config` argument. This
              loading path is slower than converting the PyTorch checkpoint in a Flax model using the provided
              conversion scripts and loading the Flax model afterwards.

        model_args (remaining positional arguments, *optional*):
            All remaning positional arguments will be passed to the underlying model's `__init__` method.

        kwargs (remaining dictionary of keyword arguments, *optional*):
            Can be used to update the configuration object (after it being loaded) and initiate the model (e.g.,
            `output_attentions=True`).

            - To update the text configuration, use the prefix *text_* for each configuration parameter.
            - To update the vision configuration, use the prefix *vision_* for each configuration parameter.
            - To update the parent model configuration, do not use a prefix for each configuration parameter.

            Behaves differently depending on whether a `config` is provided or automatically loaded.

    Example:
        ```python
        >>> from transformers import VisionTextDualEncoderModel
        ...
        >>> # initialize a model from pretrained ViT and BERT models. Note that the projection layers will be randomly initialized.
        >>> model = VisionTextDualEncoderModel.from_vision_text_pretrained(
        ...     "google/vit-base-patch16-224", "google-bert/bert-base-uncased"
        ... )
        >>> # saving model after fine-tuning
        >>> model.save_pretrained("./vit-bert")
        >>> # load fine-tuned model
        >>> model = VisionTextDualEncoderModel.from_pretrained("./vit-bert")
        ```
    """
    kwargs_vision = {
        argument[len("vision_") :]: value for argument, value in kwargs.items() if argument.startswith("vision_")
    }

    kwargs_text = {
        argument[len("text_") :]: value for argument, value in kwargs.items() if argument.startswith("text_")
    }

    # remove vision, text kwargs from kwargs
    for key in kwargs_vision.keys():
        del kwargs["vision_" + key]
    for key in kwargs_text.keys():
        del kwargs["text_" + key]

    # Load and initialize the vision and text model
    vision_model = kwargs_vision.pop("model", None)
    if vision_model is None:
        if vision_model_name_or_path is None:
            raise ValueError(
                "If `vision_model` is not defined as an argument, a `vision_model_name_or_path` has to be defined"
            )

        if "config" not in kwargs_vision:
            vision_config = AutoConfig.from_pretrained(vision_model_name_or_path)

        if vision_config.model_type == "clip":
            kwargs_vision["config"] = vision_config.vision_config
            vision_model = CLIPVisionModel.from_pretrained(vision_model_name_or_path, *model_args, **kwargs_vision)
            # TODO: Should we use the pre-trained projection as well ?
        else:
            kwargs_vision["config"] = vision_config
            vision_model = AutoModel.from_pretrained(vision_model_name_or_path, *model_args, **kwargs_vision)

    text_model = kwargs_text.pop("model", None)
    if text_model is None:
        if text_model_name_or_path is None:
            raise ValueError(
                "If `text_model` is not defined as an argument, a `text_model_name_or_path` has to be defined"
            )

        if "config" not in kwargs_text:
            text_config = AutoConfig.from_pretrained(text_model_name_or_path)
            kwargs_text["config"] = text_config

        text_model = AutoModel.from_pretrained(text_model_name_or_path, *model_args, **kwargs_text)

    # instantiate config with corresponding kwargs
    config = VisionTextDualEncoderConfig.from_vision_text_configs(vision_model.config, text_model.config, **kwargs)

    # init model
    model = cls(config=config, vision_model=vision_model, text_model=text_model)

    # the projection layers are always newly initialized when loading the model
    # using pre-trained vision and text model.
    logger.warning(
        "The projection layer and logit scale weights `['visual_projection.weight', 'text_projection.weight',"
        " 'logit_scale']` are newly initialized. You should probably TRAIN this model on a down-stream task to be"
        " able to use it for predictions and inference."
    )

    return model

mindnlp.transformers.models.vision_text_dual_encoder.modeling_vision_text_dual_encoder.VisionTextDualEncoderModel.get_image_features(pixel_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
image_features

The image embeddings obtained by applying the projection layer to the pooled output of [CLIPVisionModel].

TYPE: `torch.FloatTensor` of shape `(batch_size, output_dim`

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import VisionTextDualEncoderModel, AutoImageProcessor
...
>>> model = VisionTextDualEncoderModel.from_pretrained("clip-italian/clip-italian")
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = image_processor(images=image, return_tensors="ms")
...
>>> image_features = model.get_image_features(**inputs)
Source code in mindnlp\transformers\models\vision_text_dual_encoder\modeling_vision_text_dual_encoder.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
def get_image_features(
    self,
    pixel_values=None,
    output_attentions=None,
    output_hidden_states=None,
    return_dict=None,
):
    r"""

    Returns:
        image_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
            applying the projection layer to the pooled output of [`CLIPVisionModel`].

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import VisionTextDualEncoderModel, AutoImageProcessor
        ...
        >>> model = VisionTextDualEncoderModel.from_pretrained("clip-italian/clip-italian")
        >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = image_processor(images=image, return_tensors="ms")
        ...
        >>> image_features = model.get_image_features(**inputs)
        ```
    """
    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = vision_outputs[1]  # pooled_output
    image_features = self.visual_projection(pooled_output)

    return image_features

mindnlp.transformers.models.vision_text_dual_encoder.modeling_vision_text_dual_encoder.VisionTextDualEncoderModel.get_text_features(input_ids=None, attention_mask=None, position_ids=None, token_type_ids=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
text_features

The text embeddings obtained by applying the projection layer to the pooled output of [CLIPTextModel].

TYPE: `torch.FloatTensor` of shape `(batch_size, output_dim`

Example
>>> from transformers import VisionTextDualEncoderModel, AutoTokenizer
...
>>> model = VisionTextDualEncoderModel.from_pretrained("clip-italian/clip-italian")
>>> tokenizer = AutoTokenizer.from_pretrained("clip-italian/clip-italian")
...
>>> inputs = tokenizer(["una foto di un gatto", "una foto di un cane"], padding=True, return_tensors="ms")
>>> text_features = model.get_text_features(**inputs)
Source code in mindnlp\transformers\models\vision_text_dual_encoder\modeling_vision_text_dual_encoder.py
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def get_text_features(
    self,
    input_ids=None,
    attention_mask=None,
    position_ids=None,
    token_type_ids=None,
    output_attentions=None,
    output_hidden_states=None,
    return_dict=None,
):
    r"""
    Returns:
        text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
            applying the projection layer to the pooled output of [`CLIPTextModel`].

    Example:
        ```python
        >>> from transformers import VisionTextDualEncoderModel, AutoTokenizer
        ...
        >>> model = VisionTextDualEncoderModel.from_pretrained("clip-italian/clip-italian")
        >>> tokenizer = AutoTokenizer.from_pretrained("clip-italian/clip-italian")
        ...
        >>> inputs = tokenizer(["una foto di un gatto", "una foto di un cane"], padding=True, return_tensors="ms")
        >>> text_features = model.get_text_features(**inputs)
        ```
    """
    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        token_type_ids=token_type_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = text_outputs[1]
    text_features = self.text_projection(pooled_output)

    return text_features

mindnlp.transformers.models.vision_text_dual_encoder.processing_vision_text_dual_encoder

Processor class for VisionTextDualEncoder

mindnlp.transformers.models.vision_text_dual_encoder.processing_vision_text_dual_encoder.VisionTextDualEncoderProcessor

Bases: ProcessorMixin

Constructs a VisionTextDualEncoder processor which wraps an image processor and a tokenizer into a single processor.

[VisionTextDualEncoderProcessor] offers all the functionalities of [AutoImageProcessor] and [AutoTokenizer]. See the [~VisionTextDualEncoderProcessor.__call__] and [~VisionTextDualEncoderProcessor.decode] for more information.

PARAMETER DESCRIPTION
image_processor

The image processor is a required input.

TYPE: [`AutoImageProcessor`], *optional* DEFAULT: None

tokenizer

The tokenizer is a required input.

TYPE: [`PreTrainedTokenizer`], *optional* DEFAULT: None

Source code in mindnlp\transformers\models\vision_text_dual_encoder\processing_vision_text_dual_encoder.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
class VisionTextDualEncoderProcessor(ProcessorMixin):
    r"""
    Constructs a VisionTextDualEncoder processor which wraps an image processor and a tokenizer into a single
    processor.

    [`VisionTextDualEncoderProcessor`] offers all the functionalities of [`AutoImageProcessor`] and [`AutoTokenizer`].
    See the [`~VisionTextDualEncoderProcessor.__call__`] and [`~VisionTextDualEncoderProcessor.decode`] for more
    information.

    Args:
        image_processor ([`AutoImageProcessor`], *optional*):
            The image processor is a required input.
        tokenizer ([`PreTrainedTokenizer`], *optional*):
            The tokenizer is a required input.
    """

    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "AutoImageProcessor"
    tokenizer_class = "AutoTokenizer"

    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
        feature_extractor = None
        if "feature_extractor" in kwargs:
            warnings.warn(
                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
                " instead.",
                FutureWarning,
            )
            feature_extractor = kwargs.pop("feature_extractor")

        image_processor = image_processor if image_processor is not None else feature_extractor
        if image_processor is None:
            raise ValueError("You have to specify an image_processor.")
        if tokenizer is None:
            raise ValueError("You have to specify a tokenizer.")

        super().__init__(image_processor, tokenizer)
        self.current_processor = self.image_processor

    def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
        """
        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
        and `kwargs` arguments to VisionTextDualEncoderTokenizer's [`~PreTrainedTokenizer.__call__`] if `text` is not
        `None` to encode the text. To prepare the image(s), this method forwards the `images` and `kwargs` arguments to
        AutoImageProcessor's [`~AutoImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
        of the above two methods for more information.

        Args:
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`,
                `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. Both channels-first and channels-last formats are supported.

            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.

        Returns:
            [`BatchEncoding`]:
                A [`BatchEncoding`] with the following fields:

                - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
                - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
                  `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
                  `None`).
                - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
        """

        if text is None and images is None:
            raise ValueError("You have to specify either text or images. Both cannot be none.")

        if text is not None:
            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)

        if images is not None:
            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)

        if text is not None and images is not None:
            encoding["pixel_values"] = image_features.pixel_values
            return encoding
        elif text is not None:
            return encoding
        else:
            return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to VisionTextDualEncoderTokenizer's
        [`~PreTrainedTokenizer.batch_decode`]. Please refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to VisionTextDualEncoderTokenizer's [`~PreTrainedTokenizer.decode`].
        Please refer to the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    @property
    def model_input_names(self):
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

    @property
    def feature_extractor_class(self):
        warnings.warn(
            "`feature_extractor_class` is deprecated. Use `image_processor_class` instead.",
            FutureWarning,
        )
        return self.image_processor_class

    @property
    def feature_extractor(self):
        warnings.warn(
            "`feature_extractor` is deprecated. Use `image_processor` instead.",
            FutureWarning,
        )
        return self.image_processor

mindnlp.transformers.models.vision_text_dual_encoder.processing_vision_text_dual_encoder.VisionTextDualEncoderProcessor.__call__(text=None, images=None, return_tensors=None, **kwargs)

Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the text and kwargs arguments to VisionTextDualEncoderTokenizer's [~PreTrainedTokenizer.__call__] if text is not None to encode the text. To prepare the image(s), this method forwards the images and kwargs arguments to AutoImageProcessor's [~AutoImageProcessor.__call__] if images is not None. Please refer to the doctsring of the above two methods for more information.

PARAMETER DESCRIPTION
text

The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

TYPE: `str`, `List[str]`, `List[List[str]]` DEFAULT: None

return_tensors

If set, will return tensors of a particular framework. Acceptable values are:

  • 'tf': Return TensorFlow tf.constant objects.
  • 'pt': Return PyTorch torch.Tensor objects.
  • 'np': Return NumPy np.ndarray objects.
  • 'jax': Return JAX jnp.ndarray objects.

TYPE: `str` or [`~utils.TensorType`], *optional* DEFAULT: None

RETURNS DESCRIPTION

[BatchEncoding]: A [BatchEncoding] with the following fields:

  • input_ids -- List of token ids to be fed to a model. Returned when text is not None.
  • attention_mask -- List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if "attention_mask" is in self.model_input_names and if text is not None).
  • pixel_values -- Pixel values to be fed to a model. Returned when images is not None.
Source code in mindnlp\transformers\models\vision_text_dual_encoder\processing_vision_text_dual_encoder.py
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
    """
    Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
    and `kwargs` arguments to VisionTextDualEncoderTokenizer's [`~PreTrainedTokenizer.__call__`] if `text` is not
    `None` to encode the text. To prepare the image(s), this method forwards the `images` and `kwargs` arguments to
    AutoImageProcessor's [`~AutoImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
    of the above two methods for more information.

    Args:
        text (`str`, `List[str]`, `List[List[str]]`):
            The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
            (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
            `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
        images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`,
            `List[torch.Tensor]`):
            The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
            tensor. Both channels-first and channels-last formats are supported.

        return_tensors (`str` or [`~utils.TensorType`], *optional*):
            If set, will return tensors of a particular framework. Acceptable values are:

            - `'tf'`: Return TensorFlow `tf.constant` objects.
            - `'pt'`: Return PyTorch `torch.Tensor` objects.
            - `'np'`: Return NumPy `np.ndarray` objects.
            - `'jax'`: Return JAX `jnp.ndarray` objects.

    Returns:
        [`BatchEncoding`]:
            A [`BatchEncoding`] with the following fields:

            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
              `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
    """

    if text is None and images is None:
        raise ValueError("You have to specify either text or images. Both cannot be none.")

    if text is not None:
        encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)

    if images is not None:
        image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)

    if text is not None and images is not None:
        encoding["pixel_values"] = image_features.pixel_values
        return encoding
    elif text is not None:
        return encoding
    else:
        return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

mindnlp.transformers.models.vision_text_dual_encoder.processing_vision_text_dual_encoder.VisionTextDualEncoderProcessor.batch_decode(*args, **kwargs)

This method forwards all its arguments to VisionTextDualEncoderTokenizer's [~PreTrainedTokenizer.batch_decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp\transformers\models\vision_text_dual_encoder\processing_vision_text_dual_encoder.py
118
119
120
121
122
123
def batch_decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to VisionTextDualEncoderTokenizer's
    [`~PreTrainedTokenizer.batch_decode`]. Please refer to the docstring of this method for more information.
    """
    return self.tokenizer.batch_decode(*args, **kwargs)

mindnlp.transformers.models.vision_text_dual_encoder.processing_vision_text_dual_encoder.VisionTextDualEncoderProcessor.decode(*args, **kwargs)

This method forwards all its arguments to VisionTextDualEncoderTokenizer's [~PreTrainedTokenizer.decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp\transformers\models\vision_text_dual_encoder\processing_vision_text_dual_encoder.py
125
126
127
128
129
130
def decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to VisionTextDualEncoderTokenizer's [`~PreTrainedTokenizer.decode`].
    Please refer to the docstring of this method for more information.
    """
    return self.tokenizer.decode(*args, **kwargs)