跳转至

clip

mindnlp.transformers.models.clip.configuration_clip.CLIPConfig

Bases: PretrainedConfig

[CLIPConfig] is the configuration class to store the configuration of a [CLIPModel]. It is used to instantiate a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the CLIP openai/clip-vit-base-patch32 architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
text_config

Dictionary of configuration options used to initialize [CLIPTextConfig].

TYPE: `dict`, *optional* DEFAULT: None

vision_config

Dictionary of configuration options used to initialize [CLIPVisionConfig].

TYPE: `dict`, *optional* DEFAULT: None

projection_dim

Dimentionality of text and vision projection layers.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

logit_scale_init_value

The inital value of the logit_scale paramter. Default is used as per the original CLIP implementation.

TYPE: `float`, *optional*, defaults to 2.6592 DEFAULT: 2.6592

kwargs

Dictionary of keyword arguments.

TYPE: *optional* DEFAULT: {}

Example
>>> from transformers import CLIPConfig, CLIPModel
...
>>> # Initializing a CLIPConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPConfig()
...
>>> # Initializing a CLIPModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
...
>>> # We can also initialize a CLIPConfig from a CLIPTextConfig and a CLIPVisionConfig
>>> from transformers import CLIPTextConfig, CLIPVisionConfig
...
>>> # Initializing a CLIPText and CLIPVision configuration
>>> config_text = CLIPTextConfig()
>>> config_vision = CLIPVisionConfig()
...
>>> config = CLIPConfig.from_text_vision_configs(config_text, config_vision)
Source code in mindnlp\transformers\models\clip\configuration_clip.py
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
class CLIPConfig(PretrainedConfig):
    r"""
    [`CLIPConfig`] is the configuration class to store the configuration of a [`CLIPModel`]. It is used to instantiate
    a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating
    a configuration with the defaults will yield a similar configuration to that of the CLIP
    [openai/clip-vit-base-patch32](https://hf-mirror.com/openai/clip-vit-base-patch32) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        text_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`CLIPTextConfig`].
        vision_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`CLIPVisionConfig`].
        projection_dim (`int`, *optional*, defaults to 512):
            Dimentionality of text and vision projection layers.
        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
            The inital value of the *logit_scale* paramter. Default is used as per the original CLIP implementation.
        kwargs (*optional*):
            Dictionary of keyword arguments.

    Example:
        ```python
        >>> from transformers import CLIPConfig, CLIPModel
        ...
        >>> # Initializing a CLIPConfig with openai/clip-vit-base-patch32 style configuration
        >>> configuration = CLIPConfig()
        ...
        >>> # Initializing a CLIPModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
        >>> model = CLIPModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ...
        >>> # We can also initialize a CLIPConfig from a CLIPTextConfig and a CLIPVisionConfig
        >>> from transformers import CLIPTextConfig, CLIPVisionConfig
        ...
        >>> # Initializing a CLIPText and CLIPVision configuration
        >>> config_text = CLIPTextConfig()
        >>> config_vision = CLIPVisionConfig()
        ...
        >>> config = CLIPConfig.from_text_vision_configs(config_text, config_vision)
        ```
    """
    model_type = "clip"

    def __init__(
        self, text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs
    ):
        """
        Initializes a new instance of CLIPConfig.

        Args:
            self: The instance of the class.
            text_config (dict): The configuration for text inputs. If provided, overrides default values. Default is None.
            vision_config (dict): The configuration for vision inputs. If provided, overrides default values. Default is None.
            projection_dim (int): The dimension of the projection. Default is 512.
            logit_scale_init_value (float): The initial value for logit scaling. Default is 2.6592.

        Returns:
            None

        Raises:
            TypeError: If text_config or vision_config are not of type dict.
            ValueError: If projection_dim or logit_scale_init_value are not of type int or float respectively.
            KeyError: If 'transformers_version' key is present in text_config or vision_config.
            AttributeError: If 'id2label' key is not present in vision_config.
        """
        # If `_config_dict` exist, we use them for the backward compatibility.
        # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
        # of confusion!).
        text_config_dict = kwargs.pop("text_config_dict", None)
        vision_config_dict = kwargs.pop("vision_config_dict", None)

        super().__init__(**kwargs)

        # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
        # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
        # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
        if text_config_dict is not None:
            if text_config is None:
                text_config = {}

            # This is the complete result when using `text_config_dict`.
            _text_config_dict = CLIPTextConfig(**text_config_dict).to_dict()

            # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
            for key, value in _text_config_dict.items():
                if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
                    # If specified in `text_config_dict`
                    if key in text_config_dict:
                        message = (
                            f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
                            f'The value `text_config_dict["{key}"]` will be used instead.'
                        )
                    # If inferred from default argument values (just to be super careful)
                    else:
                        message = (
                            f"`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The "
                            f'value `text_config["{key}"]` will be overriden.'
                        )
                    logger.info(message)

            # Update all values in `text_config` with the ones in `_text_config_dict`.
            text_config.update(_text_config_dict)

        if vision_config_dict is not None:
            if vision_config is None:
                vision_config = {}

            # This is the complete result when using `vision_config_dict`.
            _vision_config_dict = CLIPVisionConfig(**vision_config_dict).to_dict()
            # convert keys to string instead of integer
            if "id2label" in _vision_config_dict:
                _vision_config_dict["id2label"] = {
                    str(key): value for key, value in _vision_config_dict["id2label"].items()
                }

            # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different.
            for key, value in _vision_config_dict.items():
                if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]:
                    # If specified in `vision_config_dict`
                    if key in vision_config_dict:
                        message = (
                            f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different "
                            f'values. The value `vision_config_dict["{key}"]` will be used instead.'
                        )
                    # If inferred from default argument values (just to be super careful)
                    else:
                        message = (
                            f"`vision_config_dict` is provided which will be used to initialize `CLIPVisionConfig`. "
                            f'The value `vision_config["{key}"]` will be overriden.'
                        )
                    logger.info(message)

            # Update all values in `vision_config` with the ones in `_vision_config_dict`.
            vision_config.update(_vision_config_dict)

        if text_config is None:
            text_config = {}
            logger.info("`text_config` is `None`. Initializing the `CLIPTextConfig` with default values.")

        if vision_config is None:
            vision_config = {}
            logger.info("`vision_config` is `None`. initializing the `CLIPVisionConfig` with default values.")

        self.text_config = CLIPTextConfig(**text_config)
        self.vision_config = CLIPVisionConfig(**vision_config)

        self.projection_dim = projection_dim
        self.logit_scale_init_value = logit_scale_init_value
        self.initializer_factor = 1.0

    @classmethod
    def from_text_vision_configs(cls, text_config: CLIPTextConfig, vision_config: CLIPVisionConfig, **kwargs):
        r"""
        Instantiate a [`CLIPConfig`] (or a derived class) from clip text model configuration and clip vision model
        configuration.

        Returns:
            [`CLIPConfig`]: An instance of a configuration object
        """
        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)

mindnlp.transformers.models.clip.configuration_clip.CLIPConfig.__init__(text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs)

Initializes a new instance of CLIPConfig.

PARAMETER DESCRIPTION
self

The instance of the class.

text_config

The configuration for text inputs. If provided, overrides default values. Default is None.

TYPE: dict DEFAULT: None

vision_config

The configuration for vision inputs. If provided, overrides default values. Default is None.

TYPE: dict DEFAULT: None

projection_dim

The dimension of the projection. Default is 512.

TYPE: int DEFAULT: 512

logit_scale_init_value

The initial value for logit scaling. Default is 2.6592.

TYPE: float DEFAULT: 2.6592

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
TypeError

If text_config or vision_config are not of type dict.

ValueError

If projection_dim or logit_scale_init_value are not of type int or float respectively.

KeyError

If 'transformers_version' key is present in text_config or vision_config.

AttributeError

If 'id2label' key is not present in vision_config.

Source code in mindnlp\transformers\models\clip\configuration_clip.py
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
def __init__(
    self, text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs
):
    """
    Initializes a new instance of CLIPConfig.

    Args:
        self: The instance of the class.
        text_config (dict): The configuration for text inputs. If provided, overrides default values. Default is None.
        vision_config (dict): The configuration for vision inputs. If provided, overrides default values. Default is None.
        projection_dim (int): The dimension of the projection. Default is 512.
        logit_scale_init_value (float): The initial value for logit scaling. Default is 2.6592.

    Returns:
        None

    Raises:
        TypeError: If text_config or vision_config are not of type dict.
        ValueError: If projection_dim or logit_scale_init_value are not of type int or float respectively.
        KeyError: If 'transformers_version' key is present in text_config or vision_config.
        AttributeError: If 'id2label' key is not present in vision_config.
    """
    # If `_config_dict` exist, we use them for the backward compatibility.
    # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
    # of confusion!).
    text_config_dict = kwargs.pop("text_config_dict", None)
    vision_config_dict = kwargs.pop("vision_config_dict", None)

    super().__init__(**kwargs)

    # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
    # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
    # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
    if text_config_dict is not None:
        if text_config is None:
            text_config = {}

        # This is the complete result when using `text_config_dict`.
        _text_config_dict = CLIPTextConfig(**text_config_dict).to_dict()

        # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
        for key, value in _text_config_dict.items():
            if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
                # If specified in `text_config_dict`
                if key in text_config_dict:
                    message = (
                        f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
                        f'The value `text_config_dict["{key}"]` will be used instead.'
                    )
                # If inferred from default argument values (just to be super careful)
                else:
                    message = (
                        f"`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The "
                        f'value `text_config["{key}"]` will be overriden.'
                    )
                logger.info(message)

        # Update all values in `text_config` with the ones in `_text_config_dict`.
        text_config.update(_text_config_dict)

    if vision_config_dict is not None:
        if vision_config is None:
            vision_config = {}

        # This is the complete result when using `vision_config_dict`.
        _vision_config_dict = CLIPVisionConfig(**vision_config_dict).to_dict()
        # convert keys to string instead of integer
        if "id2label" in _vision_config_dict:
            _vision_config_dict["id2label"] = {
                str(key): value for key, value in _vision_config_dict["id2label"].items()
            }

        # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different.
        for key, value in _vision_config_dict.items():
            if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]:
                # If specified in `vision_config_dict`
                if key in vision_config_dict:
                    message = (
                        f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different "
                        f'values. The value `vision_config_dict["{key}"]` will be used instead.'
                    )
                # If inferred from default argument values (just to be super careful)
                else:
                    message = (
                        f"`vision_config_dict` is provided which will be used to initialize `CLIPVisionConfig`. "
                        f'The value `vision_config["{key}"]` will be overriden.'
                    )
                logger.info(message)

        # Update all values in `vision_config` with the ones in `_vision_config_dict`.
        vision_config.update(_vision_config_dict)

    if text_config is None:
        text_config = {}
        logger.info("`text_config` is `None`. Initializing the `CLIPTextConfig` with default values.")

    if vision_config is None:
        vision_config = {}
        logger.info("`vision_config` is `None`. initializing the `CLIPVisionConfig` with default values.")

    self.text_config = CLIPTextConfig(**text_config)
    self.vision_config = CLIPVisionConfig(**vision_config)

    self.projection_dim = projection_dim
    self.logit_scale_init_value = logit_scale_init_value
    self.initializer_factor = 1.0

mindnlp.transformers.models.clip.configuration_clip.CLIPConfig.from_text_vision_configs(text_config, vision_config, **kwargs) classmethod

Instantiate a [CLIPConfig] (or a derived class) from clip text model configuration and clip vision model configuration.

RETURNS DESCRIPTION

[CLIPConfig]: An instance of a configuration object

Source code in mindnlp\transformers\models\clip\configuration_clip.py
509
510
511
512
513
514
515
516
517
518
@classmethod
def from_text_vision_configs(cls, text_config: CLIPTextConfig, vision_config: CLIPVisionConfig, **kwargs):
    r"""
    Instantiate a [`CLIPConfig`] (or a derived class) from clip text model configuration and clip vision model
    configuration.

    Returns:
        [`CLIPConfig`]: An instance of a configuration object
    """
    return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)

mindnlp.transformers.models.clip.configuration_clip.CLIPTextConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [CLIPTextModel]. It is used to instantiate a CLIP text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the text encoder of the CLIP openai/clip-vit-base-patch32 architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [CLIPModel].

TYPE: `int`, *optional*, defaults to 49408 DEFAULT: 49408

hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

intermediate_size

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 2048 DEFAULT: 2048

projection_dim

Dimentionality of text and vision projection layers.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 8 DEFAULT: 8

max_position_embeddings

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

TYPE: `int`, *optional*, defaults to 77 DEFAULT: 77

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" "quick_gelu" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"quick_gelu"` DEFAULT: 'quick_gelu'

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

attention_dropout

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

initializer_factor

A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

pad_token_id

Padding token id.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

bos_token_id

Beginning of stream token id.

TYPE: `int`, *optional*, defaults to 49406 DEFAULT: 49406

eos_token_id

End of stream token id.

TYPE: `int`, *optional*, defaults to 49407 DEFAULT: 49407

Example
>>> from transformers import CLIPTextConfig, CLIPTextModel
...
>>> # Initializing a CLIPTextConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPTextConfig()
...
>>> # Initializing a CLIPTextModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPTextModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp\transformers\models\clip\configuration_clip.py
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
class CLIPTextConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`CLIPTextModel`]. It is used to instantiate a CLIP
    text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the text encoder of the CLIP
    [openai/clip-vit-base-patch32](https://hf-mirror.com/openai/clip-vit-base-patch32) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vocab_size (`int`, *optional*, defaults to 49408):
            Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by
            the `inputs_ids` passed when calling [`CLIPModel`].
        hidden_size (`int`, *optional*, defaults to 512):
            Dimensionality of the encoder layers and the pooler layer.
        intermediate_size (`int`, *optional*, defaults to 2048):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        projection_dim (`int`, *optional*, defaults to 512):
            Dimentionality of text and vision projection layers.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 8):
            Number of attention heads for each attention layer in the Transformer encoder.
        max_position_embeddings (`int`, *optional*, defaults to 77):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported.
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        initializer_factor (`float`, *optional*, defaults to 1.0):
            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
            testing).
        pad_token_id (`int`, *optional*, defaults to 1):
            Padding token id.
        bos_token_id (`int`, *optional*, defaults to 49406):
            Beginning of stream token id.
        eos_token_id (`int`, *optional*, defaults to 49407):
            End of stream token id.

    Example:
        ```python
        >>> from transformers import CLIPTextConfig, CLIPTextModel
        ...
        >>> # Initializing a CLIPTextConfig with openai/clip-vit-base-patch32 style configuration
        >>> configuration = CLIPTextConfig()
        ...
        >>> # Initializing a CLIPTextModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
        >>> model = CLIPTextModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "clip_text_model"

    def __init__(
        self,
        vocab_size=49408,
        hidden_size=512,
        intermediate_size=2048,
        projection_dim=512,
        num_hidden_layers=12,
        num_attention_heads=8,
        max_position_embeddings=77,
        hidden_act="quick_gelu",
        layer_norm_eps=1e-5,
        attention_dropout=0.0,
        initializer_range=0.02,
        initializer_factor=1.0,
        # This differs from `CLIPTokenizer`'s default and from openai/clip
        # See https://github.com/huggingface/transformers/pull/24773#issuecomment-1632287538
        pad_token_id=1,
        bos_token_id=49406,
        eos_token_id=49407,
        **kwargs,
    ):
        """
        Initialize CLIPTextConfig.

        Args:
            vocab_size (int, optional): The size of the vocabulary. Default is 49408.
            hidden_size (int, optional): The size of the hidden layers. Default is 512.
            intermediate_size (int, optional): The size of the intermediate layers. Default is 2048.
            projection_dim (int, optional): The projection dimension. Default is 512.
            num_hidden_layers (int, optional): The number of hidden layers. Default is 12.
            num_attention_heads (int, optional): The number of attention heads. Default is 8.
            max_position_embeddings (int, optional): The maximum position embeddings. Default is 77.
            hidden_act (str, optional): The type of activation function for the hidden layers. Default is 'quick_gelu'.
            layer_norm_eps (float, optional): Epsilon value for layer normalization. Default is 1e-05.
            attention_dropout (float, optional): The dropout rate for attention layers. Default is 0.0.
            initializer_range (float, optional): The range for parameter initializers. Default is 0.02.
            initializer_factor (float, optional): The factor for parameter initializers. Default is 1.0.
            pad_token_id (int, optional): The ID of the padding token. Default is 1.
            bos_token_id (int, optional): The ID of the beginning of sequence token. Default is 49406.
            eos_token_id (int, optional): The ID of the end of sequence token. Default is 49407.
            **kwargs: Additional keyword arguments.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)

        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.projection_dim = projection_dim
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.max_position_embeddings = max_position_embeddings
        self.layer_norm_eps = layer_norm_eps
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.initializer_factor = initializer_factor
        self.attention_dropout = attention_dropout

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        """
        Creates a CLIPTextConfig instance from a pretrained model.

        Args:
            cls (type): The class object.
            pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.

        Returns:
            PretrainedConfig: A CLIPTextConfig instance initialized with the configuration specified by the pretrained model.

        Raises:
            TypeError: If the input parameters are not of the expected types.
            ValueError: If the configuration dictionary does not contain the required information.
            Warning: If the model type being used for instantiation does not match the class's model type, which may lead to errors.
        """
        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the text config dict if we are loading from CLIPConfig
        if config_dict.get("model_type") == "clip":
            config_dict = config_dict["text_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.clip.configuration_clip.CLIPTextConfig.__init__(vocab_size=49408, hidden_size=512, intermediate_size=2048, projection_dim=512, num_hidden_layers=12, num_attention_heads=8, max_position_embeddings=77, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, pad_token_id=1, bos_token_id=49406, eos_token_id=49407, **kwargs)

Initialize CLIPTextConfig.

PARAMETER DESCRIPTION
vocab_size

The size of the vocabulary. Default is 49408.

TYPE: int DEFAULT: 49408

hidden_size

The size of the hidden layers. Default is 512.

TYPE: int DEFAULT: 512

intermediate_size

The size of the intermediate layers. Default is 2048.

TYPE: int DEFAULT: 2048

projection_dim

The projection dimension. Default is 512.

TYPE: int DEFAULT: 512

num_hidden_layers

The number of hidden layers. Default is 12.

TYPE: int DEFAULT: 12

num_attention_heads

The number of attention heads. Default is 8.

TYPE: int DEFAULT: 8

max_position_embeddings

The maximum position embeddings. Default is 77.

TYPE: int DEFAULT: 77

hidden_act

The type of activation function for the hidden layers. Default is 'quick_gelu'.

TYPE: str DEFAULT: 'quick_gelu'

layer_norm_eps

Epsilon value for layer normalization. Default is 1e-05.

TYPE: float DEFAULT: 1e-05

attention_dropout

The dropout rate for attention layers. Default is 0.0.

TYPE: float DEFAULT: 0.0

initializer_range

The range for parameter initializers. Default is 0.02.

TYPE: float DEFAULT: 0.02

initializer_factor

The factor for parameter initializers. Default is 1.0.

TYPE: float DEFAULT: 1.0

pad_token_id

The ID of the padding token. Default is 1.

TYPE: int DEFAULT: 1

bos_token_id

The ID of the beginning of sequence token. Default is 49406.

TYPE: int DEFAULT: 49406

eos_token_id

The ID of the end of sequence token. Default is 49407.

TYPE: int DEFAULT: 49407

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION

None.

Source code in mindnlp\transformers\models\clip\configuration_clip.py
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
def __init__(
    self,
    vocab_size=49408,
    hidden_size=512,
    intermediate_size=2048,
    projection_dim=512,
    num_hidden_layers=12,
    num_attention_heads=8,
    max_position_embeddings=77,
    hidden_act="quick_gelu",
    layer_norm_eps=1e-5,
    attention_dropout=0.0,
    initializer_range=0.02,
    initializer_factor=1.0,
    # This differs from `CLIPTokenizer`'s default and from openai/clip
    # See https://github.com/huggingface/transformers/pull/24773#issuecomment-1632287538
    pad_token_id=1,
    bos_token_id=49406,
    eos_token_id=49407,
    **kwargs,
):
    """
    Initialize CLIPTextConfig.

    Args:
        vocab_size (int, optional): The size of the vocabulary. Default is 49408.
        hidden_size (int, optional): The size of the hidden layers. Default is 512.
        intermediate_size (int, optional): The size of the intermediate layers. Default is 2048.
        projection_dim (int, optional): The projection dimension. Default is 512.
        num_hidden_layers (int, optional): The number of hidden layers. Default is 12.
        num_attention_heads (int, optional): The number of attention heads. Default is 8.
        max_position_embeddings (int, optional): The maximum position embeddings. Default is 77.
        hidden_act (str, optional): The type of activation function for the hidden layers. Default is 'quick_gelu'.
        layer_norm_eps (float, optional): Epsilon value for layer normalization. Default is 1e-05.
        attention_dropout (float, optional): The dropout rate for attention layers. Default is 0.0.
        initializer_range (float, optional): The range for parameter initializers. Default is 0.02.
        initializer_factor (float, optional): The factor for parameter initializers. Default is 1.0.
        pad_token_id (int, optional): The ID of the padding token. Default is 1.
        bos_token_id (int, optional): The ID of the beginning of sequence token. Default is 49406.
        eos_token_id (int, optional): The ID of the end of sequence token. Default is 49407.
        **kwargs: Additional keyword arguments.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)

    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.intermediate_size = intermediate_size
    self.projection_dim = projection_dim
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.max_position_embeddings = max_position_embeddings
    self.layer_norm_eps = layer_norm_eps
    self.hidden_act = hidden_act
    self.initializer_range = initializer_range
    self.initializer_factor = initializer_factor
    self.attention_dropout = attention_dropout

mindnlp.transformers.models.clip.configuration_clip.CLIPTextConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) classmethod

Creates a CLIPTextConfig instance from a pretrained model.

PARAMETER DESCRIPTION
cls

The class object.

TYPE: type

pretrained_model_name_or_path

The name or path of the pretrained model.

TYPE: Union[str, PathLike]

RETURNS DESCRIPTION
PretrainedConfig

A CLIPTextConfig instance initialized with the configuration specified by the pretrained model.

TYPE: PretrainedConfig

RAISES DESCRIPTION
TypeError

If the input parameters are not of the expected types.

ValueError

If the configuration dictionary does not contain the required information.

Warning

If the model type being used for instantiation does not match the class's model type, which may lead to errors.

Source code in mindnlp\transformers\models\clip\configuration_clip.py
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
    """
    Creates a CLIPTextConfig instance from a pretrained model.

    Args:
        cls (type): The class object.
        pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.

    Returns:
        PretrainedConfig: A CLIPTextConfig instance initialized with the configuration specified by the pretrained model.

    Raises:
        TypeError: If the input parameters are not of the expected types.
        ValueError: If the configuration dictionary does not contain the required information.
        Warning: If the model type being used for instantiation does not match the class's model type, which may lead to errors.
    """
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

    # get the text config dict if we are loading from CLIPConfig
    if config_dict.get("model_type") == "clip":
        config_dict = config_dict["text_config"]

    if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
        logger.warning(
            f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
            f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
        )

    return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.clip.configuration_clip.CLIPVisionConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [CLIPVisionModel]. It is used to instantiate a CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIP openai/clip-vit-base-patch32 architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

intermediate_size

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

projection_dim

Dimentionality of text and vision projection layers.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_channels

The number of input channels.

TYPE: `int`, *optional*, defaults to 3 DEFAULT: 3

image_size

The size (resolution) of each image.

TYPE: `int`, *optional*, defaults to 224 DEFAULT: 224

patch_size

The size (resolution) of each patch.

TYPE: `int`, *optional*, defaults to 32 DEFAULT: 32

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" `"quick_gelu" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"quick_gelu"` DEFAULT: 'quick_gelu'

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

attention_dropout

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

initializer_factor

A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

Example
>>> from transformers import CLIPVisionConfig, CLIPVisionModel
...
>>> # Initializing a CLIPVisionConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPVisionConfig()
...
>>> # Initializing a CLIPVisionModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPVisionModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp\transformers\models\clip\configuration_clip.py
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
class CLIPVisionConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`CLIPVisionModel`]. It is used to instantiate a
    CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIP
    [openai/clip-vit-base-patch32](https://hf-mirror.com/openai/clip-vit-base-patch32) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        projection_dim (`int`, *optional*, defaults to 512):
            Dimentionality of text and vision projection layers.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        image_size (`int`, *optional*, defaults to 224):
            The size (resolution) of each image.
        patch_size (`int`, *optional*, defaults to 32):
            The size (resolution) of each patch.
        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        initializer_factor (`float`, *optional*, defaults to 1.0):
            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
            testing).

    Example:
        ```python
        >>> from transformers import CLIPVisionConfig, CLIPVisionModel
        ...
        >>> # Initializing a CLIPVisionConfig with openai/clip-vit-base-patch32 style configuration
        >>> configuration = CLIPVisionConfig()
        ...
        >>> # Initializing a CLIPVisionModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
        >>> model = CLIPVisionModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "clip_vision_model"

    def __init__(
        self,
        hidden_size=768,
        intermediate_size=3072,
        projection_dim=512,
        num_hidden_layers=12,
        num_attention_heads=12,
        num_channels=3,
        image_size=224,
        patch_size=32,
        hidden_act="quick_gelu",
        layer_norm_eps=1e-5,
        attention_dropout=0.0,
        initializer_range=0.02,
        initializer_factor=1.0,
        **kwargs,
    ):
        """
        Initialize a CLIPVisionConfig object with the provided configuration parameters.

        Args:
            hidden_size (int): The size of the hidden layers in the network.
            intermediate_size (int): The size of the intermediate hidden layers in the network.
            projection_dim (int): The dimension of the projected embeddings.
            num_hidden_layers (int): The number of hidden layers in the network.
            num_attention_heads (int): The number of attention heads in the network.
            num_channels (int): The number of channels in the input image.
            image_size (int): The size of the input image.
            patch_size (int): The size of the image patch used in the network.
            hidden_act (str): The activation function used in the hidden layers.
            layer_norm_eps (float): The epsilon value for layer normalization.
            attention_dropout (float): The dropout rate for attention layers.
            initializer_range (float): The range for parameter initialization.
            initializer_factor (float): The factor for parameter initialization.

        Returns:
            None.

        Raises:
            ValueError: If any of the input parameters are invalid or out of range.
        """
        super().__init__(**kwargs)

        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.projection_dim = projection_dim
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.num_channels = num_channels
        self.patch_size = patch_size
        self.image_size = image_size
        self.initializer_range = initializer_range
        self.initializer_factor = initializer_factor
        self.attention_dropout = attention_dropout
        self.layer_norm_eps = layer_norm_eps
        self.hidden_act = hidden_act

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        """
        Load a pretrained configuration from a given model name or path.

        Args:
            cls (class): The class object.
            pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.
                It can be either a string representing the name of the model or a path-like object pointing to the model location.

        Returns:
            PretrainedConfig: The loaded pretrained configuration.

        Raises:
            None.

        This method is a class method that allows loading a pretrained configuration. It takes in the class object 'cls'
        and the name or path of the pretrained model 'pretrained_model_name_or_path' as parameters. The method returns an instance
        of type 'PretrainedConfig', which represents the loaded pretrained configuration.

        The 'pretrained_model_name_or_path' parameter can be either a string representing the name of the pretrained model
        or a path-like object pointing to the location of the model. It is used to identify and locate the pretrained model
        that needs to be loaded.

        Note: If the loaded configuration belongs to the 'clip' model type, the 'config_dict' will be updated to use the
        'vision_config' sub-dictionary. Additionally, if the 'model_type' attribute is present in the 'cls' class and
        the loaded configuration's 'model_type' is different from 'cls.model_type', a warning will be logged indicating
        that instantiating a model of different types may lead to errors.

        Example:
            ```python
            >>> config = CLIPVisionConfig.from_pretrained("clip_model")
            ...
            ```
            In the above example, the 'from_pretrained' method is called on the 'CLIPVisionConfig' class to load the pretrained
            configuration of the 'clip_model'. The resulting configuration is stored in the 'config' variable.
        """
        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the vision config dict if we are loading from CLIPConfig
        if config_dict.get("model_type") == "clip":
            config_dict = config_dict["vision_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.clip.configuration_clip.CLIPVisionConfig.__init__(hidden_size=768, intermediate_size=3072, projection_dim=512, num_hidden_layers=12, num_attention_heads=12, num_channels=3, image_size=224, patch_size=32, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, **kwargs)

Initialize a CLIPVisionConfig object with the provided configuration parameters.

PARAMETER DESCRIPTION
hidden_size

The size of the hidden layers in the network.

TYPE: int DEFAULT: 768

intermediate_size

The size of the intermediate hidden layers in the network.

TYPE: int DEFAULT: 3072

projection_dim

The dimension of the projected embeddings.

TYPE: int DEFAULT: 512

num_hidden_layers

The number of hidden layers in the network.

TYPE: int DEFAULT: 12

num_attention_heads

The number of attention heads in the network.

TYPE: int DEFAULT: 12

num_channels

The number of channels in the input image.

TYPE: int DEFAULT: 3

image_size

The size of the input image.

TYPE: int DEFAULT: 224

patch_size

The size of the image patch used in the network.

TYPE: int DEFAULT: 32

hidden_act

The activation function used in the hidden layers.

TYPE: str DEFAULT: 'quick_gelu'

layer_norm_eps

The epsilon value for layer normalization.

TYPE: float DEFAULT: 1e-05

attention_dropout

The dropout rate for attention layers.

TYPE: float DEFAULT: 0.0

initializer_range

The range for parameter initialization.

TYPE: float DEFAULT: 0.02

initializer_factor

The factor for parameter initialization.

TYPE: float DEFAULT: 1.0

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If any of the input parameters are invalid or out of range.

Source code in mindnlp\transformers\models\clip\configuration_clip.py
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
def __init__(
    self,
    hidden_size=768,
    intermediate_size=3072,
    projection_dim=512,
    num_hidden_layers=12,
    num_attention_heads=12,
    num_channels=3,
    image_size=224,
    patch_size=32,
    hidden_act="quick_gelu",
    layer_norm_eps=1e-5,
    attention_dropout=0.0,
    initializer_range=0.02,
    initializer_factor=1.0,
    **kwargs,
):
    """
    Initialize a CLIPVisionConfig object with the provided configuration parameters.

    Args:
        hidden_size (int): The size of the hidden layers in the network.
        intermediate_size (int): The size of the intermediate hidden layers in the network.
        projection_dim (int): The dimension of the projected embeddings.
        num_hidden_layers (int): The number of hidden layers in the network.
        num_attention_heads (int): The number of attention heads in the network.
        num_channels (int): The number of channels in the input image.
        image_size (int): The size of the input image.
        patch_size (int): The size of the image patch used in the network.
        hidden_act (str): The activation function used in the hidden layers.
        layer_norm_eps (float): The epsilon value for layer normalization.
        attention_dropout (float): The dropout rate for attention layers.
        initializer_range (float): The range for parameter initialization.
        initializer_factor (float): The factor for parameter initialization.

    Returns:
        None.

    Raises:
        ValueError: If any of the input parameters are invalid or out of range.
    """
    super().__init__(**kwargs)

    self.hidden_size = hidden_size
    self.intermediate_size = intermediate_size
    self.projection_dim = projection_dim
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.num_channels = num_channels
    self.patch_size = patch_size
    self.image_size = image_size
    self.initializer_range = initializer_range
    self.initializer_factor = initializer_factor
    self.attention_dropout = attention_dropout
    self.layer_norm_eps = layer_norm_eps
    self.hidden_act = hidden_act

mindnlp.transformers.models.clip.configuration_clip.CLIPVisionConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) classmethod

Load a pretrained configuration from a given model name or path.

PARAMETER DESCRIPTION
cls

The class object.

TYPE: class

pretrained_model_name_or_path

The name or path of the pretrained model. It can be either a string representing the name of the model or a path-like object pointing to the model location.

TYPE: Union[str, PathLike]

RETURNS DESCRIPTION
PretrainedConfig

The loaded pretrained configuration.

TYPE: PretrainedConfig

This method is a class method that allows loading a pretrained configuration. It takes in the class object 'cls' and the name or path of the pretrained model 'pretrained_model_name_or_path' as parameters. The method returns an instance of type 'PretrainedConfig', which represents the loaded pretrained configuration.

The 'pretrained_model_name_or_path' parameter can be either a string representing the name of the pretrained model or a path-like object pointing to the location of the model. It is used to identify and locate the pretrained model that needs to be loaded.

Note: If the loaded configuration belongs to the 'clip' model type, the 'config_dict' will be updated to use the 'vision_config' sub-dictionary. Additionally, if the 'model_type' attribute is present in the 'cls' class and the loaded configuration's 'model_type' is different from 'cls.model_type', a warning will be logged indicating that instantiating a model of different types may lead to errors.

Example

>>> config = CLIPVisionConfig.from_pretrained("clip_model")
...
In the above example, the 'from_pretrained' method is called on the 'CLIPVisionConfig' class to load the pretrained configuration of the 'clip_model'. The resulting configuration is stored in the 'config' variable.

Source code in mindnlp\transformers\models\clip\configuration_clip.py
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
    """
    Load a pretrained configuration from a given model name or path.

    Args:
        cls (class): The class object.
        pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.
            It can be either a string representing the name of the model or a path-like object pointing to the model location.

    Returns:
        PretrainedConfig: The loaded pretrained configuration.

    Raises:
        None.

    This method is a class method that allows loading a pretrained configuration. It takes in the class object 'cls'
    and the name or path of the pretrained model 'pretrained_model_name_or_path' as parameters. The method returns an instance
    of type 'PretrainedConfig', which represents the loaded pretrained configuration.

    The 'pretrained_model_name_or_path' parameter can be either a string representing the name of the pretrained model
    or a path-like object pointing to the location of the model. It is used to identify and locate the pretrained model
    that needs to be loaded.

    Note: If the loaded configuration belongs to the 'clip' model type, the 'config_dict' will be updated to use the
    'vision_config' sub-dictionary. Additionally, if the 'model_type' attribute is present in the 'cls' class and
    the loaded configuration's 'model_type' is different from 'cls.model_type', a warning will be logged indicating
    that instantiating a model of different types may lead to errors.

    Example:
        ```python
        >>> config = CLIPVisionConfig.from_pretrained("clip_model")
        ...
        ```
        In the above example, the 'from_pretrained' method is called on the 'CLIPVisionConfig' class to load the pretrained
        configuration of the 'clip_model'. The resulting configuration is stored in the 'config' variable.
    """
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

    # get the vision config dict if we are loading from CLIPConfig
    if config_dict.get("model_type") == "clip":
        config_dict = config_dict["vision_config"]

    if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
        logger.warning(
            f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
            f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
        )

    return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.clip.image_processing_clip.CLIPImageProcessor

Bases: BaseImageProcessor

Constructs a CLIP image processor.

PARAMETER DESCRIPTION
do_resize

Whether to resize the image's (height, width) dimensions to the specified size. Can be overridden by do_resize in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

size

224}): Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio. Can be overridden bysizein thepreprocess` method.

TYPE: `Dict[str, int]` *optional*, defaults to `{"shortest_edge" DEFAULT: None

resample

Resampling filter to use if resizing the image. Can be overridden by resample in the preprocess method.

TYPE: `PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC` DEFAULT: BICUBIC

do_center_crop

Whether to center crop the image to the specified crop_size. Can be overridden by do_center_crop in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

crop_size

Size of the output image after applying center_crop. Can be overridden by crop_size in the preprocess method.

TYPE: `Dict[str, int]` *optional*, defaults to 224 DEFAULT: None

do_rescale

Whether to rescale the image by the specified scale rescale_factor. Can be overridden by do_rescale in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

rescale_factor

Scale factor to use if rescaling the image. Can be overridden by rescale_factor in the preprocess method.

TYPE: `int` or `float`, *optional*, defaults to `1/255` DEFAULT: 1 / 255

do_normalize

Whether to normalize the image. Can be overridden by do_normalize in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

image_mean

Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]` DEFAULT: None

image_std

Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std parameter in the preprocess method. Can be overridden by the image_std parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]` DEFAULT: None

do_convert_rgb

Whether to convert the image to RGB.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Source code in mindnlp\transformers\models\clip\image_processing_clip.py
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
class CLIPImageProcessor(BaseImageProcessor):
    r"""
    Constructs a CLIP image processor.

    Args:
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
            `do_resize` in the `preprocess` method.
        size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
            Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
            the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
            method.
        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
        do_center_crop (`bool`, *optional*, defaults to `True`):
            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
            `preprocess` method.
        crop_size (`Dict[str, int]` *optional*, defaults to 224):
            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
            method.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
            the `preprocess` method.
        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
            Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
            method.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
        image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
        image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
            Can be overridden by the `image_std` parameter in the `preprocess` method.
        do_convert_rgb (`bool`, *optional*, defaults to `True`):
            Whether to convert the image to RGB.
    """
    model_input_names = ["pixel_values"]

    def __init__(
        self,
        do_resize: bool = True,
        size: Dict[str, int] = None,
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        do_center_crop: bool = True,
        crop_size: Dict[str, int] = None,
        do_rescale: bool = True,
        rescale_factor: Union[int, float] = 1 / 255,
        do_normalize: bool = True,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = True,
        **kwargs,
    ) -> None:
        """
        Initializes a CLIPImageProcessor object.

        Args:
            self: The CLIPImageProcessor object itself.
            do_resize (bool): A flag indicating whether to resize the image. Defaults to True.
            size (Dict[str, int]): A dictionary containing the size of the image. Defaults to None.
            resample (PILImageResampling): The resampling method for resizing the image. Defaults to PILImageResampling.BICUBIC.
            do_center_crop (bool): A flag indicating whether to perform center cropping. Defaults to True.
            crop_size (Dict[str, int]): A dictionary containing the size for cropping. Defaults to None.
            do_rescale (bool): A flag indicating whether to rescale the image. Defaults to True.
            rescale_factor (Union[int, float]): The factor by which to rescale the image. Defaults to 1 / 255.
            do_normalize (bool): A flag indicating whether to normalize the image. Defaults to True.
            image_mean (Optional[Union[float, List[float]]]): The mean value for image normalization. Defaults to None.
            image_std (Optional[Union[float, List[float]]]): The standard deviation for image normalization. Defaults to None.
            do_convert_rgb (bool): A flag indicating whether to convert the image to RGB format. Defaults to True.

        Returns:
            None.

        Raises:
            None specified.
        """
        super().__init__(**kwargs)
        size = size if size is not None else {"shortest_edge": 224}
        size = get_size_dict(size, default_to_square=False)
        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
        crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")

        self.do_resize = do_resize
        self.size = size
        self.resample = resample
        self.do_center_crop = do_center_crop
        self.crop_size = crop_size
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_normalize = do_normalize
        self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
        self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
        self.do_convert_rgb = do_convert_rgb
        self._valid_processor_keys = [
            "images",
            "do_resize",
            "size",
            "resample",
            "do_center_crop",
            "crop_size",
            "do_rescale",
            "rescale_factor",
            "do_normalize",
            "image_mean",
            "image_std",
            "do_convert_rgb",
            "return_tensors",
            "data_format",
            "input_data_format",
        ]

        # for backwards compatibility of KOSMOS-2
        if "use_square_size" in kwargs:
            self.size = {"height": size["shortest_edge"], "width": size["shortest_edge"]}
            delattr(self, "use_square_size")

    def resize(
        self,
        image: np.ndarray,
        size: Dict[str, int],
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> np.ndarray:
        """
        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
        resized to keep the input aspect ratio.

        Args:
            image (`np.ndarray`):
                Image to resize.
            size (`Dict[str, int]`):
                Size of the output image.
            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
                Resampling filter to use when resiizing the image.
            data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the image. If not provided, it will be the same as the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred.
        """
        default_to_square = True
        if "shortest_edge" in size:
            size = size["shortest_edge"]
            default_to_square = False
        elif "height" in size and "width" in size:
            size = (size["height"], size["width"])
        else:
            raise ValueError("Size must contain either 'shortest_edge' or 'height' and 'width'.")

        output_size = get_resize_output_image_size(
            image,
            size=size,
            default_to_square=default_to_square,
            input_data_format=input_data_format,
        )
        return resize(
            image,
            size=output_size,
            resample=resample,
            data_format=data_format,
            input_data_format=input_data_format,
            **kwargs,
        )

    def preprocess(
        self,
        images: ImageInput,
        do_resize: bool = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_center_crop: bool = None,
        crop_size: int = None,
        do_rescale: bool = None,
        rescale_factor: float = None,
        do_normalize: bool = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> PIL.Image.Image:
        """
        Preprocess an image or batch of images.

        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
                the longest edge resized to keep the input aspect ratio.
            resample (`int`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
                has an effect if `do_resize` is set to `True`.
            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
                Whether to center crop the image.
            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
                `True`.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:

                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """
        do_resize = do_resize if do_resize is not None else self.do_resize
        size = size if size is not None else self.size
        size = get_size_dict(size, param_name="size", default_to_square=False)
        resample = resample if resample is not None else self.resample
        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
        crop_size = crop_size if crop_size is not None else self.crop_size
        crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
        image_mean = image_mean if image_mean is not None else self.image_mean
        image_std = image_std if image_std is not None else self.image_std
        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
        validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)
        images = make_list_of_images(images)
        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )
        validate_preprocess_arguments(
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_center_crop=do_center_crop,
            crop_size=crop_size,
            do_resize=do_resize,
            size=size,
            resample=resample,
        )

        if do_convert_rgb:
            images = [convert_to_rgb(image) for image in images]

        # All transformations expect numpy arrays.
        images = [to_numpy_array(image) for image in images]

        if is_scaled_image(images[0]) and do_rescale:
            logger.warning_once(
                "It looks like you are trying to rescale already rescaled images. If the input"
                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
            )

        if input_data_format is None:
            # We assume that all images have the same channel dimension format.
            input_data_format = infer_channel_dimension_format(images[0])

        if do_resize:
            images = [
                self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
                for image in images
            ]

        if do_center_crop:
            images = [
                self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
            ]

        if do_rescale:
            images = [
                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
                for image in images
            ]

        if do_normalize:
            images = [
                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
                for image in images
            ]

        images = [
            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
        ]

        data = {"pixel_values": images}
        return BatchFeature(data=data, tensor_type=return_tensors)

mindnlp.transformers.models.clip.image_processing_clip.CLIPImageProcessor.__init__(do_resize=True, size=None, resample=PILImageResampling.BICUBIC, do_center_crop=True, crop_size=None, do_rescale=True, rescale_factor=1 / 255, do_normalize=True, image_mean=None, image_std=None, do_convert_rgb=True, **kwargs)

Initializes a CLIPImageProcessor object.

PARAMETER DESCRIPTION
self

The CLIPImageProcessor object itself.

do_resize

A flag indicating whether to resize the image. Defaults to True.

TYPE: bool DEFAULT: True

size

A dictionary containing the size of the image. Defaults to None.

TYPE: Dict[str, int] DEFAULT: None

resample

The resampling method for resizing the image. Defaults to PILImageResampling.BICUBIC.

TYPE: PILImageResampling DEFAULT: BICUBIC

do_center_crop

A flag indicating whether to perform center cropping. Defaults to True.

TYPE: bool DEFAULT: True

crop_size

A dictionary containing the size for cropping. Defaults to None.

TYPE: Dict[str, int] DEFAULT: None

do_rescale

A flag indicating whether to rescale the image. Defaults to True.

TYPE: bool DEFAULT: True

rescale_factor

The factor by which to rescale the image. Defaults to 1 / 255.

TYPE: Union[int, float] DEFAULT: 1 / 255

do_normalize

A flag indicating whether to normalize the image. Defaults to True.

TYPE: bool DEFAULT: True

image_mean

The mean value for image normalization. Defaults to None.

TYPE: Optional[Union[float, List[float]]] DEFAULT: None

image_std

The standard deviation for image normalization. Defaults to None.

TYPE: Optional[Union[float, List[float]]] DEFAULT: None

do_convert_rgb

A flag indicating whether to convert the image to RGB format. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
None

None.

Source code in mindnlp\transformers\models\clip\image_processing_clip.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def __init__(
    self,
    do_resize: bool = True,
    size: Dict[str, int] = None,
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    do_center_crop: bool = True,
    crop_size: Dict[str, int] = None,
    do_rescale: bool = True,
    rescale_factor: Union[int, float] = 1 / 255,
    do_normalize: bool = True,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_convert_rgb: bool = True,
    **kwargs,
) -> None:
    """
    Initializes a CLIPImageProcessor object.

    Args:
        self: The CLIPImageProcessor object itself.
        do_resize (bool): A flag indicating whether to resize the image. Defaults to True.
        size (Dict[str, int]): A dictionary containing the size of the image. Defaults to None.
        resample (PILImageResampling): The resampling method for resizing the image. Defaults to PILImageResampling.BICUBIC.
        do_center_crop (bool): A flag indicating whether to perform center cropping. Defaults to True.
        crop_size (Dict[str, int]): A dictionary containing the size for cropping. Defaults to None.
        do_rescale (bool): A flag indicating whether to rescale the image. Defaults to True.
        rescale_factor (Union[int, float]): The factor by which to rescale the image. Defaults to 1 / 255.
        do_normalize (bool): A flag indicating whether to normalize the image. Defaults to True.
        image_mean (Optional[Union[float, List[float]]]): The mean value for image normalization. Defaults to None.
        image_std (Optional[Union[float, List[float]]]): The standard deviation for image normalization. Defaults to None.
        do_convert_rgb (bool): A flag indicating whether to convert the image to RGB format. Defaults to True.

    Returns:
        None.

    Raises:
        None specified.
    """
    super().__init__(**kwargs)
    size = size if size is not None else {"shortest_edge": 224}
    size = get_size_dict(size, default_to_square=False)
    crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
    crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")

    self.do_resize = do_resize
    self.size = size
    self.resample = resample
    self.do_center_crop = do_center_crop
    self.crop_size = crop_size
    self.do_rescale = do_rescale
    self.rescale_factor = rescale_factor
    self.do_normalize = do_normalize
    self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
    self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
    self.do_convert_rgb = do_convert_rgb
    self._valid_processor_keys = [
        "images",
        "do_resize",
        "size",
        "resample",
        "do_center_crop",
        "crop_size",
        "do_rescale",
        "rescale_factor",
        "do_normalize",
        "image_mean",
        "image_std",
        "do_convert_rgb",
        "return_tensors",
        "data_format",
        "input_data_format",
    ]

    # for backwards compatibility of KOSMOS-2
    if "use_square_size" in kwargs:
        self.size = {"height": size["shortest_edge"], "width": size["shortest_edge"]}
        delattr(self, "use_square_size")

mindnlp.transformers.models.clip.image_processing_clip.CLIPImageProcessor.preprocess(images, do_resize=None, size=None, resample=None, do_center_crop=None, crop_size=None, do_rescale=None, rescale_factor=None, do_normalize=None, image_mean=None, image_std=None, do_convert_rgb=None, return_tensors=None, data_format=ChannelDimension.FIRST, input_data_format=None, **kwargs)

Preprocess an image or batch of images.

PARAMETER DESCRIPTION
images

Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.

TYPE: `ImageInput`

do_resize

Whether to resize the image.

TYPE: `bool`, *optional*, defaults to `self.do_resize` DEFAULT: None

size

Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.size` DEFAULT: None

resample

Resampling filter to use if resizing the image. This can be one of the enum PILImageResampling. Only has an effect if do_resize is set to True.

TYPE: `int`, *optional*, defaults to `self.resample` DEFAULT: None

do_center_crop

Whether to center crop the image.

TYPE: `bool`, *optional*, defaults to `self.do_center_crop` DEFAULT: None

crop_size

Size of the center crop. Only has an effect if do_center_crop is set to True.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.crop_size` DEFAULT: None

do_rescale

Whether to rescale the image.

TYPE: `bool`, *optional*, defaults to `self.do_rescale` DEFAULT: None

rescale_factor

Rescale factor to rescale the image by if do_rescale is set to True.

TYPE: `float`, *optional*, defaults to `self.rescale_factor` DEFAULT: None

do_normalize

Whether to normalize the image.

TYPE: `bool`, *optional*, defaults to `self.do_normalize` DEFAULT: None

image_mean

Image mean to use for normalization. Only has an effect if do_normalize is set to True.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_mean` DEFAULT: None

image_std

Image standard deviation to use for normalization. Only has an effect if do_normalize is set to True.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_std` DEFAULT: None

do_convert_rgb

Whether to convert the image to RGB.

TYPE: `bool`, *optional*, defaults to `self.do_convert_rgb` DEFAULT: None

return_tensors

The type of tensors to return. Can be one of:

  • Unset: Return a list of np.ndarray.
  • TensorType.TENSORFLOW or 'tf': Return a batch of type tf.Tensor.
  • TensorType.PYTORCH or 'pt': Return a batch of type torch.Tensor.
  • TensorType.NUMPY or 'np': Return a batch of type np.ndarray.
  • TensorType.JAX or 'jax': Return a batch of type jax.numpy.ndarray.

TYPE: `str` or `TensorType`, *optional* DEFAULT: None

data_format

The channel dimension format for the output image. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • Unset: Use the channel dimension format of the input image.

TYPE: `ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST` DEFAULT: FIRST

input_data_format

The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of: - "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format. - "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format. - "none" or ChannelDimension.NONE: image in (height, width) format.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp\transformers\models\clip\image_processing_clip.py
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
def preprocess(
    self,
    images: ImageInput,
    do_resize: bool = None,
    size: Dict[str, int] = None,
    resample: PILImageResampling = None,
    do_center_crop: bool = None,
    crop_size: int = None,
    do_rescale: bool = None,
    rescale_factor: float = None,
    do_normalize: bool = None,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_convert_rgb: bool = None,
    return_tensors: Optional[Union[str, TensorType]] = None,
    data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> PIL.Image.Image:
    """
    Preprocess an image or batch of images.

    Args:
        images (`ImageInput`):
            Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
            passing in images with pixel values between 0 and 1, set `do_rescale=False`.
        do_resize (`bool`, *optional*, defaults to `self.do_resize`):
            Whether to resize the image.
        size (`Dict[str, int]`, *optional*, defaults to `self.size`):
            Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
            the longest edge resized to keep the input aspect ratio.
        resample (`int`, *optional*, defaults to `self.resample`):
            Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
            has an effect if `do_resize` is set to `True`.
        do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
            Whether to center crop the image.
        crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
            Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
        do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
            Whether to rescale the image.
        rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
            Rescale factor to rescale the image by if `do_rescale` is set to `True`.
        do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
            Whether to normalize the image.
        image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
            Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
        image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
            Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
            `True`.
        do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
            Whether to convert the image to RGB.
        return_tensors (`str` or `TensorType`, *optional*):
            The type of tensors to return. Can be one of:

            - Unset: Return a list of `np.ndarray`.
            - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
            - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
            - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
            - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
        data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
            The channel dimension format for the output image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - Unset: Use the channel dimension format of the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the input image. If unset, the channel dimension format is inferred
            from the input image. Can be one of:
            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
    """
    do_resize = do_resize if do_resize is not None else self.do_resize
    size = size if size is not None else self.size
    size = get_size_dict(size, param_name="size", default_to_square=False)
    resample = resample if resample is not None else self.resample
    do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
    crop_size = crop_size if crop_size is not None else self.crop_size
    crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
    do_rescale = do_rescale if do_rescale is not None else self.do_rescale
    rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
    do_normalize = do_normalize if do_normalize is not None else self.do_normalize
    image_mean = image_mean if image_mean is not None else self.image_mean
    image_std = image_std if image_std is not None else self.image_std
    do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
    validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)
    images = make_list_of_images(images)
    if not valid_images(images):
        raise ValueError(
            "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
            "torch.Tensor, tf.Tensor or jax.ndarray."
        )
    validate_preprocess_arguments(
        do_rescale=do_rescale,
        rescale_factor=rescale_factor,
        do_normalize=do_normalize,
        image_mean=image_mean,
        image_std=image_std,
        do_center_crop=do_center_crop,
        crop_size=crop_size,
        do_resize=do_resize,
        size=size,
        resample=resample,
    )

    if do_convert_rgb:
        images = [convert_to_rgb(image) for image in images]

    # All transformations expect numpy arrays.
    images = [to_numpy_array(image) for image in images]

    if is_scaled_image(images[0]) and do_rescale:
        logger.warning_once(
            "It looks like you are trying to rescale already rescaled images. If the input"
            " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
        )

    if input_data_format is None:
        # We assume that all images have the same channel dimension format.
        input_data_format = infer_channel_dimension_format(images[0])

    if do_resize:
        images = [
            self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
            for image in images
        ]

    if do_center_crop:
        images = [
            self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
        ]

    if do_rescale:
        images = [
            self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
            for image in images
        ]

    if do_normalize:
        images = [
            self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
            for image in images
        ]

    images = [
        to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
    ]

    data = {"pixel_values": images}
    return BatchFeature(data=data, tensor_type=return_tensors)

mindnlp.transformers.models.clip.image_processing_clip.CLIPImageProcessor.resize(image, size, resample=PILImageResampling.BICUBIC, data_format=None, input_data_format=None, **kwargs)

Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio.

PARAMETER DESCRIPTION
image

Image to resize.

TYPE: `np.ndarray`

size

Size of the output image.

TYPE: `Dict[str, int]`

resample

Resampling filter to use when resiizing the image.

TYPE: `PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC` DEFAULT: BICUBIC

data_format

The channel dimension format of the image. If not provided, it will be the same as the input image.

TYPE: `str` or `ChannelDimension`, *optional* DEFAULT: None

input_data_format

The channel dimension format of the input image. If not provided, it will be inferred.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp\transformers\models\clip\image_processing_clip.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
def resize(
    self,
    image: np.ndarray,
    size: Dict[str, int],
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    data_format: Optional[Union[str, ChannelDimension]] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> np.ndarray:
    """
    Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
    resized to keep the input aspect ratio.

    Args:
        image (`np.ndarray`):
            Image to resize.
        size (`Dict[str, int]`):
            Size of the output image.
        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
            Resampling filter to use when resiizing the image.
        data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the image. If not provided, it will be the same as the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format of the input image. If not provided, it will be inferred.
    """
    default_to_square = True
    if "shortest_edge" in size:
        size = size["shortest_edge"]
        default_to_square = False
    elif "height" in size and "width" in size:
        size = (size["height"], size["width"])
    else:
        raise ValueError("Size must contain either 'shortest_edge' or 'height' and 'width'.")

    output_size = get_resize_output_image_size(
        image,
        size=size,
        default_to_square=default_to_square,
        input_data_format=input_data_format,
    )
    return resize(
        image,
        size=output_size,
        resample=resample,
        data_format=data_format,
        input_data_format=input_data_format,
        **kwargs,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPModel

Bases: CLIPPreTrainedModel

Source code in mindnlp\transformers\models\clip\modeling_clip.py
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
class CLIPModel(CLIPPreTrainedModel):
    config_class = CLIPConfig
    _no_split_modules = ["CLIPTextEmbeddings", "CLIPEncoderLayer", "CLIPVisionEmbeddings"]

    def __init__(self, config: CLIPConfig):
        super().__init__(config)

        if not isinstance(config.text_config, CLIPTextConfig):
            raise TypeError(
                "config.text_config is expected to be of type CLIPTextConfig but is of type"
                f" {type(config.text_config)}."
            )

        if not isinstance(config.vision_config, CLIPVisionConfig):
            raise TypeError(
                "config.vision_config is expected to be of type CLIPVisionConfig but is of type"
                f" {type(config.vision_config)}."
            )

        text_config = config.text_config
        vision_config = config.vision_config

        self.projection_dim = config.projection_dim
        self.text_embed_dim = text_config.hidden_size
        self.vision_embed_dim = vision_config.hidden_size

        text_model = CLIPTextModel._from_config(text_config, attn_implementation=config._attn_implementation)
        self.text_model = text_model.text_model

        vision_model = CLIPVisionModel._from_config(vision_config, attn_implementation=config._attn_implementation)
        self.vision_model = vision_model.vision_model

        self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False)
        self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False)
        self.logit_scale = nn.Parameter(mindspore.tensor(self.config.logit_scale_init_value))

        # Initialize weights and apply final processing
        self.post_init()

    def get_text_features(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> mindspore.Tensor:
        r"""
        Returns:
            text_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
            applying the projection layer to the pooled output of [`CLIPTextModel`].

        Examples:

        ```python
        >>> from transformers import AutoTokenizer, CLIPModel

        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="ms")
        >>> text_features = model.get_text_features(**inputs)
        ```"""
        # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = text_outputs[1]
        text_features = self.text_projection(pooled_output)

        return text_features

    def get_image_features(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> mindspore.Tensor:
        r"""
        Returns:
            image_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
            applying the projection layer to the pooled output of [`CLIPVisionModel`].

        Examples:

        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, CLIPModel

        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)

        >>> inputs = processor(images=image, return_tensors="ms")

        >>> image_features = model.get_image_features(**inputs)
        ```"""
        # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = vision_outputs[1]  # pooled_output
        image_features = self.visual_projection(pooled_output)

        return image_features

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        pixel_values: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        return_loss: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CLIPOutput]:
        r"""
        Returns:

        Examples:

        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, CLIPModel

        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)

        >>> inputs = processor(
        ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="ms", padding=True
        ... )

        >>> outputs = model(**inputs)
        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
        >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
        ```"""
        # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        image_embeds = vision_outputs[1]
        image_embeds = self.visual_projection(image_embeds)

        text_embeds = text_outputs[1]
        text_embeds = self.text_projection(text_embeds)

        # normalized features
        image_embeds = image_embeds / ops.norm(image_embeds, p=2, dim=-1, keepdim=True)
        text_embeds = text_embeds / ops.norm(text_embeds, p=2, dim=-1, keepdim=True)

        # cosine similarity as logits
        logit_scale = self.logit_scale.exp()
        logits_per_text = ops.matmul(text_embeds, image_embeds.t()) * logit_scale
        logits_per_image = logits_per_text.t()

        loss = None
        if return_loss:
            loss = clip_loss(logits_per_text)

        if not return_dict:
            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
            return ((loss,) + output) if loss is not None else output

        return CLIPOutput(
            loss=loss,
            logits_per_image=logits_per_image,
            logits_per_text=logits_per_text,
            text_embeds=text_embeds,
            image_embeds=image_embeds,
            text_model_output=text_outputs,
            vision_model_output=vision_outputs,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPModel.forward(input_ids=None, pixel_values=None, attention_mask=None, position_ids=None, return_loss=None, output_attentions=None, output_hidden_states=None, return_dict=None)

Examples:

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, CLIPModel

>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(
...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="ms", padding=True
... )

>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
Source code in mindnlp\transformers\models\clip\modeling_clip.py
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    pixel_values: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    return_loss: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, CLIPOutput]:
    r"""
    Returns:

    Examples:

    ```python
    >>> from PIL import Image
    >>> import requests
    >>> from transformers import AutoProcessor, CLIPModel

    >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

    >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    >>> image = Image.open(requests.get(url, stream=True).raw)

    >>> inputs = processor(
    ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="ms", padding=True
    ... )

    >>> outputs = model(**inputs)
    >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
    >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
    ```"""
    # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    image_embeds = vision_outputs[1]
    image_embeds = self.visual_projection(image_embeds)

    text_embeds = text_outputs[1]
    text_embeds = self.text_projection(text_embeds)

    # normalized features
    image_embeds = image_embeds / ops.norm(image_embeds, p=2, dim=-1, keepdim=True)
    text_embeds = text_embeds / ops.norm(text_embeds, p=2, dim=-1, keepdim=True)

    # cosine similarity as logits
    logit_scale = self.logit_scale.exp()
    logits_per_text = ops.matmul(text_embeds, image_embeds.t()) * logit_scale
    logits_per_image = logits_per_text.t()

    loss = None
    if return_loss:
        loss = clip_loss(logits_per_text)

    if not return_dict:
        output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
        return ((loss,) + output) if loss is not None else output

    return CLIPOutput(
        loss=loss,
        logits_per_image=logits_per_image,
        logits_per_text=logits_per_text,
        text_embeds=text_embeds,
        image_embeds=image_embeds,
        text_model_output=text_outputs,
        vision_model_output=vision_outputs,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPModel.get_image_features(pixel_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
image_features

The image embeddings obtained by

TYPE: `mindspore.Tensor` of shape `(batch_size, output_dim`

Tensor

applying the projection layer to the pooled output of [CLIPVisionModel].

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, CLIPModel

>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(images=image, return_tensors="ms")

>>> image_features = model.get_image_features(**inputs)
Source code in mindnlp\transformers\models\clip\modeling_clip.py
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
def get_image_features(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> mindspore.Tensor:
    r"""
    Returns:
        image_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
        applying the projection layer to the pooled output of [`CLIPVisionModel`].

    Examples:

    ```python
    >>> from PIL import Image
    >>> import requests
    >>> from transformers import AutoProcessor, CLIPModel

    >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

    >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    >>> image = Image.open(requests.get(url, stream=True).raw)

    >>> inputs = processor(images=image, return_tensors="ms")

    >>> image_features = model.get_image_features(**inputs)
    ```"""
    # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = vision_outputs[1]  # pooled_output
    image_features = self.visual_projection(pooled_output)

    return image_features

mindnlp.transformers.models.clip.modeling_clip.CLIPModel.get_text_features(input_ids=None, attention_mask=None, position_ids=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
text_features

The text embeddings obtained by

TYPE: `mindspore.Tensor` of shape `(batch_size, output_dim`

Tensor

applying the projection layer to the pooled output of [CLIPTextModel].

>>> from transformers import AutoTokenizer, CLIPModel

>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="ms")
>>> text_features = model.get_text_features(**inputs)
Source code in mindnlp\transformers\models\clip\modeling_clip.py
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
def get_text_features(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> mindspore.Tensor:
    r"""
    Returns:
        text_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
        applying the projection layer to the pooled output of [`CLIPTextModel`].

    Examples:

    ```python
    >>> from transformers import AutoTokenizer, CLIPModel

    >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
    >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

    >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="ms")
    >>> text_features = model.get_text_features(**inputs)
    ```"""
    # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = text_outputs[1]
    text_features = self.text_projection(pooled_output)

    return text_features

mindnlp.transformers.models.clip.modeling_clip.CLIPPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp\transformers\models\clip\modeling_clip.py
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
class CLIPPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = CLIPConfig
    base_model_prefix = "clip"
    supports_gradient_checkpointing = True
    _supports_sdpa = True
    _supports_flash_attn_2 = True

    def _init_weights(self, module):
        """Initialize the weights"""
        factor = self.config.initializer_factor
        if isinstance(module, CLIPTextEmbeddings):
            nn.init.normal_(module.token_embedding.weight, mean=0.0, std=factor * 0.02)
            nn.init.normal_(module.position_embedding.weight, mean=0.0, std=factor * 0.02)
        elif isinstance(module, CLIPVisionEmbeddings):
            factor = self.config.initializer_factor
            nn.init.normal_(module.class_embedding, mean=0.0, std=module.embed_dim**-0.5 * factor)
            nn.init.normal_(module.patch_embedding.weight, std=module.config.initializer_range * factor)
            nn.init.normal_(module.position_embedding.weight, std=module.config.initializer_range * factor)
        elif isinstance(module, CLIPAttention):
            factor = self.config.initializer_factor
            in_proj_std = (module.embed_dim**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor
            out_proj_std = (module.embed_dim**-0.5) * factor
            nn.init.normal_(module.q_proj.weight, std=in_proj_std)
            nn.init.normal_(module.k_proj.weight, std=in_proj_std)
            nn.init.normal_(module.v_proj.weight, std=in_proj_std)
            nn.init.normal_(module.out_proj.weight, std=out_proj_std)
        elif isinstance(module, CLIPMLP):
            factor = self.config.initializer_factor
            in_proj_std = (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor
            fc_std = (2 * module.config.hidden_size) ** -0.5 * factor
            nn.init.normal_(module.fc1.weight, std=fc_std)
            nn.init.normal_(module.fc2.weight, std=in_proj_std)
        elif isinstance(module, CLIPModel):
            nn.init.normal_(
                module.text_projection.weight,
                std=module.text_embed_dim**-0.5 * self.config.initializer_factor,
            )
            nn.init.normal_(
                module.visual_projection.weight,
                std=module.vision_embed_dim**-0.5 * self.config.initializer_factor,
            )
        elif isinstance(module, CLIPVisionModelWithProjection):
            nn.init.normal_(
                module.visual_projection.weight,
                std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
            )
        elif isinstance(module, CLIPTextModelWithProjection):
            nn.init.normal_(
                module.text_projection.weight,
                std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
            )
        elif isinstance(module, CLIPForImageClassification):
            nn.init.normal_(
                module.classifier.weight,
                std=self.config.vision_config.hidden_size**-0.5 * self.config.initializer_factor,
            )

        if isinstance(module, nn.LayerNorm):
            nn.init.zeros_(module.bias)
            nn.init.ones_(module.weight)
        if isinstance(module, nn.Linear) and module.bias is not None:
            nn.init.zeros_(module.bias)

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModel

Bases: CLIPPreTrainedModel

Source code in mindnlp\transformers\models\clip\modeling_clip.py
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
class CLIPTextModel(CLIPPreTrainedModel):
    config_class = CLIPTextConfig

    _no_split_modules = ["CLIPTextEmbeddings", "CLIPEncoderLayer"]

    def __init__(self, config: CLIPTextConfig):
        super().__init__(config)
        self.text_model = CLIPTextTransformer(config)
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        return self.text_model.embeddings.token_embedding

    def set_input_embeddings(self, value):
        self.text_model.embeddings.token_embedding = value

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        r"""
        Returns:

        Examples:

        ```python
        >>> from transformers import AutoTokenizer, CLIPTextModel

        >>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="ms")

        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
        ```"""
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        return self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModel.forward(input_ids=None, attention_mask=None, position_ids=None, output_attentions=None, output_hidden_states=None, return_dict=None)

Examples:

>>> from transformers import AutoTokenizer, CLIPTextModel

>>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
>>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="ms")

>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
Source code in mindnlp\transformers\models\clip\modeling_clip.py
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
    r"""
    Returns:

    Examples:

    ```python
    >>> from transformers import AutoTokenizer, CLIPTextModel

    >>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
    >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

    >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="ms")

    >>> outputs = model(**inputs)
    >>> last_hidden_state = outputs.last_hidden_state
    >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
    ```"""
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    return self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModelWithProjection

Bases: CLIPPreTrainedModel

Source code in mindnlp\transformers\models\clip\modeling_clip.py
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
class CLIPTextModelWithProjection(CLIPPreTrainedModel):
    config_class = CLIPTextConfig

    _no_split_modules = ["CLIPTextEmbeddings", "CLIPEncoderLayer"]

    def __init__(self, config: CLIPTextConfig):
        super().__init__(config)

        text_model = CLIPTextModel._from_config(config, attn_implementation=config._attn_implementation)
        self.text_model = text_model.text_model

        self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        return self.text_model.embeddings.token_embedding

    def set_input_embeddings(self, value):
        self.text_model.embeddings.token_embedding = value

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CLIPTextModelOutput]:
        r"""
        Returns:

        Examples:

        ```python
        >>> from transformers import AutoTokenizer, CLIPTextModelWithProjection

        >>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
        >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="ms")

        >>> outputs = model(**inputs)
        >>> text_embeds = outputs.text_embeds
        ```"""
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = text_outputs[1]

        text_embeds = self.text_projection(pooled_output)

        if not return_dict:
            outputs = (text_embeds, text_outputs[0]) + text_outputs[2:]
            return tuple(output for output in outputs if output is not None)

        return CLIPTextModelOutput(
            text_embeds=text_embeds,
            last_hidden_state=text_outputs.last_hidden_state,
            hidden_states=text_outputs.hidden_states,
            attentions=text_outputs.attentions,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModelWithProjection.forward(input_ids=None, attention_mask=None, position_ids=None, output_attentions=None, output_hidden_states=None, return_dict=None)

Examples:

>>> from transformers import AutoTokenizer, CLIPTextModelWithProjection

>>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
>>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="ms")

>>> outputs = model(**inputs)
>>> text_embeds = outputs.text_embeds
Source code in mindnlp\transformers\models\clip\modeling_clip.py
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, CLIPTextModelOutput]:
    r"""
    Returns:

    Examples:

    ```python
    >>> from transformers import AutoTokenizer, CLIPTextModelWithProjection

    >>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
    >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")

    >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="ms")

    >>> outputs = model(**inputs)
    >>> text_embeds = outputs.text_embeds
    ```"""
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = text_outputs[1]

    text_embeds = self.text_projection(pooled_output)

    if not return_dict:
        outputs = (text_embeds, text_outputs[0]) + text_outputs[2:]
        return tuple(output for output in outputs if output is not None)

    return CLIPTextModelOutput(
        text_embeds=text_embeds,
        last_hidden_state=text_outputs.last_hidden_state,
        hidden_states=text_outputs.hidden_states,
        attentions=text_outputs.attentions,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModel

Bases: CLIPPreTrainedModel

Source code in mindnlp\transformers\models\clip\modeling_clip.py
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
class CLIPVisionModel(CLIPPreTrainedModel):
    config_class = CLIPVisionConfig
    main_input_name = "pixel_values"
    _no_split_modules = ["CLIPEncoderLayer"]

    def __init__(self, config: CLIPVisionConfig):
        super().__init__(config)
        self.vision_model = CLIPVisionTransformer(config)
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        return self.vision_model.embeddings.patch_embedding

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        r"""
        Returns:

        Examples:

        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, CLIPVisionModel

        >>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)

        >>> inputs = processor(images=image, return_tensors="ms")

        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled CLS states
        ```"""
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        return self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModel.forward(pixel_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)

Examples:

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, CLIPVisionModel

>>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(images=image, return_tensors="ms")

>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled CLS states
Source code in mindnlp\transformers\models\clip\modeling_clip.py
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
    r"""
    Returns:

    Examples:

    ```python
    >>> from PIL import Image
    >>> import requests
    >>> from transformers import AutoProcessor, CLIPVisionModel

    >>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
    >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

    >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    >>> image = Image.open(requests.get(url, stream=True).raw)

    >>> inputs = processor(images=image, return_tensors="ms")

    >>> outputs = model(**inputs)
    >>> last_hidden_state = outputs.last_hidden_state
    >>> pooled_output = outputs.pooler_output  # pooled CLS states
    ```"""
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    return self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection

Bases: CLIPPreTrainedModel

Source code in mindnlp\transformers\models\clip\modeling_clip.py
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
class CLIPVisionModelWithProjection(CLIPPreTrainedModel):
    config_class = CLIPVisionConfig
    main_input_name = "pixel_values"

    def __init__(self, config: CLIPVisionConfig):
        super().__init__(config)

        vision_model = CLIPVisionModel._from_config(config, attn_implementation=config._attn_implementation)
        self.vision_model = vision_model.vision_model

        self.visual_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        return self.vision_model.embeddings.patch_embedding

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CLIPVisionModelOutput]:
        r"""
        Returns:

        Examples:

        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, CLIPVisionModelWithProjection

        >>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)

        >>> inputs = processor(images=image, return_tensors="ms")

        >>> outputs = model(**inputs)
        >>> image_embeds = outputs.image_embeds
        ```"""
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = vision_outputs[1]  # pooled_output

        image_embeds = self.visual_projection(pooled_output)

        if not return_dict:
            outputs = (image_embeds, vision_outputs[0]) + vision_outputs[2:]
            return tuple(output for output in outputs if output is not None)

        return CLIPVisionModelOutput(
            image_embeds=image_embeds,
            last_hidden_state=vision_outputs.last_hidden_state,
            hidden_states=vision_outputs.hidden_states,
            attentions=vision_outputs.attentions,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection.forward(pixel_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)

Examples:

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, CLIPVisionModelWithProjection

>>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(images=image, return_tensors="ms")

>>> outputs = model(**inputs)
>>> image_embeds = outputs.image_embeds
Source code in mindnlp\transformers\models\clip\modeling_clip.py
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, CLIPVisionModelOutput]:
    r"""
    Returns:

    Examples:

    ```python
    >>> from PIL import Image
    >>> import requests
    >>> from transformers import AutoProcessor, CLIPVisionModelWithProjection

    >>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
    >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

    >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    >>> image = Image.open(requests.get(url, stream=True).raw)

    >>> inputs = processor(images=image, return_tensors="ms")

    >>> outputs = model(**inputs)
    >>> image_embeds = outputs.image_embeds
    ```"""
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = vision_outputs[1]  # pooled_output

    image_embeds = self.visual_projection(pooled_output)

    if not return_dict:
        outputs = (image_embeds, vision_outputs[0]) + vision_outputs[2:]
        return tuple(output for output in outputs if output is not None)

    return CLIPVisionModelOutput(
        image_embeds=image_embeds,
        last_hidden_state=vision_outputs.last_hidden_state,
        hidden_states=vision_outputs.hidden_states,
        attentions=vision_outputs.attentions,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPForImageClassification

Bases: CLIPPreTrainedModel

Source code in mindnlp\transformers\models\clip\modeling_clip.py
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
class CLIPForImageClassification(CLIPPreTrainedModel):
    main_input_name = "pixel_values"

    def __init__(self, config: CLIPConfig) -> None:
        super().__init__(config)

        self.num_labels = config.num_labels
        vision_model = CLIPVisionModel._from_config(
            config.vision_config, attn_implementation=config._attn_implementation
        )
        self.vision_model = vision_model.vision_model

        # Classifier head
        self.classifier = (
            nn.Linear(config.vision_config.hidden_size, config.num_labels) if config.num_labels > 0 else nn.Identity()
        )

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[tuple, ImageClassifierOutput]:
        r"""
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.vision_model(
            pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]

        # average pool the patch tokens
        sequence_output = ops.mean(sequence_output[:, 1:, :], dim=1)
        # apply classifier
        logits = self.classifier(sequence_output)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return ImageClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPForImageClassification.forward(pixel_values=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)

labels (mindspore.Tensor of shape (batch_size,), optional): Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Source code in mindnlp\transformers\models\clip\modeling_clip.py
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[tuple, ImageClassifierOutput]:
    r"""
    labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
        Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
        config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
        `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.vision_model(
        pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    sequence_output = outputs[0]

    # average pool the patch tokens
    sequence_output = ops.mean(sequence_output[:, 1:, :], dim=1)
    # apply classifier
    logits = self.classifier(sequence_output)

    loss = None
    if labels is not None:
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            loss_fct = MSELoss()
            if self.num_labels == 1:
                loss = loss_fct(logits.squeeze(), labels.squeeze())
            else:
                loss = loss_fct(logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss_fct = BCEWithLogitsLoss()
            loss = loss_fct(logits, labels)

    if not return_dict:
        output = (logits,) + outputs[2:]
        return ((loss,) + output) if loss is not None else output

    return ImageClassifierOutput(
        loss=loss,
        logits=logits,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor

Bases: ProcessorMixin

Constructs a CLIP processor which wraps a CLIP image processor and a CLIP tokenizer into a single processor.

[CLIPProcessor] offers all the functionalities of [CLIPImageProcessor] and [CLIPTokenizerFast]. See the [~CLIPProcessor.__call__] and [~CLIPProcessor.decode] for more information.

PARAMETER DESCRIPTION
image_processor

The image processor is a required input.

TYPE: [`CLIPImageProcessor`], *optional* DEFAULT: None

tokenizer

The tokenizer is a required input.

TYPE: [`CLIPTokenizerFast`], *optional* DEFAULT: None

Source code in mindnlp\transformers\models\clip\processing_clip.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
class CLIPProcessor(ProcessorMixin):
    r"""
    Constructs a CLIP processor which wraps a CLIP image processor and a CLIP tokenizer into a single processor.

    [`CLIPProcessor`] offers all the functionalities of [`CLIPImageProcessor`] and [`CLIPTokenizerFast`]. See the
    [`~CLIPProcessor.__call__`] and [`~CLIPProcessor.decode`] for more information.

    Args:
        image_processor ([`CLIPImageProcessor`], *optional*):
            The image processor is a required input.
        tokenizer ([`CLIPTokenizerFast`], *optional*):
            The tokenizer is a required input.
    """
    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "CLIPImageProcessor"
    tokenizer_class = ("CLIPTokenizer", "CLIPTokenizerFast")

    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
        """
        Initialize a CLIPProcessor object.

        Args:
            self (object): The instance of the class.
            image_processor (object, optional): An image processor object used for processing images. 
                If not provided, it can be passed as part of the kwargs parameter.
            tokenizer (object): A tokenizer object used for tokenizing text inputs.

        Returns:
            None.

        Raises:
            ValueError: If either `image_processor` or `tokenizer` is not specified.
            FutureWarning: If the deprecated argument `feature_extractor` is used,
                a warning is issued recommending to use `image_processor` instead.
        """
        feature_extractor = None
        if "feature_extractor" in kwargs:
            warnings.warn(
                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
                " instead.",
                FutureWarning,
            )
            feature_extractor = kwargs.pop("feature_extractor")

        image_processor = image_processor if image_processor is not None else feature_extractor
        if image_processor is None:
            raise ValueError("You need to specify an `image_processor`.")
        if tokenizer is None:
            raise ValueError("You need to specify a `tokenizer`.")

        super().__init__(image_processor, tokenizer)

    def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
        """
        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
        and `kwargs` arguments to CLIPTokenizerFast's [`~CLIPTokenizerFast.__call__`] if `text` is not `None` to encode
        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
        CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
        of the above two methods for more information.

        Args:
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
                number of channels, H and W are image height and width.

            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.

        Returns:
            [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:

                - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
                - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
                `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
                `None`).
                - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
        """
        if text is None and images is None:
            raise ValueError("You have to specify either text or images. Both cannot be none.")

        if text is not None:
            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)

        if images is not None:
            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)

        if text is not None and images is not None:
            encoding["pixel_values"] = image_features.pixel_values
            return encoding
        elif text is not None:
            return encoding
        else:
            return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    @property
    def model_input_names(self):
        """
        This method, 'model_input_names', is a property of the 'CLIPProcessor' class.
        It returns a list of unique model input names derived from the tokenizer and image processor model input names.

        Args:
            self: An instance of the 'CLIPProcessor' class.

        Returns:
            The method returns a list of unique model input names derived from the tokenizer and image processor model input names.

        Raises:
            No exceptions are explicitly raised by this method.
        """
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

    @property
    def feature_extractor_class(self):
        """
        This method returns the image processor class used for extracting features in the CLIPProcessor class.

        Args:
            self: An instance of the CLIPProcessor class.

        Returns:
            None

        Raises:
            FutureWarning: If the method is called, a FutureWarning will be raised to inform the user that
                `feature_extractor_class` is deprecated. It is recommended to use
                `image_processor_class` instead.

        Note:
            The returned image processor class is responsible for extracting features from images in the CLIPProcessor.

        Example:
            ```python
            >>> clip_processor = CLIPProcessor()
            >>> clip_processor.feature_extractor_class
            <class 'image_processor.ImageProcessor'>
            ```
        """
        warnings.warn(
            "`feature_extractor_class` is deprecated. Use `image_processor_class` instead.",
            FutureWarning,
        )
        return self.image_processor_class

    @property
    def feature_extractor(self):
        """
        This method is deprecated. Use `image_processor` instead.

        Args:
            self: An instance of the CLIPProcessor class.

        Returns:
            None.

        Raises:
            FutureWarning: This method raises a FutureWarning to alert users that it is deprecated.
        """
        warnings.warn(
            "`feature_extractor` is deprecated. Use `image_processor` instead.",
            FutureWarning,
        )
        return self.image_processor

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.feature_extractor property

This method is deprecated. Use image_processor instead.

PARAMETER DESCRIPTION
self

An instance of the CLIPProcessor class.

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
FutureWarning

This method raises a FutureWarning to alert users that it is deprecated.

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.feature_extractor_class property

This method returns the image processor class used for extracting features in the CLIPProcessor class.

PARAMETER DESCRIPTION
self

An instance of the CLIPProcessor class.

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
FutureWarning

If the method is called, a FutureWarning will be raised to inform the user that feature_extractor_class is deprecated. It is recommended to use image_processor_class instead.

Note

The returned image processor class is responsible for extracting features from images in the CLIPProcessor.

Example
>>> clip_processor = CLIPProcessor()
>>> clip_processor.feature_extractor_class
<class 'image_processor.ImageProcessor'>

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.model_input_names property

This method, 'model_input_names', is a property of the 'CLIPProcessor' class. It returns a list of unique model input names derived from the tokenizer and image processor model input names.

PARAMETER DESCRIPTION
self

An instance of the 'CLIPProcessor' class.

RETURNS DESCRIPTION

The method returns a list of unique model input names derived from the tokenizer and image processor model input names.

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.__call__(text=None, images=None, return_tensors=None, **kwargs)

Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the text and kwargs arguments to CLIPTokenizerFast's [~CLIPTokenizerFast.__call__] if text is not None to encode the text. To prepare the image(s), this method forwards the images and kwrags arguments to CLIPImageProcessor's [~CLIPImageProcessor.__call__] if images is not None. Please refer to the doctsring of the above two methods for more information.

PARAMETER DESCRIPTION
text

The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

TYPE: `str`, `List[str]`, `List[List[str]]` DEFAULT: None

images

The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a number of channels, H and W are image height and width.

TYPE: `PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]` DEFAULT: None

return_tensors

If set, will return tensors of a particular framework. Acceptable values are:

  • 'tf': Return TensorFlow tf.constant objects.
  • 'pt': Return PyTorch torch.Tensor objects.
  • 'np': Return NumPy np.ndarray objects.
  • 'jax': Return JAX jnp.ndarray objects.

TYPE: `str` or [`~utils.TensorType`], *optional* DEFAULT: None

RETURNS DESCRIPTION

[BatchEncoding]: A [BatchEncoding] with the following fields:

  • input_ids -- List of token ids to be fed to a model. Returned when text is not None.
  • attention_mask -- List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if "attention_mask" is in self.model_input_names and if text is not None).
  • pixel_values -- Pixel values to be fed to a model. Returned when images is not None.
Source code in mindnlp\transformers\models\clip\processing_clip.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
    """
    Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
    and `kwargs` arguments to CLIPTokenizerFast's [`~CLIPTokenizerFast.__call__`] if `text` is not `None` to encode
    the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
    CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
    of the above two methods for more information.

    Args:
        text (`str`, `List[str]`, `List[List[str]]`):
            The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
            (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
            `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
        images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
            The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
            tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
            number of channels, H and W are image height and width.

        return_tensors (`str` or [`~utils.TensorType`], *optional*):
            If set, will return tensors of a particular framework. Acceptable values are:

            - `'tf'`: Return TensorFlow `tf.constant` objects.
            - `'pt'`: Return PyTorch `torch.Tensor` objects.
            - `'np'`: Return NumPy `np.ndarray` objects.
            - `'jax'`: Return JAX `jnp.ndarray` objects.

    Returns:
        [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:

            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
            `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
            `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
    """
    if text is None and images is None:
        raise ValueError("You have to specify either text or images. Both cannot be none.")

    if text is not None:
        encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)

    if images is not None:
        image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)

    if text is not None and images is not None:
        encoding["pixel_values"] = image_features.pixel_values
        return encoding
    elif text is not None:
        return encoding
    else:
        return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.__init__(image_processor=None, tokenizer=None, **kwargs)

Initialize a CLIPProcessor object.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: object

image_processor

An image processor object used for processing images. If not provided, it can be passed as part of the kwargs parameter.

TYPE: object DEFAULT: None

tokenizer

A tokenizer object used for tokenizing text inputs.

TYPE: object DEFAULT: None

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If either image_processor or tokenizer is not specified.

FutureWarning

If the deprecated argument feature_extractor is used, a warning is issued recommending to use image_processor instead.

Source code in mindnlp\transformers\models\clip\processing_clip.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def __init__(self, image_processor=None, tokenizer=None, **kwargs):
    """
    Initialize a CLIPProcessor object.

    Args:
        self (object): The instance of the class.
        image_processor (object, optional): An image processor object used for processing images. 
            If not provided, it can be passed as part of the kwargs parameter.
        tokenizer (object): A tokenizer object used for tokenizing text inputs.

    Returns:
        None.

    Raises:
        ValueError: If either `image_processor` or `tokenizer` is not specified.
        FutureWarning: If the deprecated argument `feature_extractor` is used,
            a warning is issued recommending to use `image_processor` instead.
    """
    feature_extractor = None
    if "feature_extractor" in kwargs:
        warnings.warn(
            "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
            " instead.",
            FutureWarning,
        )
        feature_extractor = kwargs.pop("feature_extractor")

    image_processor = image_processor if image_processor is not None else feature_extractor
    if image_processor is None:
        raise ValueError("You need to specify an `image_processor`.")
    if tokenizer is None:
        raise ValueError("You need to specify a `tokenizer`.")

    super().__init__(image_processor, tokenizer)

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.batch_decode(*args, **kwargs)

This method forwards all its arguments to CLIPTokenizerFast's [~PreTrainedTokenizer.batch_decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp\transformers\models\clip\processing_clip.py
130
131
132
133
134
135
def batch_decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
    refer to the docstring of this method for more information.
    """
    return self.tokenizer.batch_decode(*args, **kwargs)

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.decode(*args, **kwargs)

This method forwards all its arguments to CLIPTokenizerFast's [~PreTrainedTokenizer.decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp\transformers\models\clip\processing_clip.py
137
138
139
140
141
142
def decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
    the docstring of this method for more information.
    """
    return self.tokenizer.decode(*args, **kwargs)

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer

Bases: PreTrainedTokenizer

Construct a CLIP tokenizer. Based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from [PreTrainedTokenizer] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

PARAMETER DESCRIPTION
vocab_file

Path to the vocabulary file.

TYPE: `str`

merges_file

Path to the merges file.

TYPE: `str`

errors

Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.

TYPE: `str`, *optional*, defaults to `"replace"` DEFAULT: 'replace'

unk_token

The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

TYPE: `str`, *optional*, defaults to `"<|endoftext|>"` DEFAULT: '<|endoftext|>'

bos_token

The beginning of sequence token.

TYPE: `str`, *optional*, defaults to `"<|startoftext|>"` DEFAULT: '<|startoftext|>'

eos_token

The end of sequence token.

TYPE: `str`, *optional*, defaults to `"<|endoftext|>"` DEFAULT: '<|endoftext|>'

pad_token

The token used for padding, for example when batching sequences of different lengths.

TYPE: `str`, *optional*, defaults to `"<|endoftext|>"` DEFAULT: '<|endoftext|>'

Source code in mindnlp\transformers\models\clip\tokenization_clip.py
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
class CLIPTokenizer(PreTrainedTokenizer):
    """
    Construct a CLIP tokenizer. Based on byte-level Byte-Pair-Encoding.

    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
    this superclass for more information regarding those methods.

    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
        merges_file (`str`):
            Path to the merges file.
        errors (`str`, *optional*, defaults to `"replace"`):
            Paradigm to follow when decoding bytes to UTF-8. See
            [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
        unk_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        bos_token (`str`, *optional*, defaults to `"<|startoftext|>"`):
            The beginning of sequence token.
        eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
            The end of sequence token.
        pad_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
            The token used for padding, for example when batching sequences of different lengths.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        vocab_file,
        merges_file,
        errors="replace",
        unk_token="<|endoftext|>",
        bos_token="<|startoftext|>",
        eos_token="<|endoftext|>",
        pad_token="<|endoftext|>",  # hack to enable padding
        **kwargs,
    ):
        """
        Initializes a CLIPTokenizer object.

        Args:
            self (object): The instance of the CLIPTokenizer class.
            vocab_file (str): The path to the vocabulary file containing token encodings.
            merges_file (str): The path to the file containing BPE merges for tokenization.
            errors (str, optional): The error handling strategy for text decoding. Defaults to 'replace'.
            unk_token (str, optional): The token to represent unknown words. Defaults to an empty string.
            bos_token (str, optional): The beginning of sequence token. Defaults to '<|startoftext|>'.
            eos_token (str, optional): The end of sequence token. Defaults to an empty string.
            pad_token (str, optional): The padding token. Defaults to an empty string.

        Returns:
            None.

        Raises:
            ImportError: If the 'ftfy' package is not installed.
        """
        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
        try:
            import ftfy

            self.fix_text = ftfy.fix_text
        except ImportError:
            logger.info("ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy.")
            self.nlp = BasicTokenizer(strip_accents=False, do_split_on_punc=False)
            self.fix_text = None

        with open(vocab_file, encoding="utf-8") as vocab_handle:
            self.encoder = json.load(vocab_handle)
        self.decoder = {v: k for k, v in self.encoder.items()}
        self.errors = errors  # how to handle errors in decoding
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
        with open(merges_file, encoding="utf-8") as merges_handle:
            bpe_merges = merges_handle.read().strip().split("\n")[1 : 49152 - 256 - 2 + 1]
        bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
        self.cache = {"<|startoftext|>": "<|startoftext|>", "<|endoftext|>": "<|endoftext|>"}

        self.pat = re.compile(
            r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
            re.IGNORECASE,
        )

        super().__init__(
            errors=errors,
            unk_token=unk_token,
            bos_token=bos_token,
            eos_token=eos_token,
            pad_token=pad_token,
            **kwargs,
        )

    @property
    def vocab_size(self):
        """
        Method to return the vocabulary size of the CLIPTokenizer instance.

        Args:
            self (CLIPTokenizer): The instance of the CLIPTokenizer class.
                This parameter refers to the current instance of the CLIPTokenizer for which the vocabulary size is to
                be calculated.

        Returns:
            int: The number of unique tokens in the vocabulary.
                The method returns an integer value representing the size of the vocabulary as the count of unique
                tokens stored in the encoder.

        Raises:
            None.
        """
        return len(self.encoder)

    def get_vocab(self):
        """
        Method to retrieve the vocabulary of the CLIPTokenizer instance.

        Args:
            self (CLIPTokenizer): The instance of the CLIPTokenizer class.
                Represents the current instance of the CLIPTokenizer.

        Returns:
            dict: A dictionary containing the combined vocabulary of the encoder and added_tokens_encoder.
                The vocabulary includes both the original encoder tokens and any additional tokens added to the tokenizer.

        Raises:
            None
        """
        return dict(self.encoder, **self.added_tokens_encoder)

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A CLIP sequence has the following format:

        - single sequence: `<|startoftext|> X <|endoftext|>`

        Pairs of sequences are not the expected use case, but they will be handled without a separator.

        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        bos_token = [self.bos_token_id]
        eos_token = [self.eos_token_id]

        if token_ids_1 is None:
            return bos_token + token_ids_0 + eos_token
        return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token

    def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
        special tokens using the tokenizer `prepare_for_model` method.

        Args:
            token_ids_0 (`List[int]`):
                List of IDs.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not the token list is already formatted with special tokens for the model.

        Returns:
            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """
        if already_has_special_tokens:
            return super().get_special_tokens_mask(
                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
            )

        if token_ids_1 is None:
            return [1] + ([0] * len(token_ids_0)) + [1]
        return [1] + ([0] * len(token_ids_0)) + [1] + [1] + ([0] * len(token_ids_1)) + [1]

    def create_token_type_ids_from_sequences(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of
        zeros is returned.

        Args:
            token_ids_0 (`List[int]`):
                List of IDs.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of zeros.
        """
        bos_token = [self.bos_token_id]
        eos_token = [self.eos_token_id]

        if token_ids_1 is None:
            return len(bos_token + token_ids_0 + eos_token) * [0]
        return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0]

    def bpe(self, token):
        """
        This method 'bpe' is defined in the class 'CLIPTokenizer'. It processes a given token using Byte Pair Encoding (BPE).

        Args:
            self: This parameter represents the instance of the class 'CLIPTokenizer'.
                It is used to access the attributes and methods of the class.
            token (str): The input token to be processed using Byte Pair Encoding (BPE).
                It should be a string representing a single token.

        Returns:
            str: The processed token after applying Byte Pair Encoding (BPE) algorithm.
                The token is modified based on the algorithm rules.

        Raises:
            None.
        """
        if token in self.cache:
            return self.cache[token]
        word = tuple(token[:-1]) + (token[-1] + "</w>",)
        pairs = get_pairs(word)

        if not pairs:
            return token + "</w>"

        while True:
            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
            if bigram not in self.bpe_ranks:
                break
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):
                try:
                    j = word.index(first, i)
                except ValueError:
                    new_word.extend(word[i:])
                    break
                else:
                    new_word.extend(word[i:j])
                    i = j

                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                    new_word.append(first + second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            pairs = get_pairs(word)
        word = " ".join(word)
        self.cache[token] = word
        return word

    def _tokenize(self, text):
        """Tokenize a string."""
        bpe_tokens = []
        if self.fix_text is None:
            text = " ".join(self.nlp.tokenize(text))
        else:
            text = whitespace_clean(self.fix_text(text)).lower()

        for token in re.findall(self.pat, text):
            token = "".join(
                self.byte_encoder[b] for b in token.encode("utf-8")
            )  # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
        return bpe_tokens

    def _convert_token_to_id(self, token):
        """Converts a token (str) in an id using the vocab."""
        return self.encoder.get(token, self.encoder.get(self.unk_token))

    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        return self.decoder.get(index)

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        text = "".join(tokens)
        byte_array = bytearray([self.byte_decoder[c] for c in text])
        text = byte_array.decode("utf-8", errors=self.errors).replace("</w>", " ").strip()
        return text

    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        """
        Save the vocabulary to the specified directory with an optional filename prefix.

        Args:
            self (CLIPTokenizer): The instance of the CLIPTokenizer class.
            save_directory (str): The directory where the vocabulary files will be saved.
            filename_prefix (Optional[str], optional): An optional prefix to be added to the filename. Defaults to None.

        Returns:
            Tuple[str]: A tuple containing the paths to the saved vocabulary file and merge file.

        Raises:
            OSError: If the specified save_directory is not a valid directory.
            IOError: If there is an issue with writing the vocabulary or merge files.
            Exception: If any other unexpected error occurs during the saving process.
        """
        if not os.path.isdir(save_directory):
            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
            return
        vocab_file = os.path.join(
            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
        )
        merge_file = os.path.join(
            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
        )

        with open(vocab_file, "w", encoding="utf-8") as f:
            f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")

        index = 0
        with open(merge_file, "w", encoding="utf-8") as writer:
            writer.write("#version: 0.2\n")
            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
                if index != token_index:
                    logger.warning(
                        "Saving vocabulary to {}: BPE merge indices are not consecutive."
                        " Please check that the tokenizer is not corrupted!".format(merge_file)
                    )
                    index = token_index
                writer.write(" ".join(bpe_tokens) + "\n")
                index += 1

        return vocab_file, merge_file

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.vocab_size property

Method to return the vocabulary size of the CLIPTokenizer instance.

PARAMETER DESCRIPTION
self

The instance of the CLIPTokenizer class. This parameter refers to the current instance of the CLIPTokenizer for which the vocabulary size is to be calculated.

TYPE: CLIPTokenizer

RETURNS DESCRIPTION
int

The number of unique tokens in the vocabulary. The method returns an integer value representing the size of the vocabulary as the count of unique tokens stored in the encoder.

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.__init__(vocab_file, merges_file, errors='replace', unk_token='<|endoftext|>', bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|endoftext|>', **kwargs)

Initializes a CLIPTokenizer object.

PARAMETER DESCRIPTION
self

The instance of the CLIPTokenizer class.

TYPE: object

vocab_file

The path to the vocabulary file containing token encodings.

TYPE: str

merges_file

The path to the file containing BPE merges for tokenization.

TYPE: str

errors

The error handling strategy for text decoding. Defaults to 'replace'.

TYPE: str DEFAULT: 'replace'

unk_token

The token to represent unknown words. Defaults to an empty string.

TYPE: str DEFAULT: '<|endoftext|>'

bos_token

The beginning of sequence token. Defaults to '<|startoftext|>'.

TYPE: str DEFAULT: '<|startoftext|>'

eos_token

The end of sequence token. Defaults to an empty string.

TYPE: str DEFAULT: '<|endoftext|>'

pad_token

The padding token. Defaults to an empty string.

TYPE: str DEFAULT: '<|endoftext|>'

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ImportError

If the 'ftfy' package is not installed.

Source code in mindnlp\transformers\models\clip\tokenization_clip.py
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
def __init__(
    self,
    vocab_file,
    merges_file,
    errors="replace",
    unk_token="<|endoftext|>",
    bos_token="<|startoftext|>",
    eos_token="<|endoftext|>",
    pad_token="<|endoftext|>",  # hack to enable padding
    **kwargs,
):
    """
    Initializes a CLIPTokenizer object.

    Args:
        self (object): The instance of the CLIPTokenizer class.
        vocab_file (str): The path to the vocabulary file containing token encodings.
        merges_file (str): The path to the file containing BPE merges for tokenization.
        errors (str, optional): The error handling strategy for text decoding. Defaults to 'replace'.
        unk_token (str, optional): The token to represent unknown words. Defaults to an empty string.
        bos_token (str, optional): The beginning of sequence token. Defaults to '<|startoftext|>'.
        eos_token (str, optional): The end of sequence token. Defaults to an empty string.
        pad_token (str, optional): The padding token. Defaults to an empty string.

    Returns:
        None.

    Raises:
        ImportError: If the 'ftfy' package is not installed.
    """
    bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
    eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
    unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
    try:
        import ftfy

        self.fix_text = ftfy.fix_text
    except ImportError:
        logger.info("ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy.")
        self.nlp = BasicTokenizer(strip_accents=False, do_split_on_punc=False)
        self.fix_text = None

    with open(vocab_file, encoding="utf-8") as vocab_handle:
        self.encoder = json.load(vocab_handle)
    self.decoder = {v: k for k, v in self.encoder.items()}
    self.errors = errors  # how to handle errors in decoding
    self.byte_encoder = bytes_to_unicode()
    self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
    with open(merges_file, encoding="utf-8") as merges_handle:
        bpe_merges = merges_handle.read().strip().split("\n")[1 : 49152 - 256 - 2 + 1]
    bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
    self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
    self.cache = {"<|startoftext|>": "<|startoftext|>", "<|endoftext|>": "<|endoftext|>"}

    self.pat = re.compile(
        r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
        re.IGNORECASE,
    )

    super().__init__(
        errors=errors,
        unk_token=unk_token,
        bos_token=bos_token,
        eos_token=eos_token,
        pad_token=pad_token,
        **kwargs,
    )

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.bpe(token)

This method 'bpe' is defined in the class 'CLIPTokenizer'. It processes a given token using Byte Pair Encoding (BPE).

PARAMETER DESCRIPTION
self

This parameter represents the instance of the class 'CLIPTokenizer'. It is used to access the attributes and methods of the class.

token

The input token to be processed using Byte Pair Encoding (BPE). It should be a string representing a single token.

TYPE: str

RETURNS DESCRIPTION
str

The processed token after applying Byte Pair Encoding (BPE) algorithm. The token is modified based on the algorithm rules.

Source code in mindnlp\transformers\models\clip\tokenization_clip.py
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
def bpe(self, token):
    """
    This method 'bpe' is defined in the class 'CLIPTokenizer'. It processes a given token using Byte Pair Encoding (BPE).

    Args:
        self: This parameter represents the instance of the class 'CLIPTokenizer'.
            It is used to access the attributes and methods of the class.
        token (str): The input token to be processed using Byte Pair Encoding (BPE).
            It should be a string representing a single token.

    Returns:
        str: The processed token after applying Byte Pair Encoding (BPE) algorithm.
            The token is modified based on the algorithm rules.

    Raises:
        None.
    """
    if token in self.cache:
        return self.cache[token]
    word = tuple(token[:-1]) + (token[-1] + "</w>",)
    pairs = get_pairs(word)

    if not pairs:
        return token + "</w>"

    while True:
        bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
        if bigram not in self.bpe_ranks:
            break
        first, second = bigram
        new_word = []
        i = 0
        while i < len(word):
            try:
                j = word.index(first, i)
            except ValueError:
                new_word.extend(word[i:])
                break
            else:
                new_word.extend(word[i:j])
                i = j

            if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                new_word.append(first + second)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_word = tuple(new_word)
        word = new_word
        if len(word) == 1:
            break
        pairs = get_pairs(word)
    word = " ".join(word)
    self.cache[token] = word
    return word

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A CLIP sequence has the following format:

  • single sequence: <|startoftext|> X <|endoftext|>

Pairs of sequences are not the expected use case, but they will be handled without a separator.

PARAMETER DESCRIPTION
token_ids_0

List of IDs to which the special tokens will be added.

TYPE: `List[int]`

token_ids_1

Optional second list of IDs for sequence pairs.

TYPE: `List[int]`, *optional* DEFAULT: None

RETURNS DESCRIPTION
List[int]

List[int]: List of input IDs with the appropriate special tokens.

Source code in mindnlp\transformers\models\clip\tokenization_clip.py
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
def build_inputs_with_special_tokens(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
    """
    Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
    adding special tokens. A CLIP sequence has the following format:

    - single sequence: `<|startoftext|> X <|endoftext|>`

    Pairs of sequences are not the expected use case, but they will be handled without a separator.

    Args:
        token_ids_0 (`List[int]`):
            List of IDs to which the special tokens will be added.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.

    Returns:
        `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
    """
    bos_token = [self.bos_token_id]
    eos_token = [self.eos_token_id]

    if token_ids_1 is None:
        return bos_token + token_ids_0 + eos_token
    return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.convert_tokens_to_string(tokens)

Converts a sequence of tokens (string) in a single string.

Source code in mindnlp\transformers\models\clip\tokenization_clip.py
592
593
594
595
596
597
def convert_tokens_to_string(self, tokens):
    """Converts a sequence of tokens (string) in a single string."""
    text = "".join(tokens)
    byte_array = bytearray([self.byte_decoder[c] for c in text])
    text = byte_array.decode("utf-8", errors=self.errors).replace("</w>", " ").strip()
    return text

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)

Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of zeros is returned.

PARAMETER DESCRIPTION
token_ids_0

List of IDs.

TYPE: `List[int]`

token_ids_1

Optional second list of IDs for sequence pairs.

TYPE: `List[int]`, *optional* DEFAULT: None

RETURNS DESCRIPTION
List[int]

List[int]: List of zeros.

Source code in mindnlp\transformers\models\clip\tokenization_clip.py
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
def create_token_type_ids_from_sequences(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
    """
    Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of
    zeros is returned.

    Args:
        token_ids_0 (`List[int]`):
            List of IDs.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.

    Returns:
        `List[int]`: List of zeros.
    """
    bos_token = [self.bos_token_id]
    eos_token = [self.eos_token_id]

    if token_ids_1 is None:
        return len(bos_token + token_ids_0 + eos_token) * [0]
    return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0]

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

PARAMETER DESCRIPTION
token_ids_0

List of IDs.

TYPE: `List[int]`

token_ids_1

Optional second list of IDs for sequence pairs.

TYPE: `List[int]`, *optional* DEFAULT: None

already_has_special_tokens

Whether or not the token list is already formatted with special tokens for the model.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

RETURNS DESCRIPTION
List[int]

List[int]: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Source code in mindnlp\transformers\models\clip\tokenization_clip.py
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
def get_special_tokens_mask(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
    """
    Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
    special tokens using the tokenizer `prepare_for_model` method.

    Args:
        token_ids_0 (`List[int]`):
            List of IDs.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.
        already_has_special_tokens (`bool`, *optional*, defaults to `False`):
            Whether or not the token list is already formatted with special tokens for the model.

    Returns:
        `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
    """
    if already_has_special_tokens:
        return super().get_special_tokens_mask(
            token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
        )

    if token_ids_1 is None:
        return [1] + ([0] * len(token_ids_0)) + [1]
    return [1] + ([0] * len(token_ids_0)) + [1] + [1] + ([0] * len(token_ids_1)) + [1]

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.get_vocab()

Method to retrieve the vocabulary of the CLIPTokenizer instance.

PARAMETER DESCRIPTION
self

The instance of the CLIPTokenizer class. Represents the current instance of the CLIPTokenizer.

TYPE: CLIPTokenizer

RETURNS DESCRIPTION
dict

A dictionary containing the combined vocabulary of the encoder and added_tokens_encoder. The vocabulary includes both the original encoder tokens and any additional tokens added to the tokenizer.

Source code in mindnlp\transformers\models\clip\tokenization_clip.py
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
def get_vocab(self):
    """
    Method to retrieve the vocabulary of the CLIPTokenizer instance.

    Args:
        self (CLIPTokenizer): The instance of the CLIPTokenizer class.
            Represents the current instance of the CLIPTokenizer.

    Returns:
        dict: A dictionary containing the combined vocabulary of the encoder and added_tokens_encoder.
            The vocabulary includes both the original encoder tokens and any additional tokens added to the tokenizer.

    Raises:
        None
    """
    return dict(self.encoder, **self.added_tokens_encoder)

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.save_vocabulary(save_directory, filename_prefix=None)

Save the vocabulary to the specified directory with an optional filename prefix.

PARAMETER DESCRIPTION
self

The instance of the CLIPTokenizer class.

TYPE: CLIPTokenizer

save_directory

The directory where the vocabulary files will be saved.

TYPE: str

filename_prefix

An optional prefix to be added to the filename. Defaults to None.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tuple[str]

Tuple[str]: A tuple containing the paths to the saved vocabulary file and merge file.

RAISES DESCRIPTION
OSError

If the specified save_directory is not a valid directory.

IOError

If there is an issue with writing the vocabulary or merge files.

Exception

If any other unexpected error occurs during the saving process.

Source code in mindnlp\transformers\models\clip\tokenization_clip.py
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
    """
    Save the vocabulary to the specified directory with an optional filename prefix.

    Args:
        self (CLIPTokenizer): The instance of the CLIPTokenizer class.
        save_directory (str): The directory where the vocabulary files will be saved.
        filename_prefix (Optional[str], optional): An optional prefix to be added to the filename. Defaults to None.

    Returns:
        Tuple[str]: A tuple containing the paths to the saved vocabulary file and merge file.

    Raises:
        OSError: If the specified save_directory is not a valid directory.
        IOError: If there is an issue with writing the vocabulary or merge files.
        Exception: If any other unexpected error occurs during the saving process.
    """
    if not os.path.isdir(save_directory):
        logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
        return
    vocab_file = os.path.join(
        save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
    )
    merge_file = os.path.join(
        save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
    )

    with open(vocab_file, "w", encoding="utf-8") as f:
        f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")

    index = 0
    with open(merge_file, "w", encoding="utf-8") as writer:
        writer.write("#version: 0.2\n")
        for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
            if index != token_index:
                logger.warning(
                    "Saving vocabulary to {}: BPE merge indices are not consecutive."
                    " Please check that the tokenizer is not corrupted!".format(merge_file)
                )
                index = token_index
            writer.write(" ".join(bpe_tokens) + "\n")
            index += 1

    return vocab_file, merge_file

mindnlp.transformers.models.clip.tokenization_clip_fast.CLIPTokenizerFast

Bases: PreTrainedTokenizerFast

Construct a "fast" CLIP tokenizer (backed by HuggingFace's tokenizers library). Based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from [PreTrainedTokenizerFast] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

PARAMETER DESCRIPTION
vocab_file

Path to the vocabulary file.

TYPE: `str`, *optional* DEFAULT: None

merges_file

Path to the merges file.

TYPE: `str`, *optional* DEFAULT: None

tokenizer_file

The path to a tokenizer file to use instead of the vocab file.

TYPE: `str`, *optional* DEFAULT: None

unk_token

The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

TYPE: `str`, *optional*, defaults to `"<|endoftext|>"` DEFAULT: '<|endoftext|>'

bos_token

The beginning of sequence token.

TYPE: `str`, *optional*, defaults to `"<|startoftext|>"` DEFAULT: '<|startoftext|>'

eos_token

The end of sequence token.

TYPE: `str`, *optional*, defaults to `"<|endoftext|>"` DEFAULT: '<|endoftext|>'

pad_token

The token used for padding, for example when batching sequences of different lengths.

TYPE: `str`, *optional*, defaults to `"<|endoftext|>"` DEFAULT: '<|endoftext|>'

Source code in mindnlp\transformers\models\clip\tokenization_clip_fast.py
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
class CLIPTokenizerFast(PreTrainedTokenizerFast):
    """
    Construct a "fast" CLIP tokenizer (backed by HuggingFace's *tokenizers* library). Based on byte-level
    Byte-Pair-Encoding.

    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
    refer to this superclass for more information regarding those methods.

    Args:
        vocab_file (`str`, *optional*):
            Path to the vocabulary file.
        merges_file (`str`, *optional*):
            Path to the merges file.
        tokenizer_file (`str`, *optional*):
            The path to a tokenizer file to use instead of the vocab file.
        unk_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        bos_token (`str`, *optional*, defaults to `"<|startoftext|>"`):
            The beginning of sequence token.
        eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
            The end of sequence token.
        pad_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
            The token used for padding, for example when batching sequences of different lengths.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ["input_ids", "attention_mask"]
    slow_tokenizer_class = CLIPTokenizer

    def __init__(
        self,
        vocab_file=None,
        merges_file=None,
        tokenizer_file=None,
        unk_token="<|endoftext|>",
        bos_token="<|startoftext|>",
        eos_token="<|endoftext|>",
        pad_token="<|endoftext|>",  # hack to enable padding
        **kwargs,
    ):
        """
        Initialize the CLIPTokenizerFast class.

        Args:
            self (object): The instance of the CLIPTokenizerFast class.
            vocab_file (str, optional): Path to the vocabulary file. Default is None.
            merges_file (str, optional): Path to the merges file. Default is None.
            tokenizer_file (str, optional): Path to the tokenizer file. Default is None.
            unk_token (str, optional): The unknown token. Default is 'endoftext'.
            bos_token (str, optional): The beginning of sequence token. Default is '<|startoftext|>'.
            eos_token (str, optional): The end of sequence token. Default is 'endoftext'.
            pad_token (str, optional): The padding token. Default is 'endoftext'.

        Returns:
            None.

        Raises:
            ValueError: Raised if the backend tokenizer pre_tokenizer does not match the expected format.
                The CLIP tokenizer in this version has been heavily modified from transformers version 4.17.0. To
                resolve this issue, convert the existing tokenizer to be compatible with this version using
                `CLIPTokenizerFast.from_pretrained("path_to_local_folder_or_hub_repo", from_slow=True)`.
                If using an older tokenizer version, revert to a version prior to 4.17.0 of transformers.
        """
        super().__init__(
            vocab_file,
            merges_file,
            tokenizer_file=tokenizer_file,
            unk_token=unk_token,
            bos_token=bos_token,
            eos_token=eos_token,
            pad_token=pad_token,
            **kwargs,
        )

        if not isinstance(self.backend_tokenizer.pre_tokenizer, pre_tokenizers.Sequence):
            raise ValueError(
                "The `backend_tokenizer` provided does not match the expected format. The CLIP tokenizer has been"
                " heavily modified from transformers version 4.17.0. You need to convert the tokenizer you are using"
                " to be compatible with this version.The easiest way to do so is"
                ' `CLIPTokenizerFast.from_pretrained("path_to_local_folder_or_hub_repo, from_slow=True)`. If you want'
                " to use your existing tokenizer, you will have to revert to a version prior to 4.17.0 of"
                " transformers."
            )

        self._wrap_decode_method_backend_tokenizer()

    # Very ugly hack to enable padding to have a correct decoding see https://github.com/huggingface/tokenizers/issues/872
    def _wrap_decode_method_backend_tokenizer(self):
        """
        This method '_wrap_decode_method_backend_tokenizer' is a private method within the 'CLIPTokenizerFast' class.
        It wraps the 'decode' method of the backend tokenizer by modifying its behavior.

        Args:
            self (CLIPTokenizerFast): The instance of the CLIPTokenizerFast class itself.
                It is used to access the backend_tokenizer attribute and modify the decode method.

        Returns:
            None: This method does not return any value explicitly,
                but it modifies the behavior of the 'decode' method of the backend tokenizer.

        Raises:
            None: However, potential exceptions that could be raised during the execution of the modified 'decode'
                method of the backend tokenizer should be handled within that method.
        """
        orig_decode_method = self.backend_tokenizer.decode

        def new_decode_method(*args, **kwargs):
            text = orig_decode_method(*args, **kwargs)
            text = text.replace(self.backend_tokenizer.model.end_of_word_suffix, " ").strip()
            return text

        self.backend_tokenizer.decode = new_decode_method

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A CLIP sequence has the following format:

        - single sequence: `<|startoftext|> X <|endoftext|>`

        Pairs of sequences are not the expected use case, but they will be handled without a separator.

        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        bos_token = [self.bos_token_id]
        eos_token = [self.eos_token_id]

        if token_ids_1 is None:
            return bos_token + token_ids_0 + eos_token
        return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token

    def create_token_type_ids_from_sequences(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of
        zeros is returned.

        Args:
            token_ids_0 (`List[int]`):
                List of IDs.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of zeros.
        """
        bos_token = [self.bos_token_id]
        eos_token = [self.eos_token_id]

        if token_ids_1 is None:
            return len(bos_token + token_ids_0 + eos_token) * [0]
        return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0]

    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        """
        Save the vocabulary generated by the CLIPTokenizerFast model to the specified directory.

        Args:
            self (CLIPTokenizerFast): The instance of the CLIPTokenizerFast class.
            save_directory (str): The directory where the vocabulary files will be saved.
            filename_prefix (Optional[str], optional): An optional prefix to be included in the saved filenames.
                Default is None.

        Returns:
            Tuple[str]: A tuple containing the filenames of the saved vocabulary files.

        Raises:
            This method does not raise any exceptions.
        """
        files = self._tokenizer.model.save(save_directory, name=filename_prefix)
        return tuple(files)

mindnlp.transformers.models.clip.tokenization_clip_fast.CLIPTokenizerFast.__init__(vocab_file=None, merges_file=None, tokenizer_file=None, unk_token='<|endoftext|>', bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|endoftext|>', **kwargs)

Initialize the CLIPTokenizerFast class.

PARAMETER DESCRIPTION
self

The instance of the CLIPTokenizerFast class.

TYPE: object

vocab_file

Path to the vocabulary file. Default is None.

TYPE: str DEFAULT: None

merges_file

Path to the merges file. Default is None.

TYPE: str DEFAULT: None

tokenizer_file

Path to the tokenizer file. Default is None.

TYPE: str DEFAULT: None

unk_token

The unknown token. Default is 'endoftext'.

TYPE: str DEFAULT: '<|endoftext|>'

bos_token

The beginning of sequence token. Default is '<|startoftext|>'.

TYPE: str DEFAULT: '<|startoftext|>'

eos_token

The end of sequence token. Default is 'endoftext'.

TYPE: str DEFAULT: '<|endoftext|>'

pad_token

The padding token. Default is 'endoftext'.

TYPE: str DEFAULT: '<|endoftext|>'

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

Raised if the backend tokenizer pre_tokenizer does not match the expected format. The CLIP tokenizer in this version has been heavily modified from transformers version 4.17.0. To resolve this issue, convert the existing tokenizer to be compatible with this version using CLIPTokenizerFast.from_pretrained("path_to_local_folder_or_hub_repo", from_slow=True). If using an older tokenizer version, revert to a version prior to 4.17.0 of transformers.

Source code in mindnlp\transformers\models\clip\tokenization_clip_fast.py
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def __init__(
    self,
    vocab_file=None,
    merges_file=None,
    tokenizer_file=None,
    unk_token="<|endoftext|>",
    bos_token="<|startoftext|>",
    eos_token="<|endoftext|>",
    pad_token="<|endoftext|>",  # hack to enable padding
    **kwargs,
):
    """
    Initialize the CLIPTokenizerFast class.

    Args:
        self (object): The instance of the CLIPTokenizerFast class.
        vocab_file (str, optional): Path to the vocabulary file. Default is None.
        merges_file (str, optional): Path to the merges file. Default is None.
        tokenizer_file (str, optional): Path to the tokenizer file. Default is None.
        unk_token (str, optional): The unknown token. Default is 'endoftext'.
        bos_token (str, optional): The beginning of sequence token. Default is '<|startoftext|>'.
        eos_token (str, optional): The end of sequence token. Default is 'endoftext'.
        pad_token (str, optional): The padding token. Default is 'endoftext'.

    Returns:
        None.

    Raises:
        ValueError: Raised if the backend tokenizer pre_tokenizer does not match the expected format.
            The CLIP tokenizer in this version has been heavily modified from transformers version 4.17.0. To
            resolve this issue, convert the existing tokenizer to be compatible with this version using
            `CLIPTokenizerFast.from_pretrained("path_to_local_folder_or_hub_repo", from_slow=True)`.
            If using an older tokenizer version, revert to a version prior to 4.17.0 of transformers.
    """
    super().__init__(
        vocab_file,
        merges_file,
        tokenizer_file=tokenizer_file,
        unk_token=unk_token,
        bos_token=bos_token,
        eos_token=eos_token,
        pad_token=pad_token,
        **kwargs,
    )

    if not isinstance(self.backend_tokenizer.pre_tokenizer, pre_tokenizers.Sequence):
        raise ValueError(
            "The `backend_tokenizer` provided does not match the expected format. The CLIP tokenizer has been"
            " heavily modified from transformers version 4.17.0. You need to convert the tokenizer you are using"
            " to be compatible with this version.The easiest way to do so is"
            ' `CLIPTokenizerFast.from_pretrained("path_to_local_folder_or_hub_repo, from_slow=True)`. If you want'
            " to use your existing tokenizer, you will have to revert to a version prior to 4.17.0 of"
            " transformers."
        )

    self._wrap_decode_method_backend_tokenizer()

mindnlp.transformers.models.clip.tokenization_clip_fast.CLIPTokenizerFast.build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A CLIP sequence has the following format:

  • single sequence: <|startoftext|> X <|endoftext|>

Pairs of sequences are not the expected use case, but they will be handled without a separator.

PARAMETER DESCRIPTION
token_ids_0

List of IDs to which the special tokens will be added.

TYPE: `List[int]`

token_ids_1

Optional second list of IDs for sequence pairs.

TYPE: `List[int]`, *optional* DEFAULT: None

RETURNS DESCRIPTION
List[int]

List[int]: List of input IDs with the appropriate special tokens.

Source code in mindnlp\transformers\models\clip\tokenization_clip_fast.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
def build_inputs_with_special_tokens(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
    """
    Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
    adding special tokens. A CLIP sequence has the following format:

    - single sequence: `<|startoftext|> X <|endoftext|>`

    Pairs of sequences are not the expected use case, but they will be handled without a separator.

    Args:
        token_ids_0 (`List[int]`):
            List of IDs to which the special tokens will be added.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.

    Returns:
        `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
    """
    bos_token = [self.bos_token_id]
    eos_token = [self.eos_token_id]

    if token_ids_1 is None:
        return bos_token + token_ids_0 + eos_token
    return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token

mindnlp.transformers.models.clip.tokenization_clip_fast.CLIPTokenizerFast.create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)

Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of zeros is returned.

PARAMETER DESCRIPTION
token_ids_0

List of IDs.

TYPE: `List[int]`

token_ids_1

Optional second list of IDs for sequence pairs.

TYPE: `List[int]`, *optional* DEFAULT: None

RETURNS DESCRIPTION
List[int]

List[int]: List of zeros.

Source code in mindnlp\transformers\models\clip\tokenization_clip_fast.py
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
def create_token_type_ids_from_sequences(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
    """
    Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of
    zeros is returned.

    Args:
        token_ids_0 (`List[int]`):
            List of IDs.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.

    Returns:
        `List[int]`: List of zeros.
    """
    bos_token = [self.bos_token_id]
    eos_token = [self.eos_token_id]

    if token_ids_1 is None:
        return len(bos_token + token_ids_0 + eos_token) * [0]
    return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0]

mindnlp.transformers.models.clip.tokenization_clip_fast.CLIPTokenizerFast.save_vocabulary(save_directory, filename_prefix=None)

Save the vocabulary generated by the CLIPTokenizerFast model to the specified directory.

PARAMETER DESCRIPTION
self

The instance of the CLIPTokenizerFast class.

TYPE: CLIPTokenizerFast

save_directory

The directory where the vocabulary files will be saved.

TYPE: str

filename_prefix

An optional prefix to be included in the saved filenames. Default is None.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Tuple[str]

Tuple[str]: A tuple containing the filenames of the saved vocabulary files.

Source code in mindnlp\transformers\models\clip\tokenization_clip_fast.py
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
    """
    Save the vocabulary generated by the CLIPTokenizerFast model to the specified directory.

    Args:
        self (CLIPTokenizerFast): The instance of the CLIPTokenizerFast class.
        save_directory (str): The directory where the vocabulary files will be saved.
        filename_prefix (Optional[str], optional): An optional prefix to be included in the saved filenames.
            Default is None.

    Returns:
        Tuple[str]: A tuple containing the filenames of the saved vocabulary files.

    Raises:
        This method does not raise any exceptions.
    """
    files = self._tokenizer.model.save(save_directory, name=filename_prefix)
    return tuple(files)