timesformer

`mindnlp.transformers.models.timesformer.configuration_timesformer` ¶

TimeSformer model configuration

`mindnlp.transformers.models.timesformer.configuration_timesformer.TimesformerConfig` ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [TimesformerModel]. It is used to instantiate a TimeSformer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the TimeSformer facebook/timesformer-base-finetuned-k600 architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`image_size`	The size (resolution) of each image. TYPE: `int`, optional, defaults to 224 DEFAULT: `224`
`patch_size`	The size (resolution) of each patch. TYPE: `int`, optional, defaults to 16 DEFAULT: `16`
`num_channels`	The number of input channels. TYPE: `int`, optional, defaults to 3 DEFAULT: `3`
`num_frames`	The number of frames in each video. TYPE: `int`, optional, defaults to 8 DEFAULT: `8`
`hidden_size`	Dimensionality of the encoder layers and the pooler layer. TYPE: `int`, optional, defaults to 768 DEFAULT: `768`
`num_hidden_layers`	Number of hidden layers in the Transformer encoder. TYPE: `int`, optional, defaults to 12 DEFAULT: `12`
`num_attention_heads`	Number of attention heads for each attention layer in the Transformer encoder. TYPE: `int`, optional, defaults to 12 DEFAULT: `12`
`intermediate_size`	Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. TYPE: `int`, optional, defaults to 3072 DEFAULT: `3072`
`hidden_act`	The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported. TYPE: `str` or `function`, optional, defaults to `"gelu"` DEFAULT: `'gelu'`
`hidden_dropout_prob`	The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`attention_probs_dropout_prob`	The dropout ratio for the attention probabilities. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`initializer_range`	The standard deviation of the truncated_normal_initializer for initializing all weight matrices. TYPE: `float`, optional, defaults to 0.02 DEFAULT: `0.02`
`layer_norm_eps`	The epsilon used by the layer normalization layers. TYPE: `float`, optional, defaults to 1e-06 DEFAULT: `1e-06`
`qkv_bias`	Whether to add a bias to the queries, keys and values. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`attention_type`	The attention type to use. Must be one of `"divided_space_time"`, `"space_only"`, `"joint_space_time"`. TYPE: `str`, optional, defaults to `"divided_space_time"` DEFAULT: `'divided_space_time'`
`drop_path_rate`	The dropout ratio for stochastic depth. TYPE: `float`, optional, defaults to 0 DEFAULT: `0`

>>> from transformers import TimesformerConfig, TimesformerModel

>>> # Initializing a TimeSformer timesformer-base style configuration
>>> configuration = TimesformerConfig()

>>> # Initializing a model from the configuration
>>> model = TimesformerModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Source code in mindnlp\transformers\models\timesformer\configuration_timesformer.py

class TimesformerConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`TimesformerModel`]. It is used to instantiate a
    TimeSformer model according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the TimeSformer
    [facebook/timesformer-base-finetuned-k600](https://huggingface.co/facebook/timesformer-base-finetuned-k600)
    architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        image_size (`int`, *optional*, defaults to 224):
            The size (resolution) of each image.
        patch_size (`int`, *optional*, defaults to 16):
            The size (resolution) of each patch.
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        num_frames (`int`, *optional*, defaults to 8):
            The number of frames in each video.
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the layer normalization layers.
        qkv_bias (`bool`, *optional*, defaults to `True`):
            Whether to add a bias to the queries, keys and values.
        attention_type (`str`, *optional*, defaults to `"divided_space_time"`):
            The attention type to use. Must be one of `"divided_space_time"`, `"space_only"`, `"joint_space_time"`.
        drop_path_rate (`float`, *optional*, defaults to 0):
            The dropout ratio for stochastic depth.

    Example:

    ```python
    >>> from transformers import TimesformerConfig, TimesformerModel

    >>> # Initializing a TimeSformer timesformer-base style configuration
    >>> configuration = TimesformerConfig()

    >>> # Initializing a model from the configuration
    >>> model = TimesformerModel(configuration)

    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""

    model_type = "timesformer"

    def __init__(
        self,
        image_size=224,
        patch_size=16,
        num_channels=3,
        num_frames=8,
        hidden_size=768,
        num_hidden_layers=12,
        num_attention_heads=12,
        intermediate_size=3072,
        hidden_act="gelu",
        hidden_dropout_prob=0.0,
        attention_probs_dropout_prob=0.0,
        initializer_range=0.02,
        layer_norm_eps=1e-6,
        qkv_bias=True,
        attention_type="divided_space_time",
        drop_path_rate=0,
        **kwargs,
    ):
        super().__init__(**kwargs)

        self.image_size = image_size
        self.patch_size = patch_size
        self.num_channels = num_channels
        self.num_frames = num_frames

        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.hidden_act = hidden_act
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.qkv_bias = qkv_bias

        self.attention_type = attention_type
        self.drop_path_rate = drop_path_rate

`mindnlp.transformers.models.timesformer.modeling_timesformer` ¶

MindSpore TimeSformer model.

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimeSformerDropPath` ¶

Bases: Module

Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

Source code in mindnlp\transformers\models\timesformer\modeling_timesformer.py

class TimeSformerDropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""

    def __init__(self, drop_prob: Optional[float] = None) -> None:
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor:
        return drop_path(hidden_states, self.drop_prob, self.training)

    def extra_repr(self) -> str:
        return "p={}".format(self.drop_prob)

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerEmbeddings` ¶

Bases: Module

Construct the patch and position embeddings.

Source code in mindnlp\transformers\models\timesformer\modeling_timesformer.py

class TimesformerEmbeddings(nn.Module):
    """
    Construct the patch and position embeddings.
    """

    def __init__(self, config):
        super().__init__()

        embed_dim = config.hidden_size
        num_frames = config.num_frames
        drop_rate = config.hidden_dropout_prob
        attention_type = config.attention_type

        self.attention_type = attention_type
        self.patch_embeddings = TimesformerPatchEmbeddings(config)
        self.num_patches = self.patch_embeddings.num_patches

        # Positional Embeddings
        self.cls_token = nn.Parameter(ops.zeros(1, 1, embed_dim))
        self.position_embeddings = nn.Parameter(ops.zeros(1, self.num_patches + 1, embed_dim))
        self.pos_drop = nn.Dropout(p=drop_rate)
        if attention_type != "space_only":
            self.time_embeddings = nn.Parameter(ops.zeros(1, num_frames, embed_dim))
            self.time_drop = nn.Dropout(p=drop_rate)

    def forward(self, pixel_values):
        batch_size = pixel_values.shape[0]

        # create patch embeddings
        embeddings, num_frames, patch_width = self.patch_embeddings(pixel_values)

        cls_tokens = self.cls_token.broadcast_to((embeddings.shape[0], -1, -1))
        embeddings = ops.cat((cls_tokens, embeddings), dim=1)

        # resizing the positional embeddings in case they don't match the input at inference
        if embeddings.shape[1] != self.position_embeddings.shape[1]:
            position_embeddings = self.position_embeddings
            cls_pos_embed = position_embeddings[0, 0, :].unsqueeze(0).unsqueeze(1)
            other_pos_embed = ops.transpose(position_embeddings[0, 1:, :].unsqueeze(0), 1, 2)
            patch_num = int(other_pos_embed.shape[2] ** 0.5)
            patch_height = embeddings.shape[1] // patch_width
            other_pos_embed = other_pos_embed.reshape(1, embeddings.shape[2], patch_num, patch_num)
            new_pos_embed = nn.functional.interpolate(
                other_pos_embed, size=(patch_height, patch_width), mode="nearest"
            )
            new_pos_embed = ops.flatten(new_pos_embed, 2)
            new_pos_embed = ops.transpose(new_pos_embed, 1, 2)
            new_pos_embed = ops.cat((cls_pos_embed, new_pos_embed), 1)
            embeddings = embeddings + new_pos_embed
        else:
            embeddings = embeddings + self.position_embeddings
        embeddings = self.pos_drop(embeddings)

        # Time Embeddings
        if self.attention_type != "space_only":
            cls_tokens = embeddings[:batch_size, 0, :].unsqueeze(1)
            embeddings = embeddings[:, 1:]
            _, patch_height, patch_width = embeddings.shape
            embeddings = (
                embeddings.reshape(batch_size, num_frames, patch_height, patch_width)
                .permute(0, 2, 1, 3)
                .reshape(batch_size * patch_height, num_frames, patch_width)
            )
            # Resizing time embeddings in case they don't match
            if num_frames != self.time_embeddings.shape[1]:
                time_embeddings = ops.transpose(self.time_embeddings, 1, 2)
                new_time_embeddings = nn.functional.interpolate(time_embeddings, size=(num_frames), mode="nearest")
                new_time_embeddings = ops.transpose(new_time_embeddings, 1, 2)
                embeddings = embeddings + new_time_embeddings
            else:
                embeddings = embeddings + self.time_embeddings
            embeddings = self.time_drop(embeddings)
            embeddings = embeddings.view(batch_size, patch_height, num_frames, patch_width).reshape(
                batch_size, patch_height * num_frames, patch_width
            )
            embeddings = ops.cat((cls_tokens, embeddings), dim=1)

        return embeddings

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerForVideoClassification` ¶

Bases: TimesformerPreTrainedModel

Source code in mindnlp\transformers\models\timesformer\modeling_timesformer.py

class TimesformerForVideoClassification(TimesformerPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        self.num_labels = config.num_labels
        self.timesformer = TimesformerModel(config)

        # Classifier head
        self.classifier = nn.Linear(config.hidden_size, config.num_labels) if config.num_labels > 0 else nn.Identity()

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, ImageClassifierOutput]:
        r"""
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

        Returns:

        Examples:

        ```python
        >>> import av
        >>> import torch
        >>> import numpy as np

        >>> from transformers import AutoImageProcessor, TimesformerForVideoClassification
        >>> from huggingface_hub import hf_hub_download

        >>> np.random.seed(0)


        >>> def read_video_pyav(container, indices):
        ...     '''
        ...     Decode the video with PyAV decoder.
        ...     Args:
        ...         container (`av.container.input.InputContainer`): PyAV container.
        ...         indices (`List[int]`): List of frame indices to decode.
        ...     Returns:
        ...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
        ...     '''
        ...     frames = []
        ...     container.seek(0)
        ...     start_index = indices[0]
        ...     end_index = indices[-1]
        ...     for i, frame in enumerate(container.decode(video=0)):
        ...         if i > end_index:
        ...             break
        ...         if i >= start_index and i in indices:
        ...             frames.append(frame)
        ...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])


        >>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
        ...     '''
        ...     Sample a given number of frame indices from the video.
        ...     Args:
        ...         clip_len (`int`): Total number of frames to sample.
        ...         frame_sample_rate (`int`): Sample every n-th frame.
        ...         seg_len (`int`): Maximum allowed index of sample's last frame.
        ...     Returns:
        ...         indices (`List[int]`): List of sampled frame indices
        ...     '''
        ...     converted_len = int(clip_len * frame_sample_rate)
        ...     end_idx = np.random.randint(converted_len, seg_len)
        ...     start_idx = end_idx - converted_len
        ...     indices = np.linspace(start_idx, end_idx, num=clip_len)
        ...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
        ...     return indices


        >>> # video clip consists of 300 frames (10 seconds at 30 FPS)
        >>> file_path = hf_hub_download(
        ...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
        ... )
        >>> container = av.open(file_path)

        >>> # sample 8 frames
        >>> indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
        >>> video = read_video_pyav(container, indices)

        >>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
        >>> model = TimesformerForVideoClassification.from_pretrained("facebook/timesformer-base-finetuned-k400")

        >>> inputs = image_processor(list(video), return_tensors="ms")

        >>> with no_grad():
        ...     outputs = model(**inputs)
        ...     logits = outputs.logits

        >>> # model predicts one of the 400 Kinetics-400 classes
        >>> predicted_label = logits.argmax(-1).item()
        >>> print(model.config.id2label[predicted_label])
        eating spaghetti
        ```"""
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.timesformer(
            pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0][:, 0]

        logits = self.classifier(sequence_output)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return ImageClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerForVideoClassification.forward(pixel_values=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)` ¶

labels (mindspore.Tensor of shape (batch_size,), optional): Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Returns:

Examples:

>>> import av
>>> import torch
>>> import numpy as np

>>> from transformers import AutoImageProcessor, TimesformerForVideoClassification
>>> from huggingface_hub import hf_hub_download

>>> np.random.seed(0)


>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`List[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])


>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`List[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices


>>> # video clip consists of 300 frames (10 seconds at 30 FPS)
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)

>>> # sample 8 frames
>>> indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
>>> video = read_video_pyav(container, indices)

>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
>>> model = TimesformerForVideoClassification.from_pretrained("facebook/timesformer-base-finetuned-k400")

>>> inputs = image_processor(list(video), return_tensors="ms")

>>> with no_grad():
...     outputs = model(**inputs)
...     logits = outputs.logits

>>> # model predicts one of the 400 Kinetics-400 classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
eating spaghetti

Source code in mindnlp\transformers\models\timesformer\modeling_timesformer.py

def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, ImageClassifierOutput]:
    r"""
    labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
        Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
        config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
        `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

    Returns:

    Examples:

    ```python
    >>> import av
    >>> import torch
    >>> import numpy as np

    >>> from transformers import AutoImageProcessor, TimesformerForVideoClassification
    >>> from huggingface_hub import hf_hub_download

    >>> np.random.seed(0)


    >>> def read_video_pyav(container, indices):
    ...     '''
    ...     Decode the video with PyAV decoder.
    ...     Args:
    ...         container (`av.container.input.InputContainer`): PyAV container.
    ...         indices (`List[int]`): List of frame indices to decode.
    ...     Returns:
    ...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    ...     '''
    ...     frames = []
    ...     container.seek(0)
    ...     start_index = indices[0]
    ...     end_index = indices[-1]
    ...     for i, frame in enumerate(container.decode(video=0)):
    ...         if i > end_index:
    ...             break
    ...         if i >= start_index and i in indices:
    ...             frames.append(frame)
    ...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])


    >>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    ...     '''
    ...     Sample a given number of frame indices from the video.
    ...     Args:
    ...         clip_len (`int`): Total number of frames to sample.
    ...         frame_sample_rate (`int`): Sample every n-th frame.
    ...         seg_len (`int`): Maximum allowed index of sample's last frame.
    ...     Returns:
    ...         indices (`List[int]`): List of sampled frame indices
    ...     '''
    ...     converted_len = int(clip_len * frame_sample_rate)
    ...     end_idx = np.random.randint(converted_len, seg_len)
    ...     start_idx = end_idx - converted_len
    ...     indices = np.linspace(start_idx, end_idx, num=clip_len)
    ...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    ...     return indices


    >>> # video clip consists of 300 frames (10 seconds at 30 FPS)
    >>> file_path = hf_hub_download(
    ...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
    ... )
    >>> container = av.open(file_path)

    >>> # sample 8 frames
    >>> indices = sample_frame_indices(clip_len=8, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
    >>> video = read_video_pyav(container, indices)

    >>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
    >>> model = TimesformerForVideoClassification.from_pretrained("facebook/timesformer-base-finetuned-k400")

    >>> inputs = image_processor(list(video), return_tensors="ms")

    >>> with no_grad():
    ...     outputs = model(**inputs)
    ...     logits = outputs.logits

    >>> # model predicts one of the 400 Kinetics-400 classes
    >>> predicted_label = logits.argmax(-1).item()
    >>> print(model.config.id2label[predicted_label])
    eating spaghetti
    ```"""
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.timesformer(
        pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    sequence_output = outputs[0][:, 0]

    logits = self.classifier(sequence_output)

    loss = None
    if labels is not None:
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            loss_fct = MSELoss()
            if self.num_labels == 1:
                loss = loss_fct(logits.squeeze(), labels.squeeze())
            else:
                loss = loss_fct(logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss_fct = BCEWithLogitsLoss()
            loss = loss_fct(logits, labels)

    if not return_dict:
        output = (logits,) + outputs[1:]
        return ((loss,) + output) if loss is not None else output

    return ImageClassifierOutput(
        loss=loss,
        logits=logits,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerModel` ¶

Bases: TimesformerPreTrainedModel

Source code in mindnlp\transformers\models\timesformer\modeling_timesformer.py

class TimesformerModel(TimesformerPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.config = config

        self.embeddings = TimesformerEmbeddings(config)
        self.encoder = TimesformerEncoder(config)

        self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.embeddings.patch_embeddings

    def _prune_heads(self, heads_to_prune):
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

    def forward(
        self,
        pixel_values: mindspore.Tensor,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[mindspore.Tensor], BaseModelOutput]:
        r"""
        Returns:

        Examples:

        ```python
        >>> import av
        >>> import numpy as np

        >>> from transformers import AutoImageProcessor, TimesformerModel
        >>> from huggingface_hub import hf_hub_download

        >>> np.random.seed(0)


        >>> def read_video_pyav(container, indices):
        ...     '''
        ...     Decode the video with PyAV decoder.
        ...     Args:
        ...         container (`av.container.input.InputContainer`): PyAV container.
        ...         indices (`List[int]`): List of frame indices to decode.
        ...     Returns:
        ...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
        ...     '''
        ...     frames = []
        ...     container.seek(0)
        ...     start_index = indices[0]
        ...     end_index = indices[-1]
        ...     for i, frame in enumerate(container.decode(video=0)):
        ...         if i > end_index:
        ...             break
        ...         if i >= start_index and i in indices:
        ...             frames.append(frame)
        ...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])


        >>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
        ...     '''
        ...     Sample a given number of frame indices from the video.
        ...     Args:
        ...         clip_len (`int`): Total number of frames to sample.
        ...         frame_sample_rate (`int`): Sample every n-th frame.
        ...         seg_len (`int`): Maximum allowed index of sample's last frame.
        ...     Returns:
        ...         indices (`List[int]`): List of sampled frame indices
        ...     '''
        ...     converted_len = int(clip_len * frame_sample_rate)
        ...     end_idx = np.random.randint(converted_len, seg_len)
        ...     start_idx = end_idx - converted_len
        ...     indices = np.linspace(start_idx, end_idx, num=clip_len)
        ...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
        ...     return indices


        >>> # video clip consists of 300 frames (10 seconds at 30 FPS)
        >>> file_path = hf_hub_download(
        ...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
        ... )
        >>> container = av.open(file_path)

        >>> # sample 8 frames
        >>> indices = sample_frame_indices(clip_len=8, frame_sample_rate=4, seg_len=container.streams.video[0].frames)
        >>> video = read_video_pyav(container, indices)

        >>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
        >>> model = TimesformerModel.from_pretrained("facebook/timesformer-base-finetuned-k400")

        >>> # prepare video for the model
        >>> inputs = image_processor(list(video), return_tensors="ms")

        >>> # forward pass
        >>> outputs = model(**inputs)
        >>> last_hidden_states = outputs.last_hidden_state
        >>> list(last_hidden_states.shape)
        [1, 1569, 768]
        ```"""
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        embedding_output = self.embeddings(pixel_values)

        encoder_outputs = self.encoder(
            embedding_output,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]
        if self.layernorm is not None:
            sequence_output = self.layernorm(sequence_output)

        if not return_dict:
            return (sequence_output,) + encoder_outputs[1:]

        return BaseModelOutput(
            last_hidden_state=sequence_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
        )

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerModel.forward(pixel_values, output_attentions=None, output_hidden_states=None, return_dict=None)` ¶

Examples:

>>> import av
>>> import numpy as np

>>> from transformers import AutoImageProcessor, TimesformerModel
>>> from huggingface_hub import hf_hub_download

>>> np.random.seed(0)


>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`List[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])


>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`List[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices


>>> # video clip consists of 300 frames (10 seconds at 30 FPS)
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)

>>> # sample 8 frames
>>> indices = sample_frame_indices(clip_len=8, frame_sample_rate=4, seg_len=container.streams.video[0].frames)
>>> video = read_video_pyav(container, indices)

>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
>>> model = TimesformerModel.from_pretrained("facebook/timesformer-base-finetuned-k400")

>>> # prepare video for the model
>>> inputs = image_processor(list(video), return_tensors="ms")

>>> # forward pass
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 1569, 768]

Source code in mindnlp\transformers\models\timesformer\modeling_timesformer.py

def forward(
    self,
    pixel_values: mindspore.Tensor,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple[mindspore.Tensor], BaseModelOutput]:
    r"""
    Returns:

    Examples:

    ```python
    >>> import av
    >>> import numpy as np

    >>> from transformers import AutoImageProcessor, TimesformerModel
    >>> from huggingface_hub import hf_hub_download

    >>> np.random.seed(0)


    >>> def read_video_pyav(container, indices):
    ...     '''
    ...     Decode the video with PyAV decoder.
    ...     Args:
    ...         container (`av.container.input.InputContainer`): PyAV container.
    ...         indices (`List[int]`): List of frame indices to decode.
    ...     Returns:
    ...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    ...     '''
    ...     frames = []
    ...     container.seek(0)
    ...     start_index = indices[0]
    ...     end_index = indices[-1]
    ...     for i, frame in enumerate(container.decode(video=0)):
    ...         if i > end_index:
    ...             break
    ...         if i >= start_index and i in indices:
    ...             frames.append(frame)
    ...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])


    >>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
    ...     '''
    ...     Sample a given number of frame indices from the video.
    ...     Args:
    ...         clip_len (`int`): Total number of frames to sample.
    ...         frame_sample_rate (`int`): Sample every n-th frame.
    ...         seg_len (`int`): Maximum allowed index of sample's last frame.
    ...     Returns:
    ...         indices (`List[int]`): List of sampled frame indices
    ...     '''
    ...     converted_len = int(clip_len * frame_sample_rate)
    ...     end_idx = np.random.randint(converted_len, seg_len)
    ...     start_idx = end_idx - converted_len
    ...     indices = np.linspace(start_idx, end_idx, num=clip_len)
    ...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
    ...     return indices


    >>> # video clip consists of 300 frames (10 seconds at 30 FPS)
    >>> file_path = hf_hub_download(
    ...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
    ... )
    >>> container = av.open(file_path)

    >>> # sample 8 frames
    >>> indices = sample_frame_indices(clip_len=8, frame_sample_rate=4, seg_len=container.streams.video[0].frames)
    >>> video = read_video_pyav(container, indices)

    >>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
    >>> model = TimesformerModel.from_pretrained("facebook/timesformer-base-finetuned-k400")

    >>> # prepare video for the model
    >>> inputs = image_processor(list(video), return_tensors="ms")

    >>> # forward pass
    >>> outputs = model(**inputs)
    >>> last_hidden_states = outputs.last_hidden_state
    >>> list(last_hidden_states.shape)
    [1, 1569, 768]
    ```"""
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    embedding_output = self.embeddings(pixel_values)

    encoder_outputs = self.encoder(
        embedding_output,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    sequence_output = encoder_outputs[0]
    if self.layernorm is not None:
        sequence_output = self.layernorm(sequence_output)

    if not return_dict:
        return (sequence_output,) + encoder_outputs[1:]

    return BaseModelOutput(
        last_hidden_state=sequence_output,
        hidden_states=encoder_outputs.hidden_states,
        attentions=encoder_outputs.attentions,
    )

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerPatchEmbeddings` ¶

Bases: Module

Image to Patch Embedding

Source code in mindnlp\transformers\models\timesformer\modeling_timesformer.py

class TimesformerPatchEmbeddings(nn.Module):
    """Image to Patch Embedding"""

    def __init__(self, config):
        super().__init__()

        image_size = config.image_size
        patch_size = config.patch_size

        image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size)
        patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)

        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
        self.image_size = image_size
        self.patch_size = patch_size
        self.num_patches = num_patches

        self.projection = nn.Conv2d(config.num_channels, config.hidden_size, kernel_size=patch_size, stride=patch_size)

    def forward(self, pixel_values):
        batch_size, num_frames, num_channels, height, width = pixel_values.shape
        pixel_values = pixel_values.reshape(batch_size * num_frames, num_channels, height, width)

        embeddings = self.projection(pixel_values)
        patch_width = embeddings.shape[-1]
        embeddings = ops.transpose(ops.flatten(embeddings, 2), 1, 2)
        return embeddings, num_frames, patch_width

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerPreTrainedModel` ¶

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp\transformers\models\timesformer\modeling_timesformer.py

class TimesformerPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = TimesformerConfig
    base_model_prefix = "timesformer"
    main_input_name = "pixel_values"
    supports_gradient_checkpointing = True
    _no_split_modules = ["TimesformerLayer"]

    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Conv2d)):
            nn.init.trunc_normal_(module.weight, std=self.config.initializer_range)
            if module.bias is not None:
                nn.init.constant_(module.bias, 0)
        elif isinstance(module, nn.LayerNorm):
            nn.init.constant_(module.bias, 0)
            nn.init.constant_(module.weight, 1.0)
        elif isinstance(module, TimesformerEmbeddings):
            nn.init.trunc_normal_(module.cls_token, std=self.config.initializer_range)
            nn.init.trunc_normal_(module.position_embeddings, std=self.config.initializer_range)
            module.patch_embeddings.apply(self._init_weights)

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerSelfOutput` ¶

Bases: Module

The residual connection is defined in TimesformerLayer instead of here (as is the case with other models), due to the layernorm applied before each block.

Source code in mindnlp\transformers\models\timesformer\modeling_timesformer.py

class TimesformerSelfOutput(nn.Module):
    """
    The residual connection is defined in TimesformerLayer instead of here (as is the case with other models), due to
    the layernorm applied before each block.
    """

    def __init__(self, config: TimesformerConfig) -> None:
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor:
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)

        return hidden_states

`mindnlp.transformers.models.timesformer.modeling_timesformer.drop_path(input, drop_prob=0.0, training=False)` ¶

Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks, however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper... See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the argument.

Source code in mindnlp\transformers\models\timesformer\modeling_timesformer.py

def drop_path(input: mindspore.Tensor, drop_prob: float = 0.0, training: bool = False) -> mindspore.Tensor:
    """
    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

    Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks,
    however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the
    layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the
    argument.
    """
    if drop_prob == 0.0 or not training:
        return input
    keep_prob = 1 - drop_prob
    shape = (input.shape[0],) + (1,) * (input.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + ops.rand(shape, dtype=input.dtype)
    random_tensor.floor_()  # binarize
    output = input.div(keep_prob) * random_tensor
    return output

timesformer

mindnlp.transformers.models.timesformer.configuration_timesformer ¶

mindnlp.transformers.models.timesformer.configuration_timesformer.TimesformerConfig ¶

mindnlp.transformers.models.timesformer.modeling_timesformer ¶

mindnlp.transformers.models.timesformer.modeling_timesformer.TimeSformerDropPath ¶

mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerEmbeddings ¶

mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerForVideoClassification ¶

mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerForVideoClassification.forward(pixel_values=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None) ¶

mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerModel ¶

mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerModel.forward(pixel_values, output_attentions=None, output_hidden_states=None, return_dict=None) ¶

mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerPatchEmbeddings ¶

mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerPreTrainedModel ¶

mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerSelfOutput ¶

mindnlp.transformers.models.timesformer.modeling_timesformer.drop_path(input, drop_prob=0.0, training=False) ¶

`mindnlp.transformers.models.timesformer.configuration_timesformer` ¶

`mindnlp.transformers.models.timesformer.configuration_timesformer.TimesformerConfig` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimeSformerDropPath` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerEmbeddings` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerForVideoClassification` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerForVideoClassification.forward(pixel_values=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerModel` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerModel.forward(pixel_values, output_attentions=None, output_hidden_states=None, return_dict=None)` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerPatchEmbeddings` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerPreTrainedModel` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer.TimesformerSelfOutput` ¶

`mindnlp.transformers.models.timesformer.modeling_timesformer.drop_path(input, drop_prob=0.0, training=False)` ¶