reformer

`mindnlp.transformers.models.reformer.configuration_reformer` ¶

Reformer model configuration

`mindnlp.transformers.models.reformer.configuration_reformer.ReformerConfig` ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [ReformerModel]. It is used to instantiate a Reformer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ReFormer google/reformer-crime-and-punishment architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`attention_head_size`	Dimensionality of the projected key, query and value vectors TYPE: `int`, optional, defaults to 64 DEFAULT: `64`
`attn_layers`	List of attention layer types in ascending order. It can be chosen between a LSHSelfAttention layer (`"lsh"`) and a LocalSelfAttention layer (`"local"`). For more information on LSHSelfAttention layer, see LSH Self Attention. For more information on LocalSelfAttention layer, see Local Self Attention. TYPE: `List[str]`, optional, defaults to `["local", "lsh", "local", "lsh", "local", "lsh"]` DEFAULT: `['local', 'lsh', 'local', 'lsh', 'local', 'lsh']`
`axial_pos_embds`	Whether or not to use axial position embeddings. For more information on how axial position embeddings work, see Axial Position Encodings. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`axial_norm_std`	The standard deviation of the normal_initializer for initializing the weight matrices of the axial positional encodings. TYPE: `float`, optional, defaults to 1.0 DEFAULT: `1.0`
`axial_pos_shape`	The position dims of the axial position encodings. During training, the product of the position dims has to be equal to the sequence length. For more information on how axial position embeddings work, see Axial Position Encodings. TYPE: `List[int]`, optional, defaults to `[64, 64]` DEFAULT: `[64, 64]`
`axial_pos_embds_dim`	The embedding dims of the axial position encodings. The sum of the embedding dims has to be equal to the hidden size. For more information on how axial position embeddings work, see Axial Position Encodings. TYPE: `List[int]`, optional, defaults to `[64, 192]` DEFAULT: `[64, 192]`
`chunk_size_lm_head`	The chunk size of the final language model feed forward head layer. A chunk size of 0 means that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time. For more information on feed forward chunking, see How does Feed Forward Chunking work?. TYPE: `int`, optional, defaults to 0 DEFAULT: `0`
`eos_token_id`	The token id for the end-of-sentence token. TYPE: `int`, optional, defaults to 2 DEFAULT: `2`
`feed_forward_size`	Dimensionality of the feed_forward layer in the residual attention block. TYPE: `int`, optional, defaults to 512 DEFAULT: `512`
`hash_seed`	Seed that can be used to make local sensitive hashing in `LSHSelfAttention` deterministic. This should only be set for testing purposed. For evaluation and training purposes `hash_seed` should be left as `None` to ensure fully random rotations in local sensitive hashing scheme. TYPE: `int`, optional DEFAULT: `None`
`hidden_act`	The non-linear activation function (function or string) in the feed forward layer in the residual attention block. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. TYPE: `str` or `Callable`, optional, defaults to `"relu"` DEFAULT: `'relu'`
`hidden_dropout_prob`	The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. TYPE: `float`, optional, defaults to 0.05 DEFAULT: `0.05`
`hidden_size`	Dimensionality of the output hidden states of the residual attention blocks. TYPE: `int`, optional, defaults to 256 DEFAULT: `256`
`initializer_range`	The standard deviation of the truncated_normal_initializer for initializing all weight matrices. TYPE: `float`, optional, defaults to 0.02 DEFAULT: `0.02`
`is_decoder`	Whether or not to use a causal mask in addition to the `attention_mask` passed to [`ReformerModel`]. When using the Reformer for causal language modeling, this argument should be set to `True`. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`
`layer_norm_eps`	The epsilon used by the layer normalization layers. TYPE: `float`, optional, defaults to 1e-12 DEFAULT: `1e-12`
`local_chunk_length`	Length of chunk which attends to itself in `LocalSelfAttention`. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention). TYPE: `int`, optional, defaults to 64
`local_num_chunks_before`	Number of previous neighbouring chunks to attend to in `LocalSelfAttention` layer to itself. TYPE: `int`, optional, defaults to 1 DEFAULT: `1`
`local_num_chunks_after`	Number of following neighbouring chunks to attend to in `LocalSelfAttention` layer in addition to itself. TYPE: `int`, optional, defaults to 0 DEFAULT: `0`
`local_attention_probs_dropout_prob`	The dropout ratio for the attention probabilities in `LocalSelfAttention`. TYPE: `float`, optional, defaults to 0.1 DEFAULT: `0.05`
`lsh_attn_chunk_length`	Length of chunk which attends to itself in `LSHSelfAttention`. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention). TYPE: `int`, optional, defaults to 64 DEFAULT: `64`
`lsh_num_chunks_before`	Number of previous neighbouring chunks to attend to in `LSHSelfAttention` layer to itself. TYPE: `int`, optional, defaults to 1 DEFAULT: `1`
`lsh_num_chunks_after`	Number of following neighbouring chunks to attend to in `LSHSelfAttention` layer to itself. TYPE: `int`, optional, defaults to 0 DEFAULT: `0`
`lsh_attention_probs_dropout_prob`	The dropout ratio for the attention probabilities in `LSHSelfAttention`. TYPE: `float`, optional, defaults to 0.1 DEFAULT: `0.0`
`max_position_embeddings`	The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). TYPE: `int`, optional, defaults to 4096 DEFAULT: `4096`
`num_attention_heads`	Number of attention heads for each attention layer in the Transformer encoder. TYPE: `int`, optional, defaults to 12 DEFAULT: `12`
`num_buckets`	Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`. The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors. The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` not set, a good value is calculated on the fly. TYPE: `int` or `List[int]`, optional DEFAULT: `None`
`num_hashes`	Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme. The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes. TYPE: `int`, optional, defaults to 1 DEFAULT: `1`
`pad_token_id`	The token id for the padding token. TYPE: `int`, optional, defaults to 0 DEFAULT: `0`
`vocab_size`	\ Vocabulary size of the Reformer model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`ReformerModel`]. TYPE: `int`, optional, defaults to 320 DEFAULT: `320`
`tie_word_embeddings`	Whether to tie input and output embeddings. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`
`use_cache`	Whether or not the model should return the last key/values attentions (not used by all models). TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`classifier_dropout`	The dropout ratio for the classification head. TYPE: `float`, optional DEFAULT: `None`

Example

>>> from transformers import ReformerConfig, ReformerModel
...
>>> # Initializing a Reformer configuration
>>> configuration = ReformerConfig()
...
>>> # Initializing a Reformer model (with random weights)
>>> model = ReformerModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config

Source code in mindnlp\transformers\models\reformer\configuration_reformer.py

class ReformerConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`ReformerModel`]. It is used to instantiate a
    Reformer model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the ReFormer
    [google/reformer-crime-and-punishment](https://hf-mirror.com/google/reformer-crime-and-punishment) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        attention_head_size (`int`, *optional*, defaults to 64):
            Dimensionality of the projected key, query and value vectors
        attn_layers (`List[str]`, *optional*, defaults to `["local", "lsh", "local", "lsh", "local", "lsh"]`):
            List of attention layer types in ascending order. It can be chosen between a LSHSelfAttention layer
            (`"lsh"`) and a LocalSelfAttention layer (`"local"`).

            For more information on LSHSelfAttention layer, see [LSH Self Attention](reformer#lsh-self-attention). For
            more information on LocalSelfAttention layer, see [Local Self Attention](reformer#local-self-attention).
        axial_pos_embds (`bool`, *optional*, defaults to `True`):
            Whether or not to use axial position embeddings. For more information on how axial position embeddings
            work, see [Axial Position Encodings](reformer#axial-positional-encodings).
        axial_norm_std (`float`, *optional*, defaults to 1.0):
            The standard deviation of the normal_initializer for initializing the weight matrices of the axial
            positional encodings.
        axial_pos_shape (`List[int]`, *optional*, defaults to `[64, 64]`):
            The position dims of the axial position encodings. During training, the product of the position dims has to
            be equal to the sequence length.

            For more information on how axial position embeddings work, see [Axial Position
            Encodings](reformer#axial-positional-encodings).
        axial_pos_embds_dim (`List[int]`, *optional*, defaults to `[64, 192]`):
            The embedding dims of the axial position encodings. The sum of the embedding dims has to be equal to the
            hidden size.

            For more information on how axial position embeddings work, see [Axial Position
            Encodings](reformer#axial-positional-encodings).
        chunk_size_lm_head (`int`, *optional*, defaults to 0):
            The chunk size of the final language model feed forward head layer. A chunk size of 0 means that the feed
            forward layer is not chunked. A chunk size of n means that the feed forward layer processes n <
            sequence_length embeddings at a time.

            For more information on feed forward chunking, see [How does Feed Forward Chunking
            work?](../glossary#feed-forward-chunking).
        eos_token_id (`int`, *optional*, defaults to 2):
            The token id for the end-of-sentence token.
        feed_forward_size (`int`, *optional*, defaults to 512):
            Dimensionality of the feed_forward layer in the residual attention block.
        hash_seed (`int`, *optional*):
            Seed that can be used to make local sensitive hashing in `LSHSelfAttention` deterministic. This should only
            be set for testing purposed. For evaluation and training purposes `hash_seed` should be left as `None` to
            ensure fully random rotations in local sensitive hashing scheme.
        hidden_act (`str` or `Callable`, *optional*, defaults to `"relu"`):
            The non-linear activation function (function or string) in the feed forward layer in the residual attention
            block. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.05):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        hidden_size (`int`, *optional*, defaults to 256):
            Dimensionality of the output hidden states of the residual attention blocks.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        is_decoder (`bool`, *optional*, defaults to `False`):
            Whether or not to use a causal mask in addition to the `attention_mask` passed to [`ReformerModel`]. When
            using the Reformer for causal language modeling, this argument should be set to `True`.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        local_chunk_length (`int`, *optional*, defaults to 64):
            Length of chunk which attends to itself in `LocalSelfAttention`. Chunking reduces memory complexity from
            sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk
            length (chunked self attention).
        local_num_chunks_before (`int`, *optional*, defaults to 1):
            Number of previous neighbouring chunks to attend to in `LocalSelfAttention` layer to itself.
        local_num_chunks_after (`int`, *optional*, defaults to 0):
            Number of following neighbouring chunks to attend to in `LocalSelfAttention` layer in addition to itself.
        local_attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities in `LocalSelfAttention`.
        lsh_attn_chunk_length (`int`, *optional*, defaults to 64):
            Length of chunk which attends to itself in `LSHSelfAttention`. Chunking reduces memory complexity from
            sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk
            length (chunked self attention).
        lsh_num_chunks_before (`int`, *optional*, defaults to 1):
            Number of previous neighbouring chunks to attend to in `LSHSelfAttention` layer to itself.
        lsh_num_chunks_after (`int`, *optional*, defaults to 0):
            Number of following neighbouring chunks to attend to in `LSHSelfAttention` layer to itself.
        lsh_attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities in `LSHSelfAttention`.
        max_position_embeddings (`int`, *optional*, defaults to 4096):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_buckets (`int` or `List[int]`, *optional*):
            Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme.
            Each query key vector is hashed into a hash in `1, ..., num_buckets`. The number of buckets can also be
            factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a
            hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is
            factorized into two factors. The number of buckets (or the product the factors) should approximately equal
            sequence length / lsh_chunk_length. If `num_buckets` not set, a good value is calculated on the fly.
        num_hashes (`int`, *optional*, defaults to 1):
            Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme. The higher
            `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive
            the hashing becomes.
        pad_token_id (`int`, *optional*, defaults to 0):
            The token id for the padding token.
        vocab_size (`int`, *optional*, defaults to 320):\
            Vocabulary size of the Reformer model. Defines the number of different tokens that can be represented by
            the `inputs_ids` passed when calling [`ReformerModel`].
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether to tie input and output embeddings.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models).
        classifier_dropout (`float`, *optional*):
            The dropout ratio for the classification head.

    Example:
        ```python
        >>> from transformers import ReformerConfig, ReformerModel
        ...
        >>> # Initializing a Reformer configuration
        >>> configuration = ReformerConfig()
        ...
        >>> # Initializing a Reformer model (with random weights)
        >>> model = ReformerModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
"""
    model_type = "reformer"
    keys_to_ignore_at_inference = ["past_buckets_states"]
    attribute_map = {}

    def __init__(
        self,
        attention_head_size=64,
        attn_layers=["local", "lsh", "local", "lsh", "local", "lsh"],
        axial_norm_std=1.0,
        axial_pos_embds=True,
        axial_pos_shape=[64, 64],
        axial_pos_embds_dim=[64, 192],
        chunk_size_lm_head=0,
        eos_token_id=2,
        feed_forward_size=512,
        hash_seed=None,
        hidden_act="relu",
        hidden_dropout_prob=0.05,
        hidden_size=256,
        initializer_range=0.02,
        is_decoder=False,
        layer_norm_eps=1e-12,
        local_num_chunks_before=1,
        local_num_chunks_after=0,
        local_attention_probs_dropout_prob=0.05,
        local_attn_chunk_length=64,
        lsh_attn_chunk_length=64,
        lsh_attention_probs_dropout_prob=0.0,
        lsh_num_chunks_before=1,
        lsh_num_chunks_after=0,
        max_position_embeddings=4096,
        num_attention_heads=12,
        num_buckets=None,
        num_hashes=1,
        pad_token_id=0,
        vocab_size=320,
        tie_word_embeddings=False,
        use_cache=True,
        classifier_dropout=None,
        **kwargs,
    ):
        """
        Initializes a new instance of the ReformerConfig class.

        Args:
            attention_head_size (int): The size of each attention head.
            attn_layers (list): The list of attention layer types to be used.
            axial_norm_std (float): Standard deviation for axial positional embeddings normalization.
            axial_pos_embds (bool): Whether to use axial positional embeddings.
            axial_pos_shape (list): The shape of axial positional embeddings.
            axial_pos_embds_dim (list): The dimensions of axial positional embeddings.
            chunk_size_lm_head (int): Size of chunk for the language model head.
            eos_token_id (int): The token ID for the end-of-sequence token.
            feed_forward_size (int): The size of the feed-forward network.
            hash_seed (None or int): The seed for hashing functions.
            hidden_act (str): The activation function for hidden layers.
            hidden_dropout_prob (float): The dropout probability for hidden layers.
            hidden_size (int): The size of the hidden layers.
            initializer_range (float): The range for weight initialization.
            is_decoder (bool): Whether the model is used as a decoder.
            layer_norm_eps (float): Epsilon value for layer normalization.
            local_num_chunks_before (int): Number of local attention chunks before.
            local_num_chunks_after (int): Number of local attention chunks after.
            local_attention_probs_dropout_prob (float): Dropout probability for local attention.
            local_attn_chunk_length (int): Length of chunks for local attention.
            lsh_attn_chunk_length (int): Length of chunks for LSH attention.
            lsh_attention_probs_dropout_prob (float): Dropout probability for LSH attention.
            lsh_num_chunks_before (int): Number of LSH attention chunks before.
            lsh_num_chunks_after (int): Number of LSH attention chunks after.
            max_position_embeddings (int): The maximum number of position embeddings.
            num_attention_heads (int): The number of attention heads.
            num_buckets (None or tuple): The number of buckets for hashing.
            num_hashes (int): The number of hashes for LSH attention.
            pad_token_id (int): The token ID for padding.
            vocab_size (int): The size of the vocabulary.
            tie_word_embeddings (bool): Whether to tie word embeddings.
            use_cache (bool): Whether to cache intermediate values.
            classifier_dropout (None or float): Dropout probability for classifier layers.

        Returns:
            None.

        Raises:
            None.
        """
        self.hash_seed = hash_seed
        self.vocab_size = vocab_size
        self.attention_head_size = attention_head_size
        self.hidden_size = hidden_size
        self.num_attention_heads = num_attention_heads
        self.num_hashes = num_hashes
        self.num_hidden_layers = len(attn_layers)
        self.num_buckets = tuple(num_buckets) if isinstance(num_buckets, list) else num_buckets
        self.lsh_attn_chunk_length = lsh_attn_chunk_length
        self.local_attn_chunk_length = local_attn_chunk_length
        self.lsh_num_chunks_after = lsh_num_chunks_after
        self.lsh_num_chunks_before = lsh_num_chunks_before
        self.local_num_chunks_after = local_num_chunks_after
        self.local_num_chunks_before = local_num_chunks_before
        self.hidden_act = hidden_act
        self.feed_forward_size = feed_forward_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.lsh_attention_probs_dropout_prob = lsh_attention_probs_dropout_prob
        self.local_attention_probs_dropout_prob = local_attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.axial_pos_embds = axial_pos_embds
        self.axial_pos_shape = tuple(axial_pos_shape)
        self.axial_pos_embds_dim = tuple(axial_pos_embds_dim)
        self.axial_norm_std = axial_norm_std
        self.chunk_size_lm_head = chunk_size_lm_head
        self.attn_layers = attn_layers
        self.use_cache = use_cache
        self.classifier_dropout = classifier_dropout
        super().__init__(
            pad_token_id=pad_token_id,
            eos_token_id=eos_token_id,
            is_decoder=is_decoder,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )

mindnlp.transformers.models.reformer.configuration_reformer.ReformerConfig.init(attention_head_size=64, attn_layers=['local', 'lsh', 'local', 'lsh', 'local', 'lsh'], axial_norm_std=1.0, axial_pos_embds=True, axial_pos_shape=[64, 64], axial_pos_embds_dim=[64, 192], chunk_size_lm_head=0, eos_token_id=2, feed_forward_size=512, hash_seed=None, hidden_act='relu', hidden_dropout_prob=0.05, hidden_size=256, initializer_range=0.02, is_decoder=False, layer_norm_eps=1e-12, local_num_chunks_before=1, local_num_chunks_after=0, local_attention_probs_dropout_prob=0.05, local_attn_chunk_length=64, lsh_attn_chunk_length=64, lsh_attention_probs_dropout_prob=0.0, lsh_num_chunks_before=1, lsh_num_chunks_after=0, max_position_embeddings=4096, num_attention_heads=12, num_buckets=None, num_hashes=1, pad_token_id=0, vocab_size=320, tie_word_embeddings=False, use_cache=True, classifier_dropout=None, **kwargs) ¶

Initializes a new instance of the ReformerConfig class.

PARAMETER	DESCRIPTION
`attention_head_size`	The size of each attention head. TYPE: `int` DEFAULT: `64`
`attn_layers`	The list of attention layer types to be used. TYPE: `list` DEFAULT: `['local', 'lsh', 'local', 'lsh', 'local', 'lsh']`
`axial_norm_std`	Standard deviation for axial positional embeddings normalization. TYPE: `float` DEFAULT: `1.0`
`axial_pos_embds`	Whether to use axial positional embeddings. TYPE: `bool` DEFAULT: `True`
`axial_pos_shape`	The shape of axial positional embeddings. TYPE: `list` DEFAULT: `[64, 64]`
`axial_pos_embds_dim`	The dimensions of axial positional embeddings. TYPE: `list` DEFAULT: `[64, 192]`
`chunk_size_lm_head`	Size of chunk for the language model head. TYPE: `int` DEFAULT: `0`
`eos_token_id`	The token ID for the end-of-sequence token. TYPE: `int` DEFAULT: `2`
`feed_forward_size`	The size of the feed-forward network. TYPE: `int` DEFAULT: `512`
`hash_seed`	The seed for hashing functions. TYPE: `None or int` DEFAULT: `None`
`hidden_act`	The activation function for hidden layers. TYPE: `str` DEFAULT: `'relu'`
`hidden_dropout_prob`	The dropout probability for hidden layers. TYPE: `float` DEFAULT: `0.05`
`hidden_size`	The size of the hidden layers. TYPE: `int` DEFAULT: `256`
`initializer_range`	The range for weight initialization. TYPE: `float` DEFAULT: `0.02`
`is_decoder`	Whether the model is used as a decoder. TYPE: `bool` DEFAULT: `False`
`layer_norm_eps`	Epsilon value for layer normalization. TYPE: `float` DEFAULT: `1e-12`
`local_num_chunks_before`	Number of local attention chunks before. TYPE: `int` DEFAULT: `1`
`local_num_chunks_after`	Number of local attention chunks after. TYPE: `int` DEFAULT: `0`
`local_attention_probs_dropout_prob`	Dropout probability for local attention. TYPE: `float` DEFAULT: `0.05`
`local_attn_chunk_length`	Length of chunks for local attention. TYPE: `int` DEFAULT: `64`
`lsh_attn_chunk_length`	Length of chunks for LSH attention. TYPE: `int` DEFAULT: `64`
`lsh_attention_probs_dropout_prob`	Dropout probability for LSH attention. TYPE: `float` DEFAULT: `0.0`
`lsh_num_chunks_before`	Number of LSH attention chunks before. TYPE: `int` DEFAULT: `1`
`lsh_num_chunks_after`	Number of LSH attention chunks after. TYPE: `int` DEFAULT: `0`
`max_position_embeddings`	The maximum number of position embeddings. TYPE: `int` DEFAULT: `4096`
`num_attention_heads`	The number of attention heads. TYPE: `int` DEFAULT: `12`
`num_buckets`	The number of buckets for hashing. TYPE: `None or tuple` DEFAULT: `None`
`num_hashes`	The number of hashes for LSH attention. TYPE: `int` DEFAULT: `1`
`pad_token_id`	The token ID for padding. TYPE: `int` DEFAULT: `0`
`vocab_size`	The size of the vocabulary. TYPE: `int` DEFAULT: `320`
`tie_word_embeddings`	Whether to tie word embeddings. TYPE: `bool` DEFAULT: `False`
`use_cache`	Whether to cache intermediate values. TYPE: `bool` DEFAULT: `True`
`classifier_dropout`	Dropout probability for classifier layers. TYPE: `None or float` DEFAULT: `None`

RETURNS	DESCRIPTION
	None.

Source code in mindnlp\transformers\models\reformer\configuration_reformer.py

def __init__(
    self,
    attention_head_size=64,
    attn_layers=["local", "lsh", "local", "lsh", "local", "lsh"],
    axial_norm_std=1.0,
    axial_pos_embds=True,
    axial_pos_shape=[64, 64],
    axial_pos_embds_dim=[64, 192],
    chunk_size_lm_head=0,
    eos_token_id=2,
    feed_forward_size=512,
    hash_seed=None,
    hidden_act="relu",
    hidden_dropout_prob=0.05,
    hidden_size=256,
    initializer_range=0.02,
    is_decoder=False,
    layer_norm_eps=1e-12,
    local_num_chunks_before=1,
    local_num_chunks_after=0,
    local_attention_probs_dropout_prob=0.05,
    local_attn_chunk_length=64,
    lsh_attn_chunk_length=64,
    lsh_attention_probs_dropout_prob=0.0,
    lsh_num_chunks_before=1,
    lsh_num_chunks_after=0,
    max_position_embeddings=4096,
    num_attention_heads=12,
    num_buckets=None,
    num_hashes=1,
    pad_token_id=0,
    vocab_size=320,
    tie_word_embeddings=False,
    use_cache=True,
    classifier_dropout=None,
    **kwargs,
):
    """
    Initializes a new instance of the ReformerConfig class.

    Args:
        attention_head_size (int): The size of each attention head.
        attn_layers (list): The list of attention layer types to be used.
        axial_norm_std (float): Standard deviation for axial positional embeddings normalization.
        axial_pos_embds (bool): Whether to use axial positional embeddings.
        axial_pos_shape (list): The shape of axial positional embeddings.
        axial_pos_embds_dim (list): The dimensions of axial positional embeddings.
        chunk_size_lm_head (int): Size of chunk for the language model head.
        eos_token_id (int): The token ID for the end-of-sequence token.
        feed_forward_size (int): The size of the feed-forward network.
        hash_seed (None or int): The seed for hashing functions.
        hidden_act (str): The activation function for hidden layers.
        hidden_dropout_prob (float): The dropout probability for hidden layers.
        hidden_size (int): The size of the hidden layers.
        initializer_range (float): The range for weight initialization.
        is_decoder (bool): Whether the model is used as a decoder.
        layer_norm_eps (float): Epsilon value for layer normalization.
        local_num_chunks_before (int): Number of local attention chunks before.
        local_num_chunks_after (int): Number of local attention chunks after.
        local_attention_probs_dropout_prob (float): Dropout probability for local attention.
        local_attn_chunk_length (int): Length of chunks for local attention.
        lsh_attn_chunk_length (int): Length of chunks for LSH attention.
        lsh_attention_probs_dropout_prob (float): Dropout probability for LSH attention.
        lsh_num_chunks_before (int): Number of LSH attention chunks before.
        lsh_num_chunks_after (int): Number of LSH attention chunks after.
        max_position_embeddings (int): The maximum number of position embeddings.
        num_attention_heads (int): The number of attention heads.
        num_buckets (None or tuple): The number of buckets for hashing.
        num_hashes (int): The number of hashes for LSH attention.
        pad_token_id (int): The token ID for padding.
        vocab_size (int): The size of the vocabulary.
        tie_word_embeddings (bool): Whether to tie word embeddings.
        use_cache (bool): Whether to cache intermediate values.
        classifier_dropout (None or float): Dropout probability for classifier layers.

    Returns:
        None.

    Raises:
        None.
    """
    self.hash_seed = hash_seed
    self.vocab_size = vocab_size
    self.attention_head_size = attention_head_size
    self.hidden_size = hidden_size
    self.num_attention_heads = num_attention_heads
    self.num_hashes = num_hashes
    self.num_hidden_layers = len(attn_layers)
    self.num_buckets = tuple(num_buckets) if isinstance(num_buckets, list) else num_buckets
    self.lsh_attn_chunk_length = lsh_attn_chunk_length
    self.local_attn_chunk_length = local_attn_chunk_length
    self.lsh_num_chunks_after = lsh_num_chunks_after
    self.lsh_num_chunks_before = lsh_num_chunks_before
    self.local_num_chunks_after = local_num_chunks_after
    self.local_num_chunks_before = local_num_chunks_before
    self.hidden_act = hidden_act
    self.feed_forward_size = feed_forward_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.lsh_attention_probs_dropout_prob = lsh_attention_probs_dropout_prob
    self.local_attention_probs_dropout_prob = local_attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.initializer_range = initializer_range
    self.layer_norm_eps = layer_norm_eps
    self.axial_pos_embds = axial_pos_embds
    self.axial_pos_shape = tuple(axial_pos_shape)
    self.axial_pos_embds_dim = tuple(axial_pos_embds_dim)
    self.axial_norm_std = axial_norm_std
    self.chunk_size_lm_head = chunk_size_lm_head
    self.attn_layers = attn_layers
    self.use_cache = use_cache
    self.classifier_dropout = classifier_dropout
    super().__init__(
        pad_token_id=pad_token_id,
        eos_token_id=eos_token_id,
        is_decoder=is_decoder,
        tie_word_embeddings=tie_word_embeddings,
        **kwargs,
    )

`mindnlp.transformers.models.reformer.modeling_reformer` ¶

MindSpore REFORMER model.

`mindnlp.transformers.models.reformer.modeling_reformer.AxialPositionEmbeddings` ¶

Bases: Module

Constructs axial position embeddings. Useful for very long input sequences to save memory and time.

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class AxialPositionEmbeddings(nn.Module):
    """
    Constructs axial position embeddings. Useful for very long input sequences to save memory and time.
    """

    def __init__(self, config):
        super().__init__()
        self.axial_pos_shape = config.axial_pos_shape
        self.axial_pos_embds_dim = config.axial_pos_embds_dim
        self.dropout = config.hidden_dropout_prob

        self.least_common_mult_chunk_length = _get_least_common_mult_chunk_len(config)
        self.weights = nn.ParameterList()

        if sum(self.axial_pos_embds_dim) != config.hidden_size:
            raise ValueError(
                f"Make sure that config.axial_pos_embds factors: {self.axial_pos_embds_dim} sum to "
                f"config.hidden_size: {config.hidden_size}"
            )

        # create weights
        for axis, axial_pos_embd_dim in enumerate(self.axial_pos_embds_dim):
            # create expanded shapes
            ax_shape = [1] * len(self.axial_pos_shape)
            ax_shape[axis] = self.axial_pos_shape[axis]
            ax_shape = tuple(ax_shape) + (axial_pos_embd_dim,)

            # create tensor and init
            self.weights.append(nn.Parameter(ops.ones(ax_shape, dtype=mindspore.float32)))

    def forward(self, position_ids):
        # broadcast weights to correct shape
        batch_size = position_ids.shape[0]
        sequence_length = position_ids.shape[1]

        broadcasted_weights = [
            weight.broadcast_to((batch_size,) + self.axial_pos_shape + weight.shape[-1:]) for weight in self.weights
        ]

        if self.training is True:
            if reduce(mul, self.axial_pos_shape) != sequence_length:
                raise ValueError(
                    f"If training, make sure that config.axial_pos_shape factors: {self.axial_pos_shape} multiply to "
                    f"sequence length. Got prod({self.axial_pos_shape}) != sequence_length: {sequence_length}. "
                    f"You might want to consider padding your sequence length to {reduce(mul, self.axial_pos_shape)} "
                    "or changing config.axial_pos_shape."
                )

            if self.dropout > 0:
                weights = ops.cat(broadcasted_weights, dim=-1)
                # permute weights so that 2D correctly drops dims 1 and 2
                transposed_weights = weights.swapaxes(2, 1)
                # drop entire matrix of last two dims (prev dims 1 and 2)
                dropped_transposed_weights = nn.functional.dropout2d(
                    transposed_weights, p=self.dropout, training=self.training
                )
                dropped_weights = dropped_transposed_weights.swapaxes(2, 1)

                position_encodings = ops.reshape(dropped_weights, (batch_size, sequence_length, -1))

            else:
                position_encodings = ops.cat(
                    [ops.reshape(weight, (batch_size, sequence_length, -1)) for weight in broadcasted_weights],
                    dim=-1,
                )

        else:
            if reduce(mul, self.axial_pos_shape) < sequence_length:
                raise ValueError(
                    f"Make sure that config.axial_pos_shape factors: {self.axial_pos_shape} multiply at least to "
                    f"max(sequence_length, least_common_mult_chunk_length): max({sequence_length}, "
                    f"{self.least_common_mult_chunk_length})."
                )

            # compute how many columns are needed
            max_position_id = position_ids.max().item()
            required_pos_encodings_columns = -(-(max_position_id + 1) // self.axial_pos_shape[1])

            # cut to columns that are needed
            position_encodings = ops.cat(
                [weight[:, :required_pos_encodings_columns] for weight in broadcasted_weights], dim=-1
            )
            position_encodings = ops.reshape(position_encodings, (batch_size, -1, position_encodings.shape[-1]))

            # select correct position encodings
            position_encodings = ops.cat(
                [
                    ops.index_select(position_encodings[i], 0, position_ids[i]).unsqueeze(0)
                    for i in range(batch_size)
                ],
                dim=0,
            )

        return position_encodings

`mindnlp.transformers.models.reformer.modeling_reformer.EfficientAttentionMixin` ¶

A few utilities for nn.Modules in Reformer, to be used as a mixin.

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class EfficientAttentionMixin:
    """
    A few utilities for nn.Modules in Reformer, to be used as a mixin.
    """

    def _look_adjacent(self, vectors, num_chunks_before, num_chunks_after):
        """
        Used to implement attention between consecutive chunks.

        Args:
            vectors: array of shape [batch_size, num_attention_heads, n_chunks, chunk_len, ...]
            num_chunks_before: chunks before current chunk to include in attention
            num_chunks_after: chunks after current chunk to include in attention

        Returns:
            tensor of shape [num_chunks, N * chunk_length, ...], where N = (1 + num_chunks_before + num_chunks_after).
        """
        if num_chunks_before == 0 and num_chunks_after == 0:
            return vectors

        slices = []
        for i in range(-num_chunks_before, num_chunks_after + 1):
            if i == 0:
                slices.append(vectors)
            else:
                slices.append(ops.cat([vectors[:, :, i:, ...], vectors[:, :, :i, ...]], dim=2))
        return ops.cat(slices, dim=3)

    def _split_hidden_size_dim(self, x, num_attn_heads, attn_head_size):
        """
        splits hidden_size dim into attn_head_size and num_attn_heads
        """
        new_x_shape = x.shape[:-1] + (num_attn_heads, attn_head_size)
        x = x.view(*new_x_shape)
        return x.swapaxes(2, 1)

    def _merge_hidden_size_dims(self, x, num_attn_heads, attn_head_size):
        """
        merges attn_head_size dim and num_attn_heads dim into hidden_size
        """
        x = x.permute(0, 2, 1, 3)
        return ops.reshape(x, (x.shape[0], -1, num_attn_heads * attn_head_size))

    def _split_seq_length_dim_to(self, vectors, dim_factor_1, dim_factor_2, num_attn_heads, attn_head_size=None):
        """
        splits sequence length dim of vectors into `dim_factor_1` and `dim_factor_2` dims
        """
        batch_size = vectors.shape[0]
        split_dim_shape = (batch_size, num_attn_heads, dim_factor_1, dim_factor_2)

        if len(vectors.shape) == 4:
            return ops.reshape(vectors, split_dim_shape + (attn_head_size,))
        elif len(vectors.shape) == 3:
            return ops.reshape(vectors, split_dim_shape)
        else:
            raise ValueError(f"Input vector rank should be one of [3, 4], but is: {len(vectors.shape)}")

`mindnlp.transformers.models.reformer.modeling_reformer.LSHSelfAttention` ¶

Bases: Module, EfficientAttentionMixin

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class LSHSelfAttention(nn.Module, EfficientAttentionMixin):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self.chunk_length = config.lsh_attn_chunk_length
        self.num_hashes = config.num_hashes
        self.num_buckets = config.num_buckets
        self.num_chunks_before = config.lsh_num_chunks_before
        self.num_chunks_after = config.lsh_num_chunks_after
        self.hash_seed = config.hash_seed
        self.is_decoder = config.is_decoder
        self.max_position_embeddings = config.max_position_embeddings

        self.dropout = config.lsh_attention_probs_dropout_prob

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = config.attention_head_size
        self.all_head_size = self.num_attention_heads * self.attention_head_size
        self.hidden_size = config.hidden_size

        # projection matrices
        self.query_key = nn.Linear(self.hidden_size, self.all_head_size, bias=False)
        self.value = nn.Linear(self.hidden_size, self.all_head_size, bias=False)

        # save mask value here. Need fp32 and fp16 mask values
        self.register_buffer("self_mask_value_float16", mindspore.tensor(-1e3), persistent=False)
        self.register_buffer("self_mask_value_float32", mindspore.tensor(-1e5), persistent=False)
        self.register_buffer("mask_value_float16", mindspore.tensor(-1e4), persistent=False)
        self.register_buffer("mask_value_float32", mindspore.tensor(-1e9), persistent=False)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        num_hashes=None,
        buckets=None,
        past_buckets_states=None,
        use_cache=False,
        output_attentions=False,
        **kwargs,
    ):
        sequence_length = hidden_states.shape[1]
        batch_size = hidden_states.shape[0]

        # num hashes can optionally be overwritten by user
        num_hashes = num_hashes if num_hashes is not None else self.num_hashes

        do_cached_attention = use_cache and past_buckets_states[1] is not None

        # check if cache shall be used and that hidden states are already cached
        if do_cached_attention:
            assert sequence_length == 1, (
                "At the moment, auto-regressive language generation is only possible one word at a time. Make sure"
                f" that input sequence length {sequence_length} equals 1, when `past_buckets_states` is passed."
            )
            past_buckets = past_buckets_states[0]
            past_states = past_buckets_states[1]

            # get query vector
            query_vectors = self.query_key(hidden_states)
            query_vectors = self._split_hidden_size_dim(
                query_vectors, self.num_attention_heads, self.attention_head_size
            )

            if past_buckets is not None:
                key_value_hidden_states, sorted_bucket_idx, buckets = self._get_relevant_hid_states_and_buckets(
                    query_vectors=query_vectors,
                    attention_mask=attention_mask,
                    num_hashes=num_hashes,
                    hidden_states=hidden_states,
                    past_states=past_states,
                    past_buckets=past_buckets,
                )

                query_key_vectors = self._query_per_attn_head(key_value_hidden_states)
                value_vectors = self._value_per_attn_head(key_value_hidden_states)

                # split key & value vectors by num hashes to apply
                # self attention on each separately
                query_key_vectors = self._split_seq_length_dim_to(
                    query_key_vectors,
                    num_hashes,
                    -1,
                    self.num_attention_heads,
                    self.attention_head_size,
                )
                value_vectors = self._split_seq_length_dim_to(
                    value_vectors,
                    num_hashes,
                    -1,
                    self.num_attention_heads,
                    self.attention_head_size,
                )
                # repeat query vectors across hash dimension
                query_vectors = query_vectors.unsqueeze(2).tile((1, 1, num_hashes, 1, 1))
            else:
                key_value_hidden_states = ops.cat([past_states, hidden_states], dim=1)

                query_key_vectors = self.query_key(key_value_hidden_states)
                value_vectors = self.value(key_value_hidden_states)

        else:
            # project hidden_states to query_key and value
            query_vectors = None
            query_key_vectors = self.query_key(hidden_states)
            value_vectors = self.value(hidden_states)

        # if query key is not already split
        if not do_cached_attention or past_buckets is None:
            query_key_vectors = self._split_hidden_size_dim(
                query_key_vectors, self.num_attention_heads, self.attention_head_size
            )
            value_vectors = self._split_hidden_size_dim(
                value_vectors, self.num_attention_heads, self.attention_head_size
            )

        # cache buckets for next incremental decoding
        if do_cached_attention and past_buckets is None and key_value_hidden_states.shape[1] >= self.chunk_length:
            buckets = self._hash_vectors(query_key_vectors, num_hashes, attention_mask)

        # free memory
        del hidden_states

        assert (
            query_key_vectors.shape[-1] == self.attention_head_size
        ), f"last dim of query_key_vectors is {query_key_vectors.shape[-1]} but should be {self.attention_head_size}."
        assert (
            value_vectors.shape[-1] == self.attention_head_size
        ), f"last dim of value_vectors is {value_vectors.shape[-1]} but should be {self.attention_head_size}."

        do_standard_self_attention = (sequence_length <= self.chunk_length) or (
            use_cache and past_buckets_states[1] is not None
        )
        # LSH attention only makes sense if chunked attention should be performed
        if not do_standard_self_attention:
            # set `num_buckets` on the fly, recommended way to do it
            if self.num_buckets is None:
                self._set_num_buckets(sequence_length)

            # use cached buckets for backprop only
            if buckets is None:
                # hash query key vectors into buckets
                buckets = self._hash_vectors(query_key_vectors, num_hashes, attention_mask)
            else:
                # make sure buckets has correct shape for LSH attention
                buckets = buckets.view(batch_size, self.num_attention_heads, num_hashes * sequence_length)

            assert (
                int(buckets.shape[-1]) == num_hashes * sequence_length
            ), f"last dim of buckets is {buckets.shape[-1]}, but should be {num_hashes * sequence_length}"

            sorted_bucket_idx, undo_sorted_bucket_idx = self._get_sorted_bucket_idx_and_undo_sorted_bucket_idx(
                sequence_length, buckets, num_hashes
            )

            # make sure bucket idx is not longer then sequence length
            sorted_bucket_idx_per_hash = sorted_bucket_idx % sequence_length

            # cluster query key value vectors according to hashed buckets
            query_key_vectors = self._gather_by_expansion(query_key_vectors, sorted_bucket_idx_per_hash, num_hashes)
            value_vectors = self._gather_by_expansion(value_vectors, sorted_bucket_idx_per_hash, num_hashes)
            query_key_vectors = self._split_seq_length_dim_to(
                query_key_vectors,
                -1,
                self.chunk_length,
                self.num_attention_heads,
                self.attention_head_size,
            )
            value_vectors = self._split_seq_length_dim_to(
                value_vectors,
                -1,
                self.chunk_length,
                self.num_attention_heads,
                self.attention_head_size,
            )

            if self.chunk_length is None:
                assert self.num_chunks_before == 0 and self.num_chunks_after == 0, (
                    "If `config.chunk_length` is `None`, make sure `config.num_chunks_after` and"
                    " `config.num_chunks_before` are set to 0."
                )
        elif do_cached_attention and past_buckets is not None:
            # use max sequence length
            sorted_bucket_idx_per_hash = sorted_bucket_idx
        else:
            # get sequence length indices
            sorted_bucket_idx_per_hash = ops.arange(sequence_length).tile(
                (batch_size, self.num_attention_heads, 1)
            )

        # scale key vectors
        sqrt_num = np.sqrt(self.attention_head_size)
        key_vectors = self._len_and_dim_norm(query_key_vectors, sqrt_num)

        # set query_vectors to query key vectors if LSH self attention
        query_vectors = query_vectors if query_vectors is not None else query_key_vectors

        # free memory
        del query_key_vectors

        # get attention probs
        out_vectors, logits, attention_probs = self._attend(
            query_vectors=query_vectors,
            key_vectors=key_vectors,
            value_vectors=value_vectors,
            sorted_bucket_idx_per_hash=sorted_bucket_idx_per_hash,
            attention_mask=attention_mask,
            head_mask=head_mask,
            do_standard_self_attention=do_standard_self_attention,
            do_cached_attention=do_cached_attention,
        )

        # free memory
        del key_vectors, value_vectors

        # re-order out_vectors and logits
        if not do_standard_self_attention:
            # sort clusters back to correct ordering
            out_vectors, logits = ReverseSort()(out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx)

        if not do_standard_self_attention or (do_cached_attention and past_buckets is not None):
            # sum up all hash rounds
            if num_hashes > 1:
                out_vectors = self._split_seq_length_dim_to(
                    out_vectors,
                    num_hashes,
                    sequence_length,
                    self.num_attention_heads,
                    self.attention_head_size,
                )
                logits = self._split_seq_length_dim_to(
                    logits,
                    num_hashes,
                    sequence_length,
                    self.num_attention_heads,
                    self.attention_head_size,
                ).unsqueeze(-1)

                probs_vectors = ops.exp(logits - ops.logsumexp(logits, dim=2, keepdim=True))
                out_vectors = ops.sum(out_vectors * probs_vectors, dim=2)
                # free memory
                del probs_vectors

            # free memory
            del logits

        assert out_vectors.shape == (
            batch_size,
            self.num_attention_heads,
            sequence_length,
            self.attention_head_size,
        ), (
            "out_vectors have be of shape `[batch_size, config.num_attention_heads, sequence_length,"
            " config.attention_head_size]`."
        )

        out_vectors = self._merge_hidden_size_dims(out_vectors, self.num_attention_heads, self.attention_head_size)

        if output_attentions is False:
            attention_probs = ()

        if buckets is not None:
            buckets = buckets.view(batch_size, self.num_attention_heads, num_hashes, -1)

        return LSHSelfAttentionOutput(hidden_states=out_vectors, attention_probs=attention_probs, buckets=buckets)

    def _query_per_attn_head(self, hidden_states):
        per_head_query_key = self.query_key.weight.reshape(
            self.num_attention_heads, self.attention_head_size, self.hidden_size
        ).swapaxes(-2, -1)
        # only relevant for inference and no bias => we can use einsum here
        query_key_vectors = ops.einsum("balh,ahr->balr", hidden_states, per_head_query_key)
        return query_key_vectors

    def _value_per_attn_head(self, hidden_states):
        per_head_value = self.value.weight.reshape(
            self.num_attention_heads, self.attention_head_size, self.hidden_size
        ).swapaxes(-2, -1)
        # only relevant for inference and no bias => we can use einsum here
        value_vectors = ops.einsum("balh,ahr->balr", hidden_states, per_head_value)
        return value_vectors

    def _hash_vectors(self, vectors, num_hashes, attention_mask, increase_num_buckets=False):
        batch_size = vectors.shape[0]

        # See https://arxiv.org/pdf/1509.02897.pdf
        # We sample a different random rotation for each round of hashing to
        # decrease the probability of hash misses.
        if isinstance(self.num_buckets, int):
            assert (
                self.num_buckets % 2 == 0
            ), f"There should be an even number of buckets, but `self.num_buckets`: {self.num_buckets}"
            rotation_size = self.num_buckets
            num_buckets = self.num_buckets
        else:
            # Factorize the hash if self.num_buckets is a list or tuple
            rotation_size, num_buckets = 0, 1
            for bucket_factor in self.num_buckets:
                assert (
                    bucket_factor % 2 == 0
                ), f"The number of buckets should be even, but `num_bucket`: {bucket_factor}"
                rotation_size = rotation_size + bucket_factor
                num_buckets = num_buckets * bucket_factor

        if self.hash_seed is not None:
            # for determinism
            mindspore.set_seed(self.hash_seed)
            mindspore.manual_seed(self.hash_seed)

        rotations_shape = (self.num_attention_heads, vectors.shape[-1], num_hashes, rotation_size // 2)
        # create a random self.attention_head_size x num_hashes x num_buckets/2
        random_rotations = ops.randn(rotations_shape, dtype=vectors.dtype)
        # Output dim: Batch_Size x Num_Attn_Heads x Num_Hashes x Seq_Len x Num_Buckets/2
        rotated_vectors = ops.einsum("bmtd,mdhr->bmhtr", vectors, random_rotations)

        if isinstance(self.num_buckets, int) or len(self.num_buckets) == 1:
            rotated_vectors = ops.cat([rotated_vectors, -rotated_vectors], dim=-1)
            buckets = ops.argmax(rotated_vectors, dim=-1)
        else:
            # Get the buckets for them and combine.
            buckets, cur_sum, cur_product = None, 0, 1
            for bucket_factor in self.num_buckets:
                rotated_vectors_factor = rotated_vectors[..., cur_sum : cur_sum + (bucket_factor // 2)]
                cur_sum = cur_sum + bucket_factor // 2
                rotated_vectors_factor = ops.cat([rotated_vectors_factor, -rotated_vectors_factor], dim=-1)
                if buckets is None:
                    buckets = ops.argmax(rotated_vectors_factor, dim=-1)
                else:
                    buckets = buckets + (cur_product * ops.argmax(rotated_vectors_factor, dim=-1))

                cur_product = cur_product * bucket_factor

        if attention_mask is not None and (attention_mask.sum().item() < batch_size * attention_mask.shape[-1]):
            # add an extra bucket for padding tokens only
            num_buckets = num_buckets + 1
            # assign padding tokens extra bucket
            buckets_mask = attention_mask.to(mindspore.bool_)[:, None, None, :].broadcast_to(buckets.shape)
            buckets = ops.where(
                buckets_mask, buckets, mindspore.tensor(num_buckets - 1, dtype=mindspore.int64)
            )
        elif increase_num_buckets:
            num_buckets = num_buckets + 1

        # buckets is now (Batch_size x Num_Attn_Heads x Num_Hashes x Seq_Len).
        # Next we add offsets so that bucket numbers from different hashing rounds don't overlap.
        offsets = ops.arange(num_hashes)
        offsets = (offsets * num_buckets).view((1, 1, -1, 1))

        # expand to batch size and num attention heads
        offsets = offsets.broadcast_to((batch_size, self.num_attention_heads) + offsets.shape[-2:])
        offset_buckets = ops.flatten((buckets + offsets), start_dim=2, end_dim=3)

        return offset_buckets

    def _get_sorted_bucket_idx_and_undo_sorted_bucket_idx(self, sequence_length, buckets, num_hashes):
        # no gradients are needed
        with no_grad():
            # hash-based sort
            sorted_bucket_idx = _stable_argsort(buckets, dim=-1)

            # create simple indices to scatter to, to have undo sort
            indices = (
                ops.arange(sorted_bucket_idx.shape[-1], dtype=sorted_bucket_idx.dtype)
                .view(1, 1, -1)
                .broadcast_to(sorted_bucket_idx.shape)
            )

            # get undo sort
            undo_sorted_bucket_idx = ops.zeros_like(sorted_bucket_idx)
            undo_sorted_bucket_idx = ops.scatter(undo_sorted_bucket_idx, -1, sorted_bucket_idx, indices)

        return sorted_bucket_idx, undo_sorted_bucket_idx

    def _set_num_buckets(self, sequence_length):
        # `num_buckets` should be set to 2 * sequence_length // chunk_length as recommended in paper
        num_buckets_pow_2 = (2 * (sequence_length // self.chunk_length)).bit_length() - 1
        # make sure buckets are power of 2
        num_buckets = 2**num_buckets_pow_2

        # factorize `num_buckets` if `num_buckets` becomes too large
        num_buckets_limit = 2 * max(
            int((self.max_position_embeddings // self.chunk_length) ** (0.5)),
            self.chunk_length,
        )
        if num_buckets > num_buckets_limit:
            num_buckets = [2 ** (num_buckets_pow_2 // 2), 2 ** (num_buckets_pow_2 - num_buckets_pow_2 // 2)]

        logger.warning(f"config.num_buckets is not set. Setting config.num_buckets to {num_buckets}...")

        # set num buckets in config to be properly saved
        self.config.num_buckets = num_buckets
        self.num_buckets = num_buckets

    def _attend(
        self,
        query_vectors,
        key_vectors,
        value_vectors,
        sorted_bucket_idx_per_hash,
        attention_mask,
        head_mask,
        do_standard_self_attention,
        do_cached_attention,
    ):
        # look at previous and following chunks if chunked attention
        if not do_standard_self_attention:
            key_vectors = self._look_adjacent(key_vectors, self.num_chunks_before, self.num_chunks_after)
            value_vectors = self._look_adjacent(value_vectors, self.num_chunks_before, self.num_chunks_after)

        # get logits and dots
        # (BS, NumAttn, NumHash x NumChunk, Chunk_L x Hidden),(BS, NumAttn, NumHash x NumChunk, Chunk_L * (Num_bef + Num_aft + 1) x Hidden) -> (BS, NumAttn, NumHash x NumChunk, Chunk_L, Chunk_L * (1 + Num_bef + Num_aft))
        query_key_dots = ops.matmul(query_vectors, key_vectors.swapaxes(-1, -2))

        # free memory
        del query_vectors, key_vectors

        # if chunked attention split bucket idxs to query and key
        if not do_standard_self_attention:
            query_bucket_idx = self._split_seq_length_dim_to(
                sorted_bucket_idx_per_hash, -1, self.chunk_length, self.num_attention_heads
            )
            key_value_bucket_idx = self._look_adjacent(query_bucket_idx, self.num_chunks_before, self.num_chunks_after)
        elif do_cached_attention and query_key_dots.ndim > 4:
            key_value_bucket_idx = sorted_bucket_idx_per_hash
            query_bucket_idx = (
                ops.ones(key_value_bucket_idx.shape[:-1] + (1,), dtype=key_value_bucket_idx.dtype) * key_value_bucket_idx.max()
            )
        elif do_cached_attention and query_key_dots.ndim <= 4:
            query_bucket_idx = (query_key_dots.shape[-1] - 1) * ops.ones_like(query_key_dots)[:, :, :, -1]
            key_value_bucket_idx = ops.arange(
                query_key_dots.shape[-1], dtype=mindspore.int64
            )[None, None, :].broadcast_to(query_bucket_idx.shape[:2] + (-1,))
        else:
            query_bucket_idx = key_value_bucket_idx = sorted_bucket_idx_per_hash

        # get correct mask values depending on precision
        if query_key_dots.dtype == mindspore.float16:
            self_mask_value = self.self_mask_value_float16.half()
            mask_value = self.mask_value_float16.half()
        else:
            self_mask_value = self.self_mask_value_float32
            mask_value = self.mask_value_float32

        if not do_cached_attention:
            mask = self._compute_attn_mask(
                query_bucket_idx,
                key_value_bucket_idx,
                attention_mask,
                query_key_dots.shape,
                do_standard_self_attention,
            )

            if mask is not None:
                query_key_dots = ops.where(mask, query_key_dots, mask_value)

            # free memory
            del mask

        # Self mask is ALWAYS applied.
        # From the reformer paper (https://arxiv.org/pdf/2001.04451.pdf):
        # " While attention to the future is not allowed, typical implementations of the
        # Transformer do allow a position to attend to itself.
        # Such behavior is undesirable in a shared-QK formulation because the dot-product
        # of a query vector with itself will almost always be greater than the dot product of a
        # query vector with a vector at another position. We therefore modify the masking
        # to forbid a token from attending to itself, except in situations
        # where a token has no other valid attention targets (e.g. the first token in a sequence) "

        self_mask = ops.ne(query_bucket_idx.unsqueeze(-1), key_value_bucket_idx.unsqueeze(-2))

        # apply self_mask
        query_key_dots = ops.where(self_mask, query_key_dots, self_mask_value)

        # free memory
        del self_mask

        logits = ops.logsumexp(query_key_dots, dim=-1, keepdim=True)
        # dots shape is `[batch_size, num_attn_heads, num_hashes * seq_len // chunk_length, chunk_length, chunk_length * (1 + num_chunks_before + num_chunks_after)]`
        attention_probs = ops.exp(query_key_dots - logits)

        # free memory
        del query_key_dots

        # dropout
        attention_probs = nn.functional.dropout(attention_probs, p=self.dropout, training=self.training)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        # attend values
        out_vectors = ops.matmul(attention_probs, value_vectors)

        # free memory
        del value_vectors

        # merge chunk length
        if out_vectors.ndim > 4:
            logits = ops.flatten(logits, start_dim=2, end_dim=3).squeeze(-1)
            out_vectors = ops.flatten(out_vectors, start_dim=2, end_dim=3)

        return out_vectors, logits, attention_probs

    def _compute_attn_mask(
        self, query_indices, key_indices, attention_mask, query_key_dot_shape, do_standard_self_attention
    ):
        # attention mask for LSH
        if attention_mask is not None:
            # if chunked attention, the attention mask has to correspond to LSH order
            attention_mask = attention_mask.to(mindspore.bool_)[:, None, :]
            if not do_standard_self_attention:
                # expand attn_mask to fit with key_value_bucket_idx shape
                attention_mask = attention_mask[:, None, :]
                attention_mask = attention_mask.broadcast_to(query_indices.shape[:-1] + (-1,))
                # extract attention mask from LSH sorted key_indices
                attention_mask = ops.gather(attention_mask, -1, key_indices)

            attention_mask = attention_mask.unsqueeze(-2).broadcast_to(query_key_dot_shape)

        # Causal mask
        if self.is_decoder is True:
            causal_mask = ops.ge(query_indices.unsqueeze(-1), key_indices.unsqueeze(-2))

            # add attention mask if not None
            if attention_mask is not None:
                attention_mask = causal_mask * attention_mask
            else:
                attention_mask = causal_mask

        return attention_mask

    def _get_relevant_hid_states_and_buckets(
        self, query_vectors, attention_mask, num_hashes, hidden_states, past_states, past_buckets
    ):
        # concat hidden states
        hidden_states = ops.cat([past_states, hidden_states], dim=1)

        # batch_size hidden
        batch_size = hidden_states.shape[0]
        sequence_length = hidden_states.shape[1]

        # check if cached buckets include pad bucket
        max_bucket = self.num_buckets if isinstance(self.num_buckets, int) else reduce(mul, self.num_buckets)

        # if pad bucket was cached => need to increase num buckets for caching
        increase_num_buckets = past_buckets.max() > num_hashes * max_bucket - 1

        # retrieve query buckets
        query_buckets = self._hash_vectors(
            query_vectors, num_hashes, attention_mask, increase_num_buckets=increase_num_buckets
        )

        # concat buckets
        concat_buckets = ops.cat([past_buckets, query_buckets.unsqueeze(-1)], dim=-1)

        # hash-based sort
        bucket_idx = _stable_argsort(concat_buckets, dim=-1)

        # bucket_idx has shape: BatchSize x NumAttnHeads x NumHashes x SequenceLength
        assert bucket_idx.shape == (
            batch_size,
            self.num_attention_heads,
            num_hashes,
            sequence_length,
        ), (
            f"bucket_idx should have shape {(batch_size, self.num_attention_heads, num_hashes, sequence_length)}, but"
            f" has shape {bucket_idx.shape}."
        )

        # find indices of new bucket indices
        relevant_bucket_idx = (bucket_idx == (bucket_idx.shape[-1] - 1)).nonzero()

        # expand relevant bucket indices to its chunks
        relevant_bucket_idx_chunk = self._expand_to_indices_in_relevant_chunk(relevant_bucket_idx, sequence_length)
        relevant_bucket_idx_chunk = bucket_idx[tuple(relevant_bucket_idx_chunk.swapaxes(0, 1))]

        # adapt bucket_idx for batch and hidden states for index select
        offset = ops.arange(relevant_bucket_idx_chunk.shape[-1], dtype=mindspore.int64)
        bucket_idx_batch_offset = sequence_length * (
            batch_size * ops.div(offset, relevant_bucket_idx_chunk.shape[-1], rounding_mode="floor")
        )

        # add batch offset
        relevant_bucket_idx_chunk_all_batch = relevant_bucket_idx_chunk + bucket_idx_batch_offset
        hidden_states = hidden_states.reshape((-1, self.hidden_size))

        # select all relevant hidden states
        relevant_hidden_states = hidden_states.index_select(0, relevant_bucket_idx_chunk_all_batch)

        # reshape hidden states and bucket_idx to correct output
        relevant_hidden_states = relevant_hidden_states.reshape(
            batch_size, self.num_attention_heads, -1, self.hidden_size
        )
        relevant_bucket_idx_chunk = relevant_bucket_idx_chunk.reshape(
            batch_size, self.num_attention_heads, num_hashes, -1
        )

        assert (
            relevant_hidden_states.shape[2]
            == (self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length * num_hashes
        ), (
            "There should be"
            f" {(self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length * num_hashes} `hidden_states`,"
            f" there are {relevant_hidden_states.shape[2]} `hidden_states`."
        )

        assert (
            relevant_bucket_idx_chunk.shape[-1]
            == (self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length
        ), (
            "There should be"
            f" {(self.num_chunks_before + self.num_chunks_after + 1) * self.chunk_length} `hidden_states`, there are"
            f" {relevant_bucket_idx_chunk.shape[-1]} `bucket_idx`."
        )

        return relevant_hidden_states, relevant_bucket_idx_chunk, query_buckets

    def _expand_to_indices_in_relevant_chunk(self, indices, sequence_length):
        # get relevant indices of where chunk starts and its size
        start_indices_chunk = ((indices[:, -1] // self.chunk_length) - self.num_chunks_before) * self.chunk_length
        total_chunk_size = self.chunk_length * (1 + self.num_chunks_before + self.num_chunks_after)

        # expand start indices and add correct chunk offset via arange
        expanded_start_indices = start_indices_chunk.unsqueeze(-1).broadcast_to((indices.shape[0], total_chunk_size))
        chunk_sequence_indices = expanded_start_indices + ops.arange(
            total_chunk_size, dtype=mindspore.int64
        ).unsqueeze(0).broadcast_to((indices.shape[0], total_chunk_size))

        # make sure that circular logic holds via % seq len
        chunk_sequence_indices = chunk_sequence_indices.flatten() % sequence_length

        # expand indices and set indices correctly
        indices = ops.flatten(indices.unsqueeze(1).broadcast_to((indices.shape[0], total_chunk_size, -1)), 0, 1).copy()
        indices[:, -1] = chunk_sequence_indices

        return indices

    def _len_and_dim_norm(self, vectors, sqrt_num):
        """
        length and attention head size dim normalization
        """
        vectors = self._len_norm(vectors)
        vectors = vectors / sqrt_num
        return vectors

    def _len_norm(self, x, epsilon=1e-6):
        """
        length normalization
        """
        variance = ops.mean(x**2, -1, keepdim=True)
        norm_x = x * ops.rsqrt(variance + epsilon)
        return norm_x

    def _gather_by_expansion(self, vectors, idxs, num_hashes):
        """
        expand dims of idxs and vectors for all hashes and gather
        """
        expanded_idxs = idxs.unsqueeze(-1).broadcast_to((-1, -1, -1, self.attention_head_size))
        vectors = vectors.tile((1, 1, num_hashes, 1))
        return ops.gather(vectors, 2, expanded_idxs)

`mindnlp.transformers.models.reformer.modeling_reformer.PositionEmbeddings` ¶

Bases: Module

Constructs conventional position embeddings of shape [max_pos_embeddings, hidden_size].

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class PositionEmbeddings(nn.Module):
    """Constructs conventional position embeddings of shape `[max_pos_embeddings, hidden_size]`."""

    def __init__(self, config):
        super().__init__()
        self.dropout = config.hidden_dropout_prob
        self.embedding = nn.Embedding(config.max_position_embeddings, config.hidden_size)

    def forward(self, position_ids):
        position_embeddings = self.embedding(position_ids)
        position_embeddings = nn.functional.dropout(position_embeddings, p=self.dropout, training=self.training)
        return position_embeddings

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerClassificationHead` ¶

Bases: Module

Head for sentence-level classification tasks.

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class ReformerClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(2 * config.hidden_size, config.hidden_size)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, hidden_states, **kwargs):
        hidden_states = hidden_states[:, 0, :]  # take <s> token (equiv. to [CLS])
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.dense(hidden_states)
        hidden_states = ops.tanh(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.out_proj(hidden_states)
        return hidden_states

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerEmbeddings` ¶

Bases: Module

Construct the embeddings from word, position and token_type embeddings.

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class ReformerEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config):
        super().__init__()
        self.max_position_embeddings = config.max_position_embeddings
        self.dropout = config.hidden_dropout_prob

        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embeddings = (
            AxialPositionEmbeddings(config) if config.axial_pos_embds else PositionEmbeddings(config)
        )

    def forward(self, input_ids=None, position_ids=None, inputs_embeds=None, start_idx_pos_encodings=0):
        if input_ids is not None:
            input_shape = input_ids.shape
        else:
            input_shape = inputs_embeds.shape[:-1]

        seq_length = input_shape[1]
        if position_ids is None:
            position_ids = ops.arange(
                start_idx_pos_encodings, start_idx_pos_encodings + seq_length, dtype=mindspore.int64
            )
            position_ids = position_ids.unsqueeze(0).broadcast_to(input_shape)

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)

        if position_ids.shape[-1] > self.max_position_embeddings:
            raise ValueError(
                f"Sequence Length: {position_ids.shape[-1]} has to be less or equal than "
                f"config.max_position_embeddings {self.max_position_embeddings}."
            )

        # dropout
        embeddings = nn.functional.dropout(inputs_embeds, p=self.dropout, training=self.training)

        # add positional embeddings
        position_embeddings = self.position_embeddings(position_ids)
        embeddings = embeddings + position_embeddings
        return embeddings

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForMaskedLM` ¶

Bases: ReformerPreTrainedModel

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class ReformerForMaskedLM(ReformerPreTrainedModel):
    _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"]

    def __init__(self, config):
        super().__init__(config)
        assert not config.is_decoder, (
            "If you want to use `ReformerForMaskedLM` make sure `config.is_decoder=False` for bi-directional"
            " self-attention."
        )
        self.reformer = ReformerModel(config)
        self.lm_head = ReformerOnlyLMHead(config)

        # Initialize weights and apply final processing
        self.post_init()

    def get_output_embeddings(self):
        return self.lm_head.decoder

    def set_output_embeddings(self, new_embeddings):
        self.lm_head.decoder = new_embeddings
        self.lm_head.bias = new_embeddings.bias

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        num_hashes: Optional[int] = None,
        labels: Optional[mindspore.Tensor] = None,
        output_hidden_states: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, MaskedLMOutput]:
        r"""
        labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
                config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked),
                the loss is only computed for the tokens with labels

        Returns:

        <Tip warning={true}>

        This example uses a false checkpoint since we don't have any available pretrained model for the masked language
        modeling task with the Reformer architecture.

        </Tip>

        Example:

        ```python
        >>> import torch
        >>> from transformers import AutoTokenizer, ReformerForMaskedLM

        >>> tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-reformer")
        >>> model = ReformerForMaskedLM.from_pretrained("hf-internal-testing/tiny-random-reformer")

        >>> # add mask_token
        >>> tokenizer.add_special_tokens({"mask_token": "[MASK]"})  # doctest: +IGNORE_RESULT
        >>> inputs = tokenizer("The capital of France is [MASK].", return_tensors="ms")

        >>> # resize model's embedding matrix
        >>> model.resize_token_embeddings(new_num_tokens=model.config.vocab_size + 1)  # doctest: +IGNORE_RESULT

        >>> with no_grad():
        ...     logits = model(**inputs).logits

        >>> # retrieve index of [MASK]
        >>> mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

        >>> predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
        >>> predicted_token = tokenizer.decode(predicted_token_id)
        ```

        ```python
        >>> labels = tokenizer("The capital of France is Paris.", return_tensors="ms")["input_ids"]
        >>> # mask labels of non-[MASK] tokens
        >>> labels = ops.where(
        ...     inputs.input_ids == tokenizer.mask_token_id, labels[:, : inputs["input_ids"].shape[-1]], -100
        ... )

        >>> outputs = model(**inputs, labels=labels)
        >>> loss = round(outputs.loss.item(), 2)
        ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        reformer_outputs = self.reformer(
            input_ids,
            position_ids=position_ids,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            num_hashes=num_hashes,
            use_cache=False,  # no causal mask
            output_hidden_states=output_hidden_states,
            output_attentions=output_attentions,
            return_dict=return_dict,
        )

        sequence_output = reformer_outputs[0]
        logits = self.lm_head(sequence_output)

        masked_lm_loss = None
        if labels is not None:
            loss_fct = CrossEntropyLoss()  # -100 index = padding token
            masked_lm_loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))

        if not return_dict:
            output = (logits,) + reformer_outputs[1:]
            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output

        return MaskedLMOutput(
            loss=masked_lm_loss,
            logits=logits,
            hidden_states=reformer_outputs.hidden_states,
            attentions=reformer_outputs.attentions,
        )

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForMaskedLM.forward(input_ids=None, position_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, num_hashes=None, labels=None, output_hidden_states=None, output_attentions=None, return_dict=None)` ¶

labels (mindspore.Tensor of shape (batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels

Returns:

This example uses a false checkpoint since we don't have any available pretrained model for the masked language modeling task with the Reformer architecture.

Example:

>>> import torch
>>> from transformers import AutoTokenizer, ReformerForMaskedLM

>>> tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-reformer")
>>> model = ReformerForMaskedLM.from_pretrained("hf-internal-testing/tiny-random-reformer")

>>> # add mask_token
>>> tokenizer.add_special_tokens({"mask_token": "[MASK]"})  # doctest: +IGNORE_RESULT
>>> inputs = tokenizer("The capital of France is [MASK].", return_tensors="ms")

>>> # resize model's embedding matrix
>>> model.resize_token_embeddings(new_num_tokens=model.config.vocab_size + 1)  # doctest: +IGNORE_RESULT

>>> with no_grad():
...     logits = model(**inputs).logits

>>> # retrieve index of [MASK]
>>> mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

>>> predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
>>> predicted_token = tokenizer.decode(predicted_token_id)

>>> labels = tokenizer("The capital of France is Paris.", return_tensors="ms")["input_ids"]
>>> # mask labels of non-[MASK] tokens
>>> labels = ops.where(
...     inputs.input_ids == tokenizer.mask_token_id, labels[:, : inputs["input_ids"].shape[-1]], -100
... )

>>> outputs = model(**inputs, labels=labels)
>>> loss = round(outputs.loss.item(), 2)

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    num_hashes: Optional[int] = None,
    labels: Optional[mindspore.Tensor] = None,
    output_hidden_states: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, MaskedLMOutput]:
    r"""
    labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked),
            the loss is only computed for the tokens with labels

    Returns:

    <Tip warning={true}>

    This example uses a false checkpoint since we don't have any available pretrained model for the masked language
    modeling task with the Reformer architecture.

    </Tip>

    Example:

    ```python
    >>> import torch
    >>> from transformers import AutoTokenizer, ReformerForMaskedLM

    >>> tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-reformer")
    >>> model = ReformerForMaskedLM.from_pretrained("hf-internal-testing/tiny-random-reformer")

    >>> # add mask_token
    >>> tokenizer.add_special_tokens({"mask_token": "[MASK]"})  # doctest: +IGNORE_RESULT
    >>> inputs = tokenizer("The capital of France is [MASK].", return_tensors="ms")

    >>> # resize model's embedding matrix
    >>> model.resize_token_embeddings(new_num_tokens=model.config.vocab_size + 1)  # doctest: +IGNORE_RESULT

    >>> with no_grad():
    ...     logits = model(**inputs).logits

    >>> # retrieve index of [MASK]
    >>> mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

    >>> predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
    >>> predicted_token = tokenizer.decode(predicted_token_id)
    ```

    ```python
    >>> labels = tokenizer("The capital of France is Paris.", return_tensors="ms")["input_ids"]
    >>> # mask labels of non-[MASK] tokens
    >>> labels = ops.where(
    ...     inputs.input_ids == tokenizer.mask_token_id, labels[:, : inputs["input_ids"].shape[-1]], -100
    ... )

    >>> outputs = model(**inputs, labels=labels)
    >>> loss = round(outputs.loss.item(), 2)
    ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    reformer_outputs = self.reformer(
        input_ids,
        position_ids=position_ids,
        attention_mask=attention_mask,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        num_hashes=num_hashes,
        use_cache=False,  # no causal mask
        output_hidden_states=output_hidden_states,
        output_attentions=output_attentions,
        return_dict=return_dict,
    )

    sequence_output = reformer_outputs[0]
    logits = self.lm_head(sequence_output)

    masked_lm_loss = None
    if labels is not None:
        loss_fct = CrossEntropyLoss()  # -100 index = padding token
        masked_lm_loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))

    if not return_dict:
        output = (logits,) + reformer_outputs[1:]
        return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output

    return MaskedLMOutput(
        loss=masked_lm_loss,
        logits=logits,
        hidden_states=reformer_outputs.hidden_states,
        attentions=reformer_outputs.attentions,
    )

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForQuestionAnswering` ¶

Bases: ReformerPreTrainedModel

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class ReformerForQuestionAnswering(ReformerPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.reformer = ReformerModel(config)
        # 2 * config.hidden_size because we use reversible residual layers
        self.qa_outputs = nn.Linear(2 * config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        num_hashes: Optional[int] = None,
        start_positions: Optional[mindspore.Tensor] = None,
        end_positions: Optional[mindspore.Tensor] = None,
        output_hidden_states: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, QuestionAnsweringModelOutput]:
        r"""
        start_positions (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
            are not taken into account for computing the loss.
        end_positions (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
            are not taken into account for computing the loss.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        reformer_outputs = self.reformer(
            input_ids,
            position_ids=position_ids,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            num_hashes=num_hashes,
            use_cache=False,  # no causal mask
            output_hidden_states=output_hidden_states,
            output_attentions=output_attentions,
            return_dict=return_dict,
        )

        sequence_output = reformer_outputs[0]

        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = ops.split(logits, 1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        total_loss = None
        if start_positions is not None and end_positions is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.shape) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.shape) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.shape[1]
            start_positions = start_positions.clamp(0, ignored_index)
            end_positions = end_positions.clamp(0, ignored_index)

            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2

        if not return_dict:
            output = (start_logits, end_logits) + reformer_outputs[1:]
            return ((total_loss,) + output) if total_loss is not None else output

        return QuestionAnsweringModelOutput(
            loss=total_loss,
            start_logits=start_logits,
            end_logits=end_logits,
            hidden_states=reformer_outputs.hidden_states,
            attentions=reformer_outputs.attentions,
        )

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForQuestionAnswering.forward(input_ids=None, position_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, num_hashes=None, start_positions=None, end_positions=None, output_hidden_states=None, output_attentions=None, return_dict=None)` ¶

start_positions (mindspore.Tensor of shape (batch_size,), optional): Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss. end_positions (mindspore.Tensor of shape (batch_size,), optional): Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    num_hashes: Optional[int] = None,
    start_positions: Optional[mindspore.Tensor] = None,
    end_positions: Optional[mindspore.Tensor] = None,
    output_hidden_states: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, QuestionAnsweringModelOutput]:
    r"""
    start_positions (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
        Labels for position (index) of the start of the labelled span for computing the token classification loss.
        Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
        are not taken into account for computing the loss.
    end_positions (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
        Labels for position (index) of the end of the labelled span for computing the token classification loss.
        Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
        are not taken into account for computing the loss.
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    reformer_outputs = self.reformer(
        input_ids,
        position_ids=position_ids,
        attention_mask=attention_mask,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        num_hashes=num_hashes,
        use_cache=False,  # no causal mask
        output_hidden_states=output_hidden_states,
        output_attentions=output_attentions,
        return_dict=return_dict,
    )

    sequence_output = reformer_outputs[0]

    logits = self.qa_outputs(sequence_output)
    start_logits, end_logits = ops.split(logits, 1, dim=-1)
    start_logits = start_logits.squeeze(-1)
    end_logits = end_logits.squeeze(-1)

    total_loss = None
    if start_positions is not None and end_positions is not None:
        # If we are on multi-GPU, split add a dimension
        if len(start_positions.shape) > 1:
            start_positions = start_positions.squeeze(-1)
        if len(end_positions.shape) > 1:
            end_positions = end_positions.squeeze(-1)
        # sometimes the start/end positions are outside our model inputs, we ignore these terms
        ignored_index = start_logits.shape[1]
        start_positions = start_positions.clamp(0, ignored_index)
        end_positions = end_positions.clamp(0, ignored_index)

        loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
        start_loss = loss_fct(start_logits, start_positions)
        end_loss = loss_fct(end_logits, end_positions)
        total_loss = (start_loss + end_loss) / 2

    if not return_dict:
        output = (start_logits, end_logits) + reformer_outputs[1:]
        return ((total_loss,) + output) if total_loss is not None else output

    return QuestionAnsweringModelOutput(
        loss=total_loss,
        start_logits=start_logits,
        end_logits=end_logits,
        hidden_states=reformer_outputs.hidden_states,
        attentions=reformer_outputs.attentions,
    )

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForSequenceClassification` ¶

Bases: ReformerPreTrainedModel

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class ReformerForSequenceClassification(ReformerPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.config = config

        self.reformer = ReformerModel(config)
        self.classifier = ReformerClassificationHead(config)
        if config.is_decoder is True:
            logger.warning("You might want to disable causal masking for sequence classification")

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        num_hashes: Optional[int] = None,
        labels: Optional[mindspore.Tensor] = None,
        output_hidden_states: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, SequenceClassifierOutput]:
        r"""
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

        Returns:

        Example of single-label classification:

        ```python
        >>> import torch
        >>> from transformers import AutoTokenizer, ReformerForSequenceClassification

        >>> tokenizer = AutoTokenizer.from_pretrained("google/reformer-crime-and-punishment")
        >>> model = ReformerForSequenceClassification.from_pretrained("google/reformer-crime-and-punishment")

        >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="ms")

        >>> with no_grad():
        ...     logits = model(**inputs).logits

        >>> predicted_class_id = logits.argmax().item()
        >>> label = model.config.id2label[predicted_class_id]
        ```

        ```python
        >>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
        >>> num_labels = len(model.config.id2label)
        >>> model = ReformerForSequenceClassification.from_pretrained(
        ...     "google/reformer-crime-and-punishment", num_labels=num_labels
        ... )

        >>> labels = mindspore.tensor(1)
        >>> loss = model(**inputs, labels=labels).loss
        ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.reformer(
            input_ids,
            position_ids=position_ids,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            num_hashes=num_hashes,
            output_hidden_states=output_hidden_states,
            output_attentions=output_attentions,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]
        logits = self.classifier(sequence_output)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForSequenceClassification.forward(input_ids=None, position_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, num_hashes=None, labels=None, output_hidden_states=None, output_attentions=None, return_dict=None)` ¶

labels (mindspore.Tensor of shape (batch_size,), optional): Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Returns:

Example of single-label classification:

>>> import torch
>>> from transformers import AutoTokenizer, ReformerForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("google/reformer-crime-and-punishment")
>>> model = ReformerForSequenceClassification.from_pretrained("google/reformer-crime-and-punishment")

>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="ms")

>>> with no_grad():
...     logits = model(**inputs).logits

>>> predicted_class_id = logits.argmax().item()
>>> label = model.config.id2label[predicted_class_id]

>>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
>>> num_labels = len(model.config.id2label)
>>> model = ReformerForSequenceClassification.from_pretrained(
...     "google/reformer-crime-and-punishment", num_labels=num_labels
... )

>>> labels = mindspore.tensor(1)
>>> loss = model(**inputs, labels=labels).loss

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    num_hashes: Optional[int] = None,
    labels: Optional[mindspore.Tensor] = None,
    output_hidden_states: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutput]:
    r"""
    labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
        Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
        config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
        `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

    Returns:

    Example of single-label classification:

    ```python
    >>> import torch
    >>> from transformers import AutoTokenizer, ReformerForSequenceClassification

    >>> tokenizer = AutoTokenizer.from_pretrained("google/reformer-crime-and-punishment")
    >>> model = ReformerForSequenceClassification.from_pretrained("google/reformer-crime-and-punishment")

    >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="ms")

    >>> with no_grad():
    ...     logits = model(**inputs).logits

    >>> predicted_class_id = logits.argmax().item()
    >>> label = model.config.id2label[predicted_class_id]
    ```

    ```python
    >>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
    >>> num_labels = len(model.config.id2label)
    >>> model = ReformerForSequenceClassification.from_pretrained(
    ...     "google/reformer-crime-and-punishment", num_labels=num_labels
    ... )

    >>> labels = mindspore.tensor(1)
    >>> loss = model(**inputs, labels=labels).loss
    ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.reformer(
        input_ids,
        position_ids=position_ids,
        attention_mask=attention_mask,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        num_hashes=num_hashes,
        output_hidden_states=output_hidden_states,
        output_attentions=output_attentions,
        return_dict=return_dict,
    )

    sequence_output = outputs[0]
    logits = self.classifier(sequence_output)

    loss = None
    if labels is not None:
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            loss_fct = MSELoss()
            if self.num_labels == 1:
                loss = loss_fct(logits.squeeze(), labels.squeeze())
            else:
                loss = loss_fct(logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss_fct = BCEWithLogitsLoss()
            loss = loss_fct(logits, labels)

    if not return_dict:
        output = (logits,) + outputs[2:]
        return ((loss,) + output) if loss is not None else output

    return SequenceClassifierOutput(
        loss=loss,
        logits=logits,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerLayer` ¶

Bases: Module

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class ReformerLayer(nn.Module):
    def __init__(self, config, layer_id=0):
        super().__init__()
        self.attention = ReformerAttention(config, layer_id)
        # dropout requires to have the same
        # seed for forward and backward pass
        self.attention_seed = None
        self.feed_forward_seed = None

        self.feed_forward = ChunkReformerFeedForward(config)

    def _init_attention_seed(self):
        """
        This function sets a new seed for the attention layer to make dropout deterministic for both forward calls: 1
        normal forward call and 1 forward call in backward to recalculate activations.
        """

        # randomize seeds
        # CPU
        self.attention_seed = int(mindspore.seed() % sys.maxsize)

        mindspore.set_seed(self.attention_seed)
        mindspore.manual_seed(self.attention_seed)

    def _init_feed_forward_seed(self):
        """
        This function sets a new seed for the feed forward layer to make dropout deterministic for both forward calls:
        1 normal forward call and 1 forward call in backward to recalculate activations.
        """
        # randomize seeds

        # CPU
        self.feed_forward_seed = int(mindspore.seed() % sys.maxsize)

        mindspore.set_seed(self.feed_forward_seed)
        mindspore.manual_seed(self.feed_forward_seed)

    def forward(
        self,
        prev_attn_output,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        num_hashes=None,
        past_buckets_states=None,
        use_cache=False,
        orig_sequence_length=None,
        output_attentions=False,
    ):
        with no_grad():
            # every forward pass we sample a different seed
            # for dropout and save for forward fn in backward pass
            # to have correct dropout
            if self.training:
                self._init_attention_seed()

            attn_outputs = self.attention(
                hidden_states=hidden_states,
                head_mask=head_mask,
                attention_mask=attention_mask,
                num_hashes=num_hashes,
                past_buckets_states=past_buckets_states,
                use_cache=use_cache,
                orig_sequence_length=orig_sequence_length,
                output_attentions=output_attentions,
            )
            attn_output = attn_outputs.hidden_states

            # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)
            # Y_1 = X_1 + f(X_2)
            attn_output = prev_attn_output + attn_output

            # free memory
            del prev_attn_output

            # every forward pass we sample a different seed
            # for dropout and save seed for forward fn in backward
            # to have correct dropout
            if self.training:
                self._init_feed_forward_seed()
            # Y_2 = X_2 + g(Y_1)
            hidden_states = hidden_states + self.feed_forward(attn_output)

        return ReformerOutput(
            attn_output=attn_output,
            hidden_states=hidden_states,
            attention_probs=attn_outputs.attention_probs,
            buckets=attn_outputs.buckets,
        )

    def backward_pass(
        self,
        next_attn_output,
        hidden_states,
        grad_attn_output,
        grad_hidden_states,
        attention_mask=None,
        head_mask=None,
        buckets=None,
    ):
        # Implements the backward pass for reversible ResNets.
        # A good blog post on how this works can be found here:
        # Implementation of RevNet (see Fig. 6 in https://towardsdatascience.com/illustrating-the-reformer-393575ac6ba0)
        # This code is heavily inspired by https://github.com/lucidrains/reformer-pytorch/blob/master/reformer_pytorch/reversible.py

        assert self.training, (
            "If you want to train `ReformerModel` and its variations, make sure to use `model.train()` to put the"
            " model into training mode."
        )

        with enable_grad():
            next_attn_output.requires_grad = True

            # set seed to have correct dropout
            mindspore.set_seed(self.feed_forward_seed)
            mindspore.manual_seed(self.feed_forward_seed)
            # g(Y_1)
            res_hidden_states = self.feed_forward(next_attn_output)
            res_hidden_states.backward(grad_hidden_states, retain_graph=True)

        with no_grad():
            # X_2 = Y_2 - g(Y_1)
            hidden_states = hidden_states - res_hidden_states
            del res_hidden_states

            grad_attn_output = grad_attn_output + next_attn_output.grad
            next_attn_output.grad = None

        with enable_grad():
            hidden_states.requires_grad = True

            # set seed to have correct dropout
            mindspore.set_seed(self.attention_seed)
            mindspore.manual_seed(self.attention_seed)
            # f(X_2)
            # use cached buckets for backprob if buckets not None for LSHSelfAttention
            output = self.attention(
                hidden_states=hidden_states,
                head_mask=head_mask,
                attention_mask=attention_mask,
                buckets=buckets,
            ).hidden_states
            output.backward(grad_attn_output, retain_graph=True)

        with no_grad():
            # X_1 = Y_1 - f(X_2)
            attn_output = next_attn_output - output
            del output, next_attn_output

            grad_hidden_states = grad_hidden_states + hidden_states.grad
            hidden_states.grad = None

        return ReformerBackwardOutput(
            attn_output=attn_output,
            hidden_states=hidden_states,
            grad_attn_output=grad_attn_output,
            grad_hidden_states=grad_hidden_states,
        )

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerModel` ¶

Bases: ReformerPreTrainedModel

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class ReformerModel(ReformerPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.config = config
        assert (
            self.config.num_hidden_layers > 0
        ), "`config.attn_layers` is empty. Select at least one attn layer form ['lsh', 'local']"

        self.embeddings = ReformerEmbeddings(config)
        self.encoder = ReformerEncoder(config)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.embeddings.word_embeddings

    def set_input_embeddings(self, value):
        self.embeddings.word_embeddings = value

    def _prune_heads(self, heads_to_prune):
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        num_hashes: Optional[int] = None,
        past_buckets_states: Optional[List[Tuple[mindspore.Tensor]]] = None,
        use_cache: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, ReformerModelOutput]:
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
            input_shape = input_ids.shape  # noqa: F841
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.shape[:-1]  # noqa: F841
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        assert (
            len(input_shape) == 2
        ), f"`input_ids` have be of shape `[batch_size, sequence_length]`, but got shape: {input_shape}"

        if past_buckets_states is not None:
            assert not self.training, "`past_buckets_states` can only be used for inference, not for training`."

        # prepare head mask
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers, is_attention_chunked=True)

        # original sequence length for padding
        orig_sequence_length = input_shape[-1]

        # if needs padding
        least_common_mult_chunk_length = _get_least_common_mult_chunk_len(self.config)
        min_chunk_length = _get_min_chunk_len(self.config)

        must_pad_to_match_chunk_length = (
            input_shape[-1] % least_common_mult_chunk_length != 0
            and input_shape[-1] > min_chunk_length
            and past_buckets_states is None
        )

        if must_pad_to_match_chunk_length:
            padding_length = least_common_mult_chunk_length - input_shape[-1] % least_common_mult_chunk_length

            if self.training is True:
                raise ValueError(
                    f"If training, sequence length {input_shape[-1]} has to be a multiple of least common multiple "
                    f"chunk_length {least_common_mult_chunk_length}. Please consider padding the input to a length "
                    f"of {input_shape[-1] + padding_length}."
                )

            # pad input
            input_ids, inputs_embeds, attention_mask, position_ids, input_shape = self._pad_to_mult_of_chunk_length(
                input_ids,
                inputs_embeds=inputs_embeds,
                attention_mask=attention_mask,
                position_ids=position_ids,
                input_shape=input_shape,
                padding_length=padding_length,
                padded_seq_length=least_common_mult_chunk_length,
            )

        # start index for position encoding depends on incremental decoding
        if past_buckets_states is not None:
            start_idx_pos_encodings = past_buckets_states[0][1].shape[1]
        else:
            start_idx_pos_encodings = 0

        embedding_output = self.embeddings(
            input_ids=input_ids,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            start_idx_pos_encodings=start_idx_pos_encodings,
        )

        encoder_outputs = self.encoder(
            hidden_states=embedding_output,
            head_mask=head_mask,
            attention_mask=attention_mask,
            num_hashes=num_hashes,
            past_buckets_states=past_buckets_states,
            use_cache=use_cache,
            orig_sequence_length=orig_sequence_length,
            output_hidden_states=output_hidden_states,
            output_attentions=output_attentions,
        )
        sequence_output = encoder_outputs.hidden_states

        # if padding was applied
        if must_pad_to_match_chunk_length:
            sequence_output = sequence_output[:, :orig_sequence_length]

        past_buckets_states = encoder_outputs.past_buckets_states if use_cache else None
        hidden_states = encoder_outputs.all_hidden_states if output_hidden_states else None
        attentions = encoder_outputs.all_attentions if output_attentions else None

        if not return_dict:
            return tuple(v for v in [sequence_output, past_buckets_states, hidden_states, attentions] if v is not None)
        return ReformerModelOutput(
            last_hidden_state=sequence_output,
            past_buckets_states=past_buckets_states,
            hidden_states=hidden_states,
            attentions=attentions,
        )

    def _pad_to_mult_of_chunk_length(
        self,
        input_ids,
        inputs_embeds=None,
        attention_mask=None,
        position_ids=None,
        input_shape=None,
        padding_length=None,
        padded_seq_length=None,
    ):
        logger.warning_once(
            f"Input ids are automatically padded from {input_shape[-1]} to {input_shape[-1] + padding_length} to be a "
            f"multiple of `config.chunk_length`: {padded_seq_length}"
        )

        padded_input_ids = ops.full(
            (input_shape[0], padding_length),
            self.config.pad_token_id,
            dtype=mindspore.int64,
        )

        # Extend `attention_mask`
        if attention_mask is not None:
            pad_attention_mask = ops.zeros(input_shape[0], padding_length, dtype=attention_mask.dtype)

            attention_mask = ops.cat([attention_mask, pad_attention_mask], dim=-1)
        else:
            attention_mask = ops.cat(
                [
                    ops.ones(input_shape, dtype=mindspore.bool_),
                    ops.zeros((input_shape[0], padding_length), dtype=mindspore.bool_),
                ],
                dim=-1,
            )

        # Extend `input_ids` with padding to match least common multiple chunk_length
        if input_ids is not None:
            input_ids = ops.cat([input_ids, padded_input_ids], dim=-1)
            input_shape = input_ids.shape

            # Pad position ids if given
            if position_ids is not None:
                padded_position_ids = ops.arange(input_shape[-1], padded_seq_length, dtype=mindspore.int64)
                padded_position_ids = position_ids.unsqueeze(0).broadcast_to((input_shape[0], padding_length))
                position_ids = ops.cat([position_ids, padded_position_ids], dim=-1)

        # Extend `inputs_embeds` with padding to match least common multiple chunk_length
        if inputs_embeds is not None:
            padded_inputs_embeds = self.embeddings(padded_input_ids, position_ids)
            inputs_embeds = ops.cat([inputs_embeds, padded_inputs_embeds], dim=-2)
            input_shape = inputs_embeds.shape
        return input_ids, inputs_embeds, attention_mask, position_ids, input_shape

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerModelOutput` `dataclass` ¶

Bases: ModelOutput

Output type of [ReformerModel].

PARAMETER	DESCRIPTION
`last_hidden_state`	Sequence of hidden-states at the last layer of the model. `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`. TYPE: `mindspore.Tensor` of shape `(batch_size, num_predict, hidden_size)`
`past_buckets_states`	List of `Tuple(mindspore.Tensor, mindspore.Tensor` of length `config.n_layers`, with the first element being the previous buckets of shape `(batch_size, num_heads, num_hashes, sequence_length)`) and the second being the previous hidden_states of shape `(batch_size, sequence_length, hidden_size)`). Contains precomputed buckets and hidden-states that can be used (see `past_buckets_states` input) to speed up sequential decoding. TYPE: `List[Tuple(mindspore.Tensor, mindspore.Tensor)]`, optional, returned when `use_cache=True` is passed or when `config.use_cache=True` DEFAULT: `None`
`hidden_states`	Tuple of `mindspore.Tensor` (one for the output of the embeddings and one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the initial embedding outputs. TYPE: `tuple(mindspore.Tensor)`, optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True` DEFAULT: `None`
`attentions`	Tuple of `mindspore.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. TYPE: `tuple(mindspore.Tensor)`, optional, returned when `output_attentions=True` is passed or when `config.output_attentions=True` DEFAULT: `None`

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

@dataclass
class ReformerModelOutput(ModelOutput):
    """
    Output type of [`ReformerModel`].

    Args:
        last_hidden_state (`mindspore.Tensor` of shape `(batch_size, num_predict, hidden_size)`):
            Sequence of hidden-states at the last layer of the model.

            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict`
            corresponds to `sequence_length`.
        past_buckets_states (`List[Tuple(mindspore.Tensor, mindspore.Tensor)]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
            List of `Tuple(mindspore.Tensor, mindspore.Tensor` of length `config.n_layers`, with the first element
            being the previous *buckets* of shape `(batch_size, num_heads, num_hashes, sequence_length)`) and the
            second being the previous *hidden_states* of shape `(batch_size, sequence_length, hidden_size)`).

            Contains precomputed buckets and hidden-states that can be used (see `past_buckets_states` input) to speed
            up sequential decoding.
        hidden_states (`tuple(mindspore.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
            Tuple of `mindspore.Tensor` (one for the output of the embeddings and one for the output of each layer) of
            shape `(batch_size, sequence_length, hidden_size)`.

            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (`tuple(mindspore.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
            Tuple of `mindspore.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
    """

    last_hidden_state: mindspore.Tensor
    past_buckets_states: Optional[List[Tuple[mindspore.Tensor, mindspore.Tensor]]] = None
    hidden_states: Optional[Tuple[mindspore.Tensor]] = None
    attentions: Optional[Tuple[mindspore.Tensor]] = None

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerModelWithLMHead` ¶

Bases: ReformerPreTrainedModel

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class ReformerModelWithLMHead(ReformerPreTrainedModel):
    _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"]

    def __init__(self, config):
        super().__init__(config)
        assert config.is_decoder, "If you want to use `ReformerModelWithLMHead` make sure that `is_decoder=True`."
        assert "local" not in self.config.attn_layers or config.local_num_chunks_after == 0, (
            "If causal mask is enabled, make sure that `config.local_num_chunks_after` is set to 0 and not"
            f" {config.local_num_chunks_after}."
        )
        assert "lsh" not in self.config.attn_layers or config.lsh_num_chunks_after == 0, (
            "If causal mask is enabled, make sure that `config.lsh_num_chunks_after` is set to 1 and not"
            f" {config.lsh_num_chunks_after}."
        )

        self.reformer = ReformerModel(config)
        self.lm_head = ReformerOnlyLMHead(config)

        # Initialize weights and apply final processing
        self.post_init()

    def get_output_embeddings(self):
        return self.lm_head.decoder

    def set_output_embeddings(self, new_embeddings):
        self.lm_head.decoder = new_embeddings
        self.lm_head.bias = new_embeddings.bias

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        num_hashes: Optional[int] = None,
        past_buckets_states: Optional[List[Tuple[mindspore.Tensor]]] = None,
        use_cache: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        labels: Optional[mindspore.Tensor] = None,
    ) -> Union[Tuple, CausalLMOutput]:
        r"""
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
                Labels for computing the sequence classification/regression loss. Indices should be in `[-100, 0, ...,
                config.vocab_size - 1]`. All labels set to `-100` are ignored (masked), the loss is only computed for
                labels in `[0, ..., config.vocab_size]`
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        reformer_outputs = self.reformer(
            input_ids,
            position_ids=position_ids,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            num_hashes=num_hashes,
            past_buckets_states=past_buckets_states,
            use_cache=use_cache,
            output_hidden_states=output_hidden_states,
            output_attentions=output_attentions,
            return_dict=return_dict,
        )

        sequence_output = reformer_outputs[0]
        logits = self.lm_head(sequence_output)

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :]
            shift_labels = labels[..., 1:]
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(shift_logits.view(-1, self.config.vocab_size), shift_labels.view(-1))

        if not return_dict:
            output = (logits,) + reformer_outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return ReformerModelWithLMHeadOutput(
            loss=loss,
            logits=logits,
            past_buckets_states=reformer_outputs.past_buckets_states,
            hidden_states=reformer_outputs.hidden_states,
            attentions=reformer_outputs.attentions,
        )

    def prepare_inputs_for_generation(
        self, input_ids, past_key_values=None, use_cache=None, num_hashes=None, **kwargs
    ):
        # only last token for inputs_ids if past is defined in kwargs
        if past_key_values is not None:
            input_ids = input_ids[:, -1:]

        inputs_dict = {
            "input_ids": input_ids,
            "past_buckets_states": past_key_values,
            "use_cache": use_cache,
            "num_hashes": num_hashes,
        }

        return inputs_dict

    def _reorder_cache(self, past_key_values, beam_idx):
        reord_past_buckets_states = []
        for layer_past in past_key_values:
            # buckets
            if layer_past[0] is not None:
                reord_buckets = layer_past[0].index_select(0, beam_idx)
            else:
                reord_buckets = None

            # hidden states
            reord_hidden_states = layer_past[1].index_select(0, beam_idx)
            reord_past_buckets_states.append((reord_buckets, reord_hidden_states))
        return reord_past_buckets_states

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerModelWithLMHead.forward(input_ids=None, position_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, num_hashes=None, past_buckets_states=None, use_cache=None, output_hidden_states=None, output_attentions=None, return_dict=None, labels=None)` ¶

labels (mindspore.Tensor of shape (batch_size,), optional): Labels for computing the sequence classification/regression loss. Indices should be in [-100, 0, ..., config.vocab_size - 1]. All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    num_hashes: Optional[int] = None,
    past_buckets_states: Optional[List[Tuple[mindspore.Tensor]]] = None,
    use_cache: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    labels: Optional[mindspore.Tensor] = None,
) -> Union[Tuple, CausalLMOutput]:
    r"""
    labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[-100, 0, ...,
            config.vocab_size - 1]`. All labels set to `-100` are ignored (masked), the loss is only computed for
            labels in `[0, ..., config.vocab_size]`
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    reformer_outputs = self.reformer(
        input_ids,
        position_ids=position_ids,
        attention_mask=attention_mask,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        num_hashes=num_hashes,
        past_buckets_states=past_buckets_states,
        use_cache=use_cache,
        output_hidden_states=output_hidden_states,
        output_attentions=output_attentions,
        return_dict=return_dict,
    )

    sequence_output = reformer_outputs[0]
    logits = self.lm_head(sequence_output)

    loss = None
    if labels is not None:
        # Shift so that tokens < n predict n
        shift_logits = logits[..., :-1, :]
        shift_labels = labels[..., 1:]
        # Flatten the tokens
        loss_fct = CrossEntropyLoss()
        loss = loss_fct(shift_logits.view(-1, self.config.vocab_size), shift_labels.view(-1))

    if not return_dict:
        output = (logits,) + reformer_outputs[1:]
        return ((loss,) + output) if loss is not None else output

    return ReformerModelWithLMHeadOutput(
        loss=loss,
        logits=logits,
        past_buckets_states=reformer_outputs.past_buckets_states,
        hidden_states=reformer_outputs.hidden_states,
        attentions=reformer_outputs.attentions,
    )

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerModelWithLMHeadOutput` `dataclass` ¶

Bases: ModelOutput

Output type of [ReformerModelWithLMHead].

PARAMETER	DESCRIPTION
`logits`	Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict` corresponds to `sequence_length`. TYPE: `mindspore.Tensor` of shape `(batch_size, num_predict, config.vocab_size)` DEFAULT: `None`
`past_buckets_states`	List of `Tuple(mindspore.Tensor, mindspore.Tensor` of length `config.n_layers`, with the first element being the previous buckets of shape `(batch_size, num_heads, num_hashes, sequence_length)`) and the second being the previous hidden_states of shape `(batch_size, sequence_length, hidden_size)`). Contains precomputed buckets and hidden-states that can be used (see `past_buckets_states` input) to speed up sequential decoding. TYPE: `List[Tuple(mindspore.Tensor, mindspore.Tensor)]`, optional, returned when `use_cache=True` is passed or when `config.use_cache=True` DEFAULT: `None`
`hidden_states`	TTuple of `mindspore.Tensor` (one for the output of the embeddings and one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus the initial embedding outputs. TYPE: `tuple(mindspore.Tensor)`, optional, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True` DEFAULT: `None`
`attentions`	Tuple of `mindspore.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. TYPE: `tuple(mindspore.Tensor)`, optional, returned when `output_attentions=True` is passed or when `config.output_attentions=True` DEFAULT: `None`

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

@dataclass
class ReformerModelWithLMHeadOutput(ModelOutput):
    """
    Output type of [`ReformerModelWithLMHead`].

    Args:
        loss (`mindspore.Tensor` of shape *(1,)*, *optional*, returned when `labels` is provided)
            Language modeling loss (for next-token prediction).
        logits (`mindspore.Tensor` of shape `(batch_size, num_predict, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

            `num_predict` corresponds to `target_mapping.shape[1]`. If `target_mapping` is `None`, then `num_predict`
            corresponds to `sequence_length`.
        past_buckets_states (`List[Tuple(mindspore.Tensor, mindspore.Tensor)]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
            List of `Tuple(mindspore.Tensor, mindspore.Tensor` of length `config.n_layers`, with the first element
            being the previous *buckets* of shape `(batch_size, num_heads, num_hashes, sequence_length)`) and the
            second being the previous *hidden_states* of shape `(batch_size, sequence_length, hidden_size)`).

            Contains precomputed buckets and hidden-states that can be used (see `past_buckets_states` input) to speed
            up sequential decoding.
        hidden_states (`tuple(mindspore.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
            TTuple of `mindspore.Tensor` (one for the output of the embeddings and one for the output of each layer)
            of shape `(batch_size, sequence_length, hidden_size)`.

            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (`tuple(mindspore.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
            Tuple of `mindspore.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
    """

    loss: Optional[mindspore.Tensor] = None
    logits: mindspore.Tensor = None
    past_buckets_states: Optional[List[Tuple[mindspore.Tensor, mindspore.Tensor]]] = None
    hidden_states: Optional[Tuple[mindspore.Tensor]] = None
    attentions: Optional[Tuple[mindspore.Tensor]] = None

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerPreTrainedModel` ¶

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class ReformerPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = ReformerConfig
    base_model_prefix = "reformer"

    @property
    def dummy_inputs(self):
        input_ids = mindspore.tensor(DUMMY_INPUTS)
        input_mask = mindspore.tensor(DUMMY_MASK)
        dummy_inputs = {
            "input_ids": input_ids,
            "attention_mask": input_mask,
        }
        return dummy_inputs

    def _init_weights(self, module):
        """Initialize the weights"""
        if isinstance(module, AxialPositionEmbeddings):
            for weight in module.weights:
                nn.init.normal_(weight, std=self.config.axial_norm_std)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight[module.padding_idx] = 0
        elif isinstance(module, nn.Linear):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.LayerNorm):
            nn.init.zeros_(module.bias)
            nn.init.ones_(module.weight)

`mindnlp.transformers.models.reformer.modeling_reformer.ReverseSort` ¶

Bases: Cell

After chunked attention is applied which sorted clusters, original ordering has to be restored. Since customized backward function is used for Reformer, the gradients of the output vectors have to be explicitly sorted here.

Source code in mindnlp\transformers\models\reformer\modeling_reformer.py

class ReverseSort(Cell):
    """
    After chunked attention is applied which sorted clusters, original ordering has to be restored. Since customized
    backward function is used for Reformer, the gradients of the output vectors have to be explicitly sorted here.
    """

    def construct(self, out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx):
        # save sorted_bucket_idx for backprop
        with no_grad():
            # undo sort to have correct order for next layer
            expanded_undo_sort_indices = undo_sorted_bucket_idx.unsqueeze(-1).broadcast_to(out_vectors.shape)
            out_vectors = ops.gather(out_vectors, 2, expanded_undo_sort_indices)
            logits = ops.gather(logits, 2, undo_sorted_bucket_idx)
        return out_vectors, logits

    def bprop(self, out_vectors, logits, sorted_bucket_idx, undo_sorted_bucket_idx, y, gy):
        # get parameters saved in ctx
        grad_out_vectors, grad_logits = gy

        expanded_sort_indices = sorted_bucket_idx.unsqueeze(-1).broadcast_to(grad_out_vectors.shape)
        # reverse sort of forward
        grad_out_vectors = ops.gather(grad_out_vectors, 2, expanded_sort_indices)
        grad_logits = ops.gather(grad_logits, 2, sorted_bucket_idx)

        # return grad and `None` fillers for last 2 forward args
        return grad_out_vectors, grad_logits, None, None

`mindnlp.transformers.models.reformer.tokenization_reformer` ¶

Tokenization class for model Reformer.

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer` ¶

Bases: PreTrainedTokenizer

Construct a Reformer tokenizer. Based on SentencePiece .

This tokenizer inherits from [PreTrainedTokenizer] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

PARAMETER	DESCRIPTION
`vocab_file`	SentencePiece file (generally has a .spm extension) that contains the vocabulary necessary to instantiate a tokenizer. TYPE: `str`
`eos_token`	The end of sequence token. When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`. TYPE: `str`, optional, defaults to `"</s>"` DEFAULT: `'</s>'`
`unk_token`	The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. TYPE: `str`, optional, defaults to `"<unk>"` DEFAULT: `'<unk>'`
`additional_special_tokens`	Additional special tokens used by the tokenizer. TYPE: `List[str]`, optional, defaults to `[]` DEFAULT: `[]`
`sp_model_kwargs`	Will be passed to the `SentencePieceProcessor.__init__()` method. The Python wrapper for SentencePiece can be used, among other things, to set: `enable_sampling`: Enable subword regularization. `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. `nbest_size = {0,1}`: No sampling is performed. `nbest_size > 1`: samples from the nbest_size results. `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm. `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. TYPE: `dict`, optional DEFAULT: `None`

Source code in mindnlp\transformers\models\reformer\tokenization_reformer.py

class ReformerTokenizer(PreTrainedTokenizer):
    """
    Construct a Reformer tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece) .

    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
    this superclass for more information regarding those methods.

    Args:
        vocab_file (`str`):
            [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that
            contains the vocabulary necessary to instantiate a tokenizer.
        eos_token (`str`, *optional*, defaults to `"</s>"`):
            The end of sequence token.

            <Tip>

            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
            The token used is the `sep_token`.

            </Tip>

        unk_token (`str`, *optional*, defaults to `"<unk>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        additional_special_tokens (`List[str]`, *optional*, defaults to `[]`):
            Additional special tokens used by the tokenizer.
        sp_model_kwargs (`dict`, *optional*):
            Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
            SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
            to set:

            - `enable_sampling`: Enable subword regularization.
            - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.

                - `nbest_size = {0,1}`: No sampling is performed.
                - `nbest_size > 1`: samples from the nbest_size results.
                - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
                using forward-filtering-and-backward-sampling algorithm.
                - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
                BPE-dropout.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        vocab_file,
        eos_token="</s>",
        unk_token="<unk>",
        additional_special_tokens=[],
        sp_model_kwargs: Optional[Dict[str, Any]] = None,
        **kwargs,
    ) -> None:
        """
        Initializes a new instance of the ReformerTokenizer class.

        Args:
            self: The instance of the ReformerTokenizer class.
            vocab_file (str): Path to the vocabulary file.
            eos_token (str, optional): The end-of-sentence token. Defaults to '</s>'.
            unk_token (str, optional): The unknown token. Defaults to '<unk>'.
            additional_special_tokens (List[str], optional):
                Additional special tokens to be added to the vocabulary. Defaults to an empty list.
            sp_model_kwargs (Optional[Dict[str, Any]], optional):
                Additional arguments to be passed to the SentencePieceProcessor forwardor. Defaults to None.

        Returns:
            None

        Raises:
            None
        """
        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

        self.vocab_file = vocab_file
        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.Load(vocab_file)

        super().__init__(
            eos_token=eos_token,
            unk_token=unk_token,
            additional_special_tokens=additional_special_tokens,
            sp_model_kwargs=self.sp_model_kwargs,
            **kwargs,
        )

    @property
    def vocab_size(self):
        """
        Returns the size of the vocabulary used by the ReformerTokenizer.

        Args:
            self: The instance of the ReformerTokenizer class.

        Returns:
            int: The size of the vocabulary used by the ReformerTokenizer.

        Raises:
            None.
        """
        return self.sp_model.get_piece_size()

    def get_vocab(self) -> Dict[str, int]:
        """
        Get the vocabulary of the ReformerTokenizer.

        Args:
            self: An instance of the ReformerTokenizer class.

        Returns:
            A dictionary of type Dict[str, int] mapping tokens to their corresponding IDs. The IDs are integers.

        Raises:
            None.

        """
        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
        vocab.update(self.added_tokens_encoder)
        return vocab

    def __getstate__(self):
        """
        Method '__getstate__' in the class 'ReformerTokenizer'.

        Args:
            self (object): The instance of the ReformerTokenizer class.
                Represents the current instance of the ReformerTokenizer class.
                No restrictions.

        Returns:
            None:
                This method returns a dictionary containing the state of the ReformerTokenizer instance with the
                'sp_model' key set to None.

        Raises:
            None.
        """
        state = self.__dict__.copy()
        state["sp_model"] = None
        return state

    def __setstate__(self, d):
        """
        __setstate__ method in the class ReformerTokenizer.

        Args:
            self (object): The instance of the ReformerTokenizer class.
            d (dict): A dictionary containing the state information to be set.

        Returns:
            None.

        Raises:
            None.
        """
        self.__dict__ = d

        # for backward compatibility
        if not hasattr(self, "sp_model_kwargs"):
            self.sp_model_kwargs = {}

        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.Load(self.vocab_file)

    def _tokenize(self, text: str) -> List[str]:
        """Take as input a string and return a list of strings (tokens) for words/sub-words"""
        return self.sp_model.encode(text, out_type=str)

    def _convert_token_to_id(self, token):
        """Converts a token (str) in an id using the vocab."""
        return self.sp_model.piece_to_id(token)

    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        if index < self.sp_model.get_piece_size():
            token = self.sp_model.IdToPiece(index)
        return token

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        current_sub_tokens = []
        out_string = ""
        for token in tokens:
            # make sure that special tokens are not decoded using sentencepiece model
            if token in self.all_special_tokens:
                out_string += self.sp_model.decode(current_sub_tokens) + token
                current_sub_tokens = []
            else:
                current_sub_tokens.append(token)
        out_string += self.sp_model.decode(current_sub_tokens)
        return out_string.strip()

    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        """
        Save the vocabulary to a specified directory.

        Args:
            self (ReformerTokenizer): The instance of the ReformerTokenizer class.
            save_directory (str): The directory where the vocabulary will be saved.
            filename_prefix (Optional[str], optional): An optional prefix for the filename. Defaults to None.

        Returns:
            Tuple[str]: A tuple containing the path to the saved vocabulary file.

        Raises:
            OSError: If the save_directory is not a valid directory.
        """
        if not os.path.isdir(save_directory):
            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
            return
        out_vocab_file = os.path.join(
            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
        )

        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
            copyfile(self.vocab_file, out_vocab_file)
        elif not os.path.isfile(self.vocab_file):
            with open(out_vocab_file, "wb") as fi:
                content_spiece_model = self.sp_model.serialized_model_proto()
                fi.write(content_spiece_model)

        return (out_vocab_file,)

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.vocab_size` `property` ¶

Returns the size of the vocabulary used by the ReformerTokenizer.

PARAMETER	DESCRIPTION
`self`	The instance of the ReformerTokenizer class.

RETURNS	DESCRIPTION
`int`	The size of the vocabulary used by the ReformerTokenizer.

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.getstate()` ¶

Method 'getstate' in the class 'ReformerTokenizer'.

PARAMETER	DESCRIPTION
`self`	The instance of the ReformerTokenizer class. Represents the current instance of the ReformerTokenizer class. No restrictions. TYPE: `object`

RETURNS	DESCRIPTION
`None`	This method returns a dictionary containing the state of the ReformerTokenizer instance with the 'sp_model' key set to None.

Source code in mindnlp\transformers\models\reformer\tokenization_reformer.py

def __getstate__(self):
    """
    Method '__getstate__' in the class 'ReformerTokenizer'.

    Args:
        self (object): The instance of the ReformerTokenizer class.
            Represents the current instance of the ReformerTokenizer class.
            No restrictions.

    Returns:
        None:
            This method returns a dictionary containing the state of the ReformerTokenizer instance with the
            'sp_model' key set to None.

    Raises:
        None.
    """
    state = self.__dict__.copy()
    state["sp_model"] = None
    return state

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.init(vocab_file, eos_token='</s>', unk_token='<unk>', additional_special_tokens=[], sp_model_kwargs=None, **kwargs)` ¶

Initializes a new instance of the ReformerTokenizer class.

PARAMETER	DESCRIPTION
`self`	The instance of the ReformerTokenizer class.
`vocab_file`	Path to the vocabulary file. TYPE: `str`
`eos_token`	The end-of-sentence token. Defaults to ''. TYPE: `str` DEFAULT: `'</s>'`
`unk_token`	The unknown token. Defaults to ''. TYPE: `str` DEFAULT: `'<unk>'`
`additional_special_tokens`	Additional special tokens to be added to the vocabulary. Defaults to an empty list. TYPE: `List[str]` DEFAULT: `[]`
`sp_model_kwargs`	Additional arguments to be passed to the SentencePieceProcessor forwardor. Defaults to None. TYPE: `Optional[Dict[str, Any]]` DEFAULT: `None`

RETURNS	DESCRIPTION
`None`	None

Source code in mindnlp\transformers\models\reformer\tokenization_reformer.py

def __init__(
    self,
    vocab_file,
    eos_token="</s>",
    unk_token="<unk>",
    additional_special_tokens=[],
    sp_model_kwargs: Optional[Dict[str, Any]] = None,
    **kwargs,
) -> None:
    """
    Initializes a new instance of the ReformerTokenizer class.

    Args:
        self: The instance of the ReformerTokenizer class.
        vocab_file (str): Path to the vocabulary file.
        eos_token (str, optional): The end-of-sentence token. Defaults to '</s>'.
        unk_token (str, optional): The unknown token. Defaults to '<unk>'.
        additional_special_tokens (List[str], optional):
            Additional special tokens to be added to the vocabulary. Defaults to an empty list.
        sp_model_kwargs (Optional[Dict[str, Any]], optional):
            Additional arguments to be passed to the SentencePieceProcessor forwardor. Defaults to None.

    Returns:
        None

    Raises:
        None
    """
    self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

    self.vocab_file = vocab_file
    self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
    self.sp_model.Load(vocab_file)

    super().__init__(
        eos_token=eos_token,
        unk_token=unk_token,
        additional_special_tokens=additional_special_tokens,
        sp_model_kwargs=self.sp_model_kwargs,
        **kwargs,
    )

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.setstate(d)` ¶

setstate method in the class ReformerTokenizer.

PARAMETER	DESCRIPTION
`self`	The instance of the ReformerTokenizer class. TYPE: `object`
`d`	A dictionary containing the state information to be set. TYPE: `dict`

RETURNS	DESCRIPTION
	None.

Source code in mindnlp\transformers\models\reformer\tokenization_reformer.py

def __setstate__(self, d):
    """
    __setstate__ method in the class ReformerTokenizer.

    Args:
        self (object): The instance of the ReformerTokenizer class.
        d (dict): A dictionary containing the state information to be set.

    Returns:
        None.

    Raises:
        None.
    """
    self.__dict__ = d

    # for backward compatibility
    if not hasattr(self, "sp_model_kwargs"):
        self.sp_model_kwargs = {}

    self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
    self.sp_model.Load(self.vocab_file)

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.convert_tokens_to_string(tokens)` ¶

Converts a sequence of tokens (string) in a single string.

Source code in mindnlp\transformers\models\reformer\tokenization_reformer.py

def convert_tokens_to_string(self, tokens):
    """Converts a sequence of tokens (string) in a single string."""
    current_sub_tokens = []
    out_string = ""
    for token in tokens:
        # make sure that special tokens are not decoded using sentencepiece model
        if token in self.all_special_tokens:
            out_string += self.sp_model.decode(current_sub_tokens) + token
            current_sub_tokens = []
        else:
            current_sub_tokens.append(token)
    out_string += self.sp_model.decode(current_sub_tokens)
    return out_string.strip()

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.get_vocab()` ¶

Get the vocabulary of the ReformerTokenizer.

PARAMETER	DESCRIPTION
`self`	An instance of the ReformerTokenizer class.

RETURNS	DESCRIPTION
`Dict[str, int]`	A dictionary of type Dict[str, int] mapping tokens to their corresponding IDs. The IDs are integers.

Source code in mindnlp\transformers\models\reformer\tokenization_reformer.py

def get_vocab(self) -> Dict[str, int]:
    """
    Get the vocabulary of the ReformerTokenizer.

    Args:
        self: An instance of the ReformerTokenizer class.

    Returns:
        A dictionary of type Dict[str, int] mapping tokens to their corresponding IDs. The IDs are integers.

    Raises:
        None.

    """
    vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
    vocab.update(self.added_tokens_encoder)
    return vocab

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.save_vocabulary(save_directory, filename_prefix=None)` ¶

Save the vocabulary to a specified directory.

PARAMETER	DESCRIPTION
`self`	The instance of the ReformerTokenizer class. TYPE: `ReformerTokenizer`
`save_directory`	The directory where the vocabulary will be saved. TYPE: `str`
`filename_prefix`	An optional prefix for the filename. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Tuple[str]`	Tuple[str]: A tuple containing the path to the saved vocabulary file.

RAISES	DESCRIPTION
`OSError`	If the save_directory is not a valid directory.

Source code in mindnlp\transformers\models\reformer\tokenization_reformer.py

def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
    """
    Save the vocabulary to a specified directory.

    Args:
        self (ReformerTokenizer): The instance of the ReformerTokenizer class.
        save_directory (str): The directory where the vocabulary will be saved.
        filename_prefix (Optional[str], optional): An optional prefix for the filename. Defaults to None.

    Returns:
        Tuple[str]: A tuple containing the path to the saved vocabulary file.

    Raises:
        OSError: If the save_directory is not a valid directory.
    """
    if not os.path.isdir(save_directory):
        logger.error(f"Vocabulary path ({save_directory}) should be a directory")
        return
    out_vocab_file = os.path.join(
        save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
    )

    if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
        copyfile(self.vocab_file, out_vocab_file)
    elif not os.path.isfile(self.vocab_file):
        with open(out_vocab_file, "wb") as fi:
            content_spiece_model = self.sp_model.serialized_model_proto()
            fi.write(content_spiece_model)

    return (out_vocab_file,)

`mindnlp.transformers.models.reformer.tokenization_reformer_fast` ¶

Tokenization class for model Reformer.

`mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast` ¶

Bases: PreTrainedTokenizerFast

Construct a "fast" Reformer tokenizer (backed by HuggingFace's tokenizers library). Based on Unigram.

This tokenizer inherits from [PreTrainedTokenizerFast] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

PARAMETER	DESCRIPTION
`vocab_file`	SentencePiece file (generally has a .spm extension) that contains the vocabulary necessary to instantiate a tokenizer. TYPE: `str` DEFAULT: `None`
`eos_token`	The end of sequence token. When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`. TYPE: `str`, optional, defaults to `"</s>"` DEFAULT: `'</s>'`
`unk_token`	The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. TYPE: `str`, optional, defaults to `"<unk>"` DEFAULT: `'<unk>'`
`pad_token`	The token used for padding, for example when batching sequences of different lengths. TYPE: `str`, optional, defaults to `"<pad>"`
`additional_special_tokens`	Additional special tokens used by the tokenizer. TYPE: `List[str]`, optional DEFAULT: `[]`

Source code in mindnlp\transformers\models\reformer\tokenization_reformer_fast.py

class ReformerTokenizerFast(PreTrainedTokenizerFast):
    """
    Construct a "fast" Reformer tokenizer (backed by HuggingFace's *tokenizers* library). Based on
    [Unigram](https://hf-mirror.com/docs/tokenizers/python/latest/components.html?highlight=unigram#models).

    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
    refer to this superclass for more information regarding those methods.

    Args:
        vocab_file (`str`):
            [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that
            contains the vocabulary necessary to instantiate a tokenizer.
        eos_token (`str`, *optional*, defaults to `"</s>"`):
            The end of sequence token.

            <Tip>

            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
            The token used is the `sep_token`.

            </Tip>

        unk_token (`str`, *optional*, defaults to `"<unk>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        pad_token (`str`, *optional*, defaults to `"<pad>"`):
            The token used for padding, for example when batching sequences of different lengths.
        additional_special_tokens (`List[str]`, *optional*):
            Additional special tokens used by the tokenizer.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ["input_ids", "attention_mask"]
    slow_tokenizer_class = ReformerTokenizer

    def __init__(
        self,
        vocab_file=None,
        tokenizer_file=None,
        eos_token="</s>",
        unk_token="<unk>",
        additional_special_tokens=[],
        **kwargs,
    ):
        """
        __init__

        Initializes the ReformerTokenizerFast class.

        Args:
            self: The instance of the class.
            vocab_file (str): The path to the vocabulary file. If not provided, the tokenizer will use a
                default vocabulary.
            tokenizer_file (str): The path to the tokenizer file. If not provided, the tokenizer will use a
                default tokenizer.
            eos_token (str): The end-of-sequence token. Defaults to '</s>'.
            unk_token (str): The unknown token. Defaults to '<unk>'.
            additional_special_tokens (list): A list of additional special tokens to be added to the vocabulary.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(
            vocab_file,
            tokenizer_file=tokenizer_file,
            eos_token=eos_token,
            unk_token=unk_token,
            additional_special_tokens=additional_special_tokens,
            **kwargs,
        )

        self.vocab_file = vocab_file

    @property
    def can_save_slow_tokenizer(self) -> bool:
        """
        Method to check if the slow tokenizer can be saved.

        Args:
            self (ReformerTokenizerFast): An instance of the ReformerTokenizerFast class.
                This parameter refers to the current instance of the ReformerTokenizerFast class.

        Returns:
            bool: A boolean value indicating whether the slow tokenizer can be saved.
                Returns True if the vocab_file exists, otherwise returns False.

        Raises:
            None.
        """
        return os.path.isfile(self.vocab_file) if self.vocab_file else False

    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        """Save the vocabulary for a ReformerTokenizerFast instance.

        Args:
            self (ReformerTokenizerFast): The instance of the ReformerTokenizerFast class.
            save_directory (str): The directory where the vocabulary will be saved.
            filename_prefix (Optional[str]): An optional prefix for the filename. Defaults to None.

        Returns:
            Tuple[str]: A tuple containing the path to the saved vocabulary file.

        Raises:
            ValueError: If the fast tokenizer does not have the necessary information to save the vocabulary
                for a slow tokenizer.
            OSError: If the specified save_directory is not a valid directory.
        """
        if not self.can_save_slow_tokenizer:
            raise ValueError(
                "Your fast tokenizer does not have the necessary information to save the vocabulary for a slow "
                "tokenizer."
            )

        if not os.path.isdir(save_directory):
            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
            return
        out_vocab_file = os.path.join(
            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
        )

        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
            copyfile(self.vocab_file, out_vocab_file)

        return (out_vocab_file,)

`mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast.can_save_slow_tokenizer: bool` `property` ¶

Method to check if the slow tokenizer can be saved.

PARAMETER	DESCRIPTION
`self`	An instance of the ReformerTokenizerFast class. This parameter refers to the current instance of the ReformerTokenizerFast class. TYPE: `ReformerTokenizerFast`

RETURNS	DESCRIPTION
`bool`	A boolean value indicating whether the slow tokenizer can be saved. Returns True if the vocab_file exists, otherwise returns False. TYPE: `bool`

`mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast.init(vocab_file=None, tokenizer_file=None, eos_token='</s>', unk_token='<unk>', additional_special_tokens=[], **kwargs)` ¶

init

Initializes the ReformerTokenizerFast class.

PARAMETER	DESCRIPTION
`self`	The instance of the class.
`vocab_file`	The path to the vocabulary file. If not provided, the tokenizer will use a default vocabulary. TYPE: `str` DEFAULT: `None`
`tokenizer_file`	The path to the tokenizer file. If not provided, the tokenizer will use a default tokenizer. TYPE: `str` DEFAULT: `None`
`eos_token`	The end-of-sequence token. Defaults to ''. TYPE: `str` DEFAULT: `'</s>'`
`unk_token`	The unknown token. Defaults to ''. TYPE: `str` DEFAULT: `'<unk>'`
`additional_special_tokens`	A list of additional special tokens to be added to the vocabulary. TYPE: `list` DEFAULT: `[]`

RETURNS	DESCRIPTION
	None.

Source code in mindnlp\transformers\models\reformer\tokenization_reformer_fast.py

def __init__(
    self,
    vocab_file=None,
    tokenizer_file=None,
    eos_token="</s>",
    unk_token="<unk>",
    additional_special_tokens=[],
    **kwargs,
):
    """
    __init__

    Initializes the ReformerTokenizerFast class.

    Args:
        self: The instance of the class.
        vocab_file (str): The path to the vocabulary file. If not provided, the tokenizer will use a
            default vocabulary.
        tokenizer_file (str): The path to the tokenizer file. If not provided, the tokenizer will use a
            default tokenizer.
        eos_token (str): The end-of-sequence token. Defaults to '</s>'.
        unk_token (str): The unknown token. Defaults to '<unk>'.
        additional_special_tokens (list): A list of additional special tokens to be added to the vocabulary.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(
        vocab_file,
        tokenizer_file=tokenizer_file,
        eos_token=eos_token,
        unk_token=unk_token,
        additional_special_tokens=additional_special_tokens,
        **kwargs,
    )

    self.vocab_file = vocab_file

`mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast.save_vocabulary(save_directory, filename_prefix=None)` ¶

Save the vocabulary for a ReformerTokenizerFast instance.

PARAMETER	DESCRIPTION
`self`	The instance of the ReformerTokenizerFast class. TYPE: `ReformerTokenizerFast`
`save_directory`	The directory where the vocabulary will be saved. TYPE: `str`
`filename_prefix`	An optional prefix for the filename. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Tuple[str]`	Tuple[str]: A tuple containing the path to the saved vocabulary file.

RAISES	DESCRIPTION
`ValueError`	If the fast tokenizer does not have the necessary information to save the vocabulary for a slow tokenizer.
`OSError`	If the specified save_directory is not a valid directory.

Source code in mindnlp\transformers\models\reformer\tokenization_reformer_fast.py

def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
    """Save the vocabulary for a ReformerTokenizerFast instance.

    Args:
        self (ReformerTokenizerFast): The instance of the ReformerTokenizerFast class.
        save_directory (str): The directory where the vocabulary will be saved.
        filename_prefix (Optional[str]): An optional prefix for the filename. Defaults to None.

    Returns:
        Tuple[str]: A tuple containing the path to the saved vocabulary file.

    Raises:
        ValueError: If the fast tokenizer does not have the necessary information to save the vocabulary
            for a slow tokenizer.
        OSError: If the specified save_directory is not a valid directory.
    """
    if not self.can_save_slow_tokenizer:
        raise ValueError(
            "Your fast tokenizer does not have the necessary information to save the vocabulary for a slow "
            "tokenizer."
        )

    if not os.path.isdir(save_directory):
        logger.error(f"Vocabulary path ({save_directory}) should be a directory")
        return
    out_vocab_file = os.path.join(
        save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
    )

    if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
        copyfile(self.vocab_file, out_vocab_file)

    return (out_vocab_file,)

reformer

mindnlp.transformers.models.reformer.configuration_reformer ¶

mindnlp.transformers.models.reformer.configuration_reformer.ReformerConfig ¶

mindnlp.transformers.models.reformer.modeling_reformer ¶

mindnlp.transformers.models.reformer.modeling_reformer.AxialPositionEmbeddings ¶

mindnlp.transformers.models.reformer.modeling_reformer.EfficientAttentionMixin ¶

mindnlp.transformers.models.reformer.modeling_reformer.LSHSelfAttention ¶

mindnlp.transformers.models.reformer.modeling_reformer.PositionEmbeddings ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerClassificationHead ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerEmbeddings ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerForMaskedLM ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerForMaskedLM.forward(input_ids=None, position_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, num_hashes=None, labels=None, output_hidden_states=None, output_attentions=None, return_dict=None) ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerForQuestionAnswering ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerForSequenceClassification ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerForSequenceClassification.forward(input_ids=None, position_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, num_hashes=None, labels=None, output_hidden_states=None, output_attentions=None, return_dict=None) ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerLayer ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerModel ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerModelOutput dataclass ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerModelWithLMHead ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerModelWithLMHeadOutput dataclass ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReformerPreTrainedModel ¶

mindnlp.transformers.models.reformer.modeling_reformer.ReverseSort ¶

mindnlp.transformers.models.reformer.tokenization_reformer ¶

mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer ¶

mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.vocab_size property ¶

mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.__getstate__() ¶

mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.__init__(vocab_file, eos_token='</s>', unk_token='<unk>', additional_special_tokens=[], sp_model_kwargs=None, **kwargs) ¶

mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.__setstate__(d) ¶

mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.convert_tokens_to_string(tokens) ¶

mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.get_vocab() ¶

mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.save_vocabulary(save_directory, filename_prefix=None) ¶

mindnlp.transformers.models.reformer.tokenization_reformer_fast ¶

mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast ¶

mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast.can_save_slow_tokenizer: bool property ¶

mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast.__init__(vocab_file=None, tokenizer_file=None, eos_token='</s>', unk_token='<unk>', additional_special_tokens=[], **kwargs) ¶

mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast.save_vocabulary(save_directory, filename_prefix=None) ¶

`mindnlp.transformers.models.reformer.configuration_reformer` ¶

`mindnlp.transformers.models.reformer.configuration_reformer.ReformerConfig` ¶

`mindnlp.transformers.models.reformer.modeling_reformer` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.AxialPositionEmbeddings` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.EfficientAttentionMixin` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.LSHSelfAttention` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.PositionEmbeddings` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerClassificationHead` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerEmbeddings` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForMaskedLM` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForMaskedLM.forward(input_ids=None, position_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, num_hashes=None, labels=None, output_hidden_states=None, output_attentions=None, return_dict=None)` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForQuestionAnswering` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForSequenceClassification` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerForSequenceClassification.forward(input_ids=None, position_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, num_hashes=None, labels=None, output_hidden_states=None, output_attentions=None, return_dict=None)` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerLayer` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerModel` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerModelOutput` `dataclass` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerModelWithLMHead` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerModelWithLMHeadOutput` `dataclass` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReformerPreTrainedModel` ¶

`mindnlp.transformers.models.reformer.modeling_reformer.ReverseSort` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.vocab_size` `property` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.getstate()` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.init(vocab_file, eos_token='</s>', unk_token='<unk>', additional_special_tokens=[], sp_model_kwargs=None, **kwargs)` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.setstate(d)` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.convert_tokens_to_string(tokens)` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.get_vocab()` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer.ReformerTokenizer.save_vocabulary(save_directory, filename_prefix=None)` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer_fast` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast.can_save_slow_tokenizer: bool` `property` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast.init(vocab_file=None, tokenizer_file=None, eos_token='</s>', unk_token='<unk>', additional_special_tokens=[], **kwargs)` ¶

`mindnlp.transformers.models.reformer.tokenization_reformer_fast.ReformerTokenizerFast.save_vocabulary(save_directory, filename_prefix=None)` ¶