bigbird_pegasus

`mindnlp.transformers.models.bigbird_pegasus.configuration_bigbird_pegasus.BigBirdPegasusConfig` ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [BigBirdPegasusModel]. It is used to instantiate an BigBirdPegasus model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BigBirdPegasus google/bigbird-pegasus-large-arxiv architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`vocab_size`	Vocabulary size of the BigBirdPegasus model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`BigBirdPegasusModel`]. TYPE: `int`, optional, defaults to 96103 DEFAULT: `96103`
`d_model`	Dimension of the layers and the pooler layer. TYPE: `int`, optional, defaults to 1024 DEFAULT: `1024`
`encoder_layers`	Number of encoder layers. TYPE: `int`, optional, defaults to 16 DEFAULT: `16`
`decoder_layers`	Number of decoder layers. TYPE: `int`, optional, defaults to 16 DEFAULT: `16`
`encoder_attention_heads`	Number of attention heads for each attention layer in the Transformer encoder. TYPE: `int`, optional, defaults to 16 DEFAULT: `16`
`decoder_attention_heads`	Number of attention heads for each attention layer in the Transformer decoder. TYPE: `int`, optional, defaults to 16 DEFAULT: `16`
`decoder_ffn_dim`	Dimension of the "intermediate" (often named feed-forward) layer in decoder. TYPE: `int`, optional, defaults to 4096 DEFAULT: `4096`
`encoder_ffn_dim`	Dimension of the "intermediate" (often named feed-forward) layer in decoder. TYPE: `int`, optional, defaults to 4096 DEFAULT: `4096`
`activation_function`	The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. TYPE: `str` or `function`, optional, defaults to `"gelu_new"` DEFAULT: `'gelu_new'`
`dropout`	The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. TYPE: `float`, optional, defaults to 0.1 DEFAULT: `0.1`
`attention_dropout`	The dropout ratio for the attention probabilities. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`activation_dropout`	The dropout ratio for activations inside the fully connected layer. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`classifier_dropout`	The dropout ratio for classifier. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`max_position_embeddings`	The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 1024 or 2048 or 4096). TYPE: `int`, optional, defaults to 4096 DEFAULT: `4096`
`init_std`	The standard deviation of the truncated_normal_initializer for initializing all weight matrices. TYPE: `float`, optional, defaults to 0.02 DEFAULT: `0.02`
`encoder_layerdrop`	The LayerDrop probability for the encoder. See the LayerDrop paper for more details. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`decoder_layerdrop`	The LayerDrop probability for the decoder. See the LayerDrop paper for more details. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`use_cache`	Whether or not the model should return the last key/values attentions (not used by all models). TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`

Example

>>> from transformers import BigBirdPegasusConfig, BigBirdPegasusModel
...
>>> # Initializing a BigBirdPegasus bigbird-pegasus-base style configuration
>>> configuration = BigBirdPegasusConfig()
...
>>> # Initializing a model (with random weights) from the bigbird-pegasus-base style configuration
>>> model = BigBirdPegasusModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config

Source code in mindnlp\transformers\models\bigbird_pegasus\configuration_bigbird_pegasus.py

class BigBirdPegasusConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`BigBirdPegasusModel`]. It is used to instantiate
    an BigBirdPegasus model according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the BigBirdPegasus
    [google/bigbird-pegasus-large-arxiv](https://hf-mirror.com/google/bigbird-pegasus-large-arxiv) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        vocab_size (`int`, *optional*, defaults to 96103):
            Vocabulary size of the BigBirdPegasus model. Defines the number of different tokens that can be represented
            by the `inputs_ids` passed when calling [`BigBirdPegasusModel`].
        d_model (`int`, *optional*, defaults to 1024):
            Dimension of the layers and the pooler layer.
        encoder_layers (`int`, *optional*, defaults to 16):
            Number of encoder layers.
        decoder_layers (`int`, *optional*, defaults to 16):
            Number of decoder layers.
        encoder_attention_heads (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
        decoder_attention_heads (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer decoder.
        decoder_ffn_dim (`int`, *optional*, defaults to 4096):
            Dimension of the "intermediate" (often named feed-forward) layer in decoder.
        encoder_ffn_dim (`int`, *optional*, defaults to 4096):
            Dimension of the "intermediate" (often named feed-forward) layer in decoder.
        activation_function (`str` or `function`, *optional*, defaults to `"gelu_new"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        dropout (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        activation_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for activations inside the fully connected layer.
        classifier_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for classifier.
        max_position_embeddings (`int`, *optional*, defaults to 4096):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 1024 or 2048 or 4096).
        init_std (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        encoder_layerdrop (`float`, *optional*, defaults to 0.0):
            The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
            for more details.
        decoder_layerdrop (`float`, *optional*, defaults to 0.0):
            The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
            for more details.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models).
        attention_type (`str`, *optional*, defaults to `"block_sparse"`)
            Whether to use block sparse attention (with n complexity) as introduced in paper or original attention
            layer (with n^2 complexity) in encoder. Possible values are `"original_full"` and `"block_sparse"`.
        use_bias (`bool`, *optional*, defaults to `False`)
            Whether to use bias in query, key, value.
        block_size (`int`, *optional*, defaults to 64)
            Size of each block. Useful only when `attention_type == "block_sparse"`.
        num_random_blocks (`int`, *optional*, defaults to 3)
            Each query is going to attend these many number of random blocks. Useful only when `attention_type ==
            "block_sparse"`.
        scale_embeddings (`bool`, *optional*, defaults to `True`)
            Whether to rescale embeddings with (hidden_size ** 0.5).

    Example:
        ```python
        >>> from transformers import BigBirdPegasusConfig, BigBirdPegasusModel
        ...
        >>> # Initializing a BigBirdPegasus bigbird-pegasus-base style configuration
        >>> configuration = BigBirdPegasusConfig()
        ...
        >>> # Initializing a model (with random weights) from the bigbird-pegasus-base style configuration
        >>> model = BigBirdPegasusModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "bigbird_pegasus"
    keys_to_ignore_at_inference = ["past_key_values"]
    attribute_map = {
        "num_attention_heads": "encoder_attention_heads",
        "hidden_size": "d_model",
        "attention_probs_dropout_prob": "attention_dropout",
    }

    def __init__(
        self,
        vocab_size=96103,
        max_position_embeddings=4096,
        encoder_layers=16,
        encoder_ffn_dim=4096,
        encoder_attention_heads=16,
        decoder_layers=16,
        decoder_ffn_dim=4096,
        decoder_attention_heads=16,
        encoder_layerdrop=0.0,
        decoder_layerdrop=0.0,
        use_cache=True,
        is_encoder_decoder=True,
        activation_function="gelu_new",
        d_model=1024,
        dropout=0.1,
        attention_dropout=0.0,
        activation_dropout=0.0,
        init_std=0.02,
        decoder_start_token_id=2,
        classifier_dropout=0.0,
        scale_embedding=True,
        pad_token_id=0,
        bos_token_id=2,
        eos_token_id=1,
        attention_type="block_sparse",  # only for encoder
        block_size=64,
        num_random_blocks=3,
        use_bias=False,
        **kwargs,
    ):
        """
        Initializes a new instance of the BigBirdPegasusConfig class.

        Args:
            self: The instance of the class.
            vocab_size (int, optional): The size of the vocabulary. Defaults to 96103.
            max_position_embeddings (int, optional): The maximum number of positional embeddings. Defaults to 4096.
            encoder_layers (int, optional): The number of encoder layers. Defaults to 16.
            encoder_ffn_dim (int, optional): The dimension of the encoder feed-forward network. Defaults to 4096.
            encoder_attention_heads (int, optional): The number of attention heads in the encoder. Defaults to 16.
            decoder_layers (int, optional): The number of decoder layers. Defaults to 16.
            decoder_ffn_dim (int, optional): The dimension of the decoder feed-forward network. Defaults to 4096.
            decoder_attention_heads (int, optional): The number of attention heads in the decoder. Defaults to 16.
            encoder_layerdrop (float, optional): The probability of dropping an encoder layer. Defaults to 0.0.
            decoder_layerdrop (float, optional): The probability of dropping a decoder layer. Defaults to 0.0.
            use_cache (bool, optional): Whether to use cache. Defaults to True.
            is_encoder_decoder (bool, optional): Whether the model is an encoder-decoder. Defaults to True.
            activation_function (str, optional): The activation function to be used. Defaults to 'gelu_new'.
            d_model (int, optional): The model dimension. Defaults to 1024.
            dropout (float, optional): The dropout probability. Defaults to 0.1.
            attention_dropout (float, optional): The dropout probability for attention layers. Defaults to 0.0.
            activation_dropout (float, optional): The dropout probability for activation layers. Defaults to 0.0.
            init_std (float, optional): The standard deviation for weight initialization. Defaults to 0.02.
            decoder_start_token_id (int, optional): The start token id for the decoder. Defaults to 2.
            classifier_dropout (float, optional): The dropout probability for the classifier. Defaults to 0.0.
            scale_embedding (bool, optional): Whether to scale the embeddings. Defaults to True.
            pad_token_id (int, optional): The id for padding tokens. Defaults to 0.
            bos_token_id (int, optional): The id for the beginning of sequence token. Defaults to 2.
            eos_token_id (int, optional): The id for the end of sequence token. Defaults to 1.
            attention_type (str, optional): The type of attention mechanism. Defaults to 'block_sparse'.
            block_size (int, optional): The size of blocks for block_sparse attention. Defaults to 64.
            num_random_blocks (int, optional): The number of random blocks for block_sparse attention. Defaults to 3.
            use_bias (bool, optional): Whether to use bias. Defaults to False.

        Returns:
            None.

        Raises:
            None.
        """
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.d_model = d_model
        self.encoder_ffn_dim = encoder_ffn_dim
        self.encoder_layers = encoder_layers
        self.encoder_attention_heads = encoder_attention_heads
        self.decoder_ffn_dim = decoder_ffn_dim
        self.decoder_layers = decoder_layers
        self.decoder_attention_heads = decoder_attention_heads
        self.dropout = dropout
        self.attention_dropout = attention_dropout
        self.activation_dropout = activation_dropout
        self.activation_function = activation_function
        self.init_std = init_std
        self.encoder_layerdrop = encoder_layerdrop
        self.decoder_layerdrop = decoder_layerdrop
        self.classifier_dropout = classifier_dropout
        self.use_cache = use_cache
        self.num_hidden_layers = encoder_layers
        self.scale_embedding = scale_embedding  # scale factor will be sqrt(d_model) if True

        # extra config
        self.attention_type = attention_type
        self.block_size = block_size
        self.num_random_blocks = num_random_blocks
        self.use_bias = use_bias

        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            is_encoder_decoder=is_encoder_decoder,
            decoder_start_token_id=decoder_start_token_id,
            **kwargs,
        )

mindnlp.transformers.models.bigbird_pegasus.configuration_bigbird_pegasus.BigBirdPegasusConfig.init(vocab_size=96103, max_position_embeddings=4096, encoder_layers=16, encoder_ffn_dim=4096, encoder_attention_heads=16, decoder_layers=16, decoder_ffn_dim=4096, decoder_attention_heads=16, encoder_layerdrop=0.0, decoder_layerdrop=0.0, use_cache=True, is_encoder_decoder=True, activation_function='gelu_new', d_model=1024, dropout=0.1, attention_dropout=0.0, activation_dropout=0.0, init_std=0.02, decoder_start_token_id=2, classifier_dropout=0.0, scale_embedding=True, pad_token_id=0, bos_token_id=2, eos_token_id=1, attention_type='block_sparse', block_size=64, num_random_blocks=3, use_bias=False, **kwargs) ¶

Initializes a new instance of the BigBirdPegasusConfig class.

PARAMETER	DESCRIPTION
`self`	The instance of the class.
`vocab_size`	The size of the vocabulary. Defaults to 96103. TYPE: `int` DEFAULT: `96103`
`max_position_embeddings`	The maximum number of positional embeddings. Defaults to 4096. TYPE: `int` DEFAULT: `4096`
`encoder_layers`	The number of encoder layers. Defaults to 16. TYPE: `int` DEFAULT: `16`
`encoder_ffn_dim`	The dimension of the encoder feed-forward network. Defaults to 4096. TYPE: `int` DEFAULT: `4096`
`encoder_attention_heads`	The number of attention heads in the encoder. Defaults to 16. TYPE: `int` DEFAULT: `16`
`decoder_layers`	The number of decoder layers. Defaults to 16. TYPE: `int` DEFAULT: `16`
`decoder_ffn_dim`	The dimension of the decoder feed-forward network. Defaults to 4096. TYPE: `int` DEFAULT: `4096`
`decoder_attention_heads`	The number of attention heads in the decoder. Defaults to 16. TYPE: `int` DEFAULT: `16`
`encoder_layerdrop`	The probability of dropping an encoder layer. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`decoder_layerdrop`	The probability of dropping a decoder layer. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`use_cache`	Whether to use cache. Defaults to True. TYPE: `bool` DEFAULT: `True`
`is_encoder_decoder`	Whether the model is an encoder-decoder. Defaults to True. TYPE: `bool` DEFAULT: `True`
`activation_function`	The activation function to be used. Defaults to 'gelu_new'. TYPE: `str` DEFAULT: `'gelu_new'`
`d_model`	The model dimension. Defaults to 1024. TYPE: `int` DEFAULT: `1024`
`dropout`	The dropout probability. Defaults to 0.1. TYPE: `float` DEFAULT: `0.1`
`attention_dropout`	The dropout probability for attention layers. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`activation_dropout`	The dropout probability for activation layers. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`init_std`	The standard deviation for weight initialization. Defaults to 0.02. TYPE: `float` DEFAULT: `0.02`
`decoder_start_token_id`	The start token id for the decoder. Defaults to 2. TYPE: `int` DEFAULT: `2`
`classifier_dropout`	The dropout probability for the classifier. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`scale_embedding`	Whether to scale the embeddings. Defaults to True. TYPE: `bool` DEFAULT: `True`
`pad_token_id`	The id for padding tokens. Defaults to 0. TYPE: `int` DEFAULT: `0`
`bos_token_id`	The id for the beginning of sequence token. Defaults to 2. TYPE: `int` DEFAULT: `2`
`eos_token_id`	The id for the end of sequence token. Defaults to 1. TYPE: `int` DEFAULT: `1`
`attention_type`	The type of attention mechanism. Defaults to 'block_sparse'. TYPE: `str` DEFAULT: `'block_sparse'`
`block_size`	The size of blocks for block_sparse attention. Defaults to 64. TYPE: `int` DEFAULT: `64`
`num_random_blocks`	The number of random blocks for block_sparse attention. Defaults to 3. TYPE: `int` DEFAULT: `3`
`use_bias`	Whether to use bias. Defaults to False. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
	None.

Source code in mindnlp\transformers\models\bigbird_pegasus\configuration_bigbird_pegasus.py

def __init__(
    self,
    vocab_size=96103,
    max_position_embeddings=4096,
    encoder_layers=16,
    encoder_ffn_dim=4096,
    encoder_attention_heads=16,
    decoder_layers=16,
    decoder_ffn_dim=4096,
    decoder_attention_heads=16,
    encoder_layerdrop=0.0,
    decoder_layerdrop=0.0,
    use_cache=True,
    is_encoder_decoder=True,
    activation_function="gelu_new",
    d_model=1024,
    dropout=0.1,
    attention_dropout=0.0,
    activation_dropout=0.0,
    init_std=0.02,
    decoder_start_token_id=2,
    classifier_dropout=0.0,
    scale_embedding=True,
    pad_token_id=0,
    bos_token_id=2,
    eos_token_id=1,
    attention_type="block_sparse",  # only for encoder
    block_size=64,
    num_random_blocks=3,
    use_bias=False,
    **kwargs,
):
    """
    Initializes a new instance of the BigBirdPegasusConfig class.

    Args:
        self: The instance of the class.
        vocab_size (int, optional): The size of the vocabulary. Defaults to 96103.
        max_position_embeddings (int, optional): The maximum number of positional embeddings. Defaults to 4096.
        encoder_layers (int, optional): The number of encoder layers. Defaults to 16.
        encoder_ffn_dim (int, optional): The dimension of the encoder feed-forward network. Defaults to 4096.
        encoder_attention_heads (int, optional): The number of attention heads in the encoder. Defaults to 16.
        decoder_layers (int, optional): The number of decoder layers. Defaults to 16.
        decoder_ffn_dim (int, optional): The dimension of the decoder feed-forward network. Defaults to 4096.
        decoder_attention_heads (int, optional): The number of attention heads in the decoder. Defaults to 16.
        encoder_layerdrop (float, optional): The probability of dropping an encoder layer. Defaults to 0.0.
        decoder_layerdrop (float, optional): The probability of dropping a decoder layer. Defaults to 0.0.
        use_cache (bool, optional): Whether to use cache. Defaults to True.
        is_encoder_decoder (bool, optional): Whether the model is an encoder-decoder. Defaults to True.
        activation_function (str, optional): The activation function to be used. Defaults to 'gelu_new'.
        d_model (int, optional): The model dimension. Defaults to 1024.
        dropout (float, optional): The dropout probability. Defaults to 0.1.
        attention_dropout (float, optional): The dropout probability for attention layers. Defaults to 0.0.
        activation_dropout (float, optional): The dropout probability for activation layers. Defaults to 0.0.
        init_std (float, optional): The standard deviation for weight initialization. Defaults to 0.02.
        decoder_start_token_id (int, optional): The start token id for the decoder. Defaults to 2.
        classifier_dropout (float, optional): The dropout probability for the classifier. Defaults to 0.0.
        scale_embedding (bool, optional): Whether to scale the embeddings. Defaults to True.
        pad_token_id (int, optional): The id for padding tokens. Defaults to 0.
        bos_token_id (int, optional): The id for the beginning of sequence token. Defaults to 2.
        eos_token_id (int, optional): The id for the end of sequence token. Defaults to 1.
        attention_type (str, optional): The type of attention mechanism. Defaults to 'block_sparse'.
        block_size (int, optional): The size of blocks for block_sparse attention. Defaults to 64.
        num_random_blocks (int, optional): The number of random blocks for block_sparse attention. Defaults to 3.
        use_bias (bool, optional): Whether to use bias. Defaults to False.

    Returns:
        None.

    Raises:
        None.
    """
    self.vocab_size = vocab_size
    self.max_position_embeddings = max_position_embeddings
    self.d_model = d_model
    self.encoder_ffn_dim = encoder_ffn_dim
    self.encoder_layers = encoder_layers
    self.encoder_attention_heads = encoder_attention_heads
    self.decoder_ffn_dim = decoder_ffn_dim
    self.decoder_layers = decoder_layers
    self.decoder_attention_heads = decoder_attention_heads
    self.dropout = dropout
    self.attention_dropout = attention_dropout
    self.activation_dropout = activation_dropout
    self.activation_function = activation_function
    self.init_std = init_std
    self.encoder_layerdrop = encoder_layerdrop
    self.decoder_layerdrop = decoder_layerdrop
    self.classifier_dropout = classifier_dropout
    self.use_cache = use_cache
    self.num_hidden_layers = encoder_layers
    self.scale_embedding = scale_embedding  # scale factor will be sqrt(d_model) if True

    # extra config
    self.attention_type = attention_type
    self.block_size = block_size
    self.num_random_blocks = num_random_blocks
    self.use_bias = use_bias

    super().__init__(
        pad_token_id=pad_token_id,
        bos_token_id=bos_token_id,
        eos_token_id=eos_token_id,
        is_encoder_decoder=is_encoder_decoder,
        decoder_start_token_id=decoder_start_token_id,
        **kwargs,
    )