qwen2

`mindnlp.transformers.models.qwen2.configuration_qwen2` ¶

Qwen2 model configuration

`mindnlp.transformers.models.qwen2.configuration_qwen2.Qwen2Config` ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [Qwen2Model]. It is used to instantiate a Qwen2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of Qwen2-7B-beta Qwen/Qwen2-7B-beta.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`vocab_size`	Vocabulary size of the Qwen2 model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`Qwen2Model`] TYPE: `int`, optional, defaults to 151936 DEFAULT: `151936`
`hidden_size`	Dimension of the hidden representations. TYPE: `int`, optional, defaults to 4096 DEFAULT: `4096`
`intermediate_size`	Dimension of the MLP representations. TYPE: `int`, optional, defaults to 22016 DEFAULT: `22016`
`num_hidden_layers`	Number of hidden layers in the Transformer encoder. TYPE: `int`, optional, defaults to 32 DEFAULT: `32`
`num_attention_heads`	Number of attention heads for each attention layer in the Transformer encoder. TYPE: `int`, optional, defaults to 32 DEFAULT: `32`
`num_key_value_heads`	This is the number of key_value heads that should be used to implement Grouped Query Attention. If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be forwarded by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default to `32`. TYPE: `int`, optional, defaults to 32 DEFAULT: `32`
`hidden_act`	The non-linear activation function (function or string) in the decoder. TYPE: `str` or `function`, optional, defaults to `"silu"` DEFAULT: `'silu'`
`max_position_embeddings`	The maximum sequence length that this model might ever be used with. TYPE: `int`, optional, defaults to 32768 DEFAULT: `32768`
`initializer_range`	The standard deviation of the truncated_normal_initializer for initializing all weight matrices. TYPE: `float`, optional, defaults to 0.02 DEFAULT: `0.02`
`rms_norm_eps`	The epsilon used by the rms normalization layers. TYPE: `float`, optional, defaults to 1e-06 DEFAULT: `1e-06`
`use_cache`	Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True`. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`tie_word_embeddings`	Whether the model's input and output word embeddings should be tied. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`
`rope_theta`	The base period of the RoPE embeddings. TYPE: `float`, optional, defaults to 10000.0 DEFAULT: `10000.0`
`use_sliding_window`	Whether to use sliding window attention. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`
`sliding_window`	Sliding window attention (SWA) window size. If not specified, will default to `4096`. TYPE: `int`, optional, defaults to 4096 DEFAULT: `4096`
`max_window_layers`	The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention. TYPE: `int`, optional, defaults to 28 DEFAULT: `28`
`attention_dropout`	The dropout ratio for the attention probabilities. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`

Example

>>> from transformers import Qwen2Model, Qwen2Config
...
>>> # Initializing a Qwen2 style configuration
>>> configuration = Qwen2Config()
...
>>> # Initializing a model from the Qwen2-7B style configuration
>>> model = Qwen2Model(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config

Source code in mindnlp\transformers\models\qwen2\configuration_qwen2.py

class Qwen2Config(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`Qwen2Model`]. It is used to instantiate a
    Qwen2 model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of
    Qwen2-7B-beta [Qwen/Qwen2-7B-beta](https://hf-mirror.com/Qwen/Qwen2-7B-beta).

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        vocab_size (`int`, *optional*, defaults to 151936):
            Vocabulary size of the Qwen2 model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`Qwen2Model`]
        hidden_size (`int`, *optional*, defaults to 4096):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 22016):
            Dimension of the MLP representations.
        num_hidden_layers (`int`, *optional*, defaults to 32):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_key_value_heads (`int`, *optional*, defaults to 32):
            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be forwarded
            by meanpooling all the original heads within that group. For more details checkout [this
            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `32`.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 32768):
            The maximum sequence length that this model might ever be used with.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the rms normalization layers.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether the model's input and output word embeddings should be tied.
        rope_theta (`float`, *optional*, defaults to 10000.0):
            The base period of the RoPE embeddings.
        use_sliding_window (`bool`, *optional*, defaults to `False`):
            Whether to use sliding window attention.
        sliding_window (`int`, *optional*, defaults to 4096):
            Sliding window attention (SWA) window size. If not specified, will default to `4096`.
        max_window_layers (`int`, *optional*, defaults to 28):
            The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top
            use full attention.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.

    Example:
        ```python
        >>> from transformers import Qwen2Model, Qwen2Config
        ...
        >>> # Initializing a Qwen2 style configuration
        >>> configuration = Qwen2Config()
        ...
        >>> # Initializing a model from the Qwen2-7B style configuration
        >>> model = Qwen2Model(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "qwen2"
    keys_to_ignore_at_inference = ["past_key_values"]

    def __init__(
        self,
        vocab_size=151936,
        hidden_size=4096,
        intermediate_size=22016,
        num_hidden_layers=32,
        num_attention_heads=32,
        num_key_value_heads=32,
        hidden_act="silu",
        max_position_embeddings=32768,
        initializer_range=0.02,
        rms_norm_eps=1e-6,
        use_cache=True,
        tie_word_embeddings=False,
        rope_theta=10000.0,
        use_sliding_window=False,
        sliding_window=4096,
        max_window_layers=28,
        attention_dropout=0.0,
        **kwargs,
    ):
        """
        __init__

        Initializes a Qwen2Config object.

        Args:
            self: The instance of the class.
            vocab_size (int): The size of the vocabulary. Default is 151936.
            hidden_size (int): The size of the hidden layers. Default is 4096.
            intermediate_size (int): The size of the intermediate layer. Default is 22016.
            num_hidden_layers (int): The number of hidden layers. Default is 32.
            num_attention_heads (int): The number of attention heads. Default is 32.
            num_key_value_heads (int): The number of key-value attention heads. Default is 32.
            hidden_act (str): The activation function for the hidden layers. Default is 'silu'.
            max_position_embeddings (int): The maximum position embeddings. Default is 32768.
            initializer_range (float): The range for random weight initialization. Default is 0.02.
            rms_norm_eps (float): The epsilon value for RMS normalization. Default is 1e-06.
            use_cache (bool): Indicates whether to use caching. Default is True.
            tie_word_embeddings (bool): Indicates whether to tie word embeddings. Default is False.
            rope_theta (float): The theta value for rope. Default is 10000.0.
            use_sliding_window (bool): Indicates whether to use sliding window. Default is False.
            sliding_window (int): The size of the sliding window. Default is 4096.
            max_window_layers (int): The maximum number of window layers. Default is 28.
            attention_dropout (float): The dropout rate for attention. Default is 0.0.

        Returns:
            None.

        Raises:
            None.
        """
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.use_sliding_window = use_sliding_window
        self.sliding_window = sliding_window
        self.max_window_layers = max_window_layers

        # for backward compatibility
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads

        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.attention_dropout = attention_dropout

        super().__init__(
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )

`mindnlp.transformers.models.qwen2.configuration_qwen2.Qwen2Config.init(vocab_size=151936, hidden_size=4096, intermediate_size=22016, num_hidden_layers=32, num_attention_heads=32, num_key_value_heads=32, hidden_act='silu', max_position_embeddings=32768, initializer_range=0.02, rms_norm_eps=1e-06, use_cache=True, tie_word_embeddings=False, rope_theta=10000.0, use_sliding_window=False, sliding_window=4096, max_window_layers=28, attention_dropout=0.0, **kwargs)` ¶

init

Initializes a Qwen2Config object.

PARAMETER	DESCRIPTION
`self`	The instance of the class.
`vocab_size`	The size of the vocabulary. Default is 151936. TYPE: `int` DEFAULT: `151936`
`hidden_size`	The size of the hidden layers. Default is 4096. TYPE: `int` DEFAULT: `4096`
`intermediate_size`	The size of the intermediate layer. Default is 22016. TYPE: `int` DEFAULT: `22016`
`num_hidden_layers`	The number of hidden layers. Default is 32. TYPE: `int` DEFAULT: `32`
`num_attention_heads`	The number of attention heads. Default is 32. TYPE: `int` DEFAULT: `32`
`num_key_value_heads`	The number of key-value attention heads. Default is 32. TYPE: `int` DEFAULT: `32`
`hidden_act`	The activation function for the hidden layers. Default is 'silu'. TYPE: `str` DEFAULT: `'silu'`
`max_position_embeddings`	The maximum position embeddings. Default is 32768. TYPE: `int` DEFAULT: `32768`
`initializer_range`	The range for random weight initialization. Default is 0.02. TYPE: `float` DEFAULT: `0.02`
`rms_norm_eps`	The epsilon value for RMS normalization. Default is 1e-06. TYPE: `float` DEFAULT: `1e-06`
`use_cache`	Indicates whether to use caching. Default is True. TYPE: `bool` DEFAULT: `True`
`tie_word_embeddings`	Indicates whether to tie word embeddings. Default is False. TYPE: `bool` DEFAULT: `False`
`rope_theta`	The theta value for rope. Default is 10000.0. TYPE: `float` DEFAULT: `10000.0`
`use_sliding_window`	Indicates whether to use sliding window. Default is False. TYPE: `bool` DEFAULT: `False`
`sliding_window`	The size of the sliding window. Default is 4096. TYPE: `int` DEFAULT: `4096`
`max_window_layers`	The maximum number of window layers. Default is 28. TYPE: `int` DEFAULT: `28`
`attention_dropout`	The dropout rate for attention. Default is 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
	None.

Source code in mindnlp\transformers\models\qwen2\configuration_qwen2.py

def __init__(
    self,
    vocab_size=151936,
    hidden_size=4096,
    intermediate_size=22016,
    num_hidden_layers=32,
    num_attention_heads=32,
    num_key_value_heads=32,
    hidden_act="silu",
    max_position_embeddings=32768,
    initializer_range=0.02,
    rms_norm_eps=1e-6,
    use_cache=True,
    tie_word_embeddings=False,
    rope_theta=10000.0,
    use_sliding_window=False,
    sliding_window=4096,
    max_window_layers=28,
    attention_dropout=0.0,
    **kwargs,
):
    """
    __init__

    Initializes a Qwen2Config object.

    Args:
        self: The instance of the class.
        vocab_size (int): The size of the vocabulary. Default is 151936.
        hidden_size (int): The size of the hidden layers. Default is 4096.
        intermediate_size (int): The size of the intermediate layer. Default is 22016.
        num_hidden_layers (int): The number of hidden layers. Default is 32.
        num_attention_heads (int): The number of attention heads. Default is 32.
        num_key_value_heads (int): The number of key-value attention heads. Default is 32.
        hidden_act (str): The activation function for the hidden layers. Default is 'silu'.
        max_position_embeddings (int): The maximum position embeddings. Default is 32768.
        initializer_range (float): The range for random weight initialization. Default is 0.02.
        rms_norm_eps (float): The epsilon value for RMS normalization. Default is 1e-06.
        use_cache (bool): Indicates whether to use caching. Default is True.
        tie_word_embeddings (bool): Indicates whether to tie word embeddings. Default is False.
        rope_theta (float): The theta value for rope. Default is 10000.0.
        use_sliding_window (bool): Indicates whether to use sliding window. Default is False.
        sliding_window (int): The size of the sliding window. Default is 4096.
        max_window_layers (int): The maximum number of window layers. Default is 28.
        attention_dropout (float): The dropout rate for attention. Default is 0.0.

    Returns:
        None.

    Raises:
        None.
    """
    self.vocab_size = vocab_size
    self.max_position_embeddings = max_position_embeddings
    self.hidden_size = hidden_size
    self.intermediate_size = intermediate_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.use_sliding_window = use_sliding_window
    self.sliding_window = sliding_window
    self.max_window_layers = max_window_layers

    # for backward compatibility
    if num_key_value_heads is None:
        num_key_value_heads = num_attention_heads

    self.num_key_value_heads = num_key_value_heads
    self.hidden_act = hidden_act
    self.initializer_range = initializer_range
    self.rms_norm_eps = rms_norm_eps
    self.use_cache = use_cache
    self.rope_theta = rope_theta
    self.attention_dropout = attention_dropout

    super().__init__(
        tie_word_embeddings=tie_word_embeddings,
        **kwargs,
    )

`mindnlp.transformers.models.qwen2.modeling_qwen2` ¶

MindSpore Qwen2 model.

`mindnlp.transformers.models.qwen2.modeling_qwen2.Qwen2Attention` ¶

Bases: Module

Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer and "Generating Long Sequences with Sparse Transformers".