biogpt

`mindnlp.transformers.models.biogpt.configuration_biogpt.BioGptConfig` ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [BioGptModel]. It is used to instantiate an BioGPT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BioGPT microsoft/biogpt architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`vocab_size`	Vocabulary size of the BioGPT model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`BioGptModel`]. TYPE: `int`, optional, defaults to 42384 DEFAULT: `42384`
`hidden_size`	Dimension of the encoder layers and the pooler layer. TYPE: `int`, optional, defaults to 1024 DEFAULT: `1024`
`num_hidden_layers`	Number of hidden layers in the Transformer encoder. TYPE: `int`, optional, defaults to 24 DEFAULT: `24`
`num_attention_heads`	Number of attention heads for each attention layer in the Transformer encoder. TYPE: `int`, optional, defaults to 16 DEFAULT: `16`
`intermediate_size`	Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. TYPE: `int`, optional, defaults to 4096 DEFAULT: `4096`
`hidden_act`	The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported. TYPE: `str` or `function`, optional, defaults to `"gelu"` DEFAULT: `'gelu'`
`hidden_dropout_prob`	The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. TYPE: `float`, optional, defaults to 0.1 DEFAULT: `0.1`
`attention_probs_dropout_prob`	The dropout ratio for the attention probabilities. TYPE: `float`, optional, defaults to 0.1 DEFAULT: `0.1`
`max_position_embeddings`	The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). TYPE: `int`, optional, defaults to 1024 DEFAULT: `1024`
`initializer_range`	The standard deviation of the truncated_normal_initializer for initializing all weight matrices. TYPE: `float`, optional, defaults to 0.02 DEFAULT: `0.02`
`layer_norm_eps`	The epsilon used by the layer normalization layers. TYPE: `float`, optional, defaults to 1e-12 DEFAULT: `1e-12`
`scale_embedding`	Scale embeddings by diving by sqrt(d_model). TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`use_cache`	Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True`. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`layerdrop`	Please refer to the paper about LayerDrop: https://arxiv.org/abs/1909.11556 for further details TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`activation_dropout`	The dropout ratio for activations inside the fully connected layer. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`pad_token_id`	Padding token id. TYPE: `int`, optional, defaults to 1 DEFAULT: `1`
`bos_token_id`	Beginning of stream token id. TYPE: `int`, optional, defaults to 0 DEFAULT: `0`
`eos_token_id`	End of stream token id. TYPE: `int`, optional, defaults to 2 DEFAULT: `2`

Example

>>> from transformers import BioGptModel, BioGptConfig
...
>>> # Initializing a BioGPT microsoft/biogpt style configuration
>>> configuration = BioGptConfig()
...
>>> # Initializing a model from the microsoft/biogpt style configuration
>>> model = BioGptModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config

Source code in mindnlp\transformers\models\biogpt\configuration_biogpt.py

class BioGptConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`BioGptModel`]. It is used to instantiate an
    BioGPT model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the BioGPT
    [microsoft/biogpt](https://hf-mirror.com/microsoft/biogpt) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        vocab_size (`int`, *optional*, defaults to 42384):
            Vocabulary size of the BioGPT model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`BioGptModel`].
        hidden_size (`int`, *optional*, defaults to 1024):
            Dimension of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 24):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 4096):
            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (`int`, *optional*, defaults to 1024):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        scale_embedding (`bool`, *optional*, defaults to `True`):
            Scale embeddings by diving by sqrt(d_model).
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        layerdrop (`float`, *optional*, defaults to 0.0):
            Please refer to the paper about LayerDrop: https://arxiv.org/abs/1909.11556 for further details
        activation_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for activations inside the fully connected layer.
        pad_token_id (`int`, *optional*, defaults to 1):
            Padding token id.
        bos_token_id (`int`, *optional*, defaults to 0):
            Beginning of stream token id.
        eos_token_id (`int`, *optional*, defaults to 2):
            End of stream token id.

    Example:
        ```python
        >>> from transformers import BioGptModel, BioGptConfig
        ...
        >>> # Initializing a BioGPT microsoft/biogpt style configuration
        >>> configuration = BioGptConfig()
        ...
        >>> # Initializing a model from the microsoft/biogpt style configuration
        >>> model = BioGptModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "biogpt"

    def __init__(
        self,
        vocab_size=42384,
        hidden_size=1024,
        num_hidden_layers=24,
        num_attention_heads=16,
        intermediate_size=4096,
        hidden_act="gelu",
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        max_position_embeddings=1024,
        initializer_range=0.02,
        layer_norm_eps=1e-12,
        scale_embedding=True,
        use_cache=True,
        layerdrop=0.0,
        activation_dropout=0.0,
        pad_token_id=1,
        bos_token_id=0,
        eos_token_id=2,
        **kwargs,
    ):
        """
        Initializes a new instance of the BioGptConfig class.

        Args:
            self: The instance of the class.
            vocab_size (int): The size of the vocabulary. Defaults to 42384.
            hidden_size (int): The size of the hidden layers. Defaults to 1024.
            num_hidden_layers (int): The number of hidden layers. Defaults to 24.
            num_attention_heads (int): The number of attention heads. Defaults to 16.
            intermediate_size (int): The size of the intermediate layers. Defaults to 4096.
            hidden_act (str): The activation function for the hidden layers. Defaults to 'gelu'.
            hidden_dropout_prob (float): The dropout probability for the hidden layers. Defaults to 0.1.
            attention_probs_dropout_prob (float): The dropout probability for the attention probabilities. Defaults to 0.1.
            max_position_embeddings (int): The maximum number of position embeddings. Defaults to 1024.
            initializer_range (float): The range for the initializer. Defaults to 0.02.
            layer_norm_eps (float): The epsilon value for layer normalization. Defaults to 1e-12.
            scale_embedding (bool): Whether to scale the embedding. Defaults to True.
            use_cache (bool): Whether to use caching. Defaults to True.
            layerdrop (float): The probability of dropping a layer. Defaults to 0.0.
            activation_dropout (float): The dropout probability for the activation. Defaults to 0.0.
            pad_token_id (int): The id of the padding token. Defaults to 1.
            bos_token_id (int): The id of the beginning-of-sentence token. Defaults to 0.
            eos_token_id (int): The id of the end-of-sentence token. Defaults to 2.

        Returns:
            None.

        Raises:
            None.
        """
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.hidden_act = hidden_act
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.scale_embedding = scale_embedding
        self.use_cache = use_cache
        self.layerdrop = layerdrop
        self.activation_dropout = activation_dropout
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)

`mindnlp.transformers.models.biogpt.configuration_biogpt.BioGptConfig.init(vocab_size=42384, hidden_size=1024, num_hidden_layers=24, num_attention_heads=16, intermediate_size=4096, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=1024, initializer_range=0.02, layer_norm_eps=1e-12, scale_embedding=True, use_cache=True, layerdrop=0.0, activation_dropout=0.0, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs)` ¶

Initializes a new instance of the BioGptConfig class.

PARAMETER	DESCRIPTION
`self`	The instance of the class.
`vocab_size`	The size of the vocabulary. Defaults to 42384. TYPE: `int` DEFAULT: `42384`
`hidden_size`	The size of the hidden layers. Defaults to 1024. TYPE: `int` DEFAULT: `1024`
`num_hidden_layers`	The number of hidden layers. Defaults to 24. TYPE: `int` DEFAULT: `24`
`num_attention_heads`	The number of attention heads. Defaults to 16. TYPE: `int` DEFAULT: `16`
`intermediate_size`	The size of the intermediate layers. Defaults to 4096. TYPE: `int` DEFAULT: `4096`
`hidden_act`	The activation function for the hidden layers. Defaults to 'gelu'. TYPE: `str` DEFAULT: `'gelu'`
`hidden_dropout_prob`	The dropout probability for the hidden layers. Defaults to 0.1. TYPE: `float` DEFAULT: `0.1`
`attention_probs_dropout_prob`	The dropout probability for the attention probabilities. Defaults to 0.1. TYPE: `float` DEFAULT: `0.1`
`max_position_embeddings`	The maximum number of position embeddings. Defaults to 1024. TYPE: `int` DEFAULT: `1024`
`initializer_range`	The range for the initializer. Defaults to 0.02. TYPE: `float` DEFAULT: `0.02`
`layer_norm_eps`	The epsilon value for layer normalization. Defaults to 1e-12. TYPE: `float` DEFAULT: `1e-12`
`scale_embedding`	Whether to scale the embedding. Defaults to True. TYPE: `bool` DEFAULT: `True`
`use_cache`	Whether to use caching. Defaults to True. TYPE: `bool` DEFAULT: `True`
`layerdrop`	The probability of dropping a layer. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`activation_dropout`	The dropout probability for the activation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`pad_token_id`	The id of the padding token. Defaults to 1. TYPE: `int` DEFAULT: `1`
`bos_token_id`	The id of the beginning-of-sentence token. Defaults to 0. TYPE: `int` DEFAULT: `0`
`eos_token_id`	The id of the end-of-sentence token. Defaults to 2. TYPE: `int` DEFAULT: `2`

RETURNS	DESCRIPTION
	None.

Source code in mindnlp\transformers\models\biogpt\configuration_biogpt.py

def __init__(
    self,
    vocab_size=42384,
    hidden_size=1024,
    num_hidden_layers=24,
    num_attention_heads=16,
    intermediate_size=4096,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    max_position_embeddings=1024,
    initializer_range=0.02,
    layer_norm_eps=1e-12,
    scale_embedding=True,
    use_cache=True,
    layerdrop=0.0,
    activation_dropout=0.0,
    pad_token_id=1,
    bos_token_id=0,
    eos_token_id=2,
    **kwargs,
):
    """
    Initializes a new instance of the BioGptConfig class.

    Args:
        self: The instance of the class.
        vocab_size (int): The size of the vocabulary. Defaults to 42384.
        hidden_size (int): The size of the hidden layers. Defaults to 1024.
        num_hidden_layers (int): The number of hidden layers. Defaults to 24.
        num_attention_heads (int): The number of attention heads. Defaults to 16.
        intermediate_size (int): The size of the intermediate layers. Defaults to 4096.
        hidden_act (str): The activation function for the hidden layers. Defaults to 'gelu'.
        hidden_dropout_prob (float): The dropout probability for the hidden layers. Defaults to 0.1.
        attention_probs_dropout_prob (float): The dropout probability for the attention probabilities. Defaults to 0.1.
        max_position_embeddings (int): The maximum number of position embeddings. Defaults to 1024.
        initializer_range (float): The range for the initializer. Defaults to 0.02.
        layer_norm_eps (float): The epsilon value for layer normalization. Defaults to 1e-12.
        scale_embedding (bool): Whether to scale the embedding. Defaults to True.
        use_cache (bool): Whether to use caching. Defaults to True.
        layerdrop (float): The probability of dropping a layer. Defaults to 0.0.
        activation_dropout (float): The dropout probability for the activation. Defaults to 0.0.
        pad_token_id (int): The id of the padding token. Defaults to 1.
        bos_token_id (int): The id of the beginning-of-sentence token. Defaults to 0.
        eos_token_id (int): The id of the end-of-sentence token. Defaults to 2.

    Returns:
        None.

    Raises:
        None.
    """
    self.vocab_size = vocab_size
    self.max_position_embeddings = max_position_embeddings
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.intermediate_size = intermediate_size
    self.hidden_act = hidden_act
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.initializer_range = initializer_range
    self.layer_norm_eps = layer_norm_eps
    self.scale_embedding = scale_embedding
    self.use_cache = use_cache
    self.layerdrop = layerdrop
    self.activation_dropout = activation_dropout
    super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForCausalLM` ¶

Bases: BioGptPreTrainedModel

Source code in mindnlp\transformers\models\biogpt\modeling_biogpt.py

class BioGptForCausalLM(BioGptPreTrainedModel):
    _tied_weights_keys = ["output_projection.weight"]

    def __init__(self, config):
        super().__init__(config)

        self.biogpt = BioGptModel(config)
        self.output_projection = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_output_embeddings(self):
        return self.output_projection

    def set_output_embeddings(self, new_embeddings):
        self.output_projection = new_embeddings

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CausalLMOutputWithCrossAttentions]:
        r"""
        labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
            `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
            are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.biogpt(
            input_ids,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            past_key_values=past_key_values,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]
        prediction_scores = self.output_projection(sequence_output)

        lm_loss = None
        if labels is not None:
            # we are doing next-token prediction; shift prediction scores and input ids by one
            shifted_prediction_scores = prediction_scores[:, :-1, :]
            labels = labels[:, 1:]
            loss_fct = CrossEntropyLoss()
            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

        if not return_dict:
            output = (prediction_scores,) + outputs[1:]
            return ((lm_loss,) + output) if lm_loss is not None else output

        return CausalLMOutputWithCrossAttentions(
            loss=lm_loss,
            logits=prediction_scores,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
            cross_attentions=outputs.cross_attentions,
        )

    def prepare_inputs_for_generation(
        self, input_ids, attention_mask, inputs_embeds=None, past_key_values=None, **kwargs
    ):
        # only last tokens for inputs_ids if past is defined in kwargs
        if past_key_values is not None:
            past_length = past_key_values[0][0].shape[2]

            # Some generation methods already pass only the last input ID
            if input_ids.shape[1] > past_length:
                remove_prefix_length = past_length
            else:
                # Default to old behavior: keep only final ID
                remove_prefix_length = input_ids.shape[1] - 1

            input_ids = input_ids[:, remove_prefix_length:]

        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "attention_mask": attention_mask,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
            }
        )

        return model_inputs

    @staticmethod
    def _reorder_cache(past_key_values, beam_idx):
        reordered_past = ()
        for layer_past in past_key_values:
            reordered_past += (
                tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),
            )
        return reordered_past

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForCausalLM.forward(input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, past_key_values=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)` ¶

labels (mindspore.Tensor of shape (batch_size, sequence_length), optional): Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set labels = input_ids Indices are selected in [-100, 0, ..., config.vocab_size] All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

Source code in mindnlp\transformers\models\biogpt\modeling_biogpt.py

def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithCrossAttentions]:
    r"""
    labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
        Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
        `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
        are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.biogpt(
        input_ids,
        attention_mask=attention_mask,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        past_key_values=past_key_values,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    sequence_output = outputs[0]
    prediction_scores = self.output_projection(sequence_output)

    lm_loss = None
    if labels is not None:
        # we are doing next-token prediction; shift prediction scores and input ids by one
        shifted_prediction_scores = prediction_scores[:, :-1, :]
        labels = labels[:, 1:]
        loss_fct = CrossEntropyLoss()
        lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

    if not return_dict:
        output = (prediction_scores,) + outputs[1:]
        return ((lm_loss,) + output) if lm_loss is not None else output

    return CausalLMOutputWithCrossAttentions(
        loss=lm_loss,
        logits=prediction_scores,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
        cross_attentions=outputs.cross_attentions,
    )

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForTokenClassification` ¶

Bases: BioGptPreTrainedModel

Source code in mindnlp\transformers\models\biogpt\modeling_biogpt.py

class BioGptForTokenClassification(BioGptPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.biogpt = BioGptModel(config)
        if hasattr(config, "classifier_dropout") and config.classifier_dropout is not None:
            classifier_dropout = config.classifier_dropout
        else:
            classifier_dropout = config.hidden_dropout_prob
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        self.post_init()

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, TokenClassifierOutput]:
        r"""
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        transformer_outputs = self.biogpt(
            input_ids,
            past_key_values=past_key_values,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        hidden_states = transformer_outputs[0]
        hidden_states = self.dropout(hidden_states)
        logits = self.classifier(hidden_states)

        loss = None
        if labels is not None:
            loss_fct = CrossEntropyLoss()
            # Only keep active parts of the loss
            if attention_mask is not None:
                active_loss = attention_mask.view(-1) == 1
                active_logits = logits.view(-1, self.num_labels)
                active_labels = ops.where(
                    active_loss, labels.view(-1), mindspore.tensor(loss_fct.ignore_index).type_as(labels)
                )
                loss = loss_fct(active_logits, active_labels)
            else:
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        if not return_dict:
            output = (logits,) + transformer_outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return TokenClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForTokenClassification.forward(input_ids=None, token_type_ids=None, attention_mask=None, head_mask=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)` ¶

labels (mindspore.Tensor of shape (batch_size,), optional): Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Source code in mindnlp\transformers\models\biogpt\modeling_biogpt.py

def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, TokenClassifierOutput]:
    r"""
    labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
        Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
        config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
        `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    transformer_outputs = self.biogpt(
        input_ids,
        past_key_values=past_key_values,
        attention_mask=attention_mask,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    hidden_states = transformer_outputs[0]
    hidden_states = self.dropout(hidden_states)
    logits = self.classifier(hidden_states)

    loss = None
    if labels is not None:
        loss_fct = CrossEntropyLoss()
        # Only keep active parts of the loss
        if attention_mask is not None:
            active_loss = attention_mask.view(-1) == 1
            active_logits = logits.view(-1, self.num_labels)
            active_labels = ops.where(
                active_loss, labels.view(-1), mindspore.tensor(loss_fct.ignore_index).type_as(labels)
            )
            loss = loss_fct(active_logits, active_labels)
        else:
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

    if not return_dict:
        output = (logits,) + transformer_outputs[2:]
        return ((loss,) + output) if loss is not None else output

    return TokenClassifierOutput(
        loss=loss,
        logits=logits,
        hidden_states=transformer_outputs.hidden_states,
        attentions=transformer_outputs.attentions,
    )

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForSequenceClassification` ¶

Bases: BioGptPreTrainedModel

Source code in mindnlp\transformers\models\biogpt\modeling_biogpt.py

class BioGptForSequenceClassification(BioGptPreTrainedModel):
    def __init__(self, config: BioGptConfig):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.biogpt = BioGptModel(config)
        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
        r"""
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        transformer_outputs = self.biogpt(
            input_ids,
            past_key_values=past_key_values,
            attention_mask=attention_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = transformer_outputs[0]
        logits = self.score(hidden_states)

        if input_ids is not None:
            batch_size, sequence_length = input_ids.shape[:2]
        else:
            batch_size, sequence_length = inputs_embeds.shape[:2]

        if self.config.pad_token_id is None:
            sequence_length = -1
        else:
            if input_ids is not None:
                sequence_length = (ops.ne(input_ids, self.config.pad_token_id).sum(-1) - 1)
            else:
                sequence_length = -1
                logger.warning_once(
                    f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
                    "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
                )

        pooled_logits = logits[ops.arange(batch_size), sequence_length]

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(pooled_logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(pooled_logits, labels)
        if not return_dict:
            output = (pooled_logits,) + transformer_outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutputWithPast(
            loss=loss,
            logits=pooled_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )

    def get_input_embeddings(self):
        return self.biogpt.embed_tokens

    def set_input_embeddings(self, value):
        self.biogpt.embed_tokens = value

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForSequenceClassification.forward(input_ids=None, attention_mask=None, head_mask=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)` ¶

labels (mindspore.Tensor of shape (batch_size,), optional): Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

Source code in mindnlp\transformers\models\biogpt\modeling_biogpt.py

def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
    r"""
    labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
        Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
        config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
        `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    transformer_outputs = self.biogpt(
        input_ids,
        past_key_values=past_key_values,
        attention_mask=attention_mask,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    hidden_states = transformer_outputs[0]
    logits = self.score(hidden_states)

    if input_ids is not None:
        batch_size, sequence_length = input_ids.shape[:2]
    else:
        batch_size, sequence_length = inputs_embeds.shape[:2]

    if self.config.pad_token_id is None:
        sequence_length = -1
    else:
        if input_ids is not None:
            sequence_length = (ops.ne(input_ids, self.config.pad_token_id).sum(-1) - 1)
        else:
            sequence_length = -1
            logger.warning_once(
                f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
                "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
            )

    pooled_logits = logits[ops.arange(batch_size), sequence_length]

    loss = None
    if labels is not None:
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            loss_fct = MSELoss()
            if self.num_labels == 1:
                loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
            else:
                loss = loss_fct(pooled_logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss_fct = BCEWithLogitsLoss()
            loss = loss_fct(pooled_logits, labels)
    if not return_dict:
        output = (pooled_logits,) + transformer_outputs[1:]
        return ((loss,) + output) if loss is not None else output

    return SequenceClassifierOutputWithPast(
        loss=loss,
        logits=pooled_logits,
        past_key_values=transformer_outputs.past_key_values,
        hidden_states=transformer_outputs.hidden_states,
        attentions=transformer_outputs.attentions,
    )

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptModel` ¶

Bases: BioGptPreTrainedModel

Source code in mindnlp\transformers\models\biogpt\modeling_biogpt.py

class BioGptModel(BioGptPreTrainedModel):
    def __init__(self, config: BioGptConfig):
        super().__init__(config)
        self.config = config
        self.layerdrop = config.layerdrop
        self.dropout = config.hidden_dropout_prob
        self.embed_dim = config.hidden_size
        self.padding_idx = config.pad_token_id
        embed_scale = math.sqrt(config.hidden_size) if config.scale_embedding else 1.0

        self.embed_tokens = BioGptScaledWordEmbedding(
            config.vocab_size, self.embed_dim, self.padding_idx, embed_scale=embed_scale
        )
        self.embed_positions = BioGptLearnedPositionalEmbedding(config.max_position_embeddings, self.embed_dim)

        self.layers = nn.ModuleList([BioGptDecoderLayer(config) for _ in range(config.num_hidden_layers)])
        self.layer_norm = nn.LayerNorm(self.embed_dim)

        self.gradient_checkpointing = False
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.embed_tokens

    def set_input_embeddings(self, value):
        self.embed_tokens = value

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # retrieve input_ids and inputs_embeds
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            input = input_ids
            input_shape = input.shape
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.shape[:-1]
            input = inputs_embeds[:, :, -1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        # past_key_values_length
        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input)

        if attention_mask is None:
            attention_mask = ops.ones(
                (inputs_embeds.shape[0], inputs_embeds.shape[1] + past_key_values_length),
                dtype=mindspore.bool_,
            )
        elif attention_mask.shape[1] != past_key_values_length + input_shape[1]:
            raise ValueError(
                f"The provided attention mask has length {attention_mask.shape[1]}, but its length should be "
                f"{past_key_values_length + input_shape[1]} (sum of the lengths of current and past inputs)"
            )

        # embed positions
        positions = self.embed_positions(attention_mask, past_key_values_length)

        attention_mask = _prepare_4d_causal_attention_mask(
            attention_mask, input_shape, inputs_embeds, past_key_values_length
        )

        hidden_states = inputs_embeds + positions

        hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training)

        if self.gradient_checkpointing and self.training:
            if use_cache:
                logger.warning_once(
                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                )
                use_cache = False

        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        all_cross_attentions = None
        next_decoder_cache = () if use_cache else None

        for idx, decoder_layer in enumerate(self.layers):
            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
            if output_hidden_states:
                all_hidden_states += (hidden_states,)
            if self.training:
                dropout_probability = ops.rand([])
                if dropout_probability < self.layerdrop:
                    continue

            past_key_value = past_key_values[idx] if past_key_values is not None else None

            if self.gradient_checkpointing and self.training:
                layer_outputs = self._gradient_checkpointing_func(
                    decoder_layer.__call__,
                    hidden_states,
                    attention_mask,
                    head_mask[idx] if head_mask is not None else None,
                    None,
                    output_attentions,
                    use_cache,
                )
            else:
                layer_outputs = decoder_layer(
                    hidden_states,
                    attention_mask=attention_mask,
                    layer_head_mask=(head_mask[idx] if head_mask is not None else None),
                    past_key_value=past_key_value,
                    output_attentions=output_attentions,
                    use_cache=use_cache,
                )

            hidden_states = layer_outputs[0]

            if use_cache:
                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)

            if output_attentions:
                all_self_attns += (layer_outputs[1],)

        # add hidden states from the last decoder layer
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        hidden_states = self.layer_norm(hidden_states)

        next_cache = next_decoder_cache if use_cache else None

        if not return_dict:
            return tuple(
                v
                for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, all_cross_attentions]
                if v is not None
            )
        return BaseModelOutputWithPastAndCrossAttentions(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
            cross_attentions=all_cross_attentions,
        )

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptPreTrainedModel` ¶

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp\transformers\models\biogpt\modeling_biogpt.py

class BioGptPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = BioGptConfig
    base_model_prefix = "biogpt"
    supports_gradient_checkpointing = True

    def _init_weights(self, module):
        """Initialize the weights"""
        if isinstance(module, nn.Linear):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight[module.padding_idx] = 0
        elif isinstance(module, nn.LayerNorm):
            nn.init.zeros_(module.bias)
            nn.init.ones_(module.weight)

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer` ¶

Bases: PreTrainedTokenizer

Construct an FAIRSEQ Transformer tokenizer. Moses tokenization followed by Byte-Pair Encoding.

This tokenizer inherits from [PreTrainedTokenizer] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

PARAMETER	DESCRIPTION
`vocab_file`	Path to the vocabulary file. TYPE: `str`
`merges_file`	Merges file. TYPE: `str`
`unk_token`	The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. TYPE: `str`, optional, defaults to `"<unk>"` DEFAULT: `'<unk>'`
`bos_token`	The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the `cls_token`. TYPE: `str`, optional, defaults to `"<s>"` DEFAULT: `'<s>'`
`eos_token`	The end of sequence token. When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`. TYPE: `str`, optional, defaults to `"</s>"` DEFAULT: `'</s>'`
`sep_token`	The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. TYPE: `str`, optional, defaults to `"</s>"` DEFAULT: `'</s>'`
`pad_token`	The token used for padding, for example when batching sequences of different lengths. TYPE: `str`, optional, defaults to `"<pad>"` DEFAULT: `'<pad>'`

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

class BioGptTokenizer(PreTrainedTokenizer):
    """
    Construct an FAIRSEQ Transformer tokenizer. Moses tokenization followed by Byte-Pair Encoding.

    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
    this superclass for more information regarding those methods.

    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
        merges_file (`str`):
            Merges file.
        unk_token (`str`, *optional*, defaults to `"<unk>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        bos_token (`str`, *optional*, defaults to `"<s>"`):
            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

            <Tip>

            When building a sequence using special tokens, this is not the token that is used for the beginning of
            sequence. The token used is the `cls_token`.

            </Tip>

        eos_token (`str`, *optional*, defaults to `"</s>"`):
            The end of sequence token.

            <Tip>

            When building a sequence using special tokens, this is not the token that is used for the end of sequence.
            The token used is the `sep_token`.

            </Tip>

        sep_token (`str`, *optional*, defaults to `"</s>"`):
            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
            sequence classification or for a text and a question for question answering. It is also used as the last
            token of a sequence built with special tokens.
        pad_token (`str`, *optional*, defaults to `"<pad>"`):
            The token used for padding, for example when batching sequences of different lengths.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        vocab_file,
        merges_file,
        unk_token="<unk>",
        bos_token="<s>",
        eos_token="</s>",
        sep_token="</s>",
        pad_token="<pad>",
        **kwargs,
    ):
        """
        Initializes a new instance of the BioGptTokenizer class.

        Args:
            self: The instance of the class.
            vocab_file (str): The path to the vocabulary file.
            merges_file (str): The path to the merges file.
            unk_token (str, optional): The token to represent unknown words. Defaults to '<unk>'.
            bos_token (str, optional): The token to represent the beginning of a sentence. Defaults to '<s>'.
            eos_token (str, optional): The token to represent the end of a sentence. Defaults to '</s>'.
            sep_token (str, optional): The token to represent sentence separation. Defaults to '</s>'.
            pad_token (str, optional): The token to represent padding. Defaults to '<pad>'.
            **kwargs: Additional keyword arguments.

        Returns:
            None

        Raises:
            ImportError: If sacremoses library is not installed.
            IOError: If the vocabulary or merges file cannot be read.
        """
        try:
            import sacremoses
        except ImportError as e:
            raise ImportError(
                "You need to install sacremoses to use BioGptTokenizer. "
                "See https://pypi.org/project/sacremoses/ for installation."
            ) from e

        self.lang = "en"
        self.sm = sacremoses
        # cache of sm.MosesTokenizer instance
        self.cache_moses_tokenizer = {}
        self.cache_moses_detokenizer = {}

        """ Initialisation"""
        with open(vocab_file, encoding="utf-8") as vocab_handle:
            self.encoder = json.load(vocab_handle)
        self.decoder = {v: k for k, v in self.encoder.items()}
        with open(merges_file, encoding="utf-8") as merges_handle:
            merges = merges_handle.read().split("\n")[:-1]
        merges = [tuple(merge.split()[:2]) for merge in merges]
        self.bpe_ranks = dict(zip(merges, range(len(merges))))
        self.cache = {}

        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
            sep_token=sep_token,
            unk_token=unk_token,
            pad_token=pad_token,
            **kwargs,
        )

    @property
    def vocab_size(self):
        """Returns vocab size"""
        return len(self.encoder)

    def get_vocab(self):
        """
        Method to retrieve the vocabulary dictionary consisting of tokens and their corresponding encodings.

        Args:
            self (BioGptTokenizer): The instance of the BioGptTokenizer class.
                It represents the tokenizer object.

        Returns:
            None: The method returns a vocabulary dictionary that contains tokens and their respective encodings.

        Raises:
            None.
        """
        return dict(self.encoder, **self.added_tokens_encoder)

    def moses_tokenize(self, text, lang):
        """
        Perform Moses tokenization on the given text.

        Args:
            self (BioGptTokenizer): An instance of the BioGptTokenizer class.
            text (str): The text to be tokenized.
            lang (str): The language code for tokenization.

        Returns:
            None

        Raises:
            KeyError: If the language code is not found in the cache_moses_tokenizer dictionary.
            ValueError: If the language code is invalid or unsupported.
            Exception: If any other error occurs during tokenization.

        This method utilizes the MosesTokenizer from the nltk.translate.moses package to tokenize the input text.
        It first checks if the MosesTokenizer for the specified language is already cached.
        If not, it creates a new MosesTokenizer instance for the language and adds it to the cache.
        The tokenization is then performed using the cached MosesTokenizer object.

        The 'aggressive_dash_splits', 'return_str', and 'escape' parameters are passed to the tokenize method of
        the MosesTokenizer.
        'aggressive_dash_splits' determines whether to perform aggressive dash splitting,
        'return_str' specifies whether to return a string or a list of tokens,
        and 'escape' determines whether to escape XML/HTML characters in the text before tokenization.

        Note:
            This method assumes that the BioGptTokenizer instance has been properly initialized with the necessary
            resources for tokenization.
        """
        if lang not in self.cache_moses_tokenizer:
            moses_tokenizer = self.sm.MosesTokenizer(lang=lang)
            self.cache_moses_tokenizer[lang] = moses_tokenizer
        return self.cache_moses_tokenizer[lang].tokenize(
            text, aggressive_dash_splits=True, return_str=False, escape=True
        )

    def moses_detokenize(self, tokens, lang):
        """
        Performs Moses detokenization on a list of tokens for a specified language.

        Args:
            self (BioGptTokenizer): An instance of the BioGptTokenizer class.
            tokens (list): A list of tokens to be detokenized.
            lang (str): The language of the tokens. Must be a supported language.

        Returns:
            None: The method modifies the cache_moses_detokenizer attribute of the BioGptTokenizer instance.

        Raises:
            KeyError: If the specified language is not supported.
            TypeError: If the tokens parameter is not a list.

        Note:
            This method utilizes a cache to store MosesDetokenizer objects for each language,
            ensuring efficient detokenization by reusing previously created instances.
        """
        if lang not in self.cache_moses_detokenizer:
            moses_detokenizer = self.sm.MosesDetokenizer(lang=lang)
            self.cache_moses_detokenizer[lang] = moses_detokenizer
        return self.cache_moses_detokenizer[lang].detokenize(tokens)

    def bpe(self, token):
        """
        Performs Byte Pair Encoding (BPE) on a given token.

        Args:
            self: An instance of the BioGptTokenizer class.
            token (str): The token to be encoded using BPE.

        Returns:
            str: The BPE-encoded representation of the token.

        Raises:
            None.

        Description:
            This method takes a token and applies Byte Pair Encoding (BPE) to it. BPE is a subword tokenization
            technique that breaks down a token into a sequence of subword units.
            The BPE algorithm iteratively  merges the most frequent pairs of subword units to create a vocabulary
            of subword units.

            The token parameter is the input token to be encoded using BPE. The token is expected to be a string.

            The method returns the BPE-encoded representation of the token as a string.
            The encoded representation is obtained by iteratively merging the most frequent pairs of subword units
            until no more merges can be made.
            The resulting subword units are then joined together to form the encoded token.

            Note that the method may use a cache to store previously encoded tokens for efficiency.

        Example:
            ```python
            >>> tokenizer = BioGptTokenizer()
            >>> encoded_token = tokenizer.bpe('sequence')
            >>> print(encoded_token)
            >>> # Output: 'seq uence'</w>'
            ```
        """
        word = tuple(token[:-1]) + (token[-1] + "</w>",)
        if token in self.cache:
            return self.cache[token]
        pairs = get_pairs(word)

        if not pairs:
            return token + "</w>"

        while True:
            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
            if bigram not in self.bpe_ranks:
                break
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):
                try:
                    j = word.index(first, i)
                except ValueError:
                    new_word.extend(word[i:])
                    break
                else:
                    new_word.extend(word[i:j])
                    i = j

                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                    new_word.append(first + second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            pairs = get_pairs(word)
        word = " ".join(word)
        if word == "\n  </w>":
            word = "\n</w>"
        self.cache[token] = word
        return word

    def _tokenize(self, text, bypass_tokenizer=False):
        """Returns a tokenized string."""
        if bypass_tokenizer:
            text = text.split()
        else:
            text = self.moses_tokenize(text, self.lang)

        split_tokens = []
        for token in text:
            if token:
                split_tokens.extend(list(self.bpe(token).split(" ")))

        return split_tokens

    def _convert_token_to_id(self, token):
        """Converts a token (str) in an id using the vocab."""
        return self.encoder.get(token, self.encoder.get(self.unk_token))

    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        return self.decoder.get(index, self.unk_token)

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        # remove BPE
        tokens = [t.replace(" ", "").replace("</w>", " ") for t in tokens]
        tokens = "".join(tokens).split()
        # detokenize
        text = self.moses_detokenize(tokens, self.lang)
        return text

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A BioGPT sequence has the following format:

        - single sequence: `</s> X `
        - pair of sequences: `</s> A </s> B `

        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return [self.sep_token_id] + token_ids_0
        sep = [self.sep_token_id]
        return sep + token_ids_0 + sep + token_ids_1

    def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
        special tokens using the tokenizer `prepare_for_model` method.

        Args:
            token_ids_0 (`List[int]`):
                List of IDs.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not the token list is already formatted with special tokens for the model.

        Returns:
            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """
        if already_has_special_tokens:
            return super().get_special_tokens_mask(
                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
            )
        # no bos used in fairseq
        if token_ids_1 is not None:
            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1))
        return [1] + ([0] * len(token_ids_0))

    def create_token_type_ids_from_sequences(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A FAIRSEQ
        Transformer sequence pair mask has the following format:

        ```
        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |
        ```

        If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).

        Args:
            token_ids_0 (`List[int]`):
                List of IDs.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
        """
        sep = [self.sep_token_id]

        # no bos used in fairseq
        if token_ids_1 is None:
            return len(token_ids_0 + sep) * [0]
        return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]

    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        """
        Save the vocabulary to the specified directory with the given filename prefix.

        Args:
            self: Instance of the BioGptTokenizer class.
            save_directory (str): The directory path where the vocabulary files will be saved.
                It should already exist, and the method will raise an error if the directory does not exist.
            filename_prefix (Optional[str]): An optional prefix to be added to the filenames of the vocabulary files.
                If provided, the filenames will be prefixed with this value. Default is None.

        Returns:
            Tuple[str]: A tuple containing the paths to the saved vocabulary file and merge file.

        Raises:
            OSError: If the specified save_directory is not a valid directory.
            IOError: If there is an issue writing the vocabulary files to the disk.
        """
        if not os.path.isdir(save_directory):
            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
            return
        vocab_file = os.path.join(
            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
        )
        merge_file = os.path.join(
            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
        )

        with open(vocab_file, "w", encoding="utf-8") as f:
            f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")

        index = 0
        with open(merge_file, "w", encoding="utf-8") as writer:
            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
                if index != token_index:
                    logger.warning(
                        f"Saving vocabulary to {merge_file}: BPE merge indices are not consecutive."
                        " Please check that the tokenizer is not corrupted!"
                    )
                    index = token_index
                writer.write(" ".join(bpe_tokens) + "\n")
                index += 1

        return vocab_file, merge_file

    def __getstate__(self):
        """
        The '__getstate__' method in the 'BioGptTokenizer' class is used to retrieve the state of the object for pickling.

        Args:
            self: An instance of the 'BioGptTokenizer' class.

        Returns:
            None: This method does not explicitly return a value, but modifies the state of the object.

        Raises:
            None.
        """
        state = self.__dict__.copy()
        state["sm"] = None
        return state

    def __setstate__(self, d):
        """
        Sets the state of the BioGptTokenizer object.

        Args:
            self (BioGptTokenizer): The instance of the BioGptTokenizer class.
            d (dict): The dictionary containing the state information to be set. 

        Returns:
            None.

        Raises:
            ImportError: If the sacremoses module is not installed, an ImportError is raised. 
                The error message specifies that sacremoses needs to be installed and provides a link to the installation page.
        """
        self.__dict__ = d

        try:
            import sacremoses
        except ImportError as e:
            raise ImportError(
                "You need to install sacremoses to use XLMTokenizer. "
                "See https://pypi.org/project/sacremoses/ for installation."
            ) from e

        self.sm = sacremoses

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.cache_moses_detokenizer = {}` `instance-attribute` ¶

Initialisation

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.vocab_size` `property` ¶

Returns vocab size

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.getstate()` ¶

The 'getstate' method in the 'BioGptTokenizer' class is used to retrieve the state of the object for pickling.

PARAMETER	DESCRIPTION
`self`	An instance of the 'BioGptTokenizer' class.

RETURNS	DESCRIPTION
`None`	This method does not explicitly return a value, but modifies the state of the object.

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def __getstate__(self):
    """
    The '__getstate__' method in the 'BioGptTokenizer' class is used to retrieve the state of the object for pickling.

    Args:
        self: An instance of the 'BioGptTokenizer' class.

    Returns:
        None: This method does not explicitly return a value, but modifies the state of the object.

    Raises:
        None.
    """
    state = self.__dict__.copy()
    state["sm"] = None
    return state

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.init(vocab_file, merges_file, unk_token='<unk>', bos_token='<s>', eos_token='</s>', sep_token='</s>', pad_token='<pad>', **kwargs)` ¶

Initializes a new instance of the BioGptTokenizer class.

PARAMETER	DESCRIPTION
`self`	The instance of the class.
`vocab_file`	The path to the vocabulary file. TYPE: `str`
`merges_file`	The path to the merges file. TYPE: `str`
`unk_token`	The token to represent unknown words. Defaults to ''. TYPE: `str` DEFAULT: `'<unk>'`
`bos_token`	The token to represent the beginning of a sentence. Defaults to ''. TYPE: `str` DEFAULT: `'<s>'`
`eos_token`	The token to represent the end of a sentence. Defaults to ''. TYPE: `str` DEFAULT: `'</s>'`
`sep_token`	The token to represent sentence separation. Defaults to ''. TYPE: `str` DEFAULT: `'</s>'`
`pad_token`	The token to represent padding. Defaults to ''. TYPE: `str` DEFAULT: `'<pad>'`
`**kwargs`	Additional keyword arguments. DEFAULT: `{}`

RETURNS	DESCRIPTION
	None

RAISES	DESCRIPTION
`ImportError`	If sacremoses library is not installed.
`IOError`	If the vocabulary or merges file cannot be read.

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def __init__(
    self,
    vocab_file,
    merges_file,
    unk_token="<unk>",
    bos_token="<s>",
    eos_token="</s>",
    sep_token="</s>",
    pad_token="<pad>",
    **kwargs,
):
    """
    Initializes a new instance of the BioGptTokenizer class.

    Args:
        self: The instance of the class.
        vocab_file (str): The path to the vocabulary file.
        merges_file (str): The path to the merges file.
        unk_token (str, optional): The token to represent unknown words. Defaults to '<unk>'.
        bos_token (str, optional): The token to represent the beginning of a sentence. Defaults to '<s>'.
        eos_token (str, optional): The token to represent the end of a sentence. Defaults to '</s>'.
        sep_token (str, optional): The token to represent sentence separation. Defaults to '</s>'.
        pad_token (str, optional): The token to represent padding. Defaults to '<pad>'.
        **kwargs: Additional keyword arguments.

    Returns:
        None

    Raises:
        ImportError: If sacremoses library is not installed.
        IOError: If the vocabulary or merges file cannot be read.
    """
    try:
        import sacremoses
    except ImportError as e:
        raise ImportError(
            "You need to install sacremoses to use BioGptTokenizer. "
            "See https://pypi.org/project/sacremoses/ for installation."
        ) from e

    self.lang = "en"
    self.sm = sacremoses
    # cache of sm.MosesTokenizer instance
    self.cache_moses_tokenizer = {}
    self.cache_moses_detokenizer = {}

    """ Initialisation"""
    with open(vocab_file, encoding="utf-8") as vocab_handle:
        self.encoder = json.load(vocab_handle)
    self.decoder = {v: k for k, v in self.encoder.items()}
    with open(merges_file, encoding="utf-8") as merges_handle:
        merges = merges_handle.read().split("\n")[:-1]
    merges = [tuple(merge.split()[:2]) for merge in merges]
    self.bpe_ranks = dict(zip(merges, range(len(merges))))
    self.cache = {}

    super().__init__(
        bos_token=bos_token,
        eos_token=eos_token,
        sep_token=sep_token,
        unk_token=unk_token,
        pad_token=pad_token,
        **kwargs,
    )

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.setstate(d)` ¶

Sets the state of the BioGptTokenizer object.

PARAMETER	DESCRIPTION
`self`	The instance of the BioGptTokenizer class. TYPE: `BioGptTokenizer`
`d`	The dictionary containing the state information to be set. TYPE: `dict`

RETURNS	DESCRIPTION
	None.

RAISES	DESCRIPTION
`ImportError`	If the sacremoses module is not installed, an ImportError is raised. The error message specifies that sacremoses needs to be installed and provides a link to the installation page.

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def __setstate__(self, d):
    """
    Sets the state of the BioGptTokenizer object.

    Args:
        self (BioGptTokenizer): The instance of the BioGptTokenizer class.
        d (dict): The dictionary containing the state information to be set. 

    Returns:
        None.

    Raises:
        ImportError: If the sacremoses module is not installed, an ImportError is raised. 
            The error message specifies that sacremoses needs to be installed and provides a link to the installation page.
    """
    self.__dict__ = d

    try:
        import sacremoses
    except ImportError as e:
        raise ImportError(
            "You need to install sacremoses to use XLMTokenizer. "
            "See https://pypi.org/project/sacremoses/ for installation."
        ) from e

    self.sm = sacremoses

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.bpe(token)` ¶

Performs Byte Pair Encoding (BPE) on a given token.

PARAMETER	DESCRIPTION
`self`	An instance of the BioGptTokenizer class.
`token`	The token to be encoded using BPE. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The BPE-encoded representation of the token.

Description

This method takes a token and applies Byte Pair Encoding (BPE) to it. BPE is a subword tokenization technique that breaks down a token into a sequence of subword units. The BPE algorithm iteratively merges the most frequent pairs of subword units to create a vocabulary of subword units.

The token parameter is the input token to be encoded using BPE. The token is expected to be a string.

The method returns the BPE-encoded representation of the token as a string. The encoded representation is obtained by iteratively merging the most frequent pairs of subword units until no more merges can be made. The resulting subword units are then joined together to form the encoded token.

Note that the method may use a cache to store previously encoded tokens for efficiency.

Example

>>> tokenizer = BioGptTokenizer()
>>> encoded_token = tokenizer.bpe('sequence')
>>> print(encoded_token)
>>> # Output: 'seq uence'</w>'

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def bpe(self, token):
    """
    Performs Byte Pair Encoding (BPE) on a given token.

    Args:
        self: An instance of the BioGptTokenizer class.
        token (str): The token to be encoded using BPE.

    Returns:
        str: The BPE-encoded representation of the token.

    Raises:
        None.

    Description:
        This method takes a token and applies Byte Pair Encoding (BPE) to it. BPE is a subword tokenization
        technique that breaks down a token into a sequence of subword units.
        The BPE algorithm iteratively  merges the most frequent pairs of subword units to create a vocabulary
        of subword units.

        The token parameter is the input token to be encoded using BPE. The token is expected to be a string.

        The method returns the BPE-encoded representation of the token as a string.
        The encoded representation is obtained by iteratively merging the most frequent pairs of subword units
        until no more merges can be made.
        The resulting subword units are then joined together to form the encoded token.

        Note that the method may use a cache to store previously encoded tokens for efficiency.

    Example:
        ```python
        >>> tokenizer = BioGptTokenizer()
        >>> encoded_token = tokenizer.bpe('sequence')
        >>> print(encoded_token)
        >>> # Output: 'seq uence'</w>'
        ```
    """
    word = tuple(token[:-1]) + (token[-1] + "</w>",)
    if token in self.cache:
        return self.cache[token]
    pairs = get_pairs(word)

    if not pairs:
        return token + "</w>"

    while True:
        bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
        if bigram not in self.bpe_ranks:
            break
        first, second = bigram
        new_word = []
        i = 0
        while i < len(word):
            try:
                j = word.index(first, i)
            except ValueError:
                new_word.extend(word[i:])
                break
            else:
                new_word.extend(word[i:j])
                i = j

            if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                new_word.append(first + second)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_word = tuple(new_word)
        word = new_word
        if len(word) == 1:
            break
        pairs = get_pairs(word)
    word = " ".join(word)
    if word == "\n  </w>":
        word = "\n</w>"
    self.cache[token] = word
    return word

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)` ¶

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BioGPT sequence has the following format:

single sequence: </s> X
pair of sequences: </s> A </s> B

PARAMETER	DESCRIPTION
`token_ids_0`	List of IDs to which the special tokens will be added. TYPE: `List[int]`
`token_ids_1`	Optional second list of IDs for sequence pairs. TYPE: `List[int]`, optional DEFAULT: `None`

RETURNS	DESCRIPTION
`List[int]`	`List[int]`: List of input IDs with the appropriate special tokens.

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def build_inputs_with_special_tokens(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
    """
    Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
    adding special tokens. A BioGPT sequence has the following format:

    - single sequence: `</s> X `
    - pair of sequences: `</s> A </s> B `

    Args:
        token_ids_0 (`List[int]`):
            List of IDs to which the special tokens will be added.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.

    Returns:
        `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
    """
    if token_ids_1 is None:
        return [self.sep_token_id] + token_ids_0
    sep = [self.sep_token_id]
    return sep + token_ids_0 + sep + token_ids_1

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.convert_tokens_to_string(tokens)` ¶

Converts a sequence of tokens (string) in a single string.

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def convert_tokens_to_string(self, tokens):
    """Converts a sequence of tokens (string) in a single string."""
    # remove BPE
    tokens = [t.replace(" ", "").replace("</w>", " ") for t in tokens]
    tokens = "".join(tokens).split()
    # detokenize
    text = self.moses_detokenize(tokens, self.lang)
    return text

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)` ¶

Create a mask from the two sequences passed to be used in a sequence-pair classification task. A FAIRSEQ Transformer sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |

If token_ids_1 is None, this method only returns the first portion of the mask (0s).

PARAMETER	DESCRIPTION
`token_ids_0`	List of IDs. TYPE: `List[int]`
`token_ids_1`	Optional second list of IDs for sequence pairs. TYPE: `List[int]`, optional DEFAULT: `None`

RETURNS	DESCRIPTION
`List[int]`	`List[int]`: List of token type IDs according to the given sequence(s).

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def create_token_type_ids_from_sequences(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
    """
    Create a mask from the two sequences passed to be used in a sequence-pair classification task. A FAIRSEQ
    Transformer sequence pair mask has the following format:

    ```
    0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
    | first sequence    | second sequence |
    ```

    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).

    Args:
        token_ids_0 (`List[int]`):
            List of IDs.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.

    Returns:
        `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
    """
    sep = [self.sep_token_id]

    # no bos used in fairseq
    if token_ids_1 is None:
        return len(token_ids_0 + sep) * [0]
    return len(token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)` ¶

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

PARAMETER	DESCRIPTION
`token_ids_0`	List of IDs. TYPE: `List[int]`
`token_ids_1`	Optional second list of IDs for sequence pairs. TYPE: `List[int]`, optional DEFAULT: `None`
`already_has_special_tokens`	Whether or not the token list is already formatted with special tokens for the model. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`

RETURNS	DESCRIPTION
`List[int]`	`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def get_special_tokens_mask(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
    """
    Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
    special tokens using the tokenizer `prepare_for_model` method.

    Args:
        token_ids_0 (`List[int]`):
            List of IDs.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.
        already_has_special_tokens (`bool`, *optional*, defaults to `False`):
            Whether or not the token list is already formatted with special tokens for the model.

    Returns:
        `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
    """
    if already_has_special_tokens:
        return super().get_special_tokens_mask(
            token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
        )
    # no bos used in fairseq
    if token_ids_1 is not None:
        return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1))
    return [1] + ([0] * len(token_ids_0))

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.get_vocab()` ¶

Method to retrieve the vocabulary dictionary consisting of tokens and their corresponding encodings.

PARAMETER	DESCRIPTION
`self`	The instance of the BioGptTokenizer class. It represents the tokenizer object. TYPE: `BioGptTokenizer`

RETURNS	DESCRIPTION
`None`	The method returns a vocabulary dictionary that contains tokens and their respective encodings.

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def get_vocab(self):
    """
    Method to retrieve the vocabulary dictionary consisting of tokens and their corresponding encodings.

    Args:
        self (BioGptTokenizer): The instance of the BioGptTokenizer class.
            It represents the tokenizer object.

    Returns:
        None: The method returns a vocabulary dictionary that contains tokens and their respective encodings.

    Raises:
        None.
    """
    return dict(self.encoder, **self.added_tokens_encoder)

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.moses_detokenize(tokens, lang)` ¶

Performs Moses detokenization on a list of tokens for a specified language.

PARAMETER	DESCRIPTION
`self`	An instance of the BioGptTokenizer class. TYPE: `BioGptTokenizer`
`tokens`	A list of tokens to be detokenized. TYPE: `list`
`lang`	The language of the tokens. Must be a supported language. TYPE: `str`

RETURNS	DESCRIPTION
`None`	The method modifies the cache_moses_detokenizer attribute of the BioGptTokenizer instance.

RAISES	DESCRIPTION
`KeyError`	If the specified language is not supported.
`TypeError`	If the tokens parameter is not a list.

Note

This method utilizes a cache to store MosesDetokenizer objects for each language, ensuring efficient detokenization by reusing previously created instances.

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def moses_detokenize(self, tokens, lang):
    """
    Performs Moses detokenization on a list of tokens for a specified language.

    Args:
        self (BioGptTokenizer): An instance of the BioGptTokenizer class.
        tokens (list): A list of tokens to be detokenized.
        lang (str): The language of the tokens. Must be a supported language.

    Returns:
        None: The method modifies the cache_moses_detokenizer attribute of the BioGptTokenizer instance.

    Raises:
        KeyError: If the specified language is not supported.
        TypeError: If the tokens parameter is not a list.

    Note:
        This method utilizes a cache to store MosesDetokenizer objects for each language,
        ensuring efficient detokenization by reusing previously created instances.
    """
    if lang not in self.cache_moses_detokenizer:
        moses_detokenizer = self.sm.MosesDetokenizer(lang=lang)
        self.cache_moses_detokenizer[lang] = moses_detokenizer
    return self.cache_moses_detokenizer[lang].detokenize(tokens)

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.moses_tokenize(text, lang)` ¶

Perform Moses tokenization on the given text.

PARAMETER	DESCRIPTION
`self`	An instance of the BioGptTokenizer class. TYPE: `BioGptTokenizer`
`text`	The text to be tokenized. TYPE: `str`
`lang`	The language code for tokenization. TYPE: `str`

RETURNS	DESCRIPTION
	None

RAISES	DESCRIPTION
`KeyError`	If the language code is not found in the cache_moses_tokenizer dictionary.
`ValueError`	If the language code is invalid or unsupported.
`Exception`	If any other error occurs during tokenization.

This method utilizes the MosesTokenizer from the nltk.translate.moses package to tokenize the input text. It first checks if the MosesTokenizer for the specified language is already cached. If not, it creates a new MosesTokenizer instance for the language and adds it to the cache. The tokenization is then performed using the cached MosesTokenizer object.

The 'aggressive_dash_splits', 'return_str', and 'escape' parameters are passed to the tokenize method of the MosesTokenizer. 'aggressive_dash_splits' determines whether to perform aggressive dash splitting, 'return_str' specifies whether to return a string or a list of tokens, and 'escape' determines whether to escape XML/HTML characters in the text before tokenization.

Note

This method assumes that the BioGptTokenizer instance has been properly initialized with the necessary resources for tokenization.

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def moses_tokenize(self, text, lang):
    """
    Perform Moses tokenization on the given text.

    Args:
        self (BioGptTokenizer): An instance of the BioGptTokenizer class.
        text (str): The text to be tokenized.
        lang (str): The language code for tokenization.

    Returns:
        None

    Raises:
        KeyError: If the language code is not found in the cache_moses_tokenizer dictionary.
        ValueError: If the language code is invalid or unsupported.
        Exception: If any other error occurs during tokenization.

    This method utilizes the MosesTokenizer from the nltk.translate.moses package to tokenize the input text.
    It first checks if the MosesTokenizer for the specified language is already cached.
    If not, it creates a new MosesTokenizer instance for the language and adds it to the cache.
    The tokenization is then performed using the cached MosesTokenizer object.

    The 'aggressive_dash_splits', 'return_str', and 'escape' parameters are passed to the tokenize method of
    the MosesTokenizer.
    'aggressive_dash_splits' determines whether to perform aggressive dash splitting,
    'return_str' specifies whether to return a string or a list of tokens,
    and 'escape' determines whether to escape XML/HTML characters in the text before tokenization.

    Note:
        This method assumes that the BioGptTokenizer instance has been properly initialized with the necessary
        resources for tokenization.
    """
    if lang not in self.cache_moses_tokenizer:
        moses_tokenizer = self.sm.MosesTokenizer(lang=lang)
        self.cache_moses_tokenizer[lang] = moses_tokenizer
    return self.cache_moses_tokenizer[lang].tokenize(
        text, aggressive_dash_splits=True, return_str=False, escape=True
    )

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.save_vocabulary(save_directory, filename_prefix=None)` ¶

Save the vocabulary to the specified directory with the given filename prefix.

PARAMETER	DESCRIPTION
`self`	Instance of the BioGptTokenizer class.
`save_directory`	The directory path where the vocabulary files will be saved. It should already exist, and the method will raise an error if the directory does not exist. TYPE: `str`
`filename_prefix`	An optional prefix to be added to the filenames of the vocabulary files. If provided, the filenames will be prefixed with this value. Default is None. TYPE: `Optional[str]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Tuple[str]`	Tuple[str]: A tuple containing the paths to the saved vocabulary file and merge file.

RAISES	DESCRIPTION
`OSError`	If the specified save_directory is not a valid directory.
`IOError`	If there is an issue writing the vocabulary files to the disk.

Source code in mindnlp\transformers\models\biogpt\tokenization_biogpt.py

def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
    """
    Save the vocabulary to the specified directory with the given filename prefix.

    Args:
        self: Instance of the BioGptTokenizer class.
        save_directory (str): The directory path where the vocabulary files will be saved.
            It should already exist, and the method will raise an error if the directory does not exist.
        filename_prefix (Optional[str]): An optional prefix to be added to the filenames of the vocabulary files.
            If provided, the filenames will be prefixed with this value. Default is None.

    Returns:
        Tuple[str]: A tuple containing the paths to the saved vocabulary file and merge file.

    Raises:
        OSError: If the specified save_directory is not a valid directory.
        IOError: If there is an issue writing the vocabulary files to the disk.
    """
    if not os.path.isdir(save_directory):
        logger.error(f"Vocabulary path ({save_directory}) should be a directory")
        return
    vocab_file = os.path.join(
        save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
    )
    merge_file = os.path.join(
        save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
    )

    with open(vocab_file, "w", encoding="utf-8") as f:
        f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")

    index = 0
    with open(merge_file, "w", encoding="utf-8") as writer:
        for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
            if index != token_index:
                logger.warning(
                    f"Saving vocabulary to {merge_file}: BPE merge indices are not consecutive."
                    " Please check that the tokenizer is not corrupted!"
                )
                index = token_index
            writer.write(" ".join(bpe_tokens) + "\n")
            index += 1

    return vocab_file, merge_file

biogpt

mindnlp.transformers.models.biogpt.configuration_biogpt.BioGptConfig ¶

mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForCausalLM ¶

mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForCausalLM.forward(input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, past_key_values=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None) ¶

mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForTokenClassification ¶

mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForSequenceClassification ¶

mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForSequenceClassification.forward(input_ids=None, attention_mask=None, head_mask=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None) ¶

mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptModel ¶

mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptPreTrainedModel ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.cache_moses_detokenizer = {} instance-attribute ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.vocab_size property ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.__getstate__() ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.__init__(vocab_file, merges_file, unk_token='<unk>', bos_token='<s>', eos_token='</s>', sep_token='</s>', pad_token='<pad>', **kwargs) ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.__setstate__(d) ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.bpe(token) ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.build_inputs_with_special_tokens(token_ids_0, token_ids_1=None) ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.convert_tokens_to_string(tokens) ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None) ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False) ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.get_vocab() ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.moses_detokenize(tokens, lang) ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.moses_tokenize(text, lang) ¶

mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.save_vocabulary(save_directory, filename_prefix=None) ¶

`mindnlp.transformers.models.biogpt.configuration_biogpt.BioGptConfig` ¶

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForCausalLM` ¶

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForCausalLM.forward(input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, past_key_values=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)` ¶

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForTokenClassification` ¶

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForSequenceClassification` ¶

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptForSequenceClassification.forward(input_ids=None, attention_mask=None, head_mask=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)` ¶

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptModel` ¶

`mindnlp.transformers.models.biogpt.modeling_biogpt.BioGptPreTrainedModel` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.cache_moses_detokenizer = {}` `instance-attribute` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.vocab_size` `property` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.getstate()` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.init(vocab_file, merges_file, unk_token='<unk>', bos_token='<s>', eos_token='</s>', sep_token='</s>', pad_token='<pad>', **kwargs)` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.setstate(d)` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.bpe(token)` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.convert_tokens_to_string(tokens)` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.get_vocab()` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.moses_detokenize(tokens, lang)` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.moses_tokenize(text, lang)` ¶

`mindnlp.transformers.models.biogpt.tokenization_biogpt.BioGptTokenizer.save_vocabulary(save_directory, filename_prefix=None)` ¶