cpmant

`mindnlp.transformers.models.cpmant.configuration_cpmant` ¶

CPMAnt model configuration

`mindnlp.transformers.models.cpmant.configuration_cpmant.CpmAntConfig` ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [CpmAntModel]. It is used to instantiate an CPMAnt model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the CPMAnt openbmb/cpm-ant-10b architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`vocab_size`	Vocabulary size of the CPMAnt model. Defines the number of different tokens that can be represented by the `input` passed when calling [`CpmAntModel`]. TYPE: `int`, optional, defaults to 30720 DEFAULT: `30720`
`hidden_size`	Dimension of the encoder layers. TYPE: `int`, optional, defaults to 4096 DEFAULT: `4096`
`num_attention_heads`	Number of attention heads in the Transformer encoder. TYPE: `int`, optional, defaults to 32 DEFAULT: `32`
`dim_head`	Dimension of attention heads for each attention layer in the Transformer encoder. TYPE: `int`, optional, defaults to 128 DEFAULT: `128`
`dim_ff`	Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. TYPE: `int`, optional, defaults to 10240 DEFAULT: `10240`
`num_hidden_layers`	Number of layers of the Transformer encoder. TYPE: `int`, optional, defaults to 48 DEFAULT: `48`
`dropout_p`	The dropout probability for all fully connected layers in the embeddings, encoder. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`position_bias_num_buckets`	The number of position_bias buckets. TYPE: `int`, optional, defaults to 512 DEFAULT: `512`
`position_bias_max_distance`	The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). TYPE: `int`, optional, defaults to 2048 DEFAULT: `2048`
`eps`	The epsilon used by the layer normalization layers. TYPE: `float`, optional, defaults to 1e-06 DEFAULT: `1e-06`
`init_std`	Initialize parameters with std = init_std. TYPE: `float`, optional, defaults to 1.0 DEFAULT: `1.0`
`prompt_types`	The type of prompt. TYPE: `int`, optional, defaults to 32 DEFAULT: `32`
`prompt_length`	The length of prompt. TYPE: `int`, optional, defaults to 32 DEFAULT: `32`
`segment_types`	The type of segment. TYPE: `int`, optional, defaults to 32 DEFAULT: `32`
`use_cache`	Whether to use cache. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`

Example

>>> from transformers import CpmAntModel, CpmAntConfig
...
>>> # Initializing a CPMAnt cpm-ant-10b style configuration
>>> configuration = CpmAntConfig()
...
>>> # Initializing a model from the cpm-ant-10b style configuration
>>> model = CpmAntModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config

Source code in mindnlp\transformers\models\cpmant\configuration_cpmant.py

class CpmAntConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`CpmAntModel`]. It is used to instantiate an
    CPMAnt model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the CPMAnt
    [openbmb/cpm-ant-10b](https://hf-mirror.com/openbmb/cpm-ant-10b) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vocab_size (`int`, *optional*, defaults to 30720):
            Vocabulary size of the CPMAnt model. Defines the number of different tokens that can be represented by the
            `input` passed when calling [`CpmAntModel`].
        hidden_size (`int`, *optional*, defaults to 4096):
            Dimension of the encoder layers.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads in the Transformer encoder.
        dim_head (`int`, *optional*, defaults to 128):
            Dimension of attention heads for each attention layer in the Transformer encoder.
        dim_ff (`int`, *optional*, defaults to 10240):
            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        num_hidden_layers (`int`, *optional*, defaults to 48):
            Number of layers of the Transformer encoder.
        dropout_p (`float`, *optional*, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings, encoder.
        position_bias_num_buckets (`int`, *optional*, defaults to 512):
            The number of position_bias buckets.
        position_bias_max_distance (`int`, *optional*, defaults to 2048):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the layer normalization layers.
        init_std (`float`, *optional*, defaults to 1.0):
            Initialize parameters with std = init_std.
        prompt_types (`int`, *optional*, defaults to 32):
            The type of prompt.
        prompt_length (`int`, *optional*, defaults to 32):
            The length of prompt.
        segment_types (`int`, *optional*, defaults to 32):
            The type of segment.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether to use cache.

    Example:
        ```python
        >>> from transformers import CpmAntModel, CpmAntConfig
        ...
        >>> # Initializing a CPMAnt cpm-ant-10b style configuration
        >>> configuration = CpmAntConfig()
        ...
        >>> # Initializing a model from the cpm-ant-10b style configuration
        >>> model = CpmAntModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "cpmant"

    def __init__(
        self,
        vocab_size: int = 30720,
        hidden_size: int = 4096,
        num_attention_heads: int = 32,
        dim_head: int = 128,
        dim_ff: int = 10240,
        num_hidden_layers: int = 48,
        dropout_p: int = 0.0,
        position_bias_num_buckets: int = 512,
        position_bias_max_distance: int = 2048,
        eps: int = 1e-6,
        init_std: float = 1.0,
        prompt_types: int = 32,
        prompt_length: int = 32,
        segment_types: int = 32,
        use_cache: bool = True,
        **kwargs,
    ):
        """
        Initializes an instance of the CpmAntConfig class.

        Args:
            self (CpmAntConfig): The instance of the CpmAntConfig class.
            vocab_size (int): The size of the vocabulary. Defaults to 30720.
            hidden_size (int): The size of the hidden state. Defaults to 4096.
            num_attention_heads (int): The number of attention heads. Defaults to 32.
            dim_head (int): The dimension of each attention head. Defaults to 128.
            dim_ff (int): The dimension of the feed-forward layer. Defaults to 10240.
            num_hidden_layers (int): The number of hidden layers. Defaults to 48.
            dropout_p (float): The dropout rate. Defaults to 0.0.
            position_bias_num_buckets (int): The number of buckets for position bias. Defaults to 512.
            position_bias_max_distance (int): The maximum distance for position bias. Defaults to 2048.
            eps (float): The epsilon value for numerical stability. Defaults to 1e-06.
            init_std (float): The standard deviation for weight initialization. Defaults to 1.0.
            prompt_types (int): The number of prompt types. Defaults to 32.
            prompt_length (int): The length of the prompt. Defaults to 32.
            segment_types (int): The number of segment types. Defaults to 32.
            use_cache (bool): Whether to use cache. Defaults to True.

        Returns:
            None.

        Raises:
            None.
        """
        """"""
        super().__init__(**kwargs)
        self.prompt_types = prompt_types
        self.prompt_length = prompt_length
        self.segment_types = segment_types
        self.hidden_size = hidden_size
        self.num_attention_heads = num_attention_heads
        self.dim_head = dim_head
        self.dim_ff = dim_ff
        self.num_hidden_layers = num_hidden_layers
        self.position_bias_num_buckets = position_bias_num_buckets
        self.position_bias_max_distance = position_bias_max_distance
        self.dropout_p = dropout_p
        self.eps = eps
        self.use_cache = use_cache
        self.vocab_size = vocab_size
        self.init_std = init_std

`mindnlp.transformers.models.cpmant.configuration_cpmant.CpmAntConfig.init(vocab_size=30720, hidden_size=4096, num_attention_heads=32, dim_head=128, dim_ff=10240, num_hidden_layers=48, dropout_p=0.0, position_bias_num_buckets=512, position_bias_max_distance=2048, eps=1e-06, init_std=1.0, prompt_types=32, prompt_length=32, segment_types=32, use_cache=True, **kwargs)` ¶

Initializes an instance of the CpmAntConfig class.

PARAMETER	DESCRIPTION
`self`	The instance of the CpmAntConfig class. TYPE: `CpmAntConfig`
`vocab_size`	The size of the vocabulary. Defaults to 30720. TYPE: `int` DEFAULT: `30720`
`hidden_size`	The size of the hidden state. Defaults to 4096. TYPE: `int` DEFAULT: `4096`
`num_attention_heads`	The number of attention heads. Defaults to 32. TYPE: `int` DEFAULT: `32`
`dim_head`	The dimension of each attention head. Defaults to 128. TYPE: `int` DEFAULT: `128`
`dim_ff`	The dimension of the feed-forward layer. Defaults to 10240. TYPE: `int` DEFAULT: `10240`
`num_hidden_layers`	The number of hidden layers. Defaults to 48. TYPE: `int` DEFAULT: `48`
`dropout_p`	The dropout rate. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`position_bias_num_buckets`	The number of buckets for position bias. Defaults to 512. TYPE: `int` DEFAULT: `512`
`position_bias_max_distance`	The maximum distance for position bias. Defaults to 2048. TYPE: `int` DEFAULT: `2048`
`eps`	The epsilon value for numerical stability. Defaults to 1e-06. TYPE: `float` DEFAULT: `1e-06`
`init_std`	The standard deviation for weight initialization. Defaults to 1.0. TYPE: `float` DEFAULT: `1.0`
`prompt_types`	The number of prompt types. Defaults to 32. TYPE: `int` DEFAULT: `32`
`prompt_length`	The length of the prompt. Defaults to 32. TYPE: `int` DEFAULT: `32`
`segment_types`	The number of segment types. Defaults to 32. TYPE: `int` DEFAULT: `32`
`use_cache`	Whether to use cache. Defaults to True. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
	None.

Source code in mindnlp\transformers\models\cpmant\configuration_cpmant.py

def __init__(
    self,
    vocab_size: int = 30720,
    hidden_size: int = 4096,
    num_attention_heads: int = 32,
    dim_head: int = 128,
    dim_ff: int = 10240,
    num_hidden_layers: int = 48,
    dropout_p: int = 0.0,
    position_bias_num_buckets: int = 512,
    position_bias_max_distance: int = 2048,
    eps: int = 1e-6,
    init_std: float = 1.0,
    prompt_types: int = 32,
    prompt_length: int = 32,
    segment_types: int = 32,
    use_cache: bool = True,
    **kwargs,
):
    """
    Initializes an instance of the CpmAntConfig class.

    Args:
        self (CpmAntConfig): The instance of the CpmAntConfig class.
        vocab_size (int): The size of the vocabulary. Defaults to 30720.
        hidden_size (int): The size of the hidden state. Defaults to 4096.
        num_attention_heads (int): The number of attention heads. Defaults to 32.
        dim_head (int): The dimension of each attention head. Defaults to 128.
        dim_ff (int): The dimension of the feed-forward layer. Defaults to 10240.
        num_hidden_layers (int): The number of hidden layers. Defaults to 48.
        dropout_p (float): The dropout rate. Defaults to 0.0.
        position_bias_num_buckets (int): The number of buckets for position bias. Defaults to 512.
        position_bias_max_distance (int): The maximum distance for position bias. Defaults to 2048.
        eps (float): The epsilon value for numerical stability. Defaults to 1e-06.
        init_std (float): The standard deviation for weight initialization. Defaults to 1.0.
        prompt_types (int): The number of prompt types. Defaults to 32.
        prompt_length (int): The length of the prompt. Defaults to 32.
        segment_types (int): The number of segment types. Defaults to 32.
        use_cache (bool): Whether to use cache. Defaults to True.

    Returns:
        None.

    Raises:
        None.
    """
    """"""
    super().__init__(**kwargs)
    self.prompt_types = prompt_types
    self.prompt_length = prompt_length
    self.segment_types = segment_types
    self.hidden_size = hidden_size
    self.num_attention_heads = num_attention_heads
    self.dim_head = dim_head
    self.dim_ff = dim_ff
    self.num_hidden_layers = num_hidden_layers
    self.position_bias_num_buckets = position_bias_num_buckets
    self.position_bias_max_distance = position_bias_max_distance
    self.dropout_p = dropout_p
    self.eps = eps
    self.use_cache = use_cache
    self.vocab_size = vocab_size
    self.init_std = init_std

`mindnlp.transformers.models.cpmant.tokenization_cpmant` ¶

Tokenization classes for CPMAnt.

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer` ¶

Bases: PreTrainedTokenizer

Construct a CPMAnt tokenizer. Based on byte-level Byte-Pair-Encoding.

PARAMETER	DESCRIPTION
`vocab_file`	Path to the vocabulary file. TYPE: `str`
`bod_token`	The beginning of document token. TYPE: `str`, optional, defaults to `"<d>"` DEFAULT: `'<d>'`
`eod_token`	The end of document token. TYPE: `str`, optional, defaults to `"</d>"` DEFAULT: `'</d>'`
`bos_token`	The beginning of sequence token. TYPE: `str`, optional, defaults to `"<s>"` DEFAULT: `'<s>'`
`eos_token`	The end of sequence token. TYPE: `str`, optional, defaults to `"</s>"` DEFAULT: `'</s>'`
`pad_token`	The token used for padding. TYPE: `str`, optional, defaults to `"<pad>"` DEFAULT: `'<pad>'`
`unk_token`	The unknown token. TYPE: `str`, optional, defaults to `"<unk>"` DEFAULT: `'<unk>'`
`line_token`	The line token. TYPE: `str`, optional, defaults to `"</n>"` DEFAULT: `'</n>'`
`space_token`	The space token. TYPE: `str`, optional, defaults to `"</_>"` DEFAULT: `'</_>'`

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

class CpmAntTokenizer(PreTrainedTokenizer):
    """
    Construct a CPMAnt tokenizer. Based on byte-level Byte-Pair-Encoding.

    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
        bod_token (`str`, *optional*, defaults to `"<d>"`):
            The beginning of document token.
        eod_token (`str`, *optional*, defaults to `"</d>"`):
            The end of document token.
        bos_token (`str`, *optional*, defaults to `"<s>"`):
            The beginning of sequence token.
        eos_token (`str`, *optional*, defaults to `"</s>"`):
            The end of sequence token.
        pad_token (`str`, *optional*, defaults to `"<pad>"`):
            The token used for padding.
        unk_token (`str`, *optional*, defaults to `"<unk>"`):
            The unknown token.
        line_token (`str`, *optional*, defaults to `"</n>"`):
            The line token.
        space_token (`str`, *optional*, defaults to `"</_>"`):
            The space token.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ["input_ids", "attention_mask"]
    add_prefix_space = False

    def __init__(
        self,
        vocab_file,
        bod_token="<d>",
        eod_token="</d>",
        bos_token="<s>",
        eos_token="</s>",
        pad_token="<pad>",
        unk_token="<unk>",
        line_token="</n>",
        space_token="</_>",
        padding_side="left",
        **kwargs,
    ):
        """
        Initialize a CpmAntTokenizer object with the provided parameters.

        Args:
            vocab_file (str): The path to the vocabulary file to load.
            bod_token (str, optional): Beginning of document token (default is '<d>').
            eod_token (str, optional): End of document token (default is '</d>').
            bos_token (str, optional): Beginning of sentence token (default is '<s>').
            eos_token (str, optional): End of sentence token (default is '</s>').
            pad_token (str, optional): Padding token (default is '<pad>').
            unk_token (str, optional): Token for unknown words (default is '<unk>').
            line_token (str, optional): Line break token (default is '</n>').
            space_token (str, optional): Space token (default is '</_>').
            padding_side (str, optional): Side for padding (default is 'left').

        Returns:
            None.

        Raises:
            MissingBackendError: If required backend 'jieba' is not available.
            FileNotFoundError: If the specified 'vocab_file' does not exist.
            KeyError: If 'space_token' or 'line_token' are missing in the loaded vocabulary.
            Exception: Any other unforeseen error that may occur during initialization.
        """
        requires_backends(self, ["jieba"])
        self.bod_token = bod_token
        self.eod_token = eod_token
        self.encoder = load_vocab(vocab_file)
        self.encoder[" "] = self.encoder[space_token]
        self.encoder["\n"] = self.encoder[line_token]

        del self.encoder[space_token]
        del self.encoder[line_token]

        self.encoder = collections.OrderedDict(sorted(self.encoder.items(), key=lambda x: x[1]))
        self.decoder = {v: k for k, v in self.encoder.items()}

        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.encoder, unk_token=unk_token)

        super().__init__(
            bod_token=bod_token,
            eod_token=eod_token,
            bos_token=bos_token,
            eos_token=eos_token,
            pad_token=pad_token,
            unk_token=unk_token,
            line_token=line_token,
            space_token=space_token,
            padding_side=padding_side,
            **kwargs,
        )

    @property
    def bod_token_id(self):
        """
        This method, 'bod_token_id', is a property method defined in the 'CpmAntTokenizer' class.
        It takes no external parameters and returns the token ID associated with the 'bod_token'.

        Args:
            self (CpmAntTokenizer): The instance of the CpmAntTokenizer class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.encoder[self.bod_token]

    @property
    def eod_token_id(self):
        """
        This method 'eod_token_id' in the class 'CpmAntTokenizer' retrieves the token ID of the end-of-document token.

        Args:
            self: An instance of the class CpmAntTokenizer.
                It is required as this method is part of the class and needs access to its attributes and methods.

        Returns:
            None: This method returns a value of type None.
                It retrieves the token ID of the end-of-document token from the encoder attribute of the class instance.

        Raises:
            None.
        """
        return self.encoder[self.eod_token]

    @property
    def newline_id(self):
        r"""
        This method, newline_id, in the class CpmAntTokenizer, returns the value associated with the newline character in the encoder.

        Args:
            self (CpmAntTokenizer): The instance of the CpmAntTokenizer class.

        Returns:
            None.

        Raises:
            KeyError: If the newline character `'\n'` is not found in the encoder dictionary, a KeyError is raised.
        """
        return self.encoder["\n"]

    @property
    def vocab_size(self) -> int:
        """
        Returns the size of the vocabulary used by the CpmAntTokenizer instance.

        Args:
            self: The CpmAntTokenizer instance itself.

        Returns:
            int: The number of unique tokens in the vocabulary.

        Raises:
            None.
        """
        return len(self.encoder)

    def get_vocab(self):
        """
        Retrieves the vocabulary of the CpmAntTokenizer instance.

        Args:
            self (CpmAntTokenizer): The instance of CpmAntTokenizer.

        Returns:
            dict: The vocabulary of the tokenizer, which is a dictionary mapping tokens to their corresponding IDs.

        Raises:
            None.

        Example:
            ```python
            >>> tokenizer = CpmAntTokenizer()
            >>> vocab = tokenizer.get_vocab()
            >>> vocab
            {'<pad>': 0, '<unk>': 1, '<s>': 2, '</s>': 3, ...}
            ```
        """
        return dict(self.encoder, **self.added_tokens_encoder)

    def _tokenize(self, text):
        """Tokenize a string."""
        output_tokens = []
        for x in jieba.cut(text, cut_all=False):
            output_tokens.extend(self.wordpiece_tokenizer.tokenize(x))
        return output_tokens

    def _decode(self, token_ids, **kwargs):
        """Decode ids into a string."""
        token_ids = [i for i in token_ids if i >= 0]
        token_ids = [
            x for x in token_ids if x not in (self.pad_token_id, self.eos_token_id, self.bos_token_id)
        ]
        return super()._decode(token_ids, **kwargs)

    def check(self, token):
        """
        Check if a token is present in the encoder of the CpmAntTokenizer.

        Args:
            self (CpmAntTokenizer): An instance of the CpmAntTokenizer class.
            token (Any): The token to be checked.

        Returns:
            None.

        Raises:
            None.
        """
        return token in self.encoder

    def convert_tokens_to_string(self, tokens: List[str]) -> str:
        """
        Converts a list of tokens into a string representation.

        Args:
            self (CpmAntTokenizer): An instance of the CpmAntTokenizer class.
            tokens (List[str]): A list of tokens to be converted into a string representation.

        Returns:
            str: A string representation of the tokens.

        Raises:
            None.

        Note:
            - The tokens should be provided as a list of strings.
            - The method will join the tokens together using an empty string as a separator.

        Example:
            ```python
            >>> tokenizer = CpmAntTokenizer()
            >>> tokens = ['Hello', 'world', '!']
            >>> tokenizer.convert_tokens_to_string(tokens)
            'Hello world!'
            ```
        """
        return "".join(tokens)

    def _convert_token_to_id(self, token):
        """Converts a token (str) in an id using the vocab."""
        return self.encoder.get(token, self.encoder.get(self.unk_token))

    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        return self.decoder.get(index, self.unk_token)

    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        """
        Save the vocabulary to a file with the specified directory and filename prefix.

        Args:
            self: Instance of the CpmAntTokenizer class.
            save_directory (str): The directory where the vocabulary file will be saved.
            filename_prefix (Optional[str]): A string to be prefixed to the filename. Defaults to None.

        Returns:
            Tuple[str]: A tuple containing the path to the saved vocabulary file.

        Raises:
            None.
        """
        if os.path.isdir(save_directory):
            vocab_file = os.path.join(
                save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
            )
        else:
            vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
        index = 0
        if " " in self.encoder:
            self.encoder["</_>"] = self.encoder[" "]
            del self.encoder[" "]
        if "\n" in self.encoder:
            self.encoder["</n>"] = self.encoder["\n"]
            del self.encoder["\n"]
        self.encoder = collections.OrderedDict(sorted(self.encoder.items(), key=lambda x: x[1]))
        with open(vocab_file, "w", encoding="utf-8") as writer:
            for token, token_index in self.encoder.items():
                if index != token_index:
                    logger.warning(
                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
                        " Please check that the vocabulary is not corrupted!"
                    )
                    index = token_index
                writer.write(token + "\n")
                index += 1
        return (vocab_file,)

    def build_inputs_with_special_tokens(self, token_ids_0: List[int], token_ids_1: List[int] = None) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A CPMAnt sequence has the following format:

        - single sequence: `[BOS] Sequence`.

        Args:
            token_ids_0 (`List[int]`): The first tokenized sequence that special tokens will be added.
            token_ids_1 (`List[int]`): The optional second tokenized sequence that special tokens will be added.

        Returns:
            `List[int]`: The model input with special tokens.
        """
        if token_ids_1 is None:
            return [self.bos_token_id] + token_ids_0
        return [self.bos_token_id] + token_ids_0 + [self.bos_token_id] + token_ids_1

    def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
        special tokens using the tokenizer `prepare_for_model` method.

        Args:
            token_ids_0 (`List[int]`): List of IDs.
            token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs.
            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not the token list is already formatted with special tokens for the model.

        Returns:
            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """
        if already_has_special_tokens:
            return super().get_special_tokens_mask(
                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
            )

        if token_ids_1 is not None:
            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1))
        return [1] + ([0] * len(token_ids_0))

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.bod_token_id` `property` ¶

This method, 'bod_token_id', is a property method defined in the 'CpmAntTokenizer' class. It takes no external parameters and returns the token ID associated with the 'bod_token'.

PARAMETER	DESCRIPTION
`self`	The instance of the CpmAntTokenizer class. TYPE: `CpmAntTokenizer`

RETURNS	DESCRIPTION
	None.

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.eod_token_id` `property` ¶

This method 'eod_token_id' in the class 'CpmAntTokenizer' retrieves the token ID of the end-of-document token.

PARAMETER	DESCRIPTION
`self`	An instance of the class CpmAntTokenizer. It is required as this method is part of the class and needs access to its attributes and methods.

RETURNS	DESCRIPTION
`None`	This method returns a value of type None. It retrieves the token ID of the end-of-document token from the encoder attribute of the class instance.

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.newline_id` `property` ¶

This method, newline_id, in the class CpmAntTokenizer, returns the value associated with the newline character in the encoder.

PARAMETER	DESCRIPTION
`self`	The instance of the CpmAntTokenizer class. TYPE: `CpmAntTokenizer`

RETURNS	DESCRIPTION
	None.

RAISES	DESCRIPTION
`KeyError`	If the newline character `'\n'` is not found in the encoder dictionary, a KeyError is raised.

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.vocab_size: int` `property` ¶

Returns the size of the vocabulary used by the CpmAntTokenizer instance.

PARAMETER	DESCRIPTION
`self`	The CpmAntTokenizer instance itself.

RETURNS	DESCRIPTION
`int`	The number of unique tokens in the vocabulary. TYPE: `int`

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.init(vocab_file, bod_token='<d>', eod_token='</d>', bos_token='<s>', eos_token='</s>', pad_token='<pad>', unk_token='<unk>', line_token='</n>', space_token='</_>', padding_side='left', **kwargs)` ¶

Initialize a CpmAntTokenizer object with the provided parameters.

PARAMETER	DESCRIPTION
`vocab_file`	The path to the vocabulary file to load. TYPE: `str`
`bod_token`	Beginning of document token (default is ''). TYPE: `str` DEFAULT: `'<d>'`
`eod_token`	End of document token (default is ''). TYPE: `str` DEFAULT: `'</d>'`
`bos_token`	Beginning of sentence token (default is '~~').~~ TYPE: `str` DEFAULT: `'<s>'`
`eos_token`	End of sentence token (default is ''). TYPE: `str` DEFAULT: `'</s>'`
`pad_token`	Padding token (default is ''). TYPE: `str` DEFAULT: `'<pad>'`
`unk_token`	Token for unknown words (default is ''). TYPE: `str` DEFAULT: `'<unk>'`
`line_token`	Line break token (default is ''). TYPE: `str` DEFAULT: `'</n>'`
`space_token`	Space token (default is '</_>'). TYPE: `str` DEFAULT: `'</_>'`
`padding_side`	Side for padding (default is 'left'). TYPE: `str` DEFAULT: `'left'`

RETURNS	DESCRIPTION
	None.

RAISES	DESCRIPTION
`MissingBackendError`	If required backend 'jieba' is not available.
`FileNotFoundError`	If the specified 'vocab_file' does not exist.
`KeyError`	If 'space_token' or 'line_token' are missing in the loaded vocabulary.
`Exception`	Any other unforeseen error that may occur during initialization.

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

def __init__(
    self,
    vocab_file,
    bod_token="<d>",
    eod_token="</d>",
    bos_token="<s>",
    eos_token="</s>",
    pad_token="<pad>",
    unk_token="<unk>",
    line_token="</n>",
    space_token="</_>",
    padding_side="left",
    **kwargs,
):
    """
    Initialize a CpmAntTokenizer object with the provided parameters.

    Args:
        vocab_file (str): The path to the vocabulary file to load.
        bod_token (str, optional): Beginning of document token (default is '<d>').
        eod_token (str, optional): End of document token (default is '</d>').
        bos_token (str, optional): Beginning of sentence token (default is '<s>').
        eos_token (str, optional): End of sentence token (default is '</s>').
        pad_token (str, optional): Padding token (default is '<pad>').
        unk_token (str, optional): Token for unknown words (default is '<unk>').
        line_token (str, optional): Line break token (default is '</n>').
        space_token (str, optional): Space token (default is '</_>').
        padding_side (str, optional): Side for padding (default is 'left').

    Returns:
        None.

    Raises:
        MissingBackendError: If required backend 'jieba' is not available.
        FileNotFoundError: If the specified 'vocab_file' does not exist.
        KeyError: If 'space_token' or 'line_token' are missing in the loaded vocabulary.
        Exception: Any other unforeseen error that may occur during initialization.
    """
    requires_backends(self, ["jieba"])
    self.bod_token = bod_token
    self.eod_token = eod_token
    self.encoder = load_vocab(vocab_file)
    self.encoder[" "] = self.encoder[space_token]
    self.encoder["\n"] = self.encoder[line_token]

    del self.encoder[space_token]
    del self.encoder[line_token]

    self.encoder = collections.OrderedDict(sorted(self.encoder.items(), key=lambda x: x[1]))
    self.decoder = {v: k for k, v in self.encoder.items()}

    self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.encoder, unk_token=unk_token)

    super().__init__(
        bod_token=bod_token,
        eod_token=eod_token,
        bos_token=bos_token,
        eos_token=eos_token,
        pad_token=pad_token,
        unk_token=unk_token,
        line_token=line_token,
        space_token=space_token,
        padding_side=padding_side,
        **kwargs,
    )

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)` ¶

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A CPMAnt sequence has the following format:

single sequence: [BOS] Sequence.

PARAMETER	DESCRIPTION
`token_ids_0`	The first tokenized sequence that special tokens will be added. TYPE: `List[int]`
`token_ids_1`	The optional second tokenized sequence that special tokens will be added. TYPE: `List[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`List[int]`	`List[int]`: The model input with special tokens.

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

def build_inputs_with_special_tokens(self, token_ids_0: List[int], token_ids_1: List[int] = None) -> List[int]:
    """
    Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
    adding special tokens. A CPMAnt sequence has the following format:

    - single sequence: `[BOS] Sequence`.

    Args:
        token_ids_0 (`List[int]`): The first tokenized sequence that special tokens will be added.
        token_ids_1 (`List[int]`): The optional second tokenized sequence that special tokens will be added.

    Returns:
        `List[int]`: The model input with special tokens.
    """
    if token_ids_1 is None:
        return [self.bos_token_id] + token_ids_0
    return [self.bos_token_id] + token_ids_0 + [self.bos_token_id] + token_ids_1

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.check(token)` ¶

Check if a token is present in the encoder of the CpmAntTokenizer.

PARAMETER	DESCRIPTION
`self`	An instance of the CpmAntTokenizer class. TYPE: `CpmAntTokenizer`
`token`	The token to be checked. TYPE: `Any`

RETURNS	DESCRIPTION
	None.

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

def check(self, token):
    """
    Check if a token is present in the encoder of the CpmAntTokenizer.

    Args:
        self (CpmAntTokenizer): An instance of the CpmAntTokenizer class.
        token (Any): The token to be checked.

    Returns:
        None.

    Raises:
        None.
    """
    return token in self.encoder

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.convert_tokens_to_string(tokens)` ¶

Converts a list of tokens into a string representation.

PARAMETER	DESCRIPTION
`self`	An instance of the CpmAntTokenizer class. TYPE: `CpmAntTokenizer`
`tokens`	A list of tokens to be converted into a string representation. TYPE: `List[str]`

RETURNS	DESCRIPTION
`str`	A string representation of the tokens. TYPE: `str`

Note

The tokens should be provided as a list of strings.
The method will join the tokens together using an empty string as a separator.

Example

>>> tokenizer = CpmAntTokenizer()
>>> tokens = ['Hello', 'world', '!']
>>> tokenizer.convert_tokens_to_string(tokens)
'Hello world!'

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

def convert_tokens_to_string(self, tokens: List[str]) -> str:
    """
    Converts a list of tokens into a string representation.

    Args:
        self (CpmAntTokenizer): An instance of the CpmAntTokenizer class.
        tokens (List[str]): A list of tokens to be converted into a string representation.

    Returns:
        str: A string representation of the tokens.

    Raises:
        None.

    Note:
        - The tokens should be provided as a list of strings.
        - The method will join the tokens together using an empty string as a separator.

    Example:
        ```python
        >>> tokenizer = CpmAntTokenizer()
        >>> tokens = ['Hello', 'world', '!']
        >>> tokenizer.convert_tokens_to_string(tokens)
        'Hello world!'
        ```
    """
    return "".join(tokens)

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)` ¶

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

PARAMETER	DESCRIPTION
`token_ids_0`	List of IDs. TYPE: `List[int]`
`token_ids_1`	Optional second list of IDs for sequence pairs. TYPE: `List[int]`, optional DEFAULT: `None`
`already_has_special_tokens`	Whether or not the token list is already formatted with special tokens for the model. TYPE: `bool`, optional, defaults to `False` DEFAULT: `False`

RETURNS	DESCRIPTION
`List[int]`	`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

def get_special_tokens_mask(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
    """
    Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
    special tokens using the tokenizer `prepare_for_model` method.

    Args:
        token_ids_0 (`List[int]`): List of IDs.
        token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs.
        already_has_special_tokens (`bool`, *optional*, defaults to `False`):
            Whether or not the token list is already formatted with special tokens for the model.

    Returns:
        `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
    """
    if already_has_special_tokens:
        return super().get_special_tokens_mask(
            token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
        )

    if token_ids_1 is not None:
        return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1))
    return [1] + ([0] * len(token_ids_0))

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.get_vocab()` ¶

Retrieves the vocabulary of the CpmAntTokenizer instance.

PARAMETER	DESCRIPTION
`self`	The instance of CpmAntTokenizer. TYPE: `CpmAntTokenizer`

RETURNS	DESCRIPTION
`dict`	The vocabulary of the tokenizer, which is a dictionary mapping tokens to their corresponding IDs.

Example

>>> tokenizer = CpmAntTokenizer()
>>> vocab = tokenizer.get_vocab()
>>> vocab
{'<pad>': 0, '<unk>': 1, '<s>': 2, '</s>': 3, ...}

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

def get_vocab(self):
    """
    Retrieves the vocabulary of the CpmAntTokenizer instance.

    Args:
        self (CpmAntTokenizer): The instance of CpmAntTokenizer.

    Returns:
        dict: The vocabulary of the tokenizer, which is a dictionary mapping tokens to their corresponding IDs.

    Raises:
        None.

    Example:
        ```python
        >>> tokenizer = CpmAntTokenizer()
        >>> vocab = tokenizer.get_vocab()
        >>> vocab
        {'<pad>': 0, '<unk>': 1, '<s>': 2, '</s>': 3, ...}
        ```
    """
    return dict(self.encoder, **self.added_tokens_encoder)

`mindnlp.transformers.models.cpmant.tokenization_cpmant.CpmAntTokenizer.save_vocabulary(save_directory, filename_prefix=None)` ¶

Save the vocabulary to a file with the specified directory and filename prefix.

PARAMETER	DESCRIPTION
`self`	Instance of the CpmAntTokenizer class.
`save_directory`	The directory where the vocabulary file will be saved. TYPE: `str`
`filename_prefix`	A string to be prefixed to the filename. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Tuple[str]`	Tuple[str]: A tuple containing the path to the saved vocabulary file.

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
    """
    Save the vocabulary to a file with the specified directory and filename prefix.

    Args:
        self: Instance of the CpmAntTokenizer class.
        save_directory (str): The directory where the vocabulary file will be saved.
        filename_prefix (Optional[str]): A string to be prefixed to the filename. Defaults to None.

    Returns:
        Tuple[str]: A tuple containing the path to the saved vocabulary file.

    Raises:
        None.
    """
    if os.path.isdir(save_directory):
        vocab_file = os.path.join(
            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
        )
    else:
        vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
    index = 0
    if " " in self.encoder:
        self.encoder["</_>"] = self.encoder[" "]
        del self.encoder[" "]
    if "\n" in self.encoder:
        self.encoder["</n>"] = self.encoder["\n"]
        del self.encoder["\n"]
    self.encoder = collections.OrderedDict(sorted(self.encoder.items(), key=lambda x: x[1]))
    with open(vocab_file, "w", encoding="utf-8") as writer:
        for token, token_index in self.encoder.items():
            if index != token_index:
                logger.warning(
                    f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
                    " Please check that the vocabulary is not corrupted!"
                )
                index = token_index
            writer.write(token + "\n")
            index += 1
    return (vocab_file,)

`mindnlp.transformers.models.cpmant.tokenization_cpmant.WordpieceTokenizer` ¶

The WordpieceTokenizer class represents a tokenizer that tokenizes input text into subword tokens using the WordPiece algorithm.

ATTRIBUTE	DESCRIPTION
`vocab`	A dictionary containing the vocabulary of subword tokens. TYPE: `dict`
`unk_token`	The token to be used for out-of-vocabulary or unknown words. TYPE: `str`
`max_input_chars_per_word`	The maximum number of input characters per word for tokenization. TYPE: `int`

METHOD	DESCRIPTION
`tokenize`	Tokenizes the input token into subword tokens using the WordPiece algorithm and the specified vocabulary.

Example

>>> vocab = {'hello': 'he', 'world': 'wo', 'hello,': 'hello'}
>>> tokenizer = WordpieceTokenizer(vocab, '<unk>', 200)
>>> tokenized_text = tokenizer.tokenize('helloworld')

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

class WordpieceTokenizer:

    """
    The WordpieceTokenizer class represents a tokenizer that tokenizes input text into subword tokens using the WordPiece algorithm.

    Attributes:
        vocab (dict): A dictionary containing the vocabulary of subword tokens.
        unk_token (str): The token to be used for out-of-vocabulary or unknown words.
        max_input_chars_per_word (int): The maximum number of input characters per word for tokenization.

    Methods:
        tokenize(token):
            Tokenizes the input token into subword tokens using the WordPiece algorithm and the specified vocabulary.

    Example:
        ```python
        >>> vocab = {'hello': 'he', 'world': 'wo', 'hello,': 'hello'}
        >>> tokenizer = WordpieceTokenizer(vocab, '<unk>', 200)
        >>> tokenized_text = tokenizer.tokenize('helloworld')
        ```
    """
    def __init__(self, vocab, unk_token="<unk>", max_input_chars_per_word=200):
        """
        Initializes a new instance of the WordpieceTokenizer class.

        Args:
            self (WordpieceTokenizer): The current instance of the WordpieceTokenizer class.
            vocab (list): A list of strings representing the vocabulary for the tokenizer.
            unk_token (str, optional): The token to use for unknown words. Defaults to '<unk>'.
            max_input_chars_per_word (int, optional): The maximum number of characters allowed per word. Defaults to 200.

        Returns:
            None

        Raises:
            None.

        This method initializes the WordpieceTokenizer object with the provided vocabulary, unknown token, and maximum input characters per word.
        The vocabulary is a list of strings that represents the set of tokens used by the tokenizer.
        The unk_token parameter allows customization of the token used to represent unknown words. If not provided, it defaults to '<unk>'.
        The max_input_chars_per_word parameter limits the number of characters allowed per word.
        If a word exceeds this limit, it will be split into subwords.

        Example:
            ```python
            >>> tokenizer = WordpieceTokenizer(vocab=['hello', 'world'], unk_token='<unk>', max_input_chars_per_word=200)
            ```
        """
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word

    def tokenize(self, token):
        """
        This method tokenizes a given input token into sub-tokens based on the vocabulary of the WordpieceTokenizer class.

        Args:
            self (WordpieceTokenizer): The instance of the WordpieceTokenizer class.
                It is used to access the vocabulary and maximum input characters per word.
            token (str): The input token to be tokenized.
                It represents the word to be broken down into sub-tokens.
                Must be a string.

        Returns:
            list: A list of sub-tokens generated from the input token based on the vocabulary.
                If the length of the input token exceeds the maximum allowed characters per word,
                it returns a list containing the unknown token (unk_token).
                Otherwise, it returns a list of sub-tokens that are part of the vocabulary or the unknown token.

        Raises:
            None
        """
        chars = list(token)
        if len(chars) > self.max_input_chars_per_word:
            return [self.unk_token]

        start = 0
        sub_tokens = []
        while start < len(chars):
            end = len(chars)
            cur_substr = None
            while start < end:
                substr = "".join(chars[start:end])
                if substr in self.vocab:
                    cur_substr = substr
                    break
                end -= 1
            if cur_substr is None:
                sub_tokens.append(self.unk_token)
                start += 1
            else:
                sub_tokens.append(cur_substr)
                start = end

        return sub_tokens

`mindnlp.transformers.models.cpmant.tokenization_cpmant.WordpieceTokenizer.init(vocab, unk_token='<unk>', max_input_chars_per_word=200)` ¶

Initializes a new instance of the WordpieceTokenizer class.

PARAMETER	DESCRIPTION
`self`	The current instance of the WordpieceTokenizer class. TYPE: `WordpieceTokenizer`
`vocab`	A list of strings representing the vocabulary for the tokenizer. TYPE: `list`
`unk_token`	The token to use for unknown words. Defaults to ''. TYPE: `str` DEFAULT: `'<unk>'`
`max_input_chars_per_word`	The maximum number of characters allowed per word. Defaults to 200. TYPE: `int` DEFAULT: `200`

RETURNS	DESCRIPTION
	None

This method initializes the WordpieceTokenizer object with the provided vocabulary, unknown token, and maximum input characters per word. The vocabulary is a list of strings that represents the set of tokens used by the tokenizer. The unk_token parameter allows customization of the token used to represent unknown words. If not provided, it defaults to ''. The max_input_chars_per_word parameter limits the number of characters allowed per word. If a word exceeds this limit, it will be split into subwords.

Example

>>> tokenizer = WordpieceTokenizer(vocab=['hello', 'world'], unk_token='<unk>', max_input_chars_per_word=200)

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

def __init__(self, vocab, unk_token="<unk>", max_input_chars_per_word=200):
    """
    Initializes a new instance of the WordpieceTokenizer class.

    Args:
        self (WordpieceTokenizer): The current instance of the WordpieceTokenizer class.
        vocab (list): A list of strings representing the vocabulary for the tokenizer.
        unk_token (str, optional): The token to use for unknown words. Defaults to '<unk>'.
        max_input_chars_per_word (int, optional): The maximum number of characters allowed per word. Defaults to 200.

    Returns:
        None

    Raises:
        None.

    This method initializes the WordpieceTokenizer object with the provided vocabulary, unknown token, and maximum input characters per word.
    The vocabulary is a list of strings that represents the set of tokens used by the tokenizer.
    The unk_token parameter allows customization of the token used to represent unknown words. If not provided, it defaults to '<unk>'.
    The max_input_chars_per_word parameter limits the number of characters allowed per word.
    If a word exceeds this limit, it will be split into subwords.

    Example:
        ```python
        >>> tokenizer = WordpieceTokenizer(vocab=['hello', 'world'], unk_token='<unk>', max_input_chars_per_word=200)
        ```
    """
    self.vocab = vocab
    self.unk_token = unk_token
    self.max_input_chars_per_word = max_input_chars_per_word

`mindnlp.transformers.models.cpmant.tokenization_cpmant.WordpieceTokenizer.tokenize(token)` ¶

This method tokenizes a given input token into sub-tokens based on the vocabulary of the WordpieceTokenizer class.

PARAMETER	DESCRIPTION
`self`	The instance of the WordpieceTokenizer class. It is used to access the vocabulary and maximum input characters per word. TYPE: `WordpieceTokenizer`
`token`	The input token to be tokenized. It represents the word to be broken down into sub-tokens. Must be a string. TYPE: `str`

RETURNS	DESCRIPTION
`list`	A list of sub-tokens generated from the input token based on the vocabulary. If the length of the input token exceeds the maximum allowed characters per word, it returns a list containing the unknown token (unk_token). Otherwise, it returns a list of sub-tokens that are part of the vocabulary or the unknown token.

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

def tokenize(self, token):
    """
    This method tokenizes a given input token into sub-tokens based on the vocabulary of the WordpieceTokenizer class.

    Args:
        self (WordpieceTokenizer): The instance of the WordpieceTokenizer class.
            It is used to access the vocabulary and maximum input characters per word.
        token (str): The input token to be tokenized.
            It represents the word to be broken down into sub-tokens.
            Must be a string.

    Returns:
        list: A list of sub-tokens generated from the input token based on the vocabulary.
            If the length of the input token exceeds the maximum allowed characters per word,
            it returns a list containing the unknown token (unk_token).
            Otherwise, it returns a list of sub-tokens that are part of the vocabulary or the unknown token.

    Raises:
        None
    """
    chars = list(token)
    if len(chars) > self.max_input_chars_per_word:
        return [self.unk_token]

    start = 0
    sub_tokens = []
    while start < len(chars):
        end = len(chars)
        cur_substr = None
        while start < end:
            substr = "".join(chars[start:end])
            if substr in self.vocab:
                cur_substr = substr
                break
            end -= 1
        if cur_substr is None:
            sub_tokens.append(self.unk_token)
            start += 1
        else:
            sub_tokens.append(cur_substr)
            start = end

    return sub_tokens

`mindnlp.transformers.models.cpmant.tokenization_cpmant.load_vocab(vocab_file)` ¶

Loads a vocabulary file into a dictionary.

Source code in mindnlp\transformers\models\cpmant\tokenization_cpmant.py

def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    with open(vocab_file, "r", encoding="utf-8") as reader:
        tokens = reader.readlines()
    for index, token in enumerate(tokens):
        token = token.rstrip("\n")
        vocab[token] = index
    return vocab