sam

`mindnlp.transformers.models.sam.configuration_sam` ¶

SAM model configuration

`mindnlp.transformers.models.sam.configuration_sam.SamConfig` ¶

Bases: PretrainedConfig

[SamConfig] is the configuration class to store the configuration of a [SamModel]. It is used to instantiate a SAM model according to the specified arguments, defining the vision model, prompt-encoder model and mask decoder configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the SAM-ViT-H facebook/sam-vit-huge architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`vision_config`	Dictionary of configuration options used to initialize [`SamVisionConfig`]. TYPE: Union[`dict`, `SamVisionConfig`], optional DEFAULT: `None`
`prompt_encoder_config`	Dictionary of configuration options used to initialize [`SamPromptEncoderConfig`]. TYPE: Union[`dict`, `SamPromptEncoderConfig`], optional DEFAULT: `None`
`mask_decoder_config`	Dictionary of configuration options used to initialize [`SamMaskDecoderConfig`]. TYPE: Union[`dict`, `SamMaskDecoderConfig`], optional DEFAULT: `None`
`kwargs`	Dictionary of keyword arguments. TYPE: `optional` DEFAULT: `{}`

Example

>>> from transformers import (
...     SamVisionConfig,
...     SamPromptEncoderConfig,
...     SamMaskDecoderConfig,
...     SamModel,
... )
...
>>> # Initializing a SamConfig with `"facebook/sam-vit-huge"` style configuration
>>> configuration = SamConfig()
...
>>> # Initializing a SamModel (with random weights) from the `"facebook/sam-vit-huge"` style configuration
>>> model = SamModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
...
>>> # We can also initialize a SamConfig from a SamVisionConfig, SamPromptEncoderConfig, and SamMaskDecoderConfig
...
>>> # Initializing SAM vision, SAM Q-Former and language model configurations
>>> vision_config = SamVisionConfig()
>>> prompt_encoder_config = SamPromptEncoderConfig()
>>> mask_decoder_config = SamMaskDecoderConfig()

>>> config = SamConfig(vision_config, prompt_encoder_config, mask_decoder_config)

Source code in mindnlp\transformers\models\sam\configuration_sam.py

class SamConfig(PretrainedConfig):
    r"""
    [`SamConfig`] is the configuration class to store the configuration of a [`SamModel`]. It is used to instantiate a
    SAM model according to the specified arguments, defining the vision model, prompt-encoder model and mask decoder
    configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the
    SAM-ViT-H [facebook/sam-vit-huge](https://huggingface.co/facebook/sam-vit-huge) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vision_config (Union[`dict`, `SamVisionConfig`], *optional*):
            Dictionary of configuration options used to initialize [`SamVisionConfig`].
        prompt_encoder_config (Union[`dict`, `SamPromptEncoderConfig`], *optional*):
            Dictionary of configuration options used to initialize [`SamPromptEncoderConfig`].
        mask_decoder_config (Union[`dict`, `SamMaskDecoderConfig`], *optional*):
            Dictionary of configuration options used to initialize [`SamMaskDecoderConfig`].

        kwargs (*optional*):
            Dictionary of keyword arguments.

    Example:
        ```python
        >>> from transformers import (
        ...     SamVisionConfig,
        ...     SamPromptEncoderConfig,
        ...     SamMaskDecoderConfig,
        ...     SamModel,
        ... )
        ...
        >>> # Initializing a SamConfig with `"facebook/sam-vit-huge"` style configuration
        >>> configuration = SamConfig()
        ...
        >>> # Initializing a SamModel (with random weights) from the `"facebook/sam-vit-huge"` style configuration
        >>> model = SamModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ...
        >>> # We can also initialize a SamConfig from a SamVisionConfig, SamPromptEncoderConfig, and SamMaskDecoderConfig
        ...
        >>> # Initializing SAM vision, SAM Q-Former and language model configurations
        >>> vision_config = SamVisionConfig()
        >>> prompt_encoder_config = SamPromptEncoderConfig()
        >>> mask_decoder_config = SamMaskDecoderConfig()

        >>> config = SamConfig(vision_config, prompt_encoder_config, mask_decoder_config)
        ```
    """
    model_type = "sam"

    def __init__(
        self,
        vision_config=None,
        prompt_encoder_config=None,
        mask_decoder_config=None,
        initializer_range=0.02,
        **kwargs,
    ):
        """
        Initializes a new instance of the SamConfig class.

        Args:
            self: The current instance of the SamConfig class.
            vision_config (SamVisionConfig or None): The configuration for vision. If provided,
                it should be an instance of SamVisionConfig. Defaults to None.
            prompt_encoder_config (SamPromptEncoderConfig or None): The configuration for prompt encoder.
                If provided, it should be an instance of SamPromptEncoderConfig. Defaults to None.
            mask_decoder_config (SamMaskDecoderConfig or None): The configuration for mask decoder.
                If provided, it should be an instance of SamMaskDecoderConfig. Defaults to None.
            initializer_range (float): The range for weight initialization. Defaults to 0.02.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(**kwargs)
        vision_config = vision_config if vision_config is not None else {}
        prompt_encoder_config = prompt_encoder_config if prompt_encoder_config is not None else {}
        mask_decoder_config = mask_decoder_config if mask_decoder_config is not None else {}

        if isinstance(vision_config, SamVisionConfig):
            vision_config = vision_config.to_dict()
        if isinstance(prompt_encoder_config, SamPromptEncoderConfig):
            prompt_encoder_config = prompt_encoder_config.to_dict()
        if isinstance(mask_decoder_config, SamMaskDecoderConfig):
            mask_decoder_config = mask_decoder_config.to_dict()

        self.vision_config = SamVisionConfig(**vision_config)
        self.prompt_encoder_config = SamPromptEncoderConfig(**prompt_encoder_config)
        self.mask_decoder_config = SamMaskDecoderConfig(**mask_decoder_config)
        self.initializer_range = initializer_range

`mindnlp.transformers.models.sam.configuration_sam.SamConfig.init(vision_config=None, prompt_encoder_config=None, mask_decoder_config=None, initializer_range=0.02, **kwargs)` ¶

Initializes a new instance of the SamConfig class.

PARAMETER	DESCRIPTION
`self`	The current instance of the SamConfig class.
`vision_config`	The configuration for vision. If provided, it should be an instance of SamVisionConfig. Defaults to None. TYPE: `SamVisionConfig or None` DEFAULT: `None`
`prompt_encoder_config`	The configuration for prompt encoder. If provided, it should be an instance of SamPromptEncoderConfig. Defaults to None. TYPE: `SamPromptEncoderConfig or None` DEFAULT: `None`
`mask_decoder_config`	The configuration for mask decoder. If provided, it should be an instance of SamMaskDecoderConfig. Defaults to None. TYPE: `SamMaskDecoderConfig or None` DEFAULT: `None`
`initializer_range`	The range for weight initialization. Defaults to 0.02. TYPE: `float` DEFAULT: `0.02`

RETURNS	DESCRIPTION
	None.

Source code in mindnlp\transformers\models\sam\configuration_sam.py

def __init__(
    self,
    vision_config=None,
    prompt_encoder_config=None,
    mask_decoder_config=None,
    initializer_range=0.02,
    **kwargs,
):
    """
    Initializes a new instance of the SamConfig class.

    Args:
        self: The current instance of the SamConfig class.
        vision_config (SamVisionConfig or None): The configuration for vision. If provided,
            it should be an instance of SamVisionConfig. Defaults to None.
        prompt_encoder_config (SamPromptEncoderConfig or None): The configuration for prompt encoder.
            If provided, it should be an instance of SamPromptEncoderConfig. Defaults to None.
        mask_decoder_config (SamMaskDecoderConfig or None): The configuration for mask decoder.
            If provided, it should be an instance of SamMaskDecoderConfig. Defaults to None.
        initializer_range (float): The range for weight initialization. Defaults to 0.02.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(**kwargs)
    vision_config = vision_config if vision_config is not None else {}
    prompt_encoder_config = prompt_encoder_config if prompt_encoder_config is not None else {}
    mask_decoder_config = mask_decoder_config if mask_decoder_config is not None else {}

    if isinstance(vision_config, SamVisionConfig):
        vision_config = vision_config.to_dict()
    if isinstance(prompt_encoder_config, SamPromptEncoderConfig):
        prompt_encoder_config = prompt_encoder_config.to_dict()
    if isinstance(mask_decoder_config, SamMaskDecoderConfig):
        mask_decoder_config = mask_decoder_config.to_dict()

    self.vision_config = SamVisionConfig(**vision_config)
    self.prompt_encoder_config = SamPromptEncoderConfig(**prompt_encoder_config)
    self.mask_decoder_config = SamMaskDecoderConfig(**mask_decoder_config)
    self.initializer_range = initializer_range

`mindnlp.transformers.models.sam.configuration_sam.SamMaskDecoderConfig` ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [SamMaskDecoder]. It is used to instantiate a SAM mask decoder to the specified arguments, defining the model architecture. Instantiating a configuration defaults will yield a similar configuration to that of the SAM-vit-h facebook/sam-vit-huge architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`hidden_size`	Dimensionality of the hidden states. TYPE: `int`, optional, defaults to 256 DEFAULT: `256`
`hidden_act`	The non-linear activation function used inside the `SamMaskDecoder` module. TYPE: `str`, optional, defaults to `"relu"` DEFAULT: `'relu'`
`mlp_dim`	Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. TYPE: `int`, optional, defaults to 2048 DEFAULT: `2048`
`num_hidden_layers`	Number of hidden layers in the Transformer encoder. TYPE: `int`, optional, defaults to 2 DEFAULT: `2`
`num_attention_heads`	Number of attention heads for each attention layer in the Transformer encoder. TYPE: `int`, optional, defaults to 8 DEFAULT: `8`
`attention_downsample_rate`	The downsampling rate of the attention layer. TYPE: `int`, optional, defaults to 2 DEFAULT: `2`
`num_multimask_outputs`	The number of outputs from the `SamMaskDecoder` module. In the Segment Anything paper, this is set to 3. TYPE: `int`, optional, defaults to 3 DEFAULT: `3`
`iou_head_depth`	The number of layers in the IoU head module. TYPE: `int`, optional, defaults to 3 DEFAULT: `3`
`iou_head_hidden_dim`	The dimensionality of the hidden states in the IoU head module. TYPE: `int`, optional, defaults to 256 DEFAULT: `256`
`layer_norm_eps`	The epsilon used by the layer normalization layers. TYPE: `float`, optional, defaults to 1e-06 DEFAULT: `1e-06`

Source code in mindnlp\transformers\models\sam\configuration_sam.py

class SamMaskDecoderConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`SamMaskDecoder`]. It is used to instantiate a SAM
    mask decoder to the specified arguments, defining the model architecture. Instantiating a configuration defaults
    will yield a similar configuration to that of the SAM-vit-h
    [facebook/sam-vit-huge](https://huggingface.co/facebook/sam-vit-huge) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        hidden_size (`int`, *optional*, defaults to 256):
            Dimensionality of the hidden states.
        hidden_act (`str`, *optional*, defaults to `"relu"`):
            The non-linear activation function used inside the `SamMaskDecoder` module.
        mlp_dim (`int`, *optional*, defaults to 2048):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        num_hidden_layers (`int`, *optional*, defaults to 2):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 8):
            Number of attention heads for each attention layer in the Transformer encoder.
        attention_downsample_rate (`int`, *optional*, defaults to 2):
            The downsampling rate of the attention layer.
        num_multimask_outputs (`int`, *optional*, defaults to 3):
            The number of outputs from the `SamMaskDecoder` module. In the Segment Anything paper, this is set to 3.
        iou_head_depth (`int`, *optional*, defaults to 3):
            The number of layers in the IoU head module.
        iou_head_hidden_dim (`int`, *optional*, defaults to 256):
            The dimensionality of the hidden states in the IoU head module.
        layer_norm_eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the layer normalization layers.

    """
    def __init__(
        self,
        hidden_size=256,
        hidden_act="relu",
        mlp_dim=2048,
        num_hidden_layers=2,
        num_attention_heads=8,
        attention_downsample_rate=2,
        num_multimask_outputs=3,
        iou_head_depth=3,
        iou_head_hidden_dim=256,
        layer_norm_eps=1e-6,
        **kwargs,
    ):
        """
        Initializes a new instance of the SamMaskDecoderConfig class.

        Args:
            self: The object itself.
            hidden_size (int, optional): The size of the hidden layer. Default is 256.
            hidden_act (str, optional): The activation function to be used in the hidden layer. Default is 'relu'.
            mlp_dim (int, optional): The dimension of the Multi-Layer Perceptron (MLP). Default is 2048.
            num_hidden_layers (int, optional): The number of hidden layers. Default is 2.
            num_attention_heads (int, optional): The number of attention heads. Default is 8.
            attention_downsample_rate (int, optional): The downsample rate for attention. Default is 2.
            num_multimask_outputs (int, optional): The number of outputs for multimask. Default is 3.
            iou_head_depth (int, optional): The depth of the Intersection over Union (IoU) head. Default is 3.
            iou_head_hidden_dim (int, optional): The hidden dimension of the IoU head. Default is 256.
            layer_norm_eps (float, optional): The epsilon value for layer normalization. Default is 1e-06.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(**kwargs)
        self.hidden_size = hidden_size
        self.hidden_act = hidden_act
        self.mlp_dim = mlp_dim
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.attention_downsample_rate = attention_downsample_rate
        self.num_multimask_outputs = num_multimask_outputs
        self.iou_head_depth = iou_head_depth
        self.iou_head_hidden_dim = iou_head_hidden_dim
        self.layer_norm_eps = layer_norm_eps

`mindnlp.transformers.models.sam.configuration_sam.SamMaskDecoderConfig.init(hidden_size=256, hidden_act='relu', mlp_dim=2048, num_hidden_layers=2, num_attention_heads=8, attention_downsample_rate=2, num_multimask_outputs=3, iou_head_depth=3, iou_head_hidden_dim=256, layer_norm_eps=1e-06, **kwargs)` ¶

Initializes a new instance of the SamMaskDecoderConfig class.

PARAMETER	DESCRIPTION
`self`	The object itself.
`hidden_size`	The size of the hidden layer. Default is 256. TYPE: `int` DEFAULT: `256`
`hidden_act`	The activation function to be used in the hidden layer. Default is 'relu'. TYPE: `str` DEFAULT: `'relu'`
`mlp_dim`	The dimension of the Multi-Layer Perceptron (MLP). Default is 2048. TYPE: `int` DEFAULT: `2048`
`num_hidden_layers`	The number of hidden layers. Default is 2. TYPE: `int` DEFAULT: `2`
`num_attention_heads`	The number of attention heads. Default is 8. TYPE: `int` DEFAULT: `8`
`attention_downsample_rate`	The downsample rate for attention. Default is 2. TYPE: `int` DEFAULT: `2`
`num_multimask_outputs`	The number of outputs for multimask. Default is 3. TYPE: `int` DEFAULT: `3`
`iou_head_depth`	The depth of the Intersection over Union (IoU) head. Default is 3. TYPE: `int` DEFAULT: `3`
`iou_head_hidden_dim`	The hidden dimension of the IoU head. Default is 256. TYPE: `int` DEFAULT: `256`
`layer_norm_eps`	The epsilon value for layer normalization. Default is 1e-06. TYPE: `float` DEFAULT: `1e-06`

RETURNS	DESCRIPTION
	None

Source code in mindnlp\transformers\models\sam\configuration_sam.py

def __init__(
    self,
    hidden_size=256,
    hidden_act="relu",
    mlp_dim=2048,
    num_hidden_layers=2,
    num_attention_heads=8,
    attention_downsample_rate=2,
    num_multimask_outputs=3,
    iou_head_depth=3,
    iou_head_hidden_dim=256,
    layer_norm_eps=1e-6,
    **kwargs,
):
    """
    Initializes a new instance of the SamMaskDecoderConfig class.

    Args:
        self: The object itself.
        hidden_size (int, optional): The size of the hidden layer. Default is 256.
        hidden_act (str, optional): The activation function to be used in the hidden layer. Default is 'relu'.
        mlp_dim (int, optional): The dimension of the Multi-Layer Perceptron (MLP). Default is 2048.
        num_hidden_layers (int, optional): The number of hidden layers. Default is 2.
        num_attention_heads (int, optional): The number of attention heads. Default is 8.
        attention_downsample_rate (int, optional): The downsample rate for attention. Default is 2.
        num_multimask_outputs (int, optional): The number of outputs for multimask. Default is 3.
        iou_head_depth (int, optional): The depth of the Intersection over Union (IoU) head. Default is 3.
        iou_head_hidden_dim (int, optional): The hidden dimension of the IoU head. Default is 256.
        layer_norm_eps (float, optional): The epsilon value for layer normalization. Default is 1e-06.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(**kwargs)
    self.hidden_size = hidden_size
    self.hidden_act = hidden_act
    self.mlp_dim = mlp_dim
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.attention_downsample_rate = attention_downsample_rate
    self.num_multimask_outputs = num_multimask_outputs
    self.iou_head_depth = iou_head_depth
    self.iou_head_hidden_dim = iou_head_hidden_dim
    self.layer_norm_eps = layer_norm_eps

`mindnlp.transformers.models.sam.configuration_sam.SamPromptEncoderConfig` ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [SamPromptEncoder]. The [SamPromptEncoder] module is used to encode the input 2D points and bounding boxes. Instantiating a configuration defaults will yield a similar configuration to that of the SAM-vit-h facebook/sam-vit-huge architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`hidden_size`	Dimensionality of the hidden states. TYPE: `int`, optional, defaults to 256 DEFAULT: `256`
`image_size`	The expected output resolution of the image. TYPE: `int`, optional, defaults to 1024 DEFAULT: `1024`
`patch_size`	The size (resolution) of each patch. TYPE: `int`, optional, defaults to 16 DEFAULT: `16`
`mask_input_channels`	The number of channels to be fed to the `MaskDecoder` module. TYPE: `int`, optional, defaults to 16 DEFAULT: `16`
`num_point_embeddings`	The number of point embeddings to be used. TYPE: `int`, optional, defaults to 4 DEFAULT: `4`
`hidden_act`	The non-linear activation function in the encoder and pooler. TYPE: `str`, optional, defaults to `"gelu"` DEFAULT: `'gelu'`

Source code in mindnlp\transformers\models\sam\configuration_sam.py

class SamPromptEncoderConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`SamPromptEncoder`]. The [`SamPromptEncoder`]
    module is used to encode the input 2D points and bounding boxes. Instantiating a configuration defaults will yield
    a similar configuration to that of the SAM-vit-h
    [facebook/sam-vit-huge](https://huggingface.co/facebook/sam-vit-huge) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        hidden_size (`int`, *optional*, defaults to 256):
            Dimensionality of the hidden states.
        image_size (`int`, *optional*, defaults to 1024):
            The expected output resolution of the image.
        patch_size (`int`, *optional*, defaults to 16):
            The size (resolution) of each patch.
        mask_input_channels (`int`, *optional*, defaults to 16):
            The number of channels to be fed to the `MaskDecoder` module.
        num_point_embeddings (`int`, *optional*, defaults to 4):
            The number of point embeddings to be used.
        hidden_act (`str`, *optional*, defaults to `"gelu"`):
            The non-linear activation function in the encoder and pooler.
    """
    def __init__(
        self,
        hidden_size=256,
        image_size=1024,
        patch_size=16,
        mask_input_channels=16,
        num_point_embeddings=4,
        hidden_act="gelu",
        layer_norm_eps=1e-6,
        **kwargs,
    ):
        """
        Initializes an instance of the SamPromptEncoderConfig class.

        Args:
            self (SamPromptEncoderConfig): The instance of the class itself.
            hidden_size (int, optional): The size of the hidden state. Defaults to 256.
            image_size (int, optional): The size of the input image. Defaults to 1024.
            patch_size (int, optional): The size of each image patch. Defaults to 16.
            mask_input_channels (int, optional): The number of input channels for masking. Defaults to 16.
            num_point_embeddings (int, optional): The number of point embeddings. Defaults to 4.
            hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
            layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-06.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(**kwargs)
        self.hidden_size = hidden_size
        self.image_size = image_size
        self.patch_size = patch_size
        self.image_embedding_size = image_size // patch_size
        self.mask_input_channels = mask_input_channels
        self.num_point_embeddings = num_point_embeddings
        self.hidden_act = hidden_act
        self.layer_norm_eps = layer_norm_eps

`mindnlp.transformers.models.sam.configuration_sam.SamPromptEncoderConfig.init(hidden_size=256, image_size=1024, patch_size=16, mask_input_channels=16, num_point_embeddings=4, hidden_act='gelu', layer_norm_eps=1e-06, **kwargs)` ¶

Initializes an instance of the SamPromptEncoderConfig class.

PARAMETER	DESCRIPTION
`self`	The instance of the class itself. TYPE: `SamPromptEncoderConfig`
`hidden_size`	The size of the hidden state. Defaults to 256. TYPE: `int` DEFAULT: `256`
`image_size`	The size of the input image. Defaults to 1024. TYPE: `int` DEFAULT: `1024`
`patch_size`	The size of each image patch. Defaults to 16. TYPE: `int` DEFAULT: `16`
`mask_input_channels`	The number of input channels for masking. Defaults to 16. TYPE: `int` DEFAULT: `16`
`num_point_embeddings`	The number of point embeddings. Defaults to 4. TYPE: `int` DEFAULT: `4`
`hidden_act`	The activation function for the hidden layers. Defaults to 'gelu'. TYPE: `str` DEFAULT: `'gelu'`
`layer_norm_eps`	The epsilon value for layer normalization. Defaults to 1e-06. TYPE: `float` DEFAULT: `1e-06`

RETURNS	DESCRIPTION
	None

Source code in mindnlp\transformers\models\sam\configuration_sam.py

def __init__(
    self,
    hidden_size=256,
    image_size=1024,
    patch_size=16,
    mask_input_channels=16,
    num_point_embeddings=4,
    hidden_act="gelu",
    layer_norm_eps=1e-6,
    **kwargs,
):
    """
    Initializes an instance of the SamPromptEncoderConfig class.

    Args:
        self (SamPromptEncoderConfig): The instance of the class itself.
        hidden_size (int, optional): The size of the hidden state. Defaults to 256.
        image_size (int, optional): The size of the input image. Defaults to 1024.
        patch_size (int, optional): The size of each image patch. Defaults to 16.
        mask_input_channels (int, optional): The number of input channels for masking. Defaults to 16.
        num_point_embeddings (int, optional): The number of point embeddings. Defaults to 4.
        hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
        layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-06.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(**kwargs)
    self.hidden_size = hidden_size
    self.image_size = image_size
    self.patch_size = patch_size
    self.image_embedding_size = image_size // patch_size
    self.mask_input_channels = mask_input_channels
    self.num_point_embeddings = num_point_embeddings
    self.hidden_act = hidden_act
    self.layer_norm_eps = layer_norm_eps

`mindnlp.transformers.models.sam.configuration_sam.SamVisionConfig` ¶

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [SamVisionModel]. It is used to instantiate a SAM vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration defaults will yield a similar configuration to that of the SAM ViT-h facebook/sam-vit-huge architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER	DESCRIPTION
`hidden_size`	Dimensionality of the encoder layers and the pooler layer. TYPE: `int`, optional, defaults to 768 DEFAULT: `768`
`output_channels`	Dimensionality of the output channels in the Patch Encoder. TYPE: `int`, optional, defaults to 256 DEFAULT: `256`
`num_hidden_layers`	Number of hidden layers in the Transformer encoder. TYPE: `int`, optional, defaults to 12 DEFAULT: `12`
`num_attention_heads`	Number of attention heads for each attention layer in the Transformer encoder. TYPE: `int`, optional, defaults to 12 DEFAULT: `12`
`num_channels`	Number of channels in the input image. TYPE: `int`, optional, defaults to 3 DEFAULT: `3`
`image_size`	Expected resolution. Target size of the resized input image. TYPE: `int`, optional, defaults to 1024 DEFAULT: `1024`
`patch_size`	Size of the patches to be extracted from the input image. TYPE: `int`, optional, defaults to 16 DEFAULT: `16`
`hidden_act`	The non-linear activation function (function or string) TYPE: `str`, optional, defaults to `"gelu"` DEFAULT: `'gelu'`
`layer_norm_eps`	The epsilon used by the layer normalization layers. TYPE: `float`, optional, defaults to 1e-06 DEFAULT: `1e-06`
`attention_dropout`	The dropout ratio for the attention probabilities. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`initializer_range`	The standard deviation of the truncated_normal_initializer for initializing all weight matrices. TYPE: `float`, optional, defaults to 1e-10 DEFAULT: `1e-10`
`qkv_bias`	Whether to add a bias to query, key, value projections. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`mlp_ratio`	Ratio of mlp hidden dim to embedding dim. TYPE: `float`, optional, defaults to 4.0 DEFAULT: `4.0`
`use_abs_pos`	Whether to use absolute position embedding. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`use_rel_pos`	Whether to use relative position embedding. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`window_size`	Window size for relative position. TYPE: `int`, optional, defaults to 14 DEFAULT: `14`
`global_attn_indexes`	The indexes of the global attention layers. TYPE: `List[int]`, optional, defaults to `[2, 5, 8, 11]` DEFAULT: `[2, 5, 8, 11]`
`num_pos_feats`	The dimensionality of the position embedding. TYPE: `int`, optional, defaults to 128 DEFAULT: `128`
`mlp_dim`	The dimensionality of the MLP layer in the Transformer encoder. If `None`, defaults to `mlp_ratio * hidden_size`. TYPE: `int`, optional DEFAULT: `None`

Source code in mindnlp\transformers\models\sam\configuration_sam.py

class SamVisionConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`SamVisionModel`]. It is used to instantiate a SAM
    vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration
    defaults will yield a similar configuration to that of the SAM ViT-h
    [facebook/sam-vit-huge](https://huggingface.co/facebook/sam-vit-huge) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        output_channels (`int`, *optional*, defaults to 256):
            Dimensionality of the output channels in the Patch Encoder.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_channels (`int`, *optional*, defaults to 3):
            Number of channels in the input image.
        image_size (`int`, *optional*, defaults to 1024):
            Expected resolution. Target size of the resized input image.
        patch_size (`int`, *optional*, defaults to 16):
            Size of the patches to be extracted from the input image.
        hidden_act (`str`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string)
        layer_norm_eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the layer normalization layers.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 1e-10):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        qkv_bias (`bool`, *optional*, defaults to `True`):
            Whether to add a bias to query, key, value projections.
        mlp_ratio (`float`, *optional*, defaults to 4.0):
            Ratio of mlp hidden dim to embedding dim.
        use_abs_pos (`bool`, *optional*, defaults to `True`):
            Whether to use absolute position embedding.
        use_rel_pos (`bool`, *optional*, defaults to `True`):
            Whether to use relative position embedding.
        window_size (`int`, *optional*, defaults to 14):
            Window size for relative position.
        global_attn_indexes (`List[int]`, *optional*, defaults to `[2, 5, 8, 11]`):
            The indexes of the global attention layers.
        num_pos_feats (`int`, *optional*, defaults to 128):
            The dimensionality of the position embedding.
        mlp_dim (`int`, *optional*):
            The dimensionality of the MLP layer in the Transformer encoder. If `None`, defaults to `mlp_ratio *
            hidden_size`.
    """
    def __init__(
        self,
        hidden_size=768,
        output_channels=256,
        num_hidden_layers=12,
        num_attention_heads=12,
        num_channels=3,
        image_size=1024,
        patch_size=16,
        hidden_act="gelu",
        layer_norm_eps=1e-06,
        attention_dropout=0.0,
        initializer_range=1e-10,
        qkv_bias=True,
        mlp_ratio=4.0,
        use_abs_pos=True,
        use_rel_pos=True,
        window_size=14,
        global_attn_indexes=[2, 5, 8, 11],
        num_pos_feats=128,
        mlp_dim=None,
        **kwargs,
    ):
        """
        Initializes an instance of the SamVisionConfig class.

        Args:
            self: The object instance.
            hidden_size (int, optional): The size of the hidden state. Defaults to 768.
            output_channels (int, optional): The number of output channels. Defaults to 256.
            num_hidden_layers (int, optional): The number of hidden layers. Defaults to 12.
            num_attention_heads (int, optional): The number of attention heads. Defaults to 12.
            num_channels (int, optional): The number of input channels. Defaults to 3.
            image_size (int, optional): The size of the input image. Defaults to 1024.
            patch_size (int, optional): The size of each patch in the image. Defaults to 16.
            hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
            layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-06.
            attention_dropout (float, optional): The dropout rate for the attention mechanism. Defaults to 0.0.
            initializer_range (float, optional): The range for parameter initialization. Defaults to 1e-10.
            qkv_bias (bool, optional): Whether to include bias in the query, key, and value projections. Defaults to True.
            mlp_ratio (float, optional): The ratio of the hidden size to the feed-forward network size. Defaults to 4.0.
            use_abs_pos (bool, optional): Whether to use absolute position embeddings. Defaults to True.
            use_rel_pos (bool, optional): Whether to use relative position embeddings. Defaults to True.
            window_size (int, optional): The size of the attention window. Defaults to 14.
            global_attn_indexes (list[int], optional): The list of indexes for global attention. Defaults to [2, 5, 8, 11].
            num_pos_feats (int, optional): The number of positional features. Defaults to 128.
            mlp_dim (int, optional): The size of the hidden layer in the feed-forward network. If not provided,
                it is calculated as int(hidden_size * mlp_ratio).

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(**kwargs)

        self.hidden_size = hidden_size
        self.output_channels = output_channels
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.num_channels = num_channels
        self.image_size = image_size
        self.patch_size = patch_size
        self.hidden_act = hidden_act
        self.layer_norm_eps = layer_norm_eps
        self.attention_dropout = attention_dropout
        self.initializer_range = initializer_range
        self.qkv_bias = qkv_bias
        self.mlp_ratio = mlp_ratio
        self.use_abs_pos = use_abs_pos
        self.use_rel_pos = use_rel_pos
        self.window_size = window_size
        self.global_attn_indexes = global_attn_indexes
        self.num_pos_feats = num_pos_feats
        self.mlp_dim = int(hidden_size * mlp_ratio) if mlp_dim is None else mlp_dim

`mindnlp.transformers.models.sam.configuration_sam.SamVisionConfig.init(hidden_size=768, output_channels=256, num_hidden_layers=12, num_attention_heads=12, num_channels=3, image_size=1024, patch_size=16, hidden_act='gelu', layer_norm_eps=1e-06, attention_dropout=0.0, initializer_range=1e-10, qkv_bias=True, mlp_ratio=4.0, use_abs_pos=True, use_rel_pos=True, window_size=14, global_attn_indexes=[2, 5, 8, 11], num_pos_feats=128, mlp_dim=None, **kwargs)` ¶

Initializes an instance of the SamVisionConfig class.

PARAMETER	DESCRIPTION
`self`	The object instance.
`hidden_size`	The size of the hidden state. Defaults to 768. TYPE: `int` DEFAULT: `768`
`output_channels`	The number of output channels. Defaults to 256. TYPE: `int` DEFAULT: `256`
`num_hidden_layers`	The number of hidden layers. Defaults to 12. TYPE: `int` DEFAULT: `12`
`num_attention_heads`	The number of attention heads. Defaults to 12. TYPE: `int` DEFAULT: `12`
`num_channels`	The number of input channels. Defaults to 3. TYPE: `int` DEFAULT: `3`
`image_size`	The size of the input image. Defaults to 1024. TYPE: `int` DEFAULT: `1024`
`patch_size`	The size of each patch in the image. Defaults to 16. TYPE: `int` DEFAULT: `16`
`hidden_act`	The activation function for the hidden layers. Defaults to 'gelu'. TYPE: `str` DEFAULT: `'gelu'`
`layer_norm_eps`	The epsilon value for layer normalization. Defaults to 1e-06. TYPE: `float` DEFAULT: `1e-06`
`attention_dropout`	The dropout rate for the attention mechanism. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`initializer_range`	The range for parameter initialization. Defaults to 1e-10. TYPE: `float` DEFAULT: `1e-10`
`qkv_bias`	Whether to include bias in the query, key, and value projections. Defaults to True. TYPE: `bool` DEFAULT: `True`
`mlp_ratio`	The ratio of the hidden size to the feed-forward network size. Defaults to 4.0. TYPE: `float` DEFAULT: `4.0`
`use_abs_pos`	Whether to use absolute position embeddings. Defaults to True. TYPE: `bool` DEFAULT: `True`
`use_rel_pos`	Whether to use relative position embeddings. Defaults to True. TYPE: `bool` DEFAULT: `True`
`window_size`	The size of the attention window. Defaults to 14. TYPE: `int` DEFAULT: `14`
`global_attn_indexes`	The list of indexes for global attention. Defaults to [2, 5, 8, 11]. TYPE: `list[int]` DEFAULT: `[2, 5, 8, 11]`
`num_pos_feats`	The number of positional features. Defaults to 128. TYPE: `int` DEFAULT: `128`
`mlp_dim`	The size of the hidden layer in the feed-forward network. If not provided, it is calculated as int(hidden_size * mlp_ratio). TYPE: `int` DEFAULT: `None`

RETURNS	DESCRIPTION
	None.

Source code in mindnlp\transformers\models\sam\configuration_sam.py

def __init__(
    self,
    hidden_size=768,
    output_channels=256,
    num_hidden_layers=12,
    num_attention_heads=12,
    num_channels=3,
    image_size=1024,
    patch_size=16,
    hidden_act="gelu",
    layer_norm_eps=1e-06,
    attention_dropout=0.0,
    initializer_range=1e-10,
    qkv_bias=True,
    mlp_ratio=4.0,
    use_abs_pos=True,
    use_rel_pos=True,
    window_size=14,
    global_attn_indexes=[2, 5, 8, 11],
    num_pos_feats=128,
    mlp_dim=None,
    **kwargs,
):
    """
    Initializes an instance of the SamVisionConfig class.

    Args:
        self: The object instance.
        hidden_size (int, optional): The size of the hidden state. Defaults to 768.
        output_channels (int, optional): The number of output channels. Defaults to 256.
        num_hidden_layers (int, optional): The number of hidden layers. Defaults to 12.
        num_attention_heads (int, optional): The number of attention heads. Defaults to 12.
        num_channels (int, optional): The number of input channels. Defaults to 3.
        image_size (int, optional): The size of the input image. Defaults to 1024.
        patch_size (int, optional): The size of each patch in the image. Defaults to 16.
        hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
        layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-06.
        attention_dropout (float, optional): The dropout rate for the attention mechanism. Defaults to 0.0.
        initializer_range (float, optional): The range for parameter initialization. Defaults to 1e-10.
        qkv_bias (bool, optional): Whether to include bias in the query, key, and value projections. Defaults to True.
        mlp_ratio (float, optional): The ratio of the hidden size to the feed-forward network size. Defaults to 4.0.
        use_abs_pos (bool, optional): Whether to use absolute position embeddings. Defaults to True.
        use_rel_pos (bool, optional): Whether to use relative position embeddings. Defaults to True.
        window_size (int, optional): The size of the attention window. Defaults to 14.
        global_attn_indexes (list[int], optional): The list of indexes for global attention. Defaults to [2, 5, 8, 11].
        num_pos_feats (int, optional): The number of positional features. Defaults to 128.
        mlp_dim (int, optional): The size of the hidden layer in the feed-forward network. If not provided,
            it is calculated as int(hidden_size * mlp_ratio).

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(**kwargs)

    self.hidden_size = hidden_size
    self.output_channels = output_channels
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.num_channels = num_channels
    self.image_size = image_size
    self.patch_size = patch_size
    self.hidden_act = hidden_act
    self.layer_norm_eps = layer_norm_eps
    self.attention_dropout = attention_dropout
    self.initializer_range = initializer_range
    self.qkv_bias = qkv_bias
    self.mlp_ratio = mlp_ratio
    self.use_abs_pos = use_abs_pos
    self.use_rel_pos = use_rel_pos
    self.window_size = window_size
    self.global_attn_indexes = global_attn_indexes
    self.num_pos_feats = num_pos_feats
    self.mlp_dim = int(hidden_size * mlp_ratio) if mlp_dim is None else mlp_dim

`mindnlp.transformers.models.sam.image_processing_sam` ¶

Image processor class for SAM.

`mindnlp.transformers.models.sam.image_processing_sam.SamImageProcessor` ¶

Bases: BaseImageProcessor

Constructs a SAM image processor.

PARAMETER	DESCRIPTION
`do_resize`	Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the `do_resize` parameter in the `preprocess` method. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`size`	Size of the output image after resizing. Resizes the longest edge of the image to match `size["longest_edge"]` while maintaining the aspect ratio. Can be overridden by the `size` parameter in the `preprocess` method. TYPE: `dict`, optional, defaults to `{"longest_edge" -- 1024}` DEFAULT: `None`
`mask_size`	Size of the output segmentation map after resizing. Resizes the longest edge of the image to match `size["longest_edge"]` while maintaining the aspect ratio. Can be overridden by the `mask_size` parameter in the `preprocess` method. TYPE: `dict`, optional, defaults to `{"longest_edge" -- 256}` DEFAULT: `None`
`resample`	Resampling filter to use if resizing the image. Can be overridden by the `resample` parameter in the `preprocess` method. TYPE: `PILImageResampling`, optional, defaults to `Resampling.BILINEAR` DEFAULT: `BILINEAR`
`do_rescale`	Wwhether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale` parameter in the `preprocess` method. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`rescale_factor`	Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be overridden by the `rescale_factor` parameter in the `preprocess` method. TYPE: `int` or `float`, optional, defaults to `1/255` DEFAULT: `1 / 255`
`do_normalize`	Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess` method. Can be overridden by the `do_normalize` parameter in the `preprocess` method. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`image_mean`	Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be overridden by the `image_mean` parameter in the `preprocess` method. TYPE: `float` or `List[float]`, optional, defaults to `IMAGENET_DEFAULT_MEAN` DEFAULT: `None`
`image_std`	Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method. Can be overridden by the `image_std` parameter in the `preprocess` method. TYPE: `float` or `List[float]`, optional, defaults to `IMAGENET_DEFAULT_STD` DEFAULT: `None`
`do_pad`	Whether to pad the image to the specified `pad_size`. Can be overridden by the `do_pad` parameter in the `preprocess` method. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`pad_size`	Size of the output image after padding. Can be overridden by the `pad_size` parameter in the `preprocess` method. TYPE: `dict`, optional, defaults to `{"height" -- 1024, "width" -- 1024}` DEFAULT: `None`
`mask_pad_size`	Size of the output segmentation map after padding. Can be overridden by the `mask_pad_size` parameter in the `preprocess` method. TYPE: `dict`, optional, defaults to `{"height" -- 256, "width" -- 256}` DEFAULT: `None`
`do_convert_rgb`	Whether to convert the image to RGB. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

class SamImageProcessor(BaseImageProcessor):
    r"""
    Constructs a SAM image processor.

    Args:
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
            `do_resize` parameter in the `preprocess` method.
        size (`dict`, *optional*, defaults to `{"longest_edge" -- 1024}`):
            Size of the output image after resizing. Resizes the longest edge of the image to match
            `size["longest_edge"]` while maintaining the aspect ratio. Can be overridden by the `size` parameter in the
            `preprocess` method.
        mask_size (`dict`, *optional*, defaults to `{"longest_edge" -- 256}`):
            Size of the output segmentation map after resizing. Resizes the longest edge of the image to match
            `size["longest_edge"]` while maintaining the aspect ratio. Can be overridden by the `mask_size` parameter
            in the `preprocess` method.
        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BILINEAR`):
            Resampling filter to use if resizing the image. Can be overridden by the `resample` parameter in the
            `preprocess` method.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Wwhether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the
            `do_rescale` parameter in the `preprocess` method.
        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
            Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
            overridden by the `rescale_factor` parameter in the `preprocess` method.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
            method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_MEAN`):
            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
            overridden by the `image_mean` parameter in the `preprocess` method.
        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_STD`):
            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
            Can be overridden by the `image_std` parameter in the `preprocess` method.
        do_pad (`bool`, *optional*, defaults to `True`):
            Whether to pad the image to the specified `pad_size`. Can be overridden by the `do_pad` parameter in the
            `preprocess` method.
        pad_size (`dict`, *optional*, defaults to `{"height" -- 1024, "width" -- 1024}`):
            Size of the output image after padding. Can be overridden by the `pad_size` parameter in the `preprocess`
            method.
        mask_pad_size (`dict`, *optional*, defaults to `{"height" -- 256, "width" -- 256}`):
            Size of the output segmentation map after padding. Can be overridden by the `mask_pad_size` parameter in
            the `preprocess` method.
        do_convert_rgb (`bool`, *optional*, defaults to `True`):
            Whether to convert the image to RGB.
    """
    model_input_names = ["pixel_values"]

    def __init__(
        self,
        do_resize: bool = True,
        size: Dict[str, int] = None,
        mask_size: Dict[str, int] = None,
        resample: PILImageResampling = PILImageResampling.BILINEAR,
        do_rescale: bool = True,
        rescale_factor: Union[int, float] = 1 / 255,
        do_normalize: bool = True,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_pad: bool = True,
        pad_size: int = None,
        mask_pad_size: int = None,
        do_convert_rgb: bool = True,
        **kwargs,
    ) -> None:
        """
        Initializes an instance of the SamImageProcessor class.

        Args:
            self: The instance of the class.
            do_resize (bool): Determines whether resizing of images should be performed. Defaults to True.
            size (Dict[str, int]): The desired size of the images. Defaults to {'longest_edge': 1024}.
                The size can be specified as a dictionary with keys 'longest_edge' or 'height' and 'width'.
                If not provided as a dictionary, it is converted to a dictionary with the 'longest_edge' key.
            mask_size (Dict[str, int]): The desired size of the segmentation masks. Defaults to {'longest_edge': 256}.
                The size can be specified as a dictionary with keys 'longest_edge' or 'height' and 'width'.
                If not provided as a dictionary, it is converted to a dictionary with the 'longest_edge' key.
            resample (PILImageResampling): The resampling method to use during image resizing.
                Defaults to PILImageResampling.BILINEAR.
            do_rescale (bool): Determines whether rescaling of pixel values should be performed. Defaults to True.
            rescale_factor (Union[int, float]): The factor to divide pixel values by during rescaling.
                Defaults to 1 / 255.
            do_normalize (bool): Determines whether normalization of pixel values should be performed.
                Defaults to True.
            image_mean (Optional[Union[float, List[float]]]): The mean values to subtract from pixel values
                during normalization. Defaults to None, which uses the IMAGENET_DEFAULT_MEAN.
            image_std (Optional[Union[float, List[float]]]): The standard deviation values to divide pixel values
                by during normalization. Defaults to None, which uses the IMAGENET_DEFAULT_STD.
            do_pad (bool): Determines whether padding of images should be performed. Defaults to True.
            pad_size (int): The desired size of the padded images. Defaults to None,
                which uses {'height': 1024, 'width': 1024}. The size can be specified as a single integer, representing
                both height and width.
            mask_pad_size (int): The desired size of the padded segmentation masks. Defaults to None,
                which uses {'height': 256, 'width': 256}. The size can be specified as a single integer,
                representing both height and width.
            do_convert_rgb (bool): Determines whether conversion to RGB color space should be performed. Defaults to True.
            **kwargs: Additional keyword arguments to be passed to the parent class forwardor.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(**kwargs)
        size = size if size is not None else {"longest_edge": 1024}
        size = get_size_dict(max_size=size, default_to_square=False) if not isinstance(size, dict) else size

        pad_size = pad_size if pad_size is not None else {"height": 1024, "width": 1024}
        pad_size = get_size_dict(pad_size, default_to_square=True)

        mask_size = mask_size if mask_size is not None else {"longest_edge": 256}
        mask_size = (
            get_size_dict(max_size=mask_size, default_to_square=False)
            if not isinstance(mask_size, dict)
            else mask_size
        )

        mask_pad_size = mask_pad_size if mask_pad_size is not None else {"height": 256, "width": 256}
        mask_pad_size = get_size_dict(mask_pad_size, default_to_square=True)

        self.do_resize = do_resize
        self.size = size
        self.mask_size = mask_size
        self.resample = resample
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_normalize = do_normalize
        self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
        self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
        self.do_pad = do_pad
        self.pad_size = pad_size
        self.mask_pad_size = mask_pad_size
        self.do_convert_rgb = do_convert_rgb
        self._valid_processor_keys = [
            "images",
            "segmentation_maps",
            "do_resize",
            "size",
            "mask_size",
            "resample",
            "do_rescale",
            "rescale_factor",
            "do_normalize",
            "image_mean",
            "image_std",
            "do_pad",
            "pad_size",
            "mask_pad_size",
            "do_convert_rgb",
            "return_tensors",
            "data_format",
            "input_data_format",
        ]

    def pad_image(
        self,
        image: np.ndarray,
        pad_size: Dict[str, int],
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> np.ndarray:
        """
        Pad an image to `(pad_size["height"], pad_size["width"])` with zeros to the right and bottom.

        Args:
            image (`np.ndarray`):
                Image to pad.
            pad_size (`Dict[str, int]`):
                Size of the output image after padding.
            data_format (`str` or `ChannelDimension`, *optional*):
                The data format of the image. Can be either "channels_first" or "channels_last". If `None`, the
                `data_format` of the `image` will be used.
            input_data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred.
        """
        output_height, output_width = pad_size["height"], pad_size["width"]
        input_height, input_width = get_image_size(image, channel_dim=input_data_format)

        pad_width = output_width - input_width
        pad_height = output_height - input_height

        padded_image = pad(
            image,
            ((0, pad_height), (0, pad_width)),
            data_format=data_format,
            input_data_format=input_data_format,
            **kwargs,
        )
        return padded_image

    def _get_preprocess_shape(self, old_shape: Tuple[int, int], longest_edge: int):
        """
        Compute the output size given input size and target long side length.
        """
        oldh, oldw = old_shape
        scale = longest_edge * 1.0 / max(oldh, oldw)
        newh, neww = oldh * scale, oldw * scale
        newh = int(newh + 0.5)
        neww = int(neww + 0.5)
        return (newh, neww)

    def resize(
        self,
        image: np.ndarray,
        size: Dict[str, int],
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> np.ndarray:
        """
        Resize an image to `(size["height"], size["width"])`.

        Args:
            image (`np.ndarray`):
                Image to resize.
            size (`Dict[str, int]`):
                Dictionary in the format `{"longest_edge": int}` specifying the size of the output image. The longest
                edge of the image will be resized to the specified size, while the other edge will be resized to
                maintain the aspect ratio.
            resample:
                `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BILINEAR`.
            data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the output image. If unset, the channel dimension format of the input
                image is used. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.

        Returns:
            `np.ndarray`: The resized image.
        """
        size = get_size_dict(size)
        if "longest_edge" not in size:
            raise ValueError(f"The `size` dictionary must contain the key `longest_edge`. Got {size.keys()}")
        input_size = get_image_size(image, channel_dim=input_data_format)
        output_height, output_width = self._get_preprocess_shape(input_size, size["longest_edge"])
        return resize(
            image,
            size=(output_height, output_width),
            resample=resample,
            data_format=data_format,
            input_data_format=input_data_format,
            **kwargs,
        )

    def _preprocess(
        self,
        image: ImageInput,
        do_resize: bool,
        do_rescale: bool,
        do_normalize: bool,
        size: Optional[Dict[str, int]] = None,
        resample: PILImageResampling = None,
        rescale_factor: Optional[float] = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_pad: Optional[bool] = None,
        pad_size: Optional[Dict[str, int]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ):
        '''
        This method preprocesses the input image according to the specified operations such as resizing, rescaling,
        normalization, and padding.

        Args:
            self: The instance of the SamImageProcessor class.
            image (ImageInput): The input image to be preprocessed.
            do_resize (bool): A flag indicating whether to perform resizing on the input image.
            do_rescale (bool): A flag indicating whether to perform rescaling on the input image.
            do_normalize (bool): A flag indicating whether to perform normalization on the input image.
            size (Optional[Dict[str, int]]): The target size for resizing the image in the format
                {'width': int, 'height': int}. Default is None.
            resample (PILImageResampling): The resampling filter to be used during image resizing. Default is None.
            rescale_factor (Optional[float]): The factor by which the image should be rescaled. Default is None.
            image_mean (Optional[Union[float, List[float]]]): The mean value to be used for image normalization.
                It can be a single float value or a list of float values, depending on the input_data_format.
                Default is None.
            image_std (Optional[Union[float, List[float]]]):
                The standard deviation value to be used for image normalization.
                It can be a single float value or a list of float values, depending on the input_data_format.
                Default is None.
            do_pad (Optional[bool]): A flag indicating whether to perform padding on the input image. Default is None.
            pad_size (Optional[Dict[str, int]]): The size of the padding to be applied in the format
                {'top': int, 'bottom': int, 'left': int, 'right': int}. Default is None.
            input_data_format (Optional[Union[str, ChannelDimension]]): The data format of the input image,
                e.g., 'channels_first' or 'channels_last'. Default is None.

        Returns:
            Tuple[ImageInput, Tuple[int, int, int]]: The preprocessed image and the reshaped input size in the format
                (image, (height, width, channels)).

        Raises:
            ValueError: If the input_data_format is invalid or not supported.
            TypeError: If the input_data_format is not a string or ChannelDimension.
        '''
        if do_resize:
            image = self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
        reshaped_input_size = get_image_size(image, channel_dim=input_data_format)

        if do_rescale:
            image = self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)

        if do_normalize:
            image = self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)

        if do_pad:
            image = self.pad_image(image=image, pad_size=pad_size, input_data_format=input_data_format)

        return image, reshaped_input_size

    def _preprocess_image(
        self,
        image: ImageInput,
        do_resize: Optional[bool] = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_rescale: bool = None,
        rescale_factor: Optional[float] = None,
        do_normalize: Optional[bool] = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_pad: Optional[bool] = None,
        pad_size: Optional[Dict[str, int]] = None,
        do_convert_rgb: Optional[bool] = None,
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ) -> Tuple[np.ndarray, Tuple[int, int], Tuple[int, int]]:
        """
        This method preprocesses the input image with various transformations and returns the processed image,
        original size, and reshaped input size.

        Args:
            self: The instance of the SamImageProcessor class.
            image (ImageInput): The input image to be preprocessed.
            do_resize (Optional[bool]): A flag indicating whether to resize the image. Defaults to None.
            size (Optional[Dict[str, int]]): A dictionary containing the target width and height for resizing the image.
                Defaults to None.
            resample (PILImageResampling): The resampling filter to be used during image resizing.
            do_rescale (Optional[bool]): A flag indicating whether to rescale the image. Defaults to None.
            rescale_factor (Optional[float]): The factor by which to rescale the image. Defaults to None.
            do_normalize (Optional[bool]): A flag indicating whether to normalize the image. Defaults to None.
            image_mean (Optional[Union[float, List[float]]]): The mean values to be used for image normalization.
                Defaults to None.
            image_std (Optional[Union[float, List[float]]]): The standard deviation values to be used for
                image normalization. Defaults to None.
            do_pad (Optional[bool]): A flag indicating whether to pad the image. Defaults to None.
            pad_size (Optional[Dict[str, int]]): A dictionary containing the padding width and height.
                Defaults to None.
            do_convert_rgb (Optional[bool]): A flag indicating whether to convert the image to RGB format.
                Defaults to None.
            data_format (Optional[Union[str, ChannelDimension]]): The desired data format for the processed image.
            input_data_format (Optional[Union[str, ChannelDimension]]): The input data format of the image.

        Returns:
            Tuple[np.ndarray, Tuple[int, int], Tuple[int, int]]: A tuple containing the processed image as a numpy array,
                the original size of the input image, and the reshaped input size after preprocessing.

        Raises:
            None
        """
        image = to_numpy_array(image)

        # PIL RGBA images are converted to RGB
        if do_convert_rgb:
            image = convert_to_rgb(image)

        # All transformations expect numpy arrays.
        image = to_numpy_array(image)

        if is_scaled_image(image) and do_rescale:
            logger.warning_once(
                "It looks like you are trying to rescale already rescaled images. If the input"
                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
            )

        if input_data_format is None:
            input_data_format = infer_channel_dimension_format(image)

        original_size = get_image_size(image, channel_dim=input_data_format)

        image, reshaped_input_size = self._preprocess(
            image=image,
            do_resize=do_resize,
            size=size,
            resample=resample,
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_pad=do_pad,
            pad_size=pad_size,
            input_data_format=input_data_format,
        )

        if data_format is not None:
            image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)

        return image, original_size, reshaped_input_size

    def _preprocess_mask(
        self,
        segmentation_map: ImageInput,
        do_resize: Optional[bool] = None,
        mask_size: Dict[str, int] = None,
        do_pad: Optional[bool] = None,
        mask_pad_size: Optional[Dict[str, int]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ) -> np.ndarray:
        """
        Method to preprocess a segmentation mask.

        Args:
            self: The instance of the SamImageProcessor class.
            segmentation_map (ImageInput): The input segmentation map to be preprocessed.
            do_resize (Optional[bool]): Flag indicating whether resizing should be performed. Default is None.
            mask_size (Dict[str, int]): Dictionary containing the target size for the mask after resizing.
            do_pad (Optional[bool]): Flag indicating whether padding should be applied. Default is None.
            mask_pad_size (Optional[Dict[str, int]]): Dictionary containing the padding size for the mask.
            input_data_format (Optional[Union[str, ChannelDimension]]): Format of the input data. Default is None.

        Returns:
            np.ndarray: The preprocessed segmentation map as a NumPy array.
            original_size: The size of the original segmentation map.

        Raises:
            None
        """
        segmentation_map = to_numpy_array(segmentation_map)

        # Add channel dimension if missing - needed for certain transformations
        if segmentation_map.ndim == 2:
            added_channel_dim = True
            segmentation_map = segmentation_map[None, ...]
            input_data_format = ChannelDimension.FIRST
        else:
            added_channel_dim = False
            if input_data_format is None:
                input_data_format = infer_channel_dimension_format(segmentation_map, num_channels=1)

        original_size = get_image_size(segmentation_map, channel_dim=input_data_format)

        segmentation_map, _ = self._preprocess(
            image=segmentation_map,
            do_resize=do_resize,
            size=mask_size,
            resample=PILImageResampling.NEAREST,
            do_rescale=False,
            do_normalize=False,
            do_pad=do_pad,
            pad_size=mask_pad_size,
            input_data_format=input_data_format,
        )

        # Remove extra channel dimension if added for processing
        if added_channel_dim:
            segmentation_map = segmentation_map.squeeze(0)
        segmentation_map = segmentation_map.astype(np.int64)

        return segmentation_map, original_size

    def preprocess(
        self,
        images: ImageInput,
        segmentation_maps: Optional[ImageInput] = None,
        do_resize: Optional[bool] = None,
        size: Optional[Dict[str, int]] = None,
        mask_size: Optional[Dict[str, int]] = None,
        resample: Optional["PILImageResampling"] = None,
        do_rescale: Optional[bool] = None,
        rescale_factor: Optional[Union[int, float]] = None,
        do_normalize: Optional[bool] = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_pad: Optional[bool] = None,
        pad_size: Optional[Dict[str, int]] = None,
        mask_pad_size: Optional[Dict[str, int]] = None,
        do_convert_rgb: Optional[bool] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        data_format: ChannelDimension = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ):
        """
        Preprocess an image or batch of images.

        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            segmentation_maps (`ImageInput`, *optional*):
                Segmentation map to preprocess.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Controls the size of the image after `resize`. The longest edge of the image is resized to
                `size["longest_edge"]` whilst preserving the aspect ratio.
            mask_size (`Dict[str, int]`, *optional*, defaults to `self.mask_size`):
                Controls the size of the segmentation map after `resize`. The longest edge of the image is resized to
                `size["longest_edge"]` whilst preserving the aspect ratio.
            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
                `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BILINEAR`.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image pixel values by rescaling factor.
            rescale_factor (`int` or `float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to apply to the image pixel values.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean to normalize the image by if `do_normalize` is set to `True`.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation to normalize the image by if `do_normalize` is set to `True`.
            do_pad (`bool`, *optional*, defaults to `self.do_pad`):
                Whether to pad the image.
            pad_size (`Dict[str, int]`, *optional*, defaults to `self.pad_size`):
                Controls the size of the padding applied to the image. The image is padded to `pad_size["height"]` and
                `pad_size["width"]` if `do_pad` is set to `True`.
            mask_pad_size (`Dict[str, int]`, *optional*, defaults to `self.mask_pad_size`):
                Controls the size of the padding applied to the segmentation map. The image is padded to
                `mask_pad_size["height"]` and `mask_pad_size["width"]` if `do_pad` is set to `True`.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:

                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `mindspore.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """
        do_resize = do_resize if do_resize is not None else self.do_resize
        size = size if size is not None else self.size
        size = get_size_dict(max_size=size, default_to_square=False) if not isinstance(size, dict) else size
        mask_size = mask_size if mask_size is not None else self.mask_size
        mask_size = (
            get_size_dict(max_size=mask_size, default_to_square=False)
            if not isinstance(mask_size, dict)
            else mask_size
        )
        resample = resample if resample is not None else self.resample
        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
        image_mean = image_mean if image_mean is not None else self.image_mean
        image_std = image_std if image_std is not None else self.image_std
        do_pad = do_pad if do_pad is not None else self.do_pad
        pad_size = pad_size if pad_size is not None else self.pad_size
        pad_size = get_size_dict(pad_size, default_to_square=True)
        mask_pad_size = mask_pad_size if mask_pad_size is not None else self.mask_pad_size
        mask_pad_size = get_size_dict(mask_pad_size, default_to_square=True)
        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb

        images = make_list_of_images(images)

        validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)

        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "mindspore.Tensor, tf.Tensor or jax.ndarray."
            )

        if segmentation_maps is not None:
            segmentation_maps = make_list_of_images(segmentation_maps, expected_ndims=2)

            if not valid_images(segmentation_maps):
                raise ValueError(
                    "Invalid segmentation map type. Must be of type PIL.Image.Image, numpy.ndarray, "
                    "mindspore.Tensor, tf.Tensor or jax.ndarray."
                )
        validate_preprocess_arguments(
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_pad=do_pad,
            size_divisibility=pad_size,  # Here _preprocess needs do_pad and pad_size.
            do_resize=do_resize,
            size=size,
            resample=resample,
        )

        images, original_sizes, reshaped_input_sizes = zip(
            *(
                self._preprocess_image(
                    image=img,
                    do_resize=do_resize,
                    size=size,
                    resample=resample,
                    do_rescale=do_rescale,
                    rescale_factor=rescale_factor,
                    do_normalize=do_normalize,
                    image_mean=image_mean,
                    image_std=image_std,
                    do_pad=do_pad,
                    pad_size=pad_size,
                    do_convert_rgb=do_convert_rgb,
                    data_format=data_format,
                    input_data_format=input_data_format,
                )
                for img in images
            )
        )

        data = {
            "pixel_values": images,
            "original_sizes": original_sizes,
            "reshaped_input_sizes": reshaped_input_sizes,
        }

        if segmentation_maps is not None:
            segmentation_maps, original_mask_sizes = zip(
                *(
                    self._preprocess_mask(
                        segmentation_map=mask,
                        do_resize=do_resize,
                        mask_size=mask_size,
                        do_pad=do_pad,
                        mask_pad_size=mask_pad_size,
                        input_data_format=input_data_format,
                    )
                    for mask in segmentation_maps
                )
            )

            # masks should start out the same size as input images
            assert all(
                original_im_size == original_mask_size
                for original_im_size, original_mask_size in zip(original_sizes, original_mask_sizes)
            ), "Segmentation maps should be the same size as input images."

            data["labels"] = segmentation_maps

        return BatchFeature(data=data, tensor_type=return_tensors)

    def post_process_masks(
        self,
        masks,
        original_sizes,
        reshaped_input_sizes,
        mask_threshold=0.0,
        binarize=True,
        pad_size=None,
        return_tensors="ms",
    ):
        """
        Remove padding and upscale masks to the original image size.

        Args:
            masks (`Union[List[mindspore.Tensor], List[np.ndarray], List[tf.Tensor]]`):
                Batched masks from the mask_decoder in (batch_size, num_channels, height, width) format.
            original_sizes (`Union[mindspore.Tensor, tf.Tensor, List[Tuple[int,int]]]`):
                The original sizes of each image before it was resized to the model's expected input shape, in (height,
                width) format.
            reshaped_input_sizes (`Union[mindspore.Tensor, tf.Tensor, List[Tuple[int,int]]]`):
                The size of each image as it is fed to the model, in (height, width) format. Used to remove padding.
            mask_threshold (`float`, *optional*, defaults to 0.0):
                The threshold to use for binarizing the masks.
            binarize (`bool`, *optional*, defaults to `True`):
                Whether to binarize the masks.
            pad_size (`int`, *optional*, defaults to `self.pad_size`):
                The target size the images were padded to before being passed to the model. If None, the target size is
                assumed to be the processor's `pad_size`.
            return_tensors (`str`, *optional*, defaults to `"ms"`):
                If `"ms"`, return PyTorch tensors. If `"tf"`, return TensorFlow tensors.

        Returns:
            (`Union[mindspore.Tensor, tf.Tensor]`): Batched masks in batch_size, num_channels, height, width) format, where
            (height, width) is given by original_size.
        """
        if return_tensors == "ms":
            return self._post_process_masks_ms(
                masks=masks,
                original_sizes=original_sizes,
                reshaped_input_sizes=reshaped_input_sizes,
                mask_threshold=mask_threshold,
                binarize=binarize,
                pad_size=pad_size,
            )
        else:
            raise ValueError("return_tensors must be 'ms'.")

    def _post_process_masks_ms(
        self, masks, original_sizes, reshaped_input_sizes, mask_threshold=0.0, binarize=True, pad_size=None
    ):
        """
        Remove padding and upscale masks to the original image size.

        Args:
            masks (`Union[List[mindspore.Tensor], List[np.ndarray]]`):
                Batched masks from the mask_decoder in (batch_size, num_channels, height, width) format.
            original_sizes (`Union[mindspore.Tensor, List[Tuple[int,int]]]`):
                The original sizes of each image before it was resized to the model's expected input shape, in (height,
                width) format.
            reshaped_input_sizes (`Union[mindspore.Tensor, List[Tuple[int,int]]]`):
                The size of each image as it is fed to the model, in (height, width) format. Used to remove padding.
            mask_threshold (`float`, *optional*, defaults to 0.0):
                The threshold to use for binarizing the masks.
            binarize (`bool`, *optional*, defaults to `True`):
                Whether to binarize the masks.
            pad_size (`int`, *optional*, defaults to `self.pad_size`):
                The target size the images were padded to before being passed to the model. If None, the target size is
                assumed to be the processor's `pad_size`.

        Returns:
            (`mindspore.Tensor`): Batched masks in batch_size, num_channels, height, width) format, where (height, width)
            is given by original_size.
        """
        requires_backends(self, ["mindspore"])
        pad_size = self.pad_size if pad_size is None else pad_size
        target_image_size = (pad_size["height"], pad_size["width"])
        if isinstance(original_sizes, (mindspore.Tensor, np.ndarray)):
            original_sizes = original_sizes.tolist()
        if isinstance(reshaped_input_sizes, (mindspore.Tensor, np.ndarray)):
            reshaped_input_sizes = reshaped_input_sizes.tolist()
        output_masks = []
        for i, original_size in enumerate(original_sizes):
            if isinstance(masks[i], np.ndarray):
                masks[i] = mindspore.Tensor(masks[i], dtype=mindspore.float32)
            elif not isinstance(masks[i], mindspore.Tensor):
                raise ValueError("Input masks should be a list of `mindspore.tensors` or a list of `np.ndarray`")
            interpolated_mask = F.interpolate(masks[i], target_image_size, mode="bilinear", align_corners=False)
            interpolated_mask = interpolated_mask[..., : reshaped_input_sizes[i][0], : reshaped_input_sizes[i][1]]
            interpolated_mask = F.interpolate(interpolated_mask, original_size, mode="bilinear", align_corners=False)
            if binarize:
                interpolated_mask = interpolated_mask > mask_threshold
            output_masks.append(interpolated_mask)

        return output_masks

    def post_process_for_mask_generation(
        self, all_masks, all_scores, all_boxes, crops_nms_thresh, return_tensors="ms"
    ):
        """
        Post processes mask that are generated by calling the Non Maximum Suppression algorithm on the predicted masks.

        Args:
            all_masks (`Union[List[mindspore.Tensor], List[tf.Tensor]]`):
                List of all predicted segmentation masks
            all_scores (`Union[List[mindspore.Tensor], List[tf.Tensor]]`):
                List of all predicted iou scores
            all_boxes (`Union[List[mindspore.Tensor], List[tf.Tensor]]`):
                List of all bounding boxes of the predicted masks
            crops_nms_thresh (`float`):
                Threshold for NMS (Non Maximum Suppression) algorithm.
            return_tensors (`str`, *optional*, defaults to `pt`):
                If `pt`, returns `mindspore.Tensor`. If `tf`, returns `tf.Tensor`.
        """
        if return_tensors == "ms":
            return _postprocess_for_mg(all_masks, all_scores, all_boxes, crops_nms_thresh)

    def generate_crop_boxes(
        self,
        image,
        target_size,
        crop_n_layers: int = 0,
        overlap_ratio: float = 512 / 1500,
        points_per_crop: Optional[int] = 32,
        crop_n_points_downscale_factor: Optional[List[int]] = 1,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        return_tensors: str = "ms",
    ):
        """
        Generates a list of crop boxes of different sizes. Each layer has (2**i)**2 boxes for the ith layer.

        Args:
            image (`np.array`):
                Input original image
            target_size (`int`):
                Target size of the resized image
            crop_n_layers (`int`, *optional*, defaults to 0):
                If >0, mask prediction will be run again on crops of the image. Sets the number of layers to run, where
                each layer has 2**i_layer number of image crops.
            overlap_ratio (`float`, *optional*, defaults to 512/1500):
                Sets the degree to which crops overlap. In the first crop layer, crops will overlap by this fraction of
                the image length. Later layers with more crops scale down this overlap.
            points_per_crop (`int`, *optional*, defaults to 32):
                Number of points to sample from each crop.
            crop_n_points_downscale_factor (`List[int]`, *optional*, defaults to 1):
                The number of points-per-side sampled in layer n is scaled down by crop_n_points_downscale_factor**n.
            input_data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred.
            return_tensors (`str`, *optional*, defaults to `pt`):
                If `pt`, returns `mindspore.Tensor`. If `tf`, returns `tf.Tensor`.
        """
        crop_boxes, points_per_crop, cropped_images, input_labels = _generate_crop_boxes(
            image,
            target_size,
            crop_n_layers,
            overlap_ratio,
            points_per_crop,
            crop_n_points_downscale_factor,
            input_data_format,
        )
        if return_tensors == "ms":
            crop_boxes = mindspore.tensor(crop_boxes)
            points_per_crop = mindspore.tensor(points_per_crop)
            # cropped_images stays as np
            input_labels = mindspore.tensor(input_labels)
        else:
            raise ValueError("return_tensors must be 'ms'.")
        return crop_boxes, points_per_crop, cropped_images, input_labels

    def filter_masks(
        self,
        masks,
        iou_scores,
        original_size,
        cropped_box_image,
        pred_iou_thresh=0.88,
        stability_score_thresh=0.95,
        mask_threshold=0,
        stability_score_offset=1,
        return_tensors="ms",
    ):
        """
        Filters the predicted masks by selecting only the ones that meets several criteria. The first criterion being
        that the iou scores needs to be greater than `pred_iou_thresh`. The second criterion is that the stability
        score needs to be greater than `stability_score_thresh`. The method also converts the predicted masks to
        bounding boxes and pad the predicted masks if necessary.

        Args:
            masks (`Union[mindspore.Tensor, tf.Tensor]`):
                Input masks.
            iou_scores (`Union[mindspore.Tensor, tf.Tensor]`):
                List of IoU scores.
            original_size (`Tuple[int,int]`):
                Size of the orginal image.
            cropped_box_image (`np.array`):
                The cropped image.
            pred_iou_thresh (`float`, *optional*, defaults to 0.88):
                The threshold for the iou scores.
            stability_score_thresh (`float`, *optional*, defaults to 0.95):
                The threshold for the stability score.
            mask_threshold (`float`, *optional*, defaults to 0):
                The threshold for the predicted masks.
            stability_score_offset (`float`, *optional*, defaults to 1):
                The offset for the stability score used in the `_compute_stability_score` method.
            return_tensors (`str`, *optional*, defaults to `pt`):
                If `pt`, returns `mindspore.Tensor`. If `tf`, returns `tf.Tensor`.
        """
        if return_tensors == "ms":
            return self._filter_masks(
                masks=masks,
                iou_scores=iou_scores,
                original_size=original_size,
                cropped_box_image=cropped_box_image,
                pred_iou_thresh=pred_iou_thresh,
                stability_score_thresh=stability_score_thresh,
                mask_threshold=mask_threshold,
                stability_score_offset=stability_score_offset,
            )
        elif return_tensors == "tf":
            return self._filter_masks_tf(
                masks=masks,
                iou_scores=iou_scores,
                original_size=original_size,
                cropped_box_image=cropped_box_image,
                pred_iou_thresh=pred_iou_thresh,
                stability_score_thresh=stability_score_thresh,
                mask_threshold=mask_threshold,
                stability_score_offset=stability_score_offset,
            )

    def _filter_masks(
        self,
        masks,
        iou_scores,
        original_size,
        cropped_box_image,
        pred_iou_thresh=0.88,
        stability_score_thresh=0.95,
        mask_threshold=0,
        stability_score_offset=1,
    ):
        """
        Filters the predicted masks by selecting only the ones that meets several criteria. The first criterion being
        that the iou scores needs to be greater than `pred_iou_thresh`. The second criterion is that the stability
        score needs to be greater than `stability_score_thresh`. The method also converts the predicted masks to
        bounding boxes and pad the predicted masks if necessary.

        Args:
            masks (`mindspore.Tensor`):
                Input masks.
            iou_scores (`mindspore.Tensor`):
                List of IoU scores.
            original_size (`Tuple[int,int]`):
                Size of the orginal image.
            cropped_box_image (`np.array`):
                The cropped image.
            pred_iou_thresh (`float`, *optional*, defaults to 0.88):
                The threshold for the iou scores.
            stability_score_thresh (`float`, *optional*, defaults to 0.95):
                The threshold for the stability score.
            mask_threshold (`float`, *optional*, defaults to 0):
                The threshold for the predicted masks.
            stability_score_offset (`float`, *optional*, defaults to 1):
                The offset for the stability score used in the `_compute_stability_score` method.

        """
        requires_backends(self, ["torch"])
        original_height, original_width = original_size
        iou_scores = iou_scores.flatten(start_dim=0, end_dim=1)
        masks = masks.flatten(start_dim=0, end_dim=1)

        if masks.shape[0] != iou_scores.shape[0]:
            raise ValueError("masks and iou_scores must have the same batch size.")

        batch_size = masks.shape[0]

        keep_mask = ops.ones(batch_size, dtype=mindspore.bool_)

        if pred_iou_thresh > 0.0:
            keep_mask = keep_mask & (iou_scores > pred_iou_thresh)

        # compute stability score
        if stability_score_thresh > 0.0:
            stability_scores = _compute_stability_score(masks, mask_threshold, stability_score_offset)
            keep_mask = keep_mask & (stability_scores > stability_score_thresh)

        scores = iou_scores[keep_mask]
        masks = masks[keep_mask]

        # binarize masks
        masks = masks > mask_threshold
        converted_boxes = _batched_mask_to_box(masks)

        keep_mask = ~_is_box_near_crop_edge(
            converted_boxes, cropped_box_image, [0, 0, original_width, original_height]
        )

        scores = scores[keep_mask]
        masks = masks[keep_mask]
        converted_boxes = converted_boxes[keep_mask]

        masks = _pad_masks(masks, cropped_box_image, original_height, original_width)
        # conversion to rle is necessary to run non-maximum suppresion
        masks = _mask_to_rle(masks)

        return masks, scores, converted_boxes

`mindnlp.transformers.models.sam.image_processing_sam.SamImageProcessor.init(do_resize=True, size=None, mask_size=None, resample=PILImageResampling.BILINEAR, do_rescale=True, rescale_factor=1 / 255, do_normalize=True, image_mean=None, image_std=None, do_pad=True, pad_size=None, mask_pad_size=None, do_convert_rgb=True, **kwargs)` ¶

Initializes an instance of the SamImageProcessor class.

PARAMETER	DESCRIPTION
`self`	The instance of the class.
`do_resize`	Determines whether resizing of images should be performed. Defaults to True. TYPE: `bool` DEFAULT: `True`
`size`	The desired size of the images. Defaults to {'longest_edge': 1024}. The size can be specified as a dictionary with keys 'longest_edge' or 'height' and 'width'. If not provided as a dictionary, it is converted to a dictionary with the 'longest_edge' key. TYPE: `Dict[str, int]` DEFAULT: `None`
`mask_size`	The desired size of the segmentation masks. Defaults to {'longest_edge': 256}. The size can be specified as a dictionary with keys 'longest_edge' or 'height' and 'width'. If not provided as a dictionary, it is converted to a dictionary with the 'longest_edge' key. TYPE: `Dict[str, int]` DEFAULT: `None`
`resample`	The resampling method to use during image resizing. Defaults to PILImageResampling.BILINEAR. TYPE: `PILImageResampling` DEFAULT: `BILINEAR`
`do_rescale`	Determines whether rescaling of pixel values should be performed. Defaults to True. TYPE: `bool` DEFAULT: `True`
`rescale_factor`	The factor to divide pixel values by during rescaling. Defaults to 1 / 255. TYPE: `Union[int, float]` DEFAULT: `1 / 255`
`do_normalize`	Determines whether normalization of pixel values should be performed. Defaults to True. TYPE: `bool` DEFAULT: `True`
`image_mean`	The mean values to subtract from pixel values during normalization. Defaults to None, which uses the IMAGENET_DEFAULT_MEAN. TYPE: `Optional[Union[float, List[float]]]` DEFAULT: `None`
`image_std`	The standard deviation values to divide pixel values by during normalization. Defaults to None, which uses the IMAGENET_DEFAULT_STD. TYPE: `Optional[Union[float, List[float]]]` DEFAULT: `None`
`do_pad`	Determines whether padding of images should be performed. Defaults to True. TYPE: `bool` DEFAULT: `True`
`pad_size`	The desired size of the padded images. Defaults to None, which uses {'height': 1024, 'width': 1024}. The size can be specified as a single integer, representing both height and width. TYPE: `int` DEFAULT: `None`
`mask_pad_size`	The desired size of the padded segmentation masks. Defaults to None, which uses {'height': 256, 'width': 256}. The size can be specified as a single integer, representing both height and width. TYPE: `int` DEFAULT: `None`
`do_convert_rgb`	Determines whether conversion to RGB color space should be performed. Defaults to True. TYPE: `bool` DEFAULT: `True`
`**kwargs`	Additional keyword arguments to be passed to the parent class forwardor. DEFAULT: `{}`

RETURNS	DESCRIPTION
`None`	None.

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

def __init__(
    self,
    do_resize: bool = True,
    size: Dict[str, int] = None,
    mask_size: Dict[str, int] = None,
    resample: PILImageResampling = PILImageResampling.BILINEAR,
    do_rescale: bool = True,
    rescale_factor: Union[int, float] = 1 / 255,
    do_normalize: bool = True,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_pad: bool = True,
    pad_size: int = None,
    mask_pad_size: int = None,
    do_convert_rgb: bool = True,
    **kwargs,
) -> None:
    """
    Initializes an instance of the SamImageProcessor class.

    Args:
        self: The instance of the class.
        do_resize (bool): Determines whether resizing of images should be performed. Defaults to True.
        size (Dict[str, int]): The desired size of the images. Defaults to {'longest_edge': 1024}.
            The size can be specified as a dictionary with keys 'longest_edge' or 'height' and 'width'.
            If not provided as a dictionary, it is converted to a dictionary with the 'longest_edge' key.
        mask_size (Dict[str, int]): The desired size of the segmentation masks. Defaults to {'longest_edge': 256}.
            The size can be specified as a dictionary with keys 'longest_edge' or 'height' and 'width'.
            If not provided as a dictionary, it is converted to a dictionary with the 'longest_edge' key.
        resample (PILImageResampling): The resampling method to use during image resizing.
            Defaults to PILImageResampling.BILINEAR.
        do_rescale (bool): Determines whether rescaling of pixel values should be performed. Defaults to True.
        rescale_factor (Union[int, float]): The factor to divide pixel values by during rescaling.
            Defaults to 1 / 255.
        do_normalize (bool): Determines whether normalization of pixel values should be performed.
            Defaults to True.
        image_mean (Optional[Union[float, List[float]]]): The mean values to subtract from pixel values
            during normalization. Defaults to None, which uses the IMAGENET_DEFAULT_MEAN.
        image_std (Optional[Union[float, List[float]]]): The standard deviation values to divide pixel values
            by during normalization. Defaults to None, which uses the IMAGENET_DEFAULT_STD.
        do_pad (bool): Determines whether padding of images should be performed. Defaults to True.
        pad_size (int): The desired size of the padded images. Defaults to None,
            which uses {'height': 1024, 'width': 1024}. The size can be specified as a single integer, representing
            both height and width.
        mask_pad_size (int): The desired size of the padded segmentation masks. Defaults to None,
            which uses {'height': 256, 'width': 256}. The size can be specified as a single integer,
            representing both height and width.
        do_convert_rgb (bool): Determines whether conversion to RGB color space should be performed. Defaults to True.
        **kwargs: Additional keyword arguments to be passed to the parent class forwardor.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(**kwargs)
    size = size if size is not None else {"longest_edge": 1024}
    size = get_size_dict(max_size=size, default_to_square=False) if not isinstance(size, dict) else size

    pad_size = pad_size if pad_size is not None else {"height": 1024, "width": 1024}
    pad_size = get_size_dict(pad_size, default_to_square=True)

    mask_size = mask_size if mask_size is not None else {"longest_edge": 256}
    mask_size = (
        get_size_dict(max_size=mask_size, default_to_square=False)
        if not isinstance(mask_size, dict)
        else mask_size
    )

    mask_pad_size = mask_pad_size if mask_pad_size is not None else {"height": 256, "width": 256}
    mask_pad_size = get_size_dict(mask_pad_size, default_to_square=True)

    self.do_resize = do_resize
    self.size = size
    self.mask_size = mask_size
    self.resample = resample
    self.do_rescale = do_rescale
    self.rescale_factor = rescale_factor
    self.do_normalize = do_normalize
    self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
    self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
    self.do_pad = do_pad
    self.pad_size = pad_size
    self.mask_pad_size = mask_pad_size
    self.do_convert_rgb = do_convert_rgb
    self._valid_processor_keys = [
        "images",
        "segmentation_maps",
        "do_resize",
        "size",
        "mask_size",
        "resample",
        "do_rescale",
        "rescale_factor",
        "do_normalize",
        "image_mean",
        "image_std",
        "do_pad",
        "pad_size",
        "mask_pad_size",
        "do_convert_rgb",
        "return_tensors",
        "data_format",
        "input_data_format",
    ]

`mindnlp.transformers.models.sam.image_processing_sam.SamImageProcessor.filter_masks(masks, iou_scores, original_size, cropped_box_image, pred_iou_thresh=0.88, stability_score_thresh=0.95, mask_threshold=0, stability_score_offset=1, return_tensors='ms')` ¶

Filters the predicted masks by selecting only the ones that meets several criteria. The first criterion being that the iou scores needs to be greater than pred_iou_thresh. The second criterion is that the stability score needs to be greater than stability_score_thresh. The method also converts the predicted masks to bounding boxes and pad the predicted masks if necessary.

PARAMETER	DESCRIPTION
`masks`	Input masks. TYPE: `Union[mindspore.Tensor, tf.Tensor]`
`iou_scores`	List of IoU scores. TYPE: `Union[mindspore.Tensor, tf.Tensor]`
`original_size`	Size of the orginal image. TYPE: `Tuple[int,int]`
`cropped_box_image`	The cropped image. TYPE: `np.array`
`pred_iou_thresh`	The threshold for the iou scores. TYPE: `float`, optional, defaults to 0.88 DEFAULT: `0.88`
`stability_score_thresh`	The threshold for the stability score. TYPE: `float`, optional, defaults to 0.95 DEFAULT: `0.95`
`mask_threshold`	The threshold for the predicted masks. TYPE: `float`, optional, defaults to 0 DEFAULT: `0`
`stability_score_offset`	The offset for the stability score used in the `_compute_stability_score` method. TYPE: `float`, optional, defaults to 1 DEFAULT: `1`
`return_tensors`	If `pt`, returns `mindspore.Tensor`. If `tf`, returns `tf.Tensor`. TYPE: `str`, optional, defaults to `pt` DEFAULT: `'ms'`

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

def filter_masks(
    self,
    masks,
    iou_scores,
    original_size,
    cropped_box_image,
    pred_iou_thresh=0.88,
    stability_score_thresh=0.95,
    mask_threshold=0,
    stability_score_offset=1,
    return_tensors="ms",
):
    """
    Filters the predicted masks by selecting only the ones that meets several criteria. The first criterion being
    that the iou scores needs to be greater than `pred_iou_thresh`. The second criterion is that the stability
    score needs to be greater than `stability_score_thresh`. The method also converts the predicted masks to
    bounding boxes and pad the predicted masks if necessary.

    Args:
        masks (`Union[mindspore.Tensor, tf.Tensor]`):
            Input masks.
        iou_scores (`Union[mindspore.Tensor, tf.Tensor]`):
            List of IoU scores.
        original_size (`Tuple[int,int]`):
            Size of the orginal image.
        cropped_box_image (`np.array`):
            The cropped image.
        pred_iou_thresh (`float`, *optional*, defaults to 0.88):
            The threshold for the iou scores.
        stability_score_thresh (`float`, *optional*, defaults to 0.95):
            The threshold for the stability score.
        mask_threshold (`float`, *optional*, defaults to 0):
            The threshold for the predicted masks.
        stability_score_offset (`float`, *optional*, defaults to 1):
            The offset for the stability score used in the `_compute_stability_score` method.
        return_tensors (`str`, *optional*, defaults to `pt`):
            If `pt`, returns `mindspore.Tensor`. If `tf`, returns `tf.Tensor`.
    """
    if return_tensors == "ms":
        return self._filter_masks(
            masks=masks,
            iou_scores=iou_scores,
            original_size=original_size,
            cropped_box_image=cropped_box_image,
            pred_iou_thresh=pred_iou_thresh,
            stability_score_thresh=stability_score_thresh,
            mask_threshold=mask_threshold,
            stability_score_offset=stability_score_offset,
        )
    elif return_tensors == "tf":
        return self._filter_masks_tf(
            masks=masks,
            iou_scores=iou_scores,
            original_size=original_size,
            cropped_box_image=cropped_box_image,
            pred_iou_thresh=pred_iou_thresh,
            stability_score_thresh=stability_score_thresh,
            mask_threshold=mask_threshold,
            stability_score_offset=stability_score_offset,
        )

`mindnlp.transformers.models.sam.image_processing_sam.SamImageProcessor.generate_crop_boxes(image, target_size, crop_n_layers=0, overlap_ratio=512 / 1500, points_per_crop=32, crop_n_points_downscale_factor=1, input_data_format=None, return_tensors='ms')` ¶

Generates a list of crop boxes of different sizes. Each layer has (2**i)**2 boxes for the ith layer.

PARAMETER	DESCRIPTION
`image`	Input original image TYPE: `np.array`
`target_size`	Target size of the resized image TYPE: `int`
`crop_n_layers`	If >0, mask prediction will be run again on crops of the image. Sets the number of layers to run, where each layer has 2i_layer number of image crops. TYPE:** `int`, optional, defaults to 0 DEFAULT: `0`
`overlap_ratio`	Sets the degree to which crops overlap. In the first crop layer, crops will overlap by this fraction of the image length. Later layers with more crops scale down this overlap. TYPE: `float`, optional, defaults to 512/1500 DEFAULT: `512 / 1500`
`points_per_crop`	Number of points to sample from each crop. TYPE: `int`, optional, defaults to 32 DEFAULT: `32`
`crop_n_points_downscale_factor`	The number of points-per-side sampled in layer n is scaled down by crop_n_points_downscale_factorn. TYPE:** `List[int]`, optional, defaults to 1 DEFAULT: `1`
`input_data_format`	The channel dimension format of the input image. If not provided, it will be inferred. TYPE: `str` or `ChannelDimension`, optional DEFAULT: `None`
`return_tensors`	If `pt`, returns `mindspore.Tensor`. If `tf`, returns `tf.Tensor`. TYPE: `str`, optional, defaults to `pt` DEFAULT: `'ms'`

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

def generate_crop_boxes(
    self,
    image,
    target_size,
    crop_n_layers: int = 0,
    overlap_ratio: float = 512 / 1500,
    points_per_crop: Optional[int] = 32,
    crop_n_points_downscale_factor: Optional[List[int]] = 1,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    return_tensors: str = "ms",
):
    """
    Generates a list of crop boxes of different sizes. Each layer has (2**i)**2 boxes for the ith layer.

    Args:
        image (`np.array`):
            Input original image
        target_size (`int`):
            Target size of the resized image
        crop_n_layers (`int`, *optional*, defaults to 0):
            If >0, mask prediction will be run again on crops of the image. Sets the number of layers to run, where
            each layer has 2**i_layer number of image crops.
        overlap_ratio (`float`, *optional*, defaults to 512/1500):
            Sets the degree to which crops overlap. In the first crop layer, crops will overlap by this fraction of
            the image length. Later layers with more crops scale down this overlap.
        points_per_crop (`int`, *optional*, defaults to 32):
            Number of points to sample from each crop.
        crop_n_points_downscale_factor (`List[int]`, *optional*, defaults to 1):
            The number of points-per-side sampled in layer n is scaled down by crop_n_points_downscale_factor**n.
        input_data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the input image. If not provided, it will be inferred.
        return_tensors (`str`, *optional*, defaults to `pt`):
            If `pt`, returns `mindspore.Tensor`. If `tf`, returns `tf.Tensor`.
    """
    crop_boxes, points_per_crop, cropped_images, input_labels = _generate_crop_boxes(
        image,
        target_size,
        crop_n_layers,
        overlap_ratio,
        points_per_crop,
        crop_n_points_downscale_factor,
        input_data_format,
    )
    if return_tensors == "ms":
        crop_boxes = mindspore.tensor(crop_boxes)
        points_per_crop = mindspore.tensor(points_per_crop)
        # cropped_images stays as np
        input_labels = mindspore.tensor(input_labels)
    else:
        raise ValueError("return_tensors must be 'ms'.")
    return crop_boxes, points_per_crop, cropped_images, input_labels

`mindnlp.transformers.models.sam.image_processing_sam.SamImageProcessor.pad_image(image, pad_size, data_format=None, input_data_format=None, **kwargs)` ¶

Pad an image to (pad_size["height"], pad_size["width"]) with zeros to the right and bottom.

PARAMETER	DESCRIPTION
`image`	Image to pad. TYPE: `np.ndarray`
`pad_size`	Size of the output image after padding. TYPE: `Dict[str, int]`
`data_format`	The data format of the image. Can be either "channels_first" or "channels_last". If `None`, the `data_format` of the `image` will be used. TYPE: `str` or `ChannelDimension`, optional DEFAULT: `None`
`input_data_format`	The channel dimension format of the input image. If not provided, it will be inferred. TYPE: `str` or `ChannelDimension`, optional DEFAULT: `None`

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

def pad_image(
    self,
    image: np.ndarray,
    pad_size: Dict[str, int],
    data_format: Optional[Union[str, ChannelDimension]] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> np.ndarray:
    """
    Pad an image to `(pad_size["height"], pad_size["width"])` with zeros to the right and bottom.

    Args:
        image (`np.ndarray`):
            Image to pad.
        pad_size (`Dict[str, int]`):
            Size of the output image after padding.
        data_format (`str` or `ChannelDimension`, *optional*):
            The data format of the image. Can be either "channels_first" or "channels_last". If `None`, the
            `data_format` of the `image` will be used.
        input_data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the input image. If not provided, it will be inferred.
    """
    output_height, output_width = pad_size["height"], pad_size["width"]
    input_height, input_width = get_image_size(image, channel_dim=input_data_format)

    pad_width = output_width - input_width
    pad_height = output_height - input_height

    padded_image = pad(
        image,
        ((0, pad_height), (0, pad_width)),
        data_format=data_format,
        input_data_format=input_data_format,
        **kwargs,
    )
    return padded_image

`mindnlp.transformers.models.sam.image_processing_sam.SamImageProcessor.post_process_for_mask_generation(all_masks, all_scores, all_boxes, crops_nms_thresh, return_tensors='ms')` ¶

Post processes mask that are generated by calling the Non Maximum Suppression algorithm on the predicted masks.

PARAMETER	DESCRIPTION
`all_masks`	List of all predicted segmentation masks TYPE: `Union[List[mindspore.Tensor], List[tf.Tensor]]`
`all_scores`	List of all predicted iou scores TYPE: `Union[List[mindspore.Tensor], List[tf.Tensor]]`
`all_boxes`	List of all bounding boxes of the predicted masks TYPE: `Union[List[mindspore.Tensor], List[tf.Tensor]]`
`crops_nms_thresh`	Threshold for NMS (Non Maximum Suppression) algorithm. TYPE: `float`
`return_tensors`	If `pt`, returns `mindspore.Tensor`. If `tf`, returns `tf.Tensor`. TYPE: `str`, optional, defaults to `pt` DEFAULT: `'ms'`

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

def post_process_for_mask_generation(
    self, all_masks, all_scores, all_boxes, crops_nms_thresh, return_tensors="ms"
):
    """
    Post processes mask that are generated by calling the Non Maximum Suppression algorithm on the predicted masks.

    Args:
        all_masks (`Union[List[mindspore.Tensor], List[tf.Tensor]]`):
            List of all predicted segmentation masks
        all_scores (`Union[List[mindspore.Tensor], List[tf.Tensor]]`):
            List of all predicted iou scores
        all_boxes (`Union[List[mindspore.Tensor], List[tf.Tensor]]`):
            List of all bounding boxes of the predicted masks
        crops_nms_thresh (`float`):
            Threshold for NMS (Non Maximum Suppression) algorithm.
        return_tensors (`str`, *optional*, defaults to `pt`):
            If `pt`, returns `mindspore.Tensor`. If `tf`, returns `tf.Tensor`.
    """
    if return_tensors == "ms":
        return _postprocess_for_mg(all_masks, all_scores, all_boxes, crops_nms_thresh)

`mindnlp.transformers.models.sam.image_processing_sam.SamImageProcessor.post_process_masks(masks, original_sizes, reshaped_input_sizes, mask_threshold=0.0, binarize=True, pad_size=None, return_tensors='ms')` ¶

Remove padding and upscale masks to the original image size.

PARAMETER	DESCRIPTION
`masks`	Batched masks from the mask_decoder in (batch_size, num_channels, height, width) format. TYPE: `Union[List[mindspore.Tensor], List[np.ndarray], List[tf.Tensor]]`
`original_sizes`	The original sizes of each image before it was resized to the model's expected input shape, in (height, width) format. TYPE: `Union[mindspore.Tensor, tf.Tensor, List[Tuple[int,int]]]`
`reshaped_input_sizes`	The size of each image as it is fed to the model, in (height, width) format. Used to remove padding. TYPE: `Union[mindspore.Tensor, tf.Tensor, List[Tuple[int,int]]]`
`mask_threshold`	The threshold to use for binarizing the masks. TYPE: `float`, optional, defaults to 0.0 DEFAULT: `0.0`
`binarize`	Whether to binarize the masks. TYPE: `bool`, optional, defaults to `True` DEFAULT: `True`
`pad_size`	The target size the images were padded to before being passed to the model. If None, the target size is assumed to be the processor's `pad_size`. TYPE: `int`, optional, defaults to `self.pad_size` DEFAULT: `None`
`return_tensors`	If `"ms"`, return PyTorch tensors. If `"tf"`, return TensorFlow tensors. TYPE: `str`, optional, defaults to `"ms"` DEFAULT: `'ms'`

RETURNS	DESCRIPTION
`Union[mindspore.Tensor, tf.Tensor]`	Batched masks in batch_size, num_channels, height, width) format, where
	(height, width) is given by original_size.

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

def post_process_masks(
    self,
    masks,
    original_sizes,
    reshaped_input_sizes,
    mask_threshold=0.0,
    binarize=True,
    pad_size=None,
    return_tensors="ms",
):
    """
    Remove padding and upscale masks to the original image size.

    Args:
        masks (`Union[List[mindspore.Tensor], List[np.ndarray], List[tf.Tensor]]`):
            Batched masks from the mask_decoder in (batch_size, num_channels, height, width) format.
        original_sizes (`Union[mindspore.Tensor, tf.Tensor, List[Tuple[int,int]]]`):
            The original sizes of each image before it was resized to the model's expected input shape, in (height,
            width) format.
        reshaped_input_sizes (`Union[mindspore.Tensor, tf.Tensor, List[Tuple[int,int]]]`):
            The size of each image as it is fed to the model, in (height, width) format. Used to remove padding.
        mask_threshold (`float`, *optional*, defaults to 0.0):
            The threshold to use for binarizing the masks.
        binarize (`bool`, *optional*, defaults to `True`):
            Whether to binarize the masks.
        pad_size (`int`, *optional*, defaults to `self.pad_size`):
            The target size the images were padded to before being passed to the model. If None, the target size is
            assumed to be the processor's `pad_size`.
        return_tensors (`str`, *optional*, defaults to `"ms"`):
            If `"ms"`, return PyTorch tensors. If `"tf"`, return TensorFlow tensors.

    Returns:
        (`Union[mindspore.Tensor, tf.Tensor]`): Batched masks in batch_size, num_channels, height, width) format, where
        (height, width) is given by original_size.
    """
    if return_tensors == "ms":
        return self._post_process_masks_ms(
            masks=masks,
            original_sizes=original_sizes,
            reshaped_input_sizes=reshaped_input_sizes,
            mask_threshold=mask_threshold,
            binarize=binarize,
            pad_size=pad_size,
        )
    else:
        raise ValueError("return_tensors must be 'ms'.")

`mindnlp.transformers.models.sam.image_processing_sam.SamImageProcessor.preprocess(images, segmentation_maps=None, do_resize=None, size=None, mask_size=None, resample=None, do_rescale=None, rescale_factor=None, do_normalize=None, image_mean=None, image_std=None, do_pad=None, pad_size=None, mask_pad_size=None, do_convert_rgb=None, return_tensors=None, data_format=ChannelDimension.FIRST, input_data_format=None, **kwargs)` ¶

Preprocess an image or batch of images.

PARAMETER	DESCRIPTION
`images`	Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set `do_rescale=False`. TYPE: `ImageInput`
`segmentation_maps`	Segmentation map to preprocess. TYPE: `ImageInput`, optional DEFAULT: `None`
`do_resize`	Whether to resize the image. TYPE: `bool`, optional, defaults to `self.do_resize` DEFAULT: `None`
`size`	Controls the size of the image after `resize`. The longest edge of the image is resized to `size["longest_edge"]` whilst preserving the aspect ratio. TYPE: `Dict[str, int]`, optional, defaults to `self.size` DEFAULT: `None`
`mask_size`	Controls the size of the segmentation map after `resize`. The longest edge of the image is resized to `size["longest_edge"]` whilst preserving the aspect ratio. TYPE: `Dict[str, int]`, optional, defaults to `self.mask_size` DEFAULT: `None`
`resample`	`PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BILINEAR`. TYPE: `PILImageResampling`, optional, defaults to `self.resample` DEFAULT: `None`
`do_rescale`	Whether to rescale the image pixel values by rescaling factor. TYPE: `bool`, optional, defaults to `self.do_rescale` DEFAULT: `None`
`rescale_factor`	Rescale factor to apply to the image pixel values. TYPE: `int` or `float`, optional, defaults to `self.rescale_factor` DEFAULT: `None`
`do_normalize`	Whether to normalize the image. TYPE: `bool`, optional, defaults to `self.do_normalize` DEFAULT: `None`
`image_mean`	Image mean to normalize the image by if `do_normalize` is set to `True`. TYPE: `float` or `List[float]`, optional, defaults to `self.image_mean` DEFAULT: `None`
`image_std`	Image standard deviation to normalize the image by if `do_normalize` is set to `True`. TYPE: `float` or `List[float]`, optional, defaults to `self.image_std` DEFAULT: `None`
`do_pad`	Whether to pad the image. TYPE: `bool`, optional, defaults to `self.do_pad` DEFAULT: `None`
`pad_size`	Controls the size of the padding applied to the image. The image is padded to `pad_size["height"]` and `pad_size["width"]` if `do_pad` is set to `True`. TYPE: `Dict[str, int]`, optional, defaults to `self.pad_size` DEFAULT: `None`
`mask_pad_size`	Controls the size of the padding applied to the segmentation map. The image is padded to `mask_pad_size["height"]` and `mask_pad_size["width"]` if `do_pad` is set to `True`. TYPE: `Dict[str, int]`, optional, defaults to `self.mask_pad_size` DEFAULT: `None`
`do_convert_rgb`	Whether to convert the image to RGB. TYPE: `bool`, optional, defaults to `self.do_convert_rgb` DEFAULT: `None`
`return_tensors`	The type of tensors to return. Can be one of: Unset: Return a list of `np.ndarray`. `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`. `TensorType.PYTORCH` or `'pt'`: Return a batch of type `mindspore.Tensor`. `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`. `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. TYPE: `str` or `TensorType`, optional DEFAULT: `None`
`data_format`	The channel dimension format for the output image. Can be one of: `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. Unset: Use the channel dimension format of the input image. TYPE: `ChannelDimension` or `str`, optional, defaults to `ChannelDimension.FIRST` DEFAULT: `FIRST`
`input_data_format`	The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of: `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. `"none"` or `ChannelDimension.NONE`: image in (height, width) format. TYPE: `ChannelDimension` or `str`, optional DEFAULT: `None`

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

def preprocess(
    self,
    images: ImageInput,
    segmentation_maps: Optional[ImageInput] = None,
    do_resize: Optional[bool] = None,
    size: Optional[Dict[str, int]] = None,
    mask_size: Optional[Dict[str, int]] = None,
    resample: Optional["PILImageResampling"] = None,
    do_rescale: Optional[bool] = None,
    rescale_factor: Optional[Union[int, float]] = None,
    do_normalize: Optional[bool] = None,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_pad: Optional[bool] = None,
    pad_size: Optional[Dict[str, int]] = None,
    mask_pad_size: Optional[Dict[str, int]] = None,
    do_convert_rgb: Optional[bool] = None,
    return_tensors: Optional[Union[str, TensorType]] = None,
    data_format: ChannelDimension = ChannelDimension.FIRST,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
):
    """
    Preprocess an image or batch of images.

    Args:
        images (`ImageInput`):
            Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
            passing in images with pixel values between 0 and 1, set `do_rescale=False`.
        segmentation_maps (`ImageInput`, *optional*):
            Segmentation map to preprocess.
        do_resize (`bool`, *optional*, defaults to `self.do_resize`):
            Whether to resize the image.
        size (`Dict[str, int]`, *optional*, defaults to `self.size`):
            Controls the size of the image after `resize`. The longest edge of the image is resized to
            `size["longest_edge"]` whilst preserving the aspect ratio.
        mask_size (`Dict[str, int]`, *optional*, defaults to `self.mask_size`):
            Controls the size of the segmentation map after `resize`. The longest edge of the image is resized to
            `size["longest_edge"]` whilst preserving the aspect ratio.
        resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
            `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BILINEAR`.
        do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
            Whether to rescale the image pixel values by rescaling factor.
        rescale_factor (`int` or `float`, *optional*, defaults to `self.rescale_factor`):
            Rescale factor to apply to the image pixel values.
        do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
            Whether to normalize the image.
        image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
            Image mean to normalize the image by if `do_normalize` is set to `True`.
        image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
            Image standard deviation to normalize the image by if `do_normalize` is set to `True`.
        do_pad (`bool`, *optional*, defaults to `self.do_pad`):
            Whether to pad the image.
        pad_size (`Dict[str, int]`, *optional*, defaults to `self.pad_size`):
            Controls the size of the padding applied to the image. The image is padded to `pad_size["height"]` and
            `pad_size["width"]` if `do_pad` is set to `True`.
        mask_pad_size (`Dict[str, int]`, *optional*, defaults to `self.mask_pad_size`):
            Controls the size of the padding applied to the segmentation map. The image is padded to
            `mask_pad_size["height"]` and `mask_pad_size["width"]` if `do_pad` is set to `True`.
        do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
            Whether to convert the image to RGB.
        return_tensors (`str` or `TensorType`, *optional*):
            The type of tensors to return. Can be one of:

            - Unset: Return a list of `np.ndarray`.
            - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
            - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `mindspore.Tensor`.
            - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
            - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
        data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
            The channel dimension format for the output image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - Unset: Use the channel dimension format of the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the input image. If unset, the channel dimension format is inferred
            from the input image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
    """
    do_resize = do_resize if do_resize is not None else self.do_resize
    size = size if size is not None else self.size
    size = get_size_dict(max_size=size, default_to_square=False) if not isinstance(size, dict) else size
    mask_size = mask_size if mask_size is not None else self.mask_size
    mask_size = (
        get_size_dict(max_size=mask_size, default_to_square=False)
        if not isinstance(mask_size, dict)
        else mask_size
    )
    resample = resample if resample is not None else self.resample
    do_rescale = do_rescale if do_rescale is not None else self.do_rescale
    rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
    do_normalize = do_normalize if do_normalize is not None else self.do_normalize
    image_mean = image_mean if image_mean is not None else self.image_mean
    image_std = image_std if image_std is not None else self.image_std
    do_pad = do_pad if do_pad is not None else self.do_pad
    pad_size = pad_size if pad_size is not None else self.pad_size
    pad_size = get_size_dict(pad_size, default_to_square=True)
    mask_pad_size = mask_pad_size if mask_pad_size is not None else self.mask_pad_size
    mask_pad_size = get_size_dict(mask_pad_size, default_to_square=True)
    do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb

    images = make_list_of_images(images)

    validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)

    if not valid_images(images):
        raise ValueError(
            "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
            "mindspore.Tensor, tf.Tensor or jax.ndarray."
        )

    if segmentation_maps is not None:
        segmentation_maps = make_list_of_images(segmentation_maps, expected_ndims=2)

        if not valid_images(segmentation_maps):
            raise ValueError(
                "Invalid segmentation map type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "mindspore.Tensor, tf.Tensor or jax.ndarray."
            )
    validate_preprocess_arguments(
        do_rescale=do_rescale,
        rescale_factor=rescale_factor,
        do_normalize=do_normalize,
        image_mean=image_mean,
        image_std=image_std,
        do_pad=do_pad,
        size_divisibility=pad_size,  # Here _preprocess needs do_pad and pad_size.
        do_resize=do_resize,
        size=size,
        resample=resample,
    )

    images, original_sizes, reshaped_input_sizes = zip(
        *(
            self._preprocess_image(
                image=img,
                do_resize=do_resize,
                size=size,
                resample=resample,
                do_rescale=do_rescale,
                rescale_factor=rescale_factor,
                do_normalize=do_normalize,
                image_mean=image_mean,
                image_std=image_std,
                do_pad=do_pad,
                pad_size=pad_size,
                do_convert_rgb=do_convert_rgb,
                data_format=data_format,
                input_data_format=input_data_format,
            )
            for img in images
        )
    )

    data = {
        "pixel_values": images,
        "original_sizes": original_sizes,
        "reshaped_input_sizes": reshaped_input_sizes,
    }

    if segmentation_maps is not None:
        segmentation_maps, original_mask_sizes = zip(
            *(
                self._preprocess_mask(
                    segmentation_map=mask,
                    do_resize=do_resize,
                    mask_size=mask_size,
                    do_pad=do_pad,
                    mask_pad_size=mask_pad_size,
                    input_data_format=input_data_format,
                )
                for mask in segmentation_maps
            )
        )

        # masks should start out the same size as input images
        assert all(
            original_im_size == original_mask_size
            for original_im_size, original_mask_size in zip(original_sizes, original_mask_sizes)
        ), "Segmentation maps should be the same size as input images."

        data["labels"] = segmentation_maps

    return BatchFeature(data=data, tensor_type=return_tensors)

`mindnlp.transformers.models.sam.image_processing_sam.SamImageProcessor.resize(image, size, resample=PILImageResampling.BICUBIC, data_format=None, input_data_format=None, **kwargs)` ¶

Resize an image to (size["height"], size["width"]).

PARAMETER	DESCRIPTION
`image`	Image to resize. TYPE: `np.ndarray`
`size`	Dictionary in the format `{"longest_edge": int}` specifying the size of the output image. The longest edge of the image will be resized to the specified size, while the other edge will be resized to maintain the aspect ratio. TYPE: `Dict[str, int]`
`resample`	`PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BILINEAR`. TYPE: `PILImageResampling` DEFAULT: `BICUBIC`
`data_format`	The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of: `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. TYPE: `ChannelDimension` or `str`, optional DEFAULT: `None`
`input_data_format`	The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of: `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. TYPE: `ChannelDimension` or `str`, optional DEFAULT: `None`

RETURNS	DESCRIPTION
`ndarray`	`np.ndarray`: The resized image.

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

def resize(
    self,
    image: np.ndarray,
    size: Dict[str, int],
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    data_format: Optional[Union[str, ChannelDimension]] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> np.ndarray:
    """
    Resize an image to `(size["height"], size["width"])`.

    Args:
        image (`np.ndarray`):
            Image to resize.
        size (`Dict[str, int]`):
            Dictionary in the format `{"longest_edge": int}` specifying the size of the output image. The longest
            edge of the image will be resized to the specified size, while the other edge will be resized to
            maintain the aspect ratio.
        resample:
            `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BILINEAR`.
        data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the output image. If unset, the channel dimension format of the input
            image is used. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the input image. If unset, the channel dimension format is inferred
            from the input image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.

    Returns:
        `np.ndarray`: The resized image.
    """
    size = get_size_dict(size)
    if "longest_edge" not in size:
        raise ValueError(f"The `size` dictionary must contain the key `longest_edge`. Got {size.keys()}")
    input_size = get_image_size(image, channel_dim=input_data_format)
    output_height, output_width = self._get_preprocess_shape(input_size, size["longest_edge"])
    return resize(
        image,
        size=(output_height, output_width),
        resample=resample,
        data_format=data_format,
        input_data_format=input_data_format,
        **kwargs,
    )

`mindnlp.transformers.models.sam.image_processing_sam.batched_nms(boxes, scores, idxs, iou_threshold)` ¶

Performs non-maximum suppression in a batched fashion.

Each index value correspond to a category, and NMS will not be applied between elements of different categories.

PARAMETER	DESCRIPTION
`boxes`	boxes where NMS will be performed. They are expected to be in `(x1, y1, x2, y2)` format with `0 <= x1 < x2` and `0 <= y1 < y2`. TYPE: `Tensor[N, 4]`
`scores`	scores for each one of the boxes TYPE: `Tensor[N]`
`idxs`	indices of the categories for each one of the boxes. TYPE: `Tensor[N]`
`iou_threshold`	discards all overlapping boxes with IoU > iou_threshold TYPE: `float`

RETURNS	DESCRIPTION
`Tensor`	int64 tensor with the indices of the elements that have been kept by NMS, sorted in decreasing order of scores TYPE: `Tensor`

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

def batched_nms(
    boxes: mindspore.Tensor,
    scores: mindspore.Tensor,
    idxs: mindspore.Tensor,
    iou_threshold: float,
) -> mindspore.Tensor:
    """
    Performs non-maximum suppression in a batched fashion.

    Each index value correspond to a category, and NMS
    will not be applied between elements of different categories.

    Args:
        boxes (Tensor[N, 4]): boxes where NMS will be performed. They
            are expected to be in ``(x1, y1, x2, y2)`` format with ``0 <= x1 < x2`` and ``0 <= y1 < y2``.
        scores (Tensor[N]): scores for each one of the boxes
        idxs (Tensor[N]): indices of the categories for each one of the boxes.
        iou_threshold (float): discards all overlapping boxes with IoU > iou_threshold

    Returns:
        Tensor: int64 tensor with the indices of the elements that have been kept by NMS, sorted
            in decreasing order of scores
    """
    # Benchmarks that drove the following thresholds are at
    # https://github.com/pytorch/vision/issues/1311#issuecomment-781329339
    if boxes.numel() > (4000 if mindspore.get_context('device_target') == "CPU" else 20000):
        return _batched_nms_vanilla(boxes, scores, idxs, iou_threshold)
    else:
        return _batched_nms_coordinate_trick(boxes, scores, idxs, iou_threshold)

`mindnlp.transformers.models.sam.image_processing_sam.nms(boxes, scores, iou_threshold)` ¶

Performs non-maximum suppression (NMS) on a set of bounding boxes.

PARAMETER	DESCRIPTION
`boxes`	A tensor of shape (N, 4) representing the coordinates of the N bounding boxes. Each bounding box is defined by four values: (x_min, y_min, x_max, y_max). TYPE: `Tensor`
`scores`	A tensor of shape (N,) representing the scores associated with each bounding box. TYPE: `Tensor`
`iou_threshold`	The Intersection over Union (IoU) threshold used for NMS. Bounding boxes with IoU greater than or equal to this threshold will be suppressed. TYPE: `float`

RETURNS	DESCRIPTION
	mindspore.Tensor: A tensor containing the indices of the selected bounding boxes after NMS. The shape of the returned tensor is (M,), where M is the number of selected bounding boxes.

RAISES	DESCRIPTION
`TypeError`	If any of the input arguments are not of the expected type.
`ValueError`	If the shape of 'boxes' and 'scores' tensors are incompatible or if 'iou_threshold' is not within the valid range.

Source code in mindnlp\transformers\models\sam\image_processing_sam.py

def nms(boxes: mindspore.Tensor, scores: mindspore.Tensor, iou_threshold: float):
    """
    Performs non-maximum suppression (NMS) on a set of bounding boxes.

    Args:
        boxes (mindspore.Tensor): A tensor of shape (N, 4) representing the coordinates of the N bounding boxes. 
            Each bounding box is defined by four values: (x_min, y_min, x_max, y_max).
        scores (mindspore.Tensor): A tensor of shape (N,) representing the scores associated with each bounding box.
        iou_threshold (float): The Intersection over Union (IoU) threshold used for NMS. 
            Bounding boxes with IoU greater than or equal to this threshold will be suppressed.

    Returns:
        mindspore.Tensor: A tensor containing the indices of the selected bounding boxes after NMS. 
            The shape of the returned tensor is (M,), where M is the number of selected bounding boxes.

    Raises:
        TypeError: If any of the input arguments are not of the expected type.
        ValueError: If the shape of 'boxes' and 'scores' tensors are incompatible or if 'iou_threshold'
            is not within the valid range.
    """
    box_with_score = ops.stack((boxes, scores))
    _, _, selected_mask = _get_cache_prim(mindspore.ops.NMSWithMask)(iou_threshold)(box_with_score)
    return ops.nonzero(selected_mask).reshape(-1)

`mindnlp.transformers.models.sam.modeling_sam` ¶

MindSpore SAM model.

`mindnlp.transformers.models.sam.modeling_sam.SamAttention` ¶

Bases: Module

SAM's attention layer that allows for downscaling the size of the embedding after projection to queries, keys, and values.