How to perform instance segmentation using Mask R-CNN

Key takeaways:

  • Mask R-CNN is a sophisticated computer vision model that integrates object detection and semantic segmentation which enable precise identification and pixel-level segmentation of objects in images.

  • Mask R-CNN component extracts features from the input image, generating feature maps that capture essential details such as edges and textures. Popular architectures include ResNet and feature pyramid network.

  • Mask R-CNN predicts segmentation masks for objects in each RoI, providing detailed outlines of object shapes, enhancing traditional detection methods.

Mask R-CNN (Mask Region-based Convolutional Neural Network) is an advanced computer vision model that combines object detection and semantic segmentation. It enables it to identify objects within an image and precisely segment them at the pixel level.

Mask R-CNN components

Mask R-CNN comprises of several essential components to achieve object detection and instance segmentation. The combination of components allows Mask R-CNN to achieve state-of-the-art results in various computer vision tasks. These components include:

Backbone network

The backbone network is responsible for feature extraction. It takes the entire image as input and generates feature maps that capture critical information for subsequent tasks. These features include edges, corners, shapes, and textures. Popular backbone network architectures, like ResNet and feature pyramid network[object Object], play a pivotal role in this step. These networks are pretrained on extensive datasets and are adept at extracting high-level features from images. The choice of backbone network architecture can significantly influence the performance of Mask R-CNN.

Region Proposal Network (RPN)

The feature maps produced by the Backbone Network are used by the Region Proposal Network (RPN). RPN is responsible for proposing potential object regions within an image. Instead of scanning the input image, RPN scans the feature maps to find regions. These regions are called region of interests (RoIs), represented as rectangular bounding boxes. The box is accompanied by a confidence score, which indicates the likelihood of containing an object.

RoIAlign

RoIAlign extracts features from RoIs within the feature maps generated by the backbone network. It transforms the irregularly shaped RoIs into fixed-sized feature maps, which can then be used for subsequent classification and mask prediction. RoIAlign addresses the need for precise spatial alignment, which is essential for tasks like object detection and instance segmentation.

Object detection head

The detection head identifies and localizes objects in the images. This component is responsible for classifying objects and predicting bounding boxes. It takes the fixed-sized feature maps from RoIs and performs object classification. Then it assigns a class label to each proposed object region, that determines what the object is.

Instance segmentation

Mask R-CNN accurately outlines the shape of objects within those regions by predicting the pixel-level segmentation masks for each RoI. This enables instance segmentation, which goes beyond traditional object detection by providing detailed object segmentation. Mask R-CNN predicts segmentation masks for objects within each RoI in parallel with the existing classification and bounding box regression.

Implementing Mask R-CNN in Python

When operating with TensorFlow version 2.13, we modified deprecated functions, such as tf.log with tf.math.log and migrated from the keras.engine module to the keras.layers module. This ensures compatibility and adherence to the latest practices in TensorFlow.

We configure the Mask R-CNN model using a custom configuration. We define the parameters the number of GPUs, images per GPU, and the number of classes. The rest of the parameters VALIDATION_STEPS, STEPS_PER_EPOCH are from the base class Mask_RCNN/mrcnn/config.py.

from mrcnn.config import Config
class MaskRCNN_config(Config):
# the configuration name
NAME = 'MaskRCNN_config_inference'
# Number OF GPUs to use. When using only a CPU, this needs to be set to 1.
GPU_COUNT = 1
# Number of images to train with on each GPU
IMAGES_PER_GPU = 1
# number of classes (including background)
NUM_CLASSES = 81
config = MaskRCNN_config()

We initialize the model for inference, indicating that it will be used to make predictions on new, unseen data. We pass the configuration and set up the current directory where the model files and weights will be stored.

from mrcnn import model as model_lib
print('loading weights for Mask R-CNN model…')
model = model_lib.MaskRCNN(mode='inference', config=config, model_dir='./')
Initializing Mask R-CNN model for inference

We provide the model pretrained weights trained on the COCO dataset, commonly used for training and evaluating object detection and segmentation models. Setting by_name to True ensures that the weights are correctly matched and loaded to the corresponding layers in the model.

model.load_weights('mask_rcnn_coco.h5', by_name=True)
Loading pretrained weights

We define a function to visualize an image with bounding boxes drawn around detected objects. We pass the filename of the image and boxes_list representing a list of boxes. Each box is represented by a tuple containing four values: (y1, x1, y2, x2), which denote the coordinates of the top-left and bottom-right corners of the bounding box. The bounding box colors are selected from a list of predefined colors.

# Draw an image with detected objects
def draw_image_with_boxes(filename, boxes_list):
# Load the image
data = pyplot.imread(filename)
# Plot the image
pyplot.imshow(data)
# Get the context for drawing boxes
ax = pyplot.gca()
ax.set_xticks([])
ax.set_yticks([])
colors = [
(1.0, 0.0, 0.0, 1.0), # Red
(0.0, 1.0, 0.0, 1.0), # Green
(0.0, 0.0, 1.0, 1.0), # Blue
(1.0, 1.0, 0.0, 1.0), # Yellow
(0.0, 0.0, 0.0, 1.0), # Black
(1.0, 1.0, 1.0, 1.0), # White
(0.502, 0.0, 0.502, 1.0), # Purple
(0.545, 0.271, 0.075, 1.0) # Brown
]
# Plot each box
for i, box in enumerate(boxes_list):
# Get coordinates
y1, x1, y2, x2 = box
# Calculate width and height of the box
width, height = x2 - x1, y2 - y1
# Create the shape
color=colors[i]
rect = Rectangle((x1, y1), width, height, fill=False, color=color, lw=5)
# Draw the box
ax.add_patch(rect)
# Show the plot
pyplot.show()

We call the detect function of the model object, which takes an image as input for detection. The result of the detection process is stored in the results variable. The results contain information about the detected objects, such as bounding box coordinates, class labels, and segmentation masks. The results structure contains:

  • rois: A NumPy array with the shape (N, 4), where N is the number of detected objects. Each row represents a bounding box for a detected object, defined by its top-left and bottom-right coordinates (in the format [y1, x1, y2, x2]).

  • 'class_ids': A NumPy array of class IDs for each of the detected objects. The class ID represents the class label to which the object belongs.

  • 'scores': A NumPy array with confidence scores or probabilities associated with each of the detected objects. These scores indicate the model’s confidence in the correctness of the detection. Higher scores typically represent greater confidence.

  • 'masks': A NumPy array that represents the segmentation masks for each detected object. The format of this array is a binary mask (with True indicating the object and False indicating the background) for each object. The shape of this array is often (N, H, W), where N is the number of objects, and H and W represent the height and width of the mask.

The results structure looks like this:

[
{
'rois':
array([[ 53, 99, 345, 236],
[ 24, 246, 245, 433],
[133, 82, 203, 178],
[275, 154, 454, 477],
[ 54, 0, 367, 621],
[131, 419, 231, 551],
[168, 280, 237, 432],
[ 28, 488, 83, 528]], dtype=int32),
'class_ids': array([ 1, 1, 16, 17, 58, 16, 16, 40], dtype=int32),
'scores': array([0.9947732 , 0.99246526, 0.9918096 , 0.9914255 , 0.9877041 ,
0.9839083 , 0.78973764, 0.70102155], dtype=float32),
'masks': array([[[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
]])}
]

After the detection is performed, we visualize the results by drawing bounding boxes on the input image. The results[0]['rois'] retrieves the bounding box coordinates (RoIs).

results = model.detect([img], verbose=0)
draw_image_with_boxes('/usr/local/notebooks/cd.jpeg', results[0]['rois'])
Calling detect function for object detection.

The display_instances function uses this information to draw bounding boxes, overlay segmentation masks, label object classes, and display confidence scores on the image.

from mrcnn.visualize import display_instances
# show photo with bounding boxes, masks, class labels and scores
display_instances(img, results[0]['rois'], results[0]['masks'], results[0]['class_ids'], class_names, results[0]['scores'])
Displaying the result of instance segmentation.

Try it yourself

You may need to re-run results = model.detect([img], verbose=0) cell in the .ipynb notebook. The reason for this is when running models on a CPU, TensorFlow might struggle with allocating or managing memory during the initial graph execution. The process of graph compilation and execution resolves itself upon retry. Please note you may see warnings, but they do not affect the output.

Run the code below to see the implementation:

Press the "Run" button and then click on the link in the widget below.
Performing instance segmentation using Mask R-CNN

Conclusion

In conclusion, Mask R-CNN is a powerful model, for instance, segmentation, that effectively combines object detection and pixel-level segmentation. Its architecture, featuring components like the backbone network and Region Proposal Network, allows for precise identification of objects in images. Mastering this model is essential for IT professionals and researchers tackling complex visual recognition challenges, making it a cornerstone for robust computer vision solutions in a rapidly evolving landscape.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


How to create masks for image segmentation?

Masks for image segmentation are created by assigning a label to each pixel in an image, indicating which object or region it belongs to. In practice, this is done by annotating images with binary masks or using tools like LabelMe or VGG Image Annotator. These masks are then used to train segmentation models like U-Net or Mask R-CNN.


How do you implement instance segmentation?

Instance segmentation can be implemented using deep learning models like Mask R-CNN. It involves detecting objects in an image and segmenting them pixel-wise while differentiating between individual object instances. The key steps include object detection (bounding boxes) and generating segmentation masks for each detected object.


What is the difference between Mask R-CNN and Faster R-CNN?

The main difference is that Mask R-CNN extends Faster R-CNN by adding an additional branch for predicting segmentation masks for each detected object. While Faster R-CNN focuses on object detection with bounding boxes, Mask R-CNN enables pixel-level segmentation by creating a mask for each object.


Free Resources

Copyright ©2025 Educative, Inc. All rights reserved