How to perform instance segmentation using Mask R-CNN

Key takeaways:
Mask R-CNN is a sophisticated computer vision model that integrates object detection and semantic segmentation which enable precise identification and pixel-level segmentation of objects in images.
Mask R-CNN component extracts features from the input image, generating feature maps that capture essential details such as edges and textures. Popular architectures include ResNet and feature pyramid network.
Mask R-CNN predicts segmentation masks for objects in each RoI, providing detailed outlines of object shapes, enhancing traditional detection methods.

Mask R-CNN (Mask Region-based Convolutional Neural Network) is an advanced computer vision model that combines object detection and semantic segmentation. It enables it to identify objects within an image and precisely segment them at the pixel level.

Mask R-CNN components

Mask R-CNN comprises of several essential components to achieve object detection and instance segmentation. The combination of components allows Mask R-CNN to achieve state-of-the-art results in various computer vision tasks. These components include:

Backbone network

The backbone network is responsible for feature extraction. It takes the entire image as input and generates feature maps that capture critical information for subsequent tasks. These features include edges, corners, shapes, and textures. Popular backbone network architectures, like ResNet and feature pyramid network[object Object], play a pivotal role in this step. These networks are pretrained on extensive datasets and are adept at extracting high-level features from images. The choice of backbone network architecture can significantly influence the performance of Mask R-CNN.

Region Proposal Network (RPN)

The feature maps produced by the Backbone Network are used by the Region Proposal Network (RPN). RPN is responsible for proposing potential object regions within an image. Instead of scanning the input image, RPN scans the feature maps to find regions. These regions are called region of interests (RoIs), represented as rectangular bounding boxes. The box is accompanied by a confidence score, which indicates the likelihood of containing an object.

RoIAlign

RoIAlign extracts features from RoIs within the feature maps generated by the backbone network. It transforms the irregularly shaped RoIs into fixed-sized feature maps, which can then be used for subsequent classification and mask prediction. RoIAlign addresses the need for precise spatial alignment, which is essential for tasks like object detection and instance segmentation.

Object detection head

The detection head identifies and localizes objects in the images. This component is responsible for classifying objects and predicting bounding boxes. It takes the fixed-sized feature maps from RoIs and performs object classification. Then it assigns a class label to each proposed object region, that determines what the object is.

Instance segmentation

Mask R-CNN accurately outlines the shape of objects within those regions by predicting the pixel-level segmentation masks for each RoI. This enables instance segmentation, which goes beyond traditional object detection by providing detailed object segmentation. Mask R-CNN predicts segmentation masks for objects within each RoI in parallel with the existing classification and bounding box regression.

Implementing Mask R-CNN in Python

When operating with TensorFlow version 2.13, we modified deprecated functions, such as tf.log with tf.math.log and migrated from the keras.engine module to the keras.layers module. This ensures compatibility and adherence to the latest practices in TensorFlow.

We configure the Mask R-CNN model using a custom configuration. We define the parameters the number of GPUs, images per GPU, and the number of classes. The rest of the parameters VALIDATION_STEPS, STEPS_PER_EPOCH are from the base class Mask_RCNN/mrcnn/config.py.

# Draw an image with detected objects
def draw_image_with_boxes(filename, boxes_list):
     # Load the image
     data = pyplot.imread(filename)
     # Plot the image
     pyplot.imshow(data)
     # Get the context for drawing boxes
     ax = pyplot.gca()
     
     ax.set_xticks([])
     ax.set_yticks([])
     colors = [
          (1.0, 0.0, 0.0, 1.0),     # Red
          (0.0, 1.0, 0.0, 1.0),     # Green
          (0.0, 0.0, 1.0, 1.0),     # Blue
          (1.0, 1.0, 0.0, 1.0),     # Yellow
          (0.0, 0.0, 0.0, 1.0),     # Black
          (1.0, 1.0, 1.0, 1.0),     # White
          (0.502, 0.0, 0.502, 1.0), # Purple
          (0.545, 0.271, 0.075, 1.0) # Brown
      ]
     
     # Plot each box
     for i, box in enumerate(boxes_list):
          # Get coordinates
          y1, x1, y2, x2 = box
          # Calculate width and height of the box
          width, height = x2 - x1, y2 - y1
          # Create the shape
          color=colors[i]
          rect = Rectangle((x1, y1), width, height, fill=False, color=color, lw=5)
          # Draw the box
          ax.add_patch(rect)
     # Show the plot
     pyplot.show()

We call the detect function of the model object, which takes an image as input for detection. The result of the detection process is stored in the results variable. The results contain information about the detected objects, such as bounding box coordinates, class labels, and segmentation masks. The results structure contains:

rois: A NumPy array with the shape (N, 4), where N is the number of detected objects. Each row represents a bounding box for a detected object, defined by its top-left and bottom-right coordinates (in the format [y1, x1, y2, x2]).
'class_ids': A NumPy array of class IDs for each of the detected objects. The class ID represents the class label to which the object belongs.
'scores': A NumPy array with confidence scores or probabilities associated with each of the detected objects. These scores indicate the model’s confidence in the correctness of the detection. Higher scores typically represent greater confidence.
'masks': A NumPy array that represents the segmentation masks for each detected object. The format of this array is a binary mask (with True indicating the object and False indicating the background) for each object. The shape of this array is often (N, H, W), where N is the number of objects, and H and W represent the height and width of the mask.

The results structure looks like this:

[
    {
        'rois': 
            array([[ 53,  99, 345, 236],
                [ 24, 246, 245, 433],
                [133,  82, 203, 178],
                [275, 154, 454, 477],
                [ 54,   0, 367, 621],
                [131, 419, 231, 551],
                [168, 280, 237, 432],
                [ 28, 488,  83, 528]], dtype=int32),
 
        'class_ids': array([ 1,  1, 16, 17, 58, 16, 16, 40], dtype=int32),
 
        'scores': array([0.9947732 , 0.99246526, 0.9918096 , 0.9914255 , 0.9877041 ,
                0.9839083 , 0.78973764, 0.70102155], dtype=float32),
        
        'masks': array([[[False, False, False, ..., False, False, False],
                [False, False, False, ..., False, False, False],
                [False, False, False, ..., False, False, False],
                ...,
                
        ]])}
]

Conclusion

In conclusion, Mask R-CNN is a powerful model, for instance, segmentation, that effectively combines object detection and pixel-level segmentation. Its architecture, featuring components like the backbone network and Region Proposal Network, allows for precise identification of objects in images. Mastering this model is essential for IT professionals and researchers tackling complex visual recognition challenges, making it a cornerstone for robust computer vision solutions in a rapidly evolving landscape.

Frequently asked questions

Haven’t found what you were looking for? Contact Us

How to create masks for image segmentation?

Masks for image segmentation are created by assigning a label to each pixel in an image, indicating which object or region it belongs to. In practice, this is done by annotating images with binary masks or using tools like LabelMe or VGG Image Annotator. These masks are then used to train segmentation models like U-Net or Mask R-CNN.

How do you implement instance segmentation?

Instance segmentation can be implemented using deep learning models like Mask R-CNN. It involves detecting objects in an image and segmenting them pixel-wise while differentiating between individual object instances. The key steps include object detection (bounding boxes) and generating segmentation masks for each detected object.

What is the difference between Mask R-CNN and Faster R-CNN?

The main difference is that Mask R-CNN extends Faster R-CNN by adding an additional branch for predicting segmentation masks for each detected object. While Faster R-CNN focuses on object detection with bounding boxes, Mask R-CNN enables pixel-level segmentation by creating a mask for each object.