What is multitask cascaded convolutional neural network (MTCNN)?

The multitask cascaded convolutional networks (MTCNN) algorithm is a groundbreaking advancement in face detection and recognition. Introduced in 2016, MTCNN leverages a cascading structure of neural networks to detect, align, and extract facial features from digital images, achieving exceptional precision and efficiency.

Architecture of MTCNN

MTCNN features a sophisticated deep learning architecture composed of three cascaded networks that work together to identify faces and landmarks.

The MTCNN framework comprises three networks: P-Net, R-Net, and O-Net. First, the input image is scaled to various resolutions. The proposal network (P-Net) processes these scaled images, producing numerous potential bounding boxes containing faces. After initial processing, the refine network (R-Net) further refines these bounding boxes, filtering them to the most probable candidates. Finally, the output network (O-Net) generates the final bounding boxes and facial landmark coordinates, resulting in highly accurate face detection.

Architecture of MTCNN
Architecture of MTCNN

In the figure above, MP refers to max pooling, and Conv means convolution.

Let’s explore the three networks in more detail.

How does MTCNN work?

Our cascaded framework employs a three-stage multitask deep convolutional network to detect faces in an image. Initially, the image is resized to multiple scales to create an image pyramid, which is the input for the three-stage cascaded process.

  • Stage 1 (Proposal network (P-Net)): The P-Net processes the input image and generates a set of candidate bounding boxes that may contain faces. It applies convolutional filters to create feature maps, which are then processed by fully connected layers to estimate the likelihood of a face being present in each region. The P-Net also predicts the coordinates of bounding boxes around detected faces, providing information about their position and size.

  • Stage 2 (Refine network (R-Net)): The R-Net refines the candidate bounding boxes produced by the P-Net. It crops the corresponding regions from the input image, resizes them to a fixed size, and passes them through convolutional and fully connected layers to classify each bounding box as either a face or a non-face. The R-Net also fine-tunes the bounding box coordinates to improve the accuracy of the detected face location.

  • Stage 3 (Output network (O-Net)): The O-Net provides detailed descriptions of the faces, including the positions of five facial landmarks, and produces the final bounding boxes. It takes the refined bounding boxes from the R-Net, crops the corresponding regions from the input image, resizes them, and processes them through convolutional and fully connected layers. The O-Net further refines the bounding box coordinates and extracts the coordinates of the five facial landmarksfacial landmarks typically: Left eye center, right eye center, nose tip, left corner of the mouth, and right corner of the mouth. for each detected face.

Implementation

Now that we understand the MTCNN architecture and its stages let’s implement a practical example using the MTCNN library to detect faces and facial landmarks in an image.

import matplotlib.pyplot as plt
import cv2
from matplotlib.patches import Rectangle, Circle
from mtcnn.mtcnn import MTCNN
image = cv2.imread('sample1.png') # Try sample2.png and sample3.png too
detector = MTCNN()
faces = detector.detect_faces(image)
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
ax = plt.gca()
for face in faces:
x, y, width, height = face['box']
rect = Rectangle((x, y), width, height, fill=False, color='orange')
ax.add_patch(rect)
for key, value in face['keypoints'].items():
dot = Circle(value, radius=12, color='red')
ax.add_patch(dot)
plt.title('Face and Landmarks Identified')
plt.axis('off')
plt.savefig(output_path, dpi=300)
plt.show()

Explanation

The code above is explained in detail below:

  • Lines 1–4: We import the necessary libraries and modules:

    • matplotlib.pyplot as plt: For visualization and plotting

    • cv2: For image processing with OpenCV

    • Rectangle and Circle from matplotlib.patches: For drawing rectangles and circles on the plot

    • MTCNN from mtcnn.mtcnn: For face detection and facial landmark extraction

  • Line 8: We instantiate the MTCNN detector object using MTCNN().

  • Line 9: We detect the faces and facial landmarks in the image using the MTCNN detector’s detect_faces() method and store the results in the faces variable.

  • Lines 11–12: We convert the image from BGR to RGB format using cv2.cvtColor() to ensure compatibility with Matplotlib’s color format. Then, we set up the current axis for plotting using plt.gca() and stored it in the variable ax.

  • Lines 14–21: We iterate over the detected faces. We extract the bounding box coordinates for each face and draw a rectangle around it using Rectangle. Then, we iterate over the facial landmarks, extracting the coordinates and drawing a circle around the landmark using Circle.

  • Lines 23–26: The plt.title('Face and Landmarks Identified') sets the title, plt.axis('off') removes axis labels, plt.savefig(output_path, dpi=300) saves the plot with high resolution to output_path, and plt.show() displays the plot with identified faces and landmarks detected by MTCNN.

Applications of MTCNN

The MTCNN algorithm finds extensive application in various fields requiring accurate face detection and facial landmark recognition. Its robust architecture and efficient processing make it suitable for:

  • Face recognition systems: MTCNN is a fundamental component in biometric security systems, allowing for accurate identification and verification of individuals based on facial features.

  • Video surveillance: In surveillance systems, MTCNN enables real-time detection and tracking of faces, enhancing security measures in public spaces and sensitive areas.

  • Automated photo tagging: Social media platforms utilize MTCNN to automatically tag individuals in photos, improving user experience and engagement.

  • Emotion recognition: By detecting facial landmarks, MTCNN aids in recognizing emotions, facilitating applications in psychological research and human-computer interaction.

  • Augmented reality: MTCNN supports augmented reality applications by accurately overlaying virtual objects on detected faces in real time.

Conclusion

The multitask cascaded convolutional networks (MTCNN) algorithm significantly advances face detection and facial landmark localization. Its three-stage cascaded architecture, comprising P-Net, R-Net, and O-Net, demonstrates robust performance in various practical applications, from security systems to augmented reality. MTCNN enhances automation, security, and user interaction across diverse technological domains by efficiently detecting faces and extracting detailed facial landmarks. As advancements in deep learning continue, MTCNN remains a pivotal tool in leveraging facial recognition capabilities for research and industry applications.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved