How to load and train DETR using PyTorch

Key takeaways:

  • DETR is a state-of-the-art object detection model that uses transformers for efficient and competitive performance.  

  • Training DETR with PyTorch involves setting up the environment, preparing the dataset in COCO format, and running the training script with desired parameters.  

  • Evaluating the trained DETR model involves running a separate script with the checkpoint file to obtain metrics like mAP and IoU for performance assessment.

The DEtection TRansformer (DETR) model, introduced by Facebook AI Research, offers a novel approach to object detection by utilizing transformers. In this Answer, we’ll walk through the process of loading and training a DETR model using PyTorch.

What is DETR?

DETR is a state-of-the-art object detection model that leverages the transformer architecture. Unlike traditional object detection models that rely on complex handcrafted components like region proposal networks (RPNs) and anchor boxes, DETR uses a fully end-to-end trainable architecture. This makes it simpler and more efficient while achieving competitive performance.

Prerequisites

Before diving into training DETR, let’s ensure we have the following prerequisites installed:

Step 1: Clone the DETR repository

First, clone the official DETR repository from GitHub:

git clone https://github.com/facebookresearch/detr.git
Terminal command to clone github repository

Navigate to the cloned repository:

cd detr
Terminal command to go in a specific directory

Step 2: Set up the environment

Set up the Python environment by installing the required dependencies:

pip install -r requirements.txt
Terminal command to install the dependencies

Step 3: Prepare the dataset

Before training DETR, we need to prepare the dataset. DETR supports datasets in the COCO format, which is a widely used standard for object detection tasks. If our dataset is not in COCO format, we must convert it.

COCO dataset directory hierarchy

The COCOCommon Objects in Context dataset dataset follows a specific directory hierarchy. Here’s how it typically looks:

- coco_dataset/
- annotations/
- instances_train.json
- instances_val.json
- train2017/
- image1.jpg
- image2.jpg
- ...
- val2017/
- image1.jpg
- image2.jpg
- ...
COCO dataset file heirarchy
  • annotations/: This directory contains annotation files in JSON format. The two main files are instances_train.json for training set annotations and instances_val.json for validation set annotations. These files contain information about the images, such as image IDs, bounding box coordinates, and class labels.

  • train2017/: This directory contains the training images. Each image is typically in JPEG format and is named according to its image ID.

  • val2017/: Similarly, this directory contains the validation images.

Converting the dataset to COCO format

If our dataset is not already in COCO format, we’ll need to convert it. We can use tools like labelImg or COCO API to annotate our dataset and generate the required annotation files (instances_train.json and instances_val.json). Make sure the directory structure matches the COCO dataset hierarchy described above.

Once our dataset is prepared in COCO format, we can train DETR using the provided scripts.

Step 4: Training

Now that we have prepared our dataset, we can train the DETR model using the provided training script. This script allows us to specify parameters such as batch size, number of epochs, learning rate, and more.

Training script parameters

The main training script is main.py, and it accepts several command-line parameters to customize the training process. Here’s a detailed explanation of each parameter:

  • –nproc_per_node: We can specify the number of GPUs we want to use for training through this parameter. We can set this parameter to leverage multiple GPUs for faster training if we have multiple GPUs available. For example, --nproc_per_node=4 would utilize four GPUs.

  • –batch_size: We can specify the batch size we want to use for our training. The batch size calculates the number of samples we want to be processed in each iteration of our training loop. Larger batch sizes can lead to faster convergence but may require more memory. For example: --batch_size 2.

  • –epochs: The number of epochs specifies how many times the entire dataset will be traversed during training. One epoch is one complete pass through the entire dataset. We can increase the number of epochs for longer training cycles. For example: --epochs 500.

  • –output_dir: This parameter specifies the directory where the trained model checkpoints and logs will be saved. For example: --output_dir /path/to/output_dir.

  • –resume: This optional parameter allows us to resume training from a previously saved checkpoint. If we have already trained the model and want to continue training from a specific checkpoint, provide the path to the checkpoint file here. For example: --resume '/path/to/checkpoint.pth'.

Example training command

Here’s an example command to train the DETR model:

python -m torch.distributed.launch --nproc_per_node=4 main.py --batch_size 2 --epochs 500 --output_dir /path/to/output_dir --resume ''
Command to start the training of DETR model.

This command launches the training script with distributed data parallelism across four GPUs (--nproc_per_node=4). It specifies a batch size of 2 (--batch_size 2), trains for 500 epochs (--epochs 500), saves the checkpoints and logs to /path/to/output_dir (--output_dir /path/to/output_dir), and starts training from scratch (--resume '').

We can adjust the parameters according to our hardware configuration, dataset size, and training requirements.

Step 5: Evaluation of the model

After training the DETR model, it’s essential to evaluate its performance on a separate validation set to assess its accuracy and generalization ability. The evaluation script allows us to measure metrics such as mAP (mean Average Precision) and IoU (Intersection over Union) on our validation data.

Evaluation script parameters

The main evaluation script is also main.py, and accepts additional parameters for evaluation. Here’s a detailed explanation of each parameter:

  • –eval: This flag indicates that we want to perform the evaluation. By including this flag, the script will evaluate the trained model on the validation set. For example: --eval.

  • –resume: This parameter specifies the path to the saved checkpoint of the trained model. This checkpoint will be loaded for evaluation. For example: --resume /path/to/checkpoint.pth.

  • –output_dir: Similar to the training process, this parameter specifies the directory where evaluation results will be saved. For example: --output_dir /path/to/output_dir.

Example evaluation command

Here’s an example command to evaluate the trained DETR model:

python main.py --eval --resume /path/to/checkpoint.pth --output_dir /path/to/output_dir
Command to perform evaluation on our model.

This command performs evaluation (--eval) on the validation set using the trained model checkpoint located at /path/to/checkpoint.pth (--resume /path/to/checkpoint.pth). The evaluation results will be saved to /path/to/output_dir (--output_dir /path/to/output_dir).

Adjust the parameters according to the location of the checkpoint file and the desired output directory.

Interpretation of evaluation results

After running the evaluation script, we will obtain metrics such as mAP and IoU, which provide insights into the model’s performance. A higher mAP indicates better object detection accuracy, while higher IoU values imply better localization of objects in the images.

Inspecting these metrics will help us understand how well the trained DETR model performs on our validation data, and we can identify areas for improvement if necessary.

Quiz

We’ll test our understanding of the concepts learned in this Answer with a short quiz.

1

What is the main architectural difference between DETR and traditional object detection models?

A)

DETR uses region proposal networks (RPNs).

B)

DETR uses anchor boxes.

C)

DETR is fully end-to-end trainable and relies on transformers.

D)

DETR uses convolutional neural networks exclusively.

Question 1 of 40 attempted

Demo

For a working demo of the model, feel free to check this GitHub repositoryhttps://github.com/thedeepreader/detr_tutorial.

After this Answer, we can load, train, and evaluate the DETR model on our custom dataset. It always helps to go through the official documentation once to understand further details and to familiarize ourselves with the workings of the model.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


What is object detection using DETR?

DETR is a deep-learning model for object detection and segmentation that uses transformers. It directly predicts object classes and locations by processing global image context, providing a more unified and simpler approach than previous methods reliant on region proposals.


What is the structure of DETR?

DETR consists of a CNN backbone for feature extraction, a transformer encoder-decoder module for processing these features, and a set-based Hungarian matching process for outputting predictions. This structure eliminates the need for traditional methods like anchors and non-maximum suppression.


What is the loss function in DETR object detection?

The loss function in DETR has two key steps:

  • First, it finds the optimal bipartite matching between predictions and ground truths using a graph-based cost function, which helps determine the best matches.
  • Second, it calculates a loss that penalizes classification errors and inaccuracies in bounding box predictions. This matching process ensures the model associates each prediction with a specific ground-truth object, using both class and localization accuracy as factors.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved