本节将以 CIFAR-10 数据集的评测为例，分别介绍如何使用 MMEval 结合 torch.distributed 和 MPI4Py 进行分布式评测，相关代码可以在 [mmeval/examples/cifar10_dist_eval](https://github.com/open-mmlab/mmeval/tree/main/examples/cifar10_dist_eval) 中找到。

In [1]:
!pip install torch torchvision tqdm

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


In [2]:
import torch
import torchvision as tv
import tqdm
from torch.utils.data import DataLoader

from mmeval import Accuracy

2022-11-16 22:05:56.763074: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


## 1. 单进程评测

首先我们需要加载 CIFAR-10 测试数据，我们可以使用 TorchVison 提供的数据集类。

In [3]:
def get_eval_dataloader():
    dataset = tv.datasets.CIFAR10(
        root='./',
        train=False,
        download=True,
        transform=tv.transforms.ToTensor())
    return DataLoader(dataset, batch_size=1)

其次，我们需要准备待评测的模型，这里我们使用 TorchVision 中的 resnet18。

In [4]:
def get_model(pretrained_model_fpath=None):
    model = tv.models.resnet18(num_classes=10)
    if pretrained_model_fpath is not None:
        model.load_state_dict(torch.load(pretrained_model_fpath))
    return model.eval()

有了待评测的数据集与模型，就可以使用 mmeval.Accuracy 指标对模型预测结果进行评测。

In [5]:
eval_dataloader = get_eval_dataloader()
model = get_model('./cifar10_resnet18.pth').cuda()
# 实例化 `Accuracy`，计算 top1 与 top3 准确率
accuracy = Accuracy(topk=(1, 3))

with torch.no_grad():
    for images, labels in tqdm.tqdm(eval_dataloader):
        predicted_score = model(images.cuda()).cpu()
        # 累计批次数据，中间结果将保存在 `accuracy._results` 中
        accuracy.add(predictions=predicted_score, labels=labels)

# 调用 `accuracy.compute` 进行指标计算
print(accuracy.compute())
# 调用 `accuracy.reset` 清除保存在 `accuracy._results` 中的中间结果
accuracy.reset()

Files already downloaded and verified


100%|███████████████████████████████████████████████████████████████████████████| 10000/10000 [00:33<00:00, 302.07it/s]


{'top1': 0.7458999752998352, 'top3': 0.8931000232696533}


## 2. 使用 torch.distributed 进行分布式评测

在 MMEval 中为 torch.distributed 实现了两个分布式通信后端，分别是 TorchCPUDist 和 TorchCUDADist。

为 MMEval 设置分布式通信后端的方式有两种：

In [6]:
from mmeval.core import set_default_dist_backend
from mmeval import Accuracy

# 1. 设置全局默认分布式通信后端
set_default_dist_backend('torch_cpu')

# 2. 初始化评测指标时候通过 `dist_backend` 传参
accuracy = Accuracy(dist_backend='torch_cpu')

结合上述单进程评测的代码，再加入数据集切片和分布式初始化，即可实现分布式评测。

In [7]:
!cat cifar10_dist_eval/cifar10_eval_torch_dist.py

import torch
import torchvision as tv
import tqdm
from torch.utils.data import DataLoader, DistributedSampler

from mmeval import Accuracy


def get_eval_dataloader(rank=0, num_replicas=1):
    dataset = tv.datasets.CIFAR10(
        root='./',
        train=False,
        download=True,
        transform=tv.transforms.ToTensor())
    dist_sampler = DistributedSampler(
        dataset, num_replicas=num_replicas, rank=rank)
    data_loader = DataLoader(dataset, batch_size=1, sampler=dist_sampler)
    return data_loader, len(dataset)


def get_model(pretrained_model_fpath=None):
    model = tv.models.resnet18(num_classes=10)
    if pretrained_model_fpath is not None:
        model.load_state_dict(torch.load(pretrained_model_fpath))
    return model.eval()


def eval_fn(rank, process_num):
    torch.distributed.init_process_group(
        backend='gloo',
        init_method='tcp://127.0.0.1:2345',
        world_size=process_num,
        rank=rank)
    torch

In [8]:
!python cifar10_dist_eval/cifar10_eval_torch_dist.py

2022-11-16 22:10:49.905219: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-11-16 22:10:53.194848: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-11-16 22:10:53.194848: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-11-16 22:10:53.207048: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
100%|██████████████████████████████████████| 3334/3334 [00:11<00:00, 283.87it/s]
{'top1': 0.7458999752998352, 'top3': 0.8931000232696533}
{'top1': 0.7458999752998352, 'top3': 0.8931000232696533}
{'top1': 0.7458999752998352, 'top3': 0.8931000232696533}


## 3. 使用 MPI4Py 进行分布式评测

MMEval 将分布式通信功能抽象解耦了，因此虽然上述例子使用的是 PyTorch 模型和数据加载，我们仍然可以使用除 torch.distributed 以外的分布式通信后端来实现分布式评测。下面将展示如何使用 MPI4Py 作为分布式通信后端来进行分布式评测。

首先需要安装 MPI4Py 以及 openmpi，建议使用 conda 进行安装：

In [9]:
!conda install -y openmpi mpi4py

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [10]:
!cat cifar10_dist_eval/cifar10_eval_mpi4py.py

import torch
import torchvision as tv
import tqdm
from mpi4py import MPI
from torch.utils.data import DataLoader, DistributedSampler

from mmeval import Accuracy


def get_eval_dataloader(rank=0, num_replicas=1):
    dataset = tv.datasets.CIFAR10(
        root='./',
        train=False,
        download=True,
        transform=tv.transforms.ToTensor())
    dist_sampler = DistributedSampler(
        dataset, num_replicas=num_replicas, rank=rank)
    data_loader = DataLoader(dataset, batch_size=1, sampler=dist_sampler)
    return data_loader, len(dataset)


def get_model(pretrained_model_fpath=None):
    model = tv.models.resnet18(num_classes=10)
    if pretrained_model_fpath is not None:
        model.load_state_dict(torch.load(pretrained_model_fpath))
    return model.eval()


def eval_fn(rank, process_num):
    torch.cuda.set_device(rank)
    eval_dataloader, total_num_samples = get_eval_dataloader(rank, process_num)
    model = get_model('./cifar10_resn

使用 mpirun 作为分布式评测启动方式：

In [1]:
# 使用 mpirun 启动 3 个进程
!mpirun -np 3 python cifar10_dist_eval/cifar10_eval_mpi4py.py

2022-11-16 22:12:59.873751: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-11-16 22:12:59.873752: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-11-16 22:12:59.874402: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
100%|██████████| 3334/3334 [00:11<00:00, 282.08it/s]{'top1': 0.7458999752998352, 'top3': 0.8931000232696533}
{'top1': 0.7458999752998352, 'top3': 0.8931000232696533}
{'top1': 0.7458999752998352, 'top3': 0.8931000232696533}

