{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fe12e252",
   "metadata": {},
   "source": [
    "# 微调分割\n",
    "\n",
    "参考：[TorchVision Object Detection Finetuning Tutorial](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html)\n",
    "\n",
    "在 [Penn-Fudan 数据库](https://www.cis.upenn.edu/~jshi/ped_html/) 中对行人检测和分割预训练的 [Mask R-CNN](https://arxiv.org/abs/1703.06870) 模型进行微调。它包含 170 个图像 345 个实例的行人，使用它来说明如何在 `torchvision` 中使用新的特征，以训练一个自定义数据集上的实例分割模型。\n",
    "\n",
    "## 定义数据集\n",
    "\n",
    "用于训练对象检测、实例分割和人员关键点检测的参考脚本允许轻松地支持添加新的自定义数据集。数据集应该继承标准的 `torch.utils.data.Dataset` 类，并实现 `__len__` 和 `__getitem__`。\n",
    "\n",
    "唯一需要的指定的是数据集 `__getitem__` 应该返回：\n",
    "\n",
    "- image：PIL 图片尺寸 `(H, W)`\n",
    "- target：一个包含以下字段的字典。\n",
    "    - `boxes` (`FloatTensor[N, 4]`)：`N` 个边界框的坐标为 `[x0, y0, x1, y1]` 格式，取值范围为 $[0, W) \\times [0, H)$。\n",
    "    - `labels` (`Int64Tensor[N]`)：每个边界框的标签。`0` 表示始终是背景类。\n",
    "    - `image_id` (`Int64Tensor[1]`)：一个图像标识符。它应该在数据集中的所有图像之间是唯一的，并在评估期间使用。\n",
    "    - `area` (`Tensor[N]`)：边界框的面积。这是在使用 COCO 度量进行评估时使用的，用于分隔小、中、大盒子之间的度量分数。\n",
    "    - `iscrowd` (`UInt8Tensor[N]`): `iscrowd=True` 的实例在计算时会被忽略。\n",
    "    - （可选）`mask` (`UInt8Tensor[N, H, W]`):每个对象的分割掩码。\n",
    "    - （可选）`keypoints` (`FloatTensor[N, K, 3]`):对于 `N` 个对象中的每一个，它包含了 `[x, y, visibility]` 格式的 `K` 个关键点，定义了对象。`visibility=0` 表示关键点不可见。注意，对于数据扩展，翻转关键点的概念取决于数据表示，您可能应该为新的关键点表示调整 `references/detection/transforms.py`。\n",
    "\n",
    "如果您的模型返回上述方法，它们将使其同时适用于训练和评估，并将使用来自 `pycocotools` 的评估脚本，这些脚本可以与 `pip` 安装 `pycocotools` 一起安装。\n",
    "\n",
    ":::{note}\n",
    "对于 Windows，请使用命令从 `gautamchitnis` 安装 `pycocotools`：\n",
    "\n",
    "```shell\n",
    "pip install git+https://github.com/gautamchitnis/cocoapi.git@cocodataset-master#subdirectory=PythonAPI\n",
    "```\n",
    ":::\n",
    "\n",
    "`labels`上有一个注意事项。模型将第 `0` 类作为背景。如果数据集不包含背景类，则标签中不应该有 `0`。例如，假设您只有两个类，`cat` 和 `dog`，您可以定义 `1`（不是 `0`）表示猫，`2` 表示狗。所以，例如，如果一个图像同时具有两个类，你的标签张量应该看起来像 `[1, 2]`。\n",
    "\n",
    "此外，如果想在训练期间使用高宽比分组（以便每批只包含具有相似高宽比的图像），那么建议也实现一个 `get_height_and_width` 方法，它返回图像的高度和宽度。如果没有提供此方法，可以通过 `__getitem__` 查询数据集的所有元素，它将在内存中加载图像，比提供自定义方法要慢。\n",
    "\n",
    "## 为 PennFudan 编写自定义数据集\n",
    "\n",
    "[下载并解压 zip 文件](https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip) 后，有以下文件夹结构：\n",
    "\n",
    "```sh\n",
    "PennFudanPed/\n",
    "  PedMasks/\n",
    "    FudanPed00001_mask.png\n",
    "    FudanPed00002_mask.png\n",
    "    FudanPed00003_mask.png\n",
    "    FudanPed00004_mask.png\n",
    "    ...\n",
    "  PNGImages/\n",
    "    FudanPed00001.png\n",
    "    FudanPed00002.png\n",
    "    FudanPed00003.png\n",
    "    FudanPed00004.png\n",
    "```\n",
    "\n",
    "这是一对图像和分割蒙版的一个例子：\n",
    "\n",
    "![](images/ee.png)\n",
    "\n",
    "所以每个图像都有一个对应的分割蒙版，其中每个颜色对应一个不同的实例。为这个数据集编写一个 `torch.utils.data.Dataset` 类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "0f595403",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import torch\n",
    "from PIL import Image\n",
    "\n",
    "\n",
    "class PennFudanDataset(torch.utils.data.Dataset):\n",
    "    def __init__(self, root, transforms):\n",
    "        self.root = root\n",
    "        self.transforms = transforms\n",
    "        # load all image files, sorting them to\n",
    "        # ensure that they are aligned\n",
    "        self.imgs = list(sorted(os.listdir(os.path.join(root, \"PNGImages\"))))\n",
    "        self.masks = list(sorted(os.listdir(os.path.join(root, \"PedMasks\"))))\n",
    "\n",
    "    def __getitem__(self, idx):\n",
    "        # load images and masks\n",
    "        img_path = os.path.join(self.root, \"PNGImages\", self.imgs[idx])\n",
    "        mask_path = os.path.join(self.root, \"PedMasks\", self.masks[idx])\n",
    "        img = Image.open(img_path).convert(\"RGB\")\n",
    "        # note that we haven't converted the mask to RGB,\n",
    "        # because each color corresponds to a different instance\n",
    "        # with 0 being background\n",
    "        mask = Image.open(mask_path)\n",
    "        # convert the PIL Image into a numpy array\n",
    "        mask = np.array(mask)\n",
    "        # instances are encoded as different colors\n",
    "        obj_ids = np.unique(mask)\n",
    "        # first id is the background, so remove it\n",
    "        obj_ids = obj_ids[1:]\n",
    "\n",
    "        # split the color-encoded mask into a set\n",
    "        # of binary masks\n",
    "        masks = mask == obj_ids[:, None, None]\n",
    "\n",
    "        # get bounding box coordinates for each mask\n",
    "        num_objs = len(obj_ids)\n",
    "        boxes = []\n",
    "        for i in range(num_objs):\n",
    "            pos = np.where(masks[i])\n",
    "            xmin = np.min(pos[1])\n",
    "            xmax = np.max(pos[1])\n",
    "            ymin = np.min(pos[0])\n",
    "            ymax = np.max(pos[0])\n",
    "            boxes.append([xmin, ymin, xmax, ymax])\n",
    "\n",
    "        # convert everything into a torch.Tensor\n",
    "        boxes = torch.as_tensor(boxes, dtype=torch.float32)\n",
    "        # there is only one class\n",
    "        labels = torch.ones((num_objs,), dtype=torch.int64)\n",
    "        masks = torch.as_tensor(masks, dtype=torch.uint8)\n",
    "\n",
    "        image_id = torch.tensor([idx])\n",
    "        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])\n",
    "        # suppose all instances are not crowd\n",
    "        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)\n",
    "\n",
    "        target = {}\n",
    "        target[\"boxes\"] = boxes\n",
    "        target[\"labels\"] = labels\n",
    "        target[\"masks\"] = masks\n",
    "        target[\"image_id\"] = image_id\n",
    "        target[\"area\"] = area\n",
    "        target[\"iscrowd\"] = iscrowd\n",
    "\n",
    "        if self.transforms is not None:\n",
    "            img, target = self.transforms(img, target)\n",
    "\n",
    "        return img, target\n",
    "\n",
    "    def __len__(self):\n",
    "        return len(self.imgs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e448c7ee",
   "metadata": {},
   "source": [
    "这就是数据集的全部内容。现在，定义一个可以对这个数据集执行预测的模型。\n",
    "\n",
    "## 定义模型\n",
    "\n",
    "使用 [Mask R-CNN](https://arxiv.org/abs/1703.06870)，它是基于 [Faster R-CNN](https://arxiv.org/abs/1506.01497) 的。Faster R-CNN 是一种模型，可以预测图像中潜在物体的边界框和类分数。\n",
    "\n",
    "![](./images/tv_image03.png)\n",
    "\n",
    "Mask R-CNN 增加了一个额外的分支到 Faster R-CNN，它也预测每个实例的分割掩码。\n",
    "\n",
    "![](./images/tv_image04.png)\n",
    "\n",
    "有两种常见的情况，人们可能想要修改 torchvision modelzoo 中的一个可用模型。第一个是当我们想从一个预先训练的模型开始，只是微调最后一层。另一种是当我们想要用一个不同的模型替换模型的主干时（例如，为了更快的预测）。\n",
    "\n",
    "让我们看看在接下来的部分中我们将如何完成这两个步骤。\n",
    "\n",
    "### 1. 从预先训练的模型进行微调\n",
    "\n",
    "让我们假设您希望从一个经过 COCO 训练的模型开始，并希望针对特定的类对其进行微调。下面是一种可能的方法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6ffdca79",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchvision\n",
    "from torchvision.models.detection.faster_rcnn import FastRCNNPredictor\n",
    "\n",
    "# load a model pre-trained pre-trained on COCO\n",
    "model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)\n",
    "\n",
    "# replace the classifier with a new one, that has\n",
    "# num_classes which is user-defined\n",
    "num_classes = 2  # 1 class (person) + background\n",
    "# get number of input features for the classifier\n",
    "in_features = model.roi_heads.box_predictor.cls_score.in_features\n",
    "# replace the pre-trained head with a new one\n",
    "model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21faea39",
   "metadata": {},
   "source": [
    "### 2 修改模型，以添加不同的骨干"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6ce16b70",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchvision\n",
    "from torchvision.models.detection import FasterRCNN\n",
    "from torchvision.models.detection.rpn import AnchorGenerator\n",
    "\n",
    "# load a pre-trained model for classification and return\n",
    "# only the features\n",
    "backbone = torchvision.models.mobilenet_v2(pretrained=True).features\n",
    "# FasterRCNN needs to know the number of\n",
    "# output channels in a backbone. For mobilenet_v2, it's 1280\n",
    "# so we need to add it here\n",
    "backbone.out_channels = 1280\n",
    "\n",
    "# let's make the RPN generate 5 x 3 anchors per spatial\n",
    "# location, with 5 different sizes and 3 different aspect\n",
    "# ratios. We have a Tuple[Tuple[int]] because each feature\n",
    "# map could potentially have different sizes and\n",
    "# aspect ratios\n",
    "anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),\n",
    "                                   aspect_ratios=((0.5, 1.0, 2.0),))\n",
    "\n",
    "# let's define what are the feature maps that we will\n",
    "# use to perform the region of interest cropping, as well as\n",
    "# the size of the crop after rescaling.\n",
    "# if your backbone returns a Tensor, featmap_names is expected to\n",
    "# be [0]. More generally, the backbone should return an\n",
    "# OrderedDict[Tensor], and in featmap_names you can choose which\n",
    "# feature maps to use.\n",
    "roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0'],\n",
    "                                                output_size=7,\n",
    "                                                sampling_ratio=2)\n",
    "\n",
    "# put the pieces together inside a FasterRCNN model\n",
    "model = FasterRCNN(backbone,\n",
    "                   num_classes=2,\n",
    "                   rpn_anchor_generator=anchor_generator,\n",
    "                   box_roi_pool=roi_pooler)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c66e4163",
   "metadata": {},
   "source": [
    "### 基于 PennFudan 数据集的实例分割模型\n",
    "\n",
    "在我们的例子中，我们希望从一个预先训练的模型进行微调，因为我们的数据集非常小，所以我们将采用第1种方法。\n",
    "\n",
    "这里我们还想计算实例分割掩码，所以我们将使用 Mask R-CNN："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fb1a9a9e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchvision\n",
    "from torchvision.models.detection.faster_rcnn import FastRCNNPredictor\n",
    "from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor\n",
    "\n",
    "\n",
    "def get_model_instance_segmentation(num_classes):\n",
    "    # load an instance segmentation model pre-trained pre-trained on COCO\n",
    "    model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)\n",
    "\n",
    "    # get number of input features for the classifier\n",
    "    in_features = model.roi_heads.box_predictor.cls_score.in_features\n",
    "    # replace the pre-trained head with a new one\n",
    "    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)\n",
    "\n",
    "    # now get the number of input features for the mask classifier\n",
    "    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels\n",
    "    hidden_layer = 256\n",
    "    # and replace the mask predictor with a new one\n",
    "    model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,\n",
    "                                                       hidden_layer,\n",
    "                                                       num_classes)\n",
    "\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26472247",
   "metadata": {},
   "source": [
    "就是这样，这将使模型准备好训练和评估您的自定义数据集。\n",
    "\n",
    "## 把每个组件组合在一起\n",
    "\n",
    "在 `references/detection/` 中，有许多辅助函数来简化训练和评估检测模型。这里，我们将使用 `references/detection/engine.py`、 `references/detection/utils.py` 和 `references/detection/transforms.py`。只需要将 `references/detection ` 下的所有内容复制到你的文件夹中，并在这里使用它们。\n",
    "\n",
    "让我们编写一些辅助函数用于数据增强/转换："
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94dc809e",
   "metadata": {},
   "source": [
    "待更....."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "850ab69a",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "md:myst,ipynb"
  },
  "kernelspec": {
   "display_name": "Python 3.10.4 ('torch': conda)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  },
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "vscode": {
   "interpreter": {
    "hash": "20e538bd0bbffa4ce75068aaf85df10d4944f3fdb705eeec6781a4702773116f"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}