1.官方解释

查看https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data，里面有这样一句话。

For training command outputs and further details please see the training section of Google Colab Notebook.

打开这个notebook（需要点手段，你们懂的）。
总结一下，这个notebook中有关train 的信息。

actual training is much longer, around 300-1000 epochs, depending on your dataset
--cfg选择model文件（models/yolo5s.yaml）
--data选择datase文件(data/coco128.yaml)
--weights指定初始权重文件（随机初始化--weights ''）
All training results are saved to runs/exp0 for the first experiment, then runs/exp1, runs/exp2 etc. for subsequent experiments.(实验发现到10就停下了，之后不断更新exp10)
可选tensorboard（还不会用。。。）
A Mosaic Dataloader is used for training
View test_batch0_gt.jpg to see test batch 0 ground truth labels.
View test_batch0_pred.jpg to see test batch 0 predictions.
Training losses and performance metrics are saved to Tensorboard and also to a runs/exp0/results.txt logfile. results.txt is plotted as results.png after training completes.

然后就没了。。。。。显然对咱深入理解没啥帮助，也就勉强一用。

2.源码阅读

传参都在这了。

if __name__ == '__main__':
    check_git_status()
    parser = argparse.ArgumentParser()
    parser.add_argument('--cfg', type=str, default='models/yolov5s.yaml', help='model.yaml path')
    parser.add_argument('--data', type=str, default='data/coco128.yaml', help='data.yaml path')
    parser.add_argument('--hyp', type=str, default='', help='hyp.yaml path (optional)')
    parser.add_argument('--epochs', type=int, default=300)
    parser.add_argument('--batch-size', type=int, default=16)
    parser.add_argument('--img-size', nargs='+', type=int, default=[640, 640], help='train,test sizes')
    parser.add_argument('--rect', action='store_true', help='rectangular training')
    parser.add_argument('--resume', nargs='?', const='get_last', default=False,
                        help='resume from given path/to/last.pt, or most recent run if blank.')
    parser.add_argument('--nosave', action='store_true', help='only save final checkpoint')
    parser.add_argument('--notest', action='store_true', help='only test final epoch')
    parser.add_argument('--noautoanchor', action='store_true', help='disable autoanchor check')
    parser.add_argument('--evolve', action='store_true', help='evolve hyperparameters')
    parser.add_argument('--bucket', type=str, default='', help='gsutil bucket')
    parser.add_argument('--cache-images', action='store_true', help='cache images for faster training')
    parser.add_argument('--weights', type=str, default='', help='initial weights path')
    parser.add_argument('--name', default='', help='renames results.txt to results_name.txt if supplied')
    parser.add_argument('--device', default='', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')
    parser.add_argument('--multi-scale', action='store_true', help='vary img-size +/- 50%%')
    parser.add_argument('--single-cls', action='store_true', help='train as single-class dataset')
    opt = parser.parse_args()

总结一下：

cfg,data,weights：前面看过了是一定要传的两个参；
hyp：参数咱暂时用不上，是指定一些超参数用的（学习率啥的）；
epochs：轮数，默认300，需要指定；
batch-size：一次喂多少数据，我这内存就能给16，所以可以不传按默认16；
img-size：训练和测试数据集的图片尺寸(个人理解为分辨率)，默认640，640nargs='+' 表示参数可设置一个或多个；
rect：只要加上’–rect’程序就会将rect设为true，作用未知（应该是训练时启用矩形训练）；
resume：重新训练（个人理解epoch会从头计算）；
notest：only test final epoch（这样训练中间变化趋势应该就看不到了）；
evolve：进化超参数（hyp），可以试试；
cache-images：cache images for faster training，可以试试；
name：renames results.txt to results_name.txt if supplied；
device：cuda device, i.e. 0 or 0,1,2,3 or cpu，我这默认已经用了gtx1060了，不用改；
single-cls：train as single-class dataset，暂时没用；

以下这些都没太看懂
noautoanchor：disable autoanchor check
nosave：only save final checkpoint
bucket：gsutil bucket（应该关于谷歌云，应该用不上）
multi-scale：vary img-size +/- 50%%

读下来我的命令行语句应该改为：

python train.py --epoch 53 --data .\data\junk2020.yaml --cfg .\models\yolov5s.yaml --weight runs\exp10\weights\best.pt --evolve --cache-images

测试一下
内存可能不太够，电脑差点崩掉，中途杀了python，所以那个cache没能力就先别加了。。。
evolve之后的hyp也不知道存在哪了，，，明天再说吧。。。。

python train.py --epoch 80 --data .\data\junk2020.yaml --cfg .\models\yolov5s.yaml --weight runs\exp10\weights\best.pt --evolve

结果就是evolve出错

Traceback (most recent call last):
  File "train.py", line 449, in <module>
    print_mutation(hyp, results, opt.bucket)
  File "D:\ForSpeed\junk_yolov5\yolov5\utils\utils.py", line 823, in print_mutation
    b = '%10.3g' * len(hyp) % tuple(hyp.values())  # hyperparam values
TypeError: must be real number, not str

由于不好debug，这边先把evolve去了。

3.可视化结果解释

解释一下result.png里都是啥：
在这里插入图片描述

GIoU：推测为GIoU损失函数均值，越小方框越准；
Objectness：推测为目标检测loss均值，越小目标检测越准；
Classification：推测为分类loss均值，越小分类越准；
Precision：准确率（找对的/找到的）；
Recall：召回率（找对的/该找对的）；
mAP@0.5 & mAP@0.5:0.95：这里说的挺好，总之就是AP是用Precision和Recall作为两轴作图后围成的面积，m表示平均，@后面的数表示判定iou为正负样本的阈值，@0.5:0.95表示阈值取0.5:0.05:0.95后取均值。

4.evolve报错解决

Traceback (most recent call last):
  File "train.py", line 449, in <module>
    print_mutation(hyp, results, opt.bucket)
  File "D:\ForSpeed\junk_yolov5\yolov5\utils\utils.py", line 823, in print_mutation
    b = '%10.3g' * len(hyp) % tuple(hyp.values())  # hyperparam values
TypeError: must be real number, not str

这波看这句b = '%10.3g' * len(hyp) % tuple(hyp.values())，意思是把hyp这个字典的value都提出来形成一个元组，然后以10.3g批量格式化。

hyp = {'optimizer': 'SGD',  # ['adam', 'SGD', None] if none, default is SGD
       'lr0': 0.01,  # initial learning rate (SGD=1E-2, Adam=1E-3)
       'momentum': 0.937,  # SGD momentum/Adam beta1
       'weight_decay': 5e-4,  # optimizer weight decay
       'giou': 0.05,  # giou loss gain
       'cls': 0.58,  # cls loss gain
       'cls_pw': 1.0,  # cls BCELoss positive_weight
       'obj': 1.0,  # obj loss gain (*=img_size/320 if img_size != 320)
       'obj_pw': 1.0,  # obj BCELoss positive_weight
       'iou_t': 0.20,  # iou training threshold
       'anchor_t': 4.0,  # anchor-multiple threshold
       'fl_gamma': 0.0,  # focal loss gamma (efficientDet default is gamma=1.5)
       'hsv_h': 0.014,  # image HSV-Hue augmentation (fraction)
       'hsv_s': 0.68,  # image HSV-Saturation augmentation (fraction)
       'hsv_v': 0.36,  # image HSV-Value augmentation (fraction)
       'degrees': 0.0,  # image rotation (+/- deg)
       'translate': 0.0,  # image translation (+/- fraction)
       'scale': 0.5,  # image scale (+/- gain)
       'shear': 0.0}  # image shear (+/- deg)

观察values，第一项为字符串’SGD’，所以格式化出现了问题。
将b = '%10.3g' * len(hyp) % tuple(hyp.values())改为
b = '%10s' * 1 % (list(hyp.values())[0],) + '%10.3g' * (len(hyp) - 1) % tuple( list(hyp.values())[1:])
训练一轮试试

Traceback (most recent call last):
  File "train.py", line 449, in <module>
    print_mutation(hyp, results, opt.bucket)
  File "D:\ForSpeed\junk_yolov5\yolov5\utils\utils.py", line 837, in print_mutation
    x = np.unique(np.loadtxt('evolve.txt', ndmin=2), axis=0)  # load unique rows
  File "C:\Users\15518\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\lib\npyio.py", line 1146, in loadtxt
    for x in read_data(_loadtxt_chunksize):
  File "C:\Users\15518\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\lib\npyio.py", line 1074, in read_data
    items = [conv(val) for (conv, val) in zip(converters, vals)]
  File "C:\Users\15518\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\lib\npyio.py", line 1074, in <listcomp>
    items = [conv(val) for (conv, val) in zip(converters, vals)]
  File "C:\Users\15518\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\lib\npyio.py", line 781, in floatconv
    return float(x)
ValueError: could not convert string to float: 'SGD'

这里是np.loadtxt('evolve.txt', ndmin=2)这里txt里有字符串，所以出错。

把第一项去掉看看

Traceback (most recent call last):
  File "train.py", line 437, in <module>
    hyp[k] = x[i + 7] * v[i]  # mutate
IndexError: index 18 is out of bounds for axis 0 with size 18