目标识别和文字检测算法 Faster R-CNN、CTPN

Faster R-CNN 目标检测算法

Towards Real-Time Object Detection with Region Proposal Networks

R-CNN:Regions with CNN features

  1. Input image
  2. Extract region proposals(~2k)
  3. Compute CNN features
  4. Classify regions

IoU Intersection over Union

测量在特定数据集中检测相应物体准确度的一个标准

预测范围: bounding boxex

ground-truth bounding boxes(人为在训练集图像中标出要检测物体的大概范围)

I

o

U

=

A

r

e

a

o

f

O

v

e

r

l

a

p

A

r

e

a

o

f

U

n

i

o

n

IoU = \frac{Area\quad of\quad Overlap}{Area \quad of \quad Union}

IoU=AreaofUnionAreaofOverlap

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IldTlrdy-1639284143003)(Detecting%20Text%20in%20Natural%20Image%20with%20Connectionist%20Text%20Proposal%20Network.assets/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dhb3l1MTI1MzQwMTU2Mw==,size_16,color_FFFFFF,t_70.png)]

NMS (Non-Maximum Suppression)

Fast R-CNN

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YaZeS7QC-1639284143006)(Detecting%20Text%20in%20Natural%20Image%20with%20Connectionist%20Text%20Proposal%20Network.assets/image-20211211202345500.png)]

Selection search

Anchor sliding window Feature extraction

RPN Loss

Cls label 二分类,是否有物体,使用IoU gt bounding box anchor box

Loc label

t

x

=

(

x

x

a

)

/

w

a

,

t

y

=

(

y

y

a

)

/

h

a

,

t

w

=

l

o

g

(

w

/

w

a

)

,

t

h

=

l

o

g

(

t

w

)

t_x^* = (x^*-x_a)/w_a, t_y^* = (y^*-y_a)/h_a,\\ t_w^* = log(w^*/w_a), t_h^* = log(t_w^*)

tx=(xxa)/wa,ty=(yya)/ha,tw=log(w/wa),th=log(tw)

t

x

=

(

x

x

a

)

/

w

a

,

t

y

=

(

y

y

a

)

/

h

a

,

t

w

=

l

o

g

(

w

/

w

a

)

,

t

h

=

l

o

g

(

h

/

h

a

)

t_x = (x-x_a)/w_a, t_y = (y-ya)/h_a,\\ t_w = log(w/w_a), t_h = log(h/h_a)

tx=(xxa)/wa,ty=(yya)/ha,tw=log(w/wa),th=log(h/ha)

Cls loss

Cross Entropy交叉熵

Loc Loss

z

i

=

0.5

(

x

i

y

i

)

2

/

b

e

t

a

,

i

f

x

i

y

i

<

b

e

t

a

z

i

=

x

i

y

i

0.5

b

e

t

a

,

o

t

h

e

r

w

i

s

e

z_i = 0.5(x_i-y_i)^2/beta, \quad if |x_i-y_i|<beta\\ z_i = |x_i-y_i|-0.5*beta, \quad otherwise

zi=0.5(xiyi)2/beta,ifxiyi<betazi=xiyi0.5beta,otherwise

RoI Head Region of Interest

Mask R-CNN

L

=

L

c

l

s

+

L

b

o

x

+

L

m

a

s

k

L = L_{cls}+L_{box}+L_{mask}

L=Lcls+Lbox+Lmask

To this we apply a per-pixel sigmoid,and define

L

m

a

s

k

L_{mask}

Lmask as the average binary cross-entropy loss. For an RoI associated with gorund-truth k,

L

m

a

s

k

L_{mask}

Lmask is only defined o the k-th mask(other mask outputs do not contribute to the loss).

RoI Align不对齐,保留浮点,在小区域之内继续划分

CTPN 文字检测算法

Detecting Text in Natural Image with Connectionist Text Proposal Network

  • Detecting text in fine-scale proposals
  • Recurrent connectionist text proposals
  • Side-refinement

v

c

=

(

c

y

c

y

a

)

/

h

a

v

c

=

(

c

y

c

y

a

)

/

h

a

v

h

=

l

o

g

(

h

/

h

a

)

v

h

=

l

o

g

(

h

/

h

a

)

v_c= (c_y-c_y^a)/h^a\\ v_c^* = (c_y^*-c_y^a)/h^a\\ v_h = log(h/h_a)\\ v_h^* = log(h^*/h^a)

vc=(cycya)/havc=(cycya)/havh=log(h/ha)vh=log(h/ha)

Text line construction

o

=

(

x

s

i

d

e

c

x

a

)

/

w

a

o^* = (x^*_{side} -c^a_x)/w^a

o=(xsidecxa)/wa

Code

bounding box

CRNN 文字识别算法

An End-yo-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

  • CRNN

  • Code

  • CTC[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-g467jlT1-1639284143006)(Detecting%20Text%20in%20Natural%20Image%20with%20Connectionist%20Text%20Proposal%20Network.assets/image-20211209220712175.png)]

  • lexicon-based

  • lexicon-free

feature sequence —— receptive field感受野

CRNN——CTC

π

=

h

h

e

l

l

l

o

o

B

(

π

)

=

h

e

l

l

o

p

(

l

y

)

=

π

:

B

(

π

)

=

1

p

(

π

y

)

,

p

(

h

e

l

l

o

y

)

=

π

:

B

(

π

)

=

h

e

l

l

o

p

(

π

y

)

\pi = --hh-e-l-ll-oo--\\ B(\pi) = hello\\ p(l|y) = \sum_{\pi:B(\pi)=1} p(\pi|y), \quad p('hello'|y) = \sum_{\pi:B(\pi)='hello'} p(\pi|y)

π=hhelllooB(π)=hellop(ly)=π:B(π)=1p(πy),p(helloy)=π:B(π)=hellop(πy)

CTC Theory

p

(

l

x

)

=

π

B

1

(

1

)

p

(

π

x

)

.

h

(

x

)

=

a

r

g

m

a

x

1

L

T

p

(

l

x

)

.

O

M

L

(

S

,

N

w

)

=

(

x

,

z

)

S

l

n

(

p

(

z

x

)

)

=

(

x

,

z

)

S

l

n

(

π

B

1

(

z

)

p

(

π

x

)

)

p(l|x) = \sum_{\pi \in B^{-1}(1)} p(\pi|x).\\ h(x) = arg\quad max_{1\in L\leq T} \quad p(l|x).\\ O^{ML}(S,N_w) = -\sum_{(x,z)\in S} ln(p(z|x))=-\sum_{(x,z) \in S} ln(\sum_{\pi \in B^{-1}(z)} p(\pi |x))

p(lx)=πB1(1)p(πx).h(x)=argmax1LTp(lx).OML(S,Nw)=(x,z)Sln(p(zx))=(x,z)Sln(πB1(z)p(πx))

为了让所有的path都能在图中唯一、合法的表示,结点转换有如下约束:

  1. 转换只能往右下方向,其他方向不允许
  2. 相同的字符之间起码要有一个空字符
  3. 非空字符不能被跳过
  4. 起点必须从前两个字符开始
  5. 终点必须落在结尾两个字符

forward-backward

定义在时刻t经过节点s的全部前缀子路径的概率总和为前向概率

α

t

(

s

)

\alpha_t(s)

αt(s)

α

3

(

4

)

=

p

(

a

p

)

+

p

(

a

a

p

)

+

p

(

a

p

)

+

p

(

a

p

p

)

\alpha_3(4) = p(_ap)+p(aap)+p(a_p)+p(app)

α3(4)=p(ap)+p(aap)+p(ap)+p(app)

  • 情况1:第s个符号为空符号blank

    α

    t

    (

    s

    )

    =

    (

    α

    t

    1

    (

    s

    )

    +

    α

    t

    1

    (

    s

    1

    )

    )

    y

    s

    e

    q

    (

    s

    )

    t

    \alpha_t(s) = (\alpha_{t-1}(s)+\alpha_{t-1}(s-1))·y^t_{seq(s)}

    αt(s)=(αt1(s)+αt1(s1))yseq(s)t

  • 情况2:第s个符号等于第s-2个符号

    α

    t

    (

    s

    )

    =

    (

    α

    t

    1

    (

    s

    )

    +

    α

    t

    1

    (

    s

    1

    )

    )

    y

    s

    e

    q

    (

    s

    )

    t

    \alpha_t(s) = (\alpha_{t-1}(s)+\alpha_{t-1}(s-1))·y^t_{seq(s)}

    αt(s)=(αt1(s)+αt1(s1))yseq(s)t

  • 情况3:既不属于情况1,也不属于情况2

    α

    t

    (

    s

    )

    =

    (

    α

    t

    1

    (

    s

    )

    +

    α

    t

    1

    (

    s

    1

    )

    +

    α

    t

    1

    (

    s

    2

    )

    )

    y

    s

    e

    q

    (

    s

    )

    t

    \alpha_t(s) = (\alpha_{t-1}(s)+\alpha_{t-1}(s-1)+\alpha_{t-1}(s-2))·y^t_{seq(s)}

    αt(s)=(αt1(s)+αt1(s1)+αt1(s2))yseq(s)t
    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GAanawON-1639284143009)(Detecting%20Text%20in%20Natural%20Image%20with%20Connectionist%20Text%20Proposal%20Network.assets/image-20211209231338595.png)]

不属于情况2

α

t

(

s

)

=

(

α

t

1

(

s

)

+

α

t

1

(

s

1

)

+

α

t

1

(

s

2

)

)

y

s

e

q

(

s

)

t

\alpha_t(s) = (\alpha_{t-1}(s)+\alpha_{t-1}(s-1)+\alpha_{t-1}(s-2))·y^t_{seq(s)}

αt(s)=(αt1(s)+αt1(s1)+αt1(s2))yseq(s)t
[外链图片转存中...(img-GAanawON-1639284143009)]

版权声明:本文为CSDN博主「Cachel wood」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/weixin_46530492/article/details/121885721

Cachel wood

我还没有学会写个人说明!

暂无评论

发表评论

相关推荐

两阶段目标检测的开山奠基之作:R-CNN

首次将深度学习和卷积神经网络用于目标检测并取得显著性能提升。 图像分类、定位、目标检测、语义分割、实例分割、关键点检测(关节等等输出点的坐标) 图像分类(输入图像输出类别)目标检测&#xf

Fast-R-CNN论文解读

Fast-r-cnn是Ross在2015年发表的一篇论文,其网络全称为: Fast Region-based Convolutional Network method–用于目标检测的基于区域的快速卷积网络算法。