抄録
1-ms vision systems represent an extreme case of temporal development in video sensing techniques. Moreover, a 1-ms dual-hand tracking system leverages the dexterous functionality of hands and thus serves as a seamless and intuitive interface for Human-Computer Interaction. Deep CNN is promising for high tracking robustness, however, neither GPU-based nor FPGA-based implementation addresses the tracking task with ultra-high-speed. This paper proposes: (a) A paradigm to directly map a deep CNN as a hardwired circuit, so the entire network runs in parallel and high processing speed is obtained. The network is exempted from memory access since all intermediate neural values are implicitly represented in hardware states. And condensed binarization is used to reduce resource utilization; (b) Hardware design of the hardwired network on FPGA, inside which kernel-adapted convolutional trees are devised to maximize the parallelism. The speed bottleneck of the network is therefore removed by implementing convolutional layers as fine-grained pipelines with unified components; (c) FPGA-GPU hetero complementation, which utilizes an auxiliary GPU network to compensate for accuracy of the FPGA network without affecting its speed. The quick primary results on FPGA are intermittently refined using delayed but accurate hints from GPU. Implementation results show that the proposed method reaches 973fps and consumes merely 1.30ms to process on $640\times 480$ images, while the accuracy is only 4.7% lower compared with the general method on test sequences. Video demonstrations are available at https://wcms.waseda.jp/em/5f9d020f136e7.
本文言語 | English |
---|---|
ページ(範囲) | 8192-8203 |
ページ数 | 12 |
ジャーナル | IEEE Transactions on Circuits and Systems for Video Technology |
巻 | 32 |
号 | 12 |
DOI | |
出版ステータス | Published - 2022 12月 1 |
ASJC Scopus subject areas
- メディア記述
- 電子工学および電気工学