2020年06月10日

finn をやってみる14（tfc_end2end_example.ipynb その9）

”finn をやってみる13（tfc_end2end_example.ipynb その8）”の続き。

前回は、Vivado 2019.1 を使用して、もう一度、end2end_example の tfc_end2end_example.ipynb をやり直したところ、 4. PYNQ hardware generation and deployment の Inserting the IP into a PYNQ Overlay Shell で Vivado プロジェクトが生成された。今回は、 4. PYNQ hardware generation and deployment の Driver Generation からやってみよう。

今回も end2end_example の tfc_end2end_example.ipynb の図や文章の翻訳、コードを引用して勉強していく。

Driver Generation
ネットワークのビットファイルを合成したので、このビットファイルのドライバーとして機能するPYNQのPythonコードを生成し、すべてを展開フォルダーにパッケージ化して、PYNQボードにコピーする。

生成されたドライバーは、pynq_driver_dirトップレベルメタデータで示されるフォルダーに配置される。生成されたPYNQ Pythonドライバーコードを次のように確認できる。

実際のドライバコードは、/tmp/finn_dev_masaaki/pynq_deployment_0fgmr9qo にあって、そのディレクトリには、 driver.py, finn ディレクトリ、 input.npy, resizer.bit, resizer.hwh ファイルがある。

driver.py を引用する。

import argparse

from pynq import Overlay
import numpy as np
from pynq import allocate
import time
from finn.util.data_packing import (
    finnpy_to_packed_bytearray,
    packed_bytearray_to_finnpy
)
from finn.core.datatype import DataType

class FINNAccelDriver():
    def __init__(self, N, bitfile):
        """Instantiate the FINN accelerator driver.
        Gets batchsize (N) as integer and path to bitfile as string."""
        self.N = N
        # input FINN DataType
        self.idt = DataType.BINARY
        # output FINN DataType
        self.odt = DataType.UINT32
        # input and output shapes
        self.ishape_normal = (N, 784)
        self.oshape_normal = (N, 10)
        self.ishape_folded = (N, 16, 49)
        self.oshape_folded = (N, 1, 10)
        self.ishape_packed = (N, 16, 7)   # datatype np.uint8
        self.oshape_packed = (N, 1, 40)  # datatype np.uint8
        # load bitfile and set up accelerator
        self.ol = Overlay(bitfile)
        self.dma = self.ol.axi_dma_0
        self.ctrl_regs = self.ol.resize_accel_0
        # neuron folding factor of output = iterations per sample
        self.itersPerSample = self.oshape_packed[-2]
        # AXI lite register offset for number of iterations
        # used by TLastMarker to signal end of transmission for AXI CDMA
        self.REG_OFFSET_NUM_ITERS = 0x10
        # set up TLastMarker with correct num. samples
        self.ctrl_regs.write(self.REG_OFFSET_NUM_ITERS, self.N*self.itersPerSample)

        # allocate a PYNQ buffer for the packed input and buffer
        self.ibuf_packed_device = allocate(shape=self.ishape_packed, dtype=np.uint8)
        self.obuf_packed_device = allocate(shape=self.oshape_packed, dtype=np.uint8)

    def fold_input(self, ibuf_normal):
        """Reshapes input in desired shape.
        Gets input data (ibuf_normal), checks if data is in expected normal shape.
        Returns folded input."""
        # ensure that shape is as expected
        assert ibuf_normal.shape == self.ishape_normal
        # convert to folded form
        ibuf_folded = ibuf_normal.reshape(self.ishape_folded)
        return ibuf_folded

    def pack_input(self, ibuf_folded):
        """Packs folded input and reverses both SIMD dim and endianness.
        Gets input data in folded shape and returns packed input data."""
        ibuf_packed = finnpy_to_packed_bytearray(
            ibuf_folded, self.idt, reverse_endian=True, reverse_inner=True
        )
        return ibuf_packed

    def unpack_output(self, obuf_packed):
        """Unpacks the packed output buffer from accelerator.
        Gets packed output and returns output data in folded shape."""
        obuf_folded = packed_bytearray_to_finnpy(
            obuf_packed, self.odt, self.oshape_folded, reverse_endian=True, reverse_inner=True
        )
        return obuf_folded

    def unfold_output(self, obuf_folded):
        """Unfolds output data to normal shape.
        Gets folded output data and returns output data in normal shape."""
        obuf_normal = obuf_folded.reshape(self.oshape_normal)
        return obuf_normal

    def copy_input_data_to_device(self, data):
        """Copies given input data to PYNQ buffer."""
        np.copyto(self.ibuf_packed_device, data)

    def execute(self):
        """Executes accelerator by setting up the DMA and
        waiting until all transfers complete. Uses only member variables and
        returns nothing."""
        dma = self.dma
        dma.sendchannel.transfer(self.ibuf_packed_device)
        dma.recvchannel.transfer(self.obuf_packed_device)
        dma.sendchannel.wait()
        dma.recvchannel.wait()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Set exec mode, batchsize N, bitfile name, inputfile name and outputfile name')
    parser.add_argument('--exec_mode', help='Please select functional verification ("execute") or throughput test ("throughput_test")', default="execute")
    parser.add_argument('--batchsize', help='number of samples for inference', type=int, default=1)
    parser.add_argument('--bitfile', help='name of bitfile (i.e. "resizer.bit")', default="resizer.bit")
    parser.add_argument('--inputfile', help='name of input npy file (i.e. "input.npy")', default="input.npy")
    parser.add_argument('--outputfile', help='name of output npy file (i.e. "output.npy")', default="output.npy")
    # parse arguments
    args = parser.parse_args()
    exec_mode = args.exec_mode
    N = args.batchsize
    bitfile = args.bitfile
    inputfile = args.inputfile
    outputfile = args.outputfile

    # instantiate FINN accelerator driver and pass batchsize and bitfile
    finnDriver = FINNAccelDriver(N, bitfile)

    # for the remote execution the data from the input npy file has to be loaded,
    # packed and copied to the PYNQ buffer
    if exec_mode == "execute":
        # load desired input .npy file
        ibuf_normal = np.load(inputfile)
        ibuf_folded = finnDriver.fold_input(ibuf_normal)
        ibuf_packed = finnDriver.pack_input(ibuf_folded)
        finnDriver.copy_input_data_to_device(ibuf_packed)
    elif exec_mode != "throughput_test":
        raise Exception("Exec mode has to be set to remote_pynq or throughput_test")

    # for the throughput test the runtime of the network has to be measured
    if exec_mode == "throughput_test":
        # measure runtime of network
        start = time.time()
        # dictionary for results of throughput test
        res={}

    # execute accelerator
    finnDriver.execute()

    # measure run time and fill dictionary with results of the throughput test
    if exec_mode == "throughput_test":
        end = time.time()
        runtime = end - start
        res["runtime[ms]"] = runtime*1000
        res["throughput[images/s]"] = N / runtime
        res["DRAM_in_bandwidth[Mb/s]"] = np.prod(finnDriver.ishape_packed)*0.000001 / runtime
        res["DRAM_out_bandwidth[Mb/s]"] = np.prod(finnDriver.oshape_packed)*0.000001 / runtime
        file = open("nw_metrics.txt", "w")
        file.write(str(res))
        file.close()

    # if execution is selected unpack, unfold and save output to output npy file
    else:
        obuf_folded = finnDriver.unpack_output(finnDriver.obuf_packed_device)
        obuf_normal = finnDriver.unfold_output(obuf_folded)
        np.save(outputfile, obuf_normal)

生成されたドライバーには、FINNアクセラレーターを実装するクラスが実装されていることがわかる。コンストラクタは、バッチサイズ（N）を整数として、ビットファイルを文字列として取得する。また、予想される入力/出力形状も含まれており、ビットファイルをロードしてdmaとバッファーを設定することにより、アクセラレーターのインスタンス化を処理する。いくつかのメンバー関数がデータの折りたたみとパッキングを処理する。関数copy_input_data_to_deviceは、入力データをPYNQバッファーにコピーして実行し、dmaチャネルをセットアップして、転送が完了するまで待機する。このクラスはmain関数で使用される。ただし、最初に引数が解析され、スクリプトに渡される。このドライバーは、「execute」と「throughput_test」の2つのモードで使用できる。デフォルトでは、すべての引数が「execute」モードに設定されている。このモードでは、バッチサイズは1であり、渡されたファイルはFINN変換で使用される名前に設定される。

「execute」モードでは、次のように機能する。

1. データは「inputfile」からロードされる
2. データは fold_input を使用して折りたたまれる
3. データは pack_input を使用してパックされる
4. データは copy_input_data_to_device を使用してデバイスにコピーされる
5. FINNAccelDriver は、 execute で実行される
6. データは unpack_output で解凍される
7. データは unfold_output で展開される
8. データは「outputfile」に保存される

「throughput_test」がexec_modeとして選択されている場合、実際のデータをロードする必要はない。バッチサイズNは高い値（つまり1000）に設定する必要があり、時間測定はPythonで実装される。空の辞書（res）が作成され、測定されたランタイムでアクセラレーターを実行した後、メトリックが入力され、.txtファイルに保存される。

ドライバーを変更してアクセラレーターを中心に独自のアプリケーションを構築するか、FINNが提供するリモート実行機能を使用して、それが機能しているかどうかを確認することができる。

2020年06月08日

finn をやってみる13（tfc_end2end_example.ipynb その8）

”finn をやってみる12（tfc_end2end_example.ipynb その7）”の続き。

前回は、end2end_example の tfc_end2end_example.ipynb の 4. PYNQ hardware generation and deployment の Inserting the IP into a PYNQ Overlay Shell で Vivado プロジェクトが生成されていなかった。トラブルシューティングをしたところ、Vivado の 2019.2 バージョンを使用していてはダメで、 2019.1 を使う必要があるということが分かった。今回は、Vivado 2019.1 を使用して、もう一度、end2end_example の tfc_end2end_example.ipynb をやり直してみようと思う。

今回も end2end_example の tfc_end2end_example.ipynb の図や文章の翻訳、コードを引用して勉強していく。

まずは、docker image は finn_dev_masaaki という名前で 8.4 GB 程度消費している。それを削除した。
docker rmi <イメージID>

finn ディレクトリも一旦削除して、もう一度 git clone した。
rm -rf finn
git clone https://github.com/Xilinx/finn.git

VIVADO_PATH 環境変数には、Vivado 2019.1 のインストール・ディレクトリを指定した。
export VIVADO_PATH=/media/masaaki/Ubuntu_Disk/tools/Xilinx/Vivado/2019.1

Docker を走らせて finn のインストールを行った。
sh run-docker.sh

インストール後に exit してから、 Jupyter Notebook を起動した。
sh run-docker.sh notebook

end2end_example の tfc_end2end_example.ipynb を起動して、最初から実行して、 Inserting the IP into a PYNQ Overlay Shell まで実行すると、”何ということでしょう”（大改造!!劇的ビフォーアフター（TV番組）の真似です）、Vivado のプロジェクトが作成されていました。やはり、Vivado 2019.1 を使う必要があるんですね。。。

Docker コンテナから /tmp/finn_dev_masaaki ディレクトリをコピーした。
docker cp 4cd592793b01:/tmp/finn_dev_masaaki .

finn/finn_dev_masaaki/vivado_pynq_proj_4vynls70 の Vivado 2019.1 プロジェクトを開いた。

ブロックデザインを開いた。
resize_accel_0 が FINN 部分ということだ。DMA エンジンやデータ幅コンバーター等もある。なお、Run Block Automation が表示されているのは、ZYNQ7 Processing System の設定がされていないからだ。あとで設定するのだろうか？

Address Editor 画面を示す。

Synthesis, Place and Route
これで、FPGAビットファイルを生成するための合成、配置、配線の最後のハードウェア生成ステップの準備が整った。これは、Docker内の生成されたVivado PYNQプロジェクトディレクトリで synth_project.sh スクリプトを実行するか、 SynthPYNQProject 変換を実行することで実行できる。この手順では、合成のためにVivadoを起動する必要があり、数時間かかる場合がある。
モデルを tfc_w1_a1_post_synthesis.onnx にセーブした。

finn_dev_masaaki ディレクトリを削除して、もう一度、Docker コンテナから /tmp/finn_dev_masaaki ディレクトリをコピーした。
docker cp 4cd592793b01:/tmp/finn_dev_masaaki .

Vivado 2019.1 を起動して、 finn/finn_dev_masaaki/vivado_pynq_proj_4vynls70 の Vivado 2019.1 プロジェクトを開いた。

Project Summary を示す。無事にビットストリームが生成されている。

ブロックデザインを見ると、まだ、Run Block Automation が表示されている。ZYNQ7 Processing System の設定がされないまま、ビットストリーム生成までできてしまったようだ。これで良いのだろうか？後で設定できるのかな？謎である。これができれば汎用にコンパイルしておいて、後で ZYNQ7 Processing System の設定だけ入れて使いまわすということができるはず。。。

2020年06月07日

finn をやってみる12（tfc_end2end_example.ipynb その7）

”finn をやってみる11（tfc_end2end_example.ipynb その6）”の続き。

前回は、end2end_example の tfc_end2end_example.ipynb 3. Vivado HLS and IPI の IP Stitching をやって、Vivado のプロジェクトを確認した。今回は、end2end_example の tfc_end2end_example.ipynb の 4. PYNQ hardware generation and deployment からやってみよう。

今回も end2end_example の tfc_end2end_example.ipynb の図や文章の翻訳、コードを引用して勉強していく。

4. PYNQ hardware generation and deployment

・ Inserting the IP into a PYNQ Overlay Shell
・ Synthesis, Place and Route
・ Driver Generation
・ Deployment and Remote Execution
・ Throughput Test on PYNQ Board

ハードウェア設計の準備がほぼ完了した。次に、PYNQオーバーレイとしての使用に適した形式で配置し、合成して展開する。

Inserting the IP into a PYNQ Overlay Shell
アクセラレータをPYNQプラットフォームにデプロイするには、基盤となるシステムが公開するインターフェースとブリッジする適切なシェル内にアクセラレータを配置する必要がある。 FINNでは、MakePYNQProjectトランスフォーメーションを使用して適切なPYNQシェルにステッチされたIPを挿入することでPYNQ互換オーバーレイを簡単に作成し、metadata_propsを使用して作成されたPYNQシェルプロジェクトディレクトリを表示する。これによりVivadoが起動し、実行に数分かかる場合がある。

Vivado のプロジェクトが生成されていない？たぶん、make_project.sh が実行されいないのではないだろうか？
もう一度やってみたが同じだった。さらにもう一度最初からやってみたがダメだった。

ちなみに、end2end_example の tfc_end2end_example.ipynb の結果を示す。

ip_config.tcl resizer.cache resizer.ip_user_files resizer.xpr
make_project.sh resizer.hw resizer.srcs synth_project.sh

このように Vivado プロジェクトのディレクトリが生成されている。

Docker 上の /tmp/finn_dev_masaaki ディレクトリを見ると、 vivado_pynq_proj_s42jfdkl が増えている。

作成したVivadoプロジェクト（.xpr）を上のvivado_pynq_projディレクトリの下で開くと、システムレベルのブロックデザインが次のように表示され、デザインのFINN生成部分が強調表示されている。 DMAエンジンやデータ幅コンバーターなど、他のさまざまなコンポーネントもインスタンス化されている。ということなので、Vivado プロジェクトを見てみよう。

前回 Docker コンテナからコピーした /tmp/finn_dev_masaaki ディレクトリを一旦削除した。
もう一度、Docker コンテナから /tmp/finn_dev_masaaki ディレクトリをコピーしよう。
docker cp 7b13e06d4195:/tmp/finn_dev_masaaki .

コピーされた finn_dev_masaaki ディレクトリの中に vivado_pynq_proj_s42jfdkl があって、やはり 3 個の tcl スクリプトしかない。

おかしいので検索してみると、FINN Community の記事が引っかかった。
それによると Vivado 2019.2 ではだめで、 2019.1 が必要ということだった。現在、Vivado 2019.2 を使用しているので、Vivado 2019.1 に変更する必要があるようだ。

もう一度、最初から Vivado 2019.1 でやり直してみよう。

2020年06月06日

finn をやってみる11（tfc_end2end_example.ipynb その6）

”finn をやってみる10（tfc_end2end_example.ipynb その5）”の続き。

前回は、 end2end_example の tfc_end2end_example.ipynb の 3. Vivado HLS and IPI の Synthesizing HLS to IP Blocks まで実行して、docker でコンテナに入り、Vivado HLS プロジェクトをみた。今回は、続きの end2end_example の tfc_end2end_example.ipynb の 3. Vivado HLS and IPI の IP Stitching からやっていこう。

今回も end2end_example の tfc_end2end_example.ipynb の図や文章の翻訳、コードを引用して勉強していく。

IP Stitching
これで、各層にIPブロックがあり、それらをCreateStitchedIP 変換を使用してネットワーク全体を実装するより大きなIPにまとめる。この変換は、すでに HLSSynthIP 変換を実行した HLS ノードのみを含むグラフにのみ適用できることに注意。これは、実行する最後のステップだ。 IPスティッチングを呼び出す前に、 ReplaceVerilogRelPaths 変換を使用して、生成されたIPブロック内の相対 $readmemhパスを絶対パスに変換します。これにより、後でエラーが発生しなくなる。この手順によりVivadoが呼び出され、実行に数分かかる場合がある。

変換されたモデルのノード自体を調べる場合、IP Stitching がグラフにモデルレベルのメタデータを追加するため、違いはわからない。これには、ModelWrapperの.model.metadata_props、get_metadata_prop関数を使用するか、Netronのグローバル入力/出力テンソルをクリックしてアクセスできる。

Docker コンテナ内の /tmp/finn_dev_masaaki に Vivado プロジェクトなどが入っている。それを Vivado でみられるはずなのだが、Docker で Vivado の GUI が動作しなかったので、後でコンテナからコピーしてやってみよう。

tfc_w1_a1_ipstitch.onnx としてモデルをセーブする。

ここで、Docker コンテナから /tmp/finn_dev_masaaki ディレクトリをローカルに保存して Vivado のプロジェクトを見てみよう。
”Dockerコンテナからホストへファイルをコピーする”を参考にして、Docker コンテナから /tmp/finn_dev_masaaki ディレクトリをコピーしよう。
docker ps
docker cp 7b13e06d4195:/tmp/finn_dev_masaaki .

finn_dev_masaaki ディレクトリがコピーされた。

finn_dev_masaaki/vivado_stitch_proj_su3iq9q8 ディレクトリが Vivado のプロジェクト・ディレクトリとなる。 finn_vivado_stitch_proj.xpr を Vivado 2019.2 で開いた。

ブロックデザインを開くと、tfc_w1_a1_ipgen.onnx がブロックデザインに表現されていた。

2020年06月05日

finn をやってみる10（tfc_end2end_example.ipynb その5）

”finn をやってみる9（tfc_end2end_example.ipynb その4）”の続き。

前回は、 end2end_example の tfc_end2end_example.ipynb の 3. Vivado HLS and IPI の Synthesizing HLS to IP Blocks まで実行して、docker でコンテナに入り、Vivado HLS プロジェクトを覗き始めた。今回は、その続きで、Vivado HLS のプロジェクトを見ていこう。

code_gen_ipgen_StreamingFCLayer_Batch_0_av35euos に入ったところからスタートだ。
sol1 ディレクトリに移動して、syn と syn/report ディレクトリの内容と impl_ip ディレクトリの内容を見た。
impl/ip ディレクトリには、IP の圧縮形式の xilinx_com_hls_StreamingFCLayer_Batch_0_1_0.zip が見えた。

syn/report の StreamingFCLayer_Batch_0_csynth.rpt を示す。

================================================================
== Vivado HLS Report for 'StreamingFCLayer_Batch_0'
================================================================
* Date:           Tue Jun  2 19:58:23 2020

* Version:        2019.2 (Build 2698951 on Thu Oct 24 19:15:34 MDT 2019)
* Project:        project_StreamingFCLayer_Batch_0
* Solution:       sol1
* Product family: zynq
* Target device:  xc7z020-clg400-1

================================================================
== Performance Estimates
================================================================
+ Timing:
    * Summary:
    +--------+----------+----------+------------+
    |  Clock |  Target  | Estimated| Uncertainty|
    +--------+----------+----------+------------+
    |ap_clk  | 10.00 ns | 8.488 ns |   1.25 ns  |
    +--------+----------+----------+------------+

+ Latency:
    * Summary:
    +---------+---------+----------+----------+-----+-----+---------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline|
    |   min   |   max   |    min   |    max   | min | max |   Type  |
    +---------+---------+----------+----------+-----+-----+---------+
    |       70|       70| 0.700 us | 0.700 us |   70|   70|   none  |
    +---------+---------+----------+----------+-----+-----+---------+

    + Detail:
        * Instance:
        +--------------------------------+----------------------+---------+---------+----------+----------+-----+-----+---------+
        |                                |                      |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline|
        |            Instance            |        Module        |   min   |   max   |    min   |    max   | min | max |   Type  |
        +--------------------------------+----------------------+---------+---------+----------+----------+-----+-----+---------+
        |grp_Matrix_Vector_Activa_fu_28  |Matrix_Vector_Activa  |       67|       67| 0.670 us | 0.670 us |   67|   67|   none  |
        +--------------------------------+----------------------+---------+---------+----------+----------+-----+-----+---------+

        * Loop:
        N/A

================================================================
== Utilization Estimates
================================================================
* Summary:
+-----------------+---------+-------+--------+-------+-----+
|       Name      | BRAM_18K| DSP48E|   FF   |  LUT  | URAM|
+-----------------+---------+-------+--------+-------+-----+
|DSP              |        -|      -|       -|      -|    -|
|Expression       |        -|      -|       0|      2|    -|
|FIFO             |        -|      -|       -|      -|    -|
|Instance         |        -|      -|    2530|  25464|    -|
|Memory           |        -|      -|       -|      -|    -|
|Multiplexer      |        -|      -|       -|     45|    -|
|Register         |        -|      -|       5|      -|    -|
+-----------------+---------+-------+--------+-------+-----+
|Total            |        0|      0|    2535|  25511|    0|
+-----------------+---------+-------+--------+-------+-----+
|Available        |      280|    220|  106400|  53200|    0|
+-----------------+---------+-------+--------+-------+-----+
|Utilization (%)  |        0|      0|       2|     47|    0|
+-----------------+---------+-------+--------+-------+-----+

+ Detail:
    * Instance:
    +--------------------------------+----------------------+---------+-------+-
-----+-------+-----+
    |            Instance            |        Module        | BRAM_18K| DSP48E|
FF  |  LUT  | URAM|
    +--------------------------------+----------------------+---------+-------+-
-----+-------+-----+
    |grp_Matrix_Vector_Activa_fu_28  |Matrix_Vector_Activa  |        0|      0|
2530|  25464|    0|
    +--------------------------------+----------------------+---------+-------+-
-----+-------+-----+
    |Total                           |                      |        0|      0|
2530|  25464|    0|
    +--------------------------------+----------------------+---------+-------+-
-----+-------+-----+

    * DSP48E:
    N/A

    * Memory:
    N/A

    * FIFO:
    N/A

    * Expression:
    +-----------------------------------------------+----------+-------+---+----
+------------+------------+
    |                 Variable Name                 | Operation| DSP48E| FF| LUT
| Bitwidth P0| Bitwidth P1|
    +-----------------------------------------------+----------+-------+---+----
+------------+------------+
    |grp_Matrix_Vector_Activa_fu_28_out_V_V_TREADY  |    and   |      0|  0|   2
|           1|           1|
    +-----------------------------------------------+----------+-------+---+----
+------------+------------+
    |Total                                          |          |      0|  0|   2
|           1|           1|
    +-----------------------------------------------+----------+-------+---+----
+------------+------------+

    * Multiplexer:
    +------------------------+----+-----------+-----+-----------+
    |          Name          | LUT| Input Size| Bits| Total Bits|
    +------------------------+----+-----------+-----+-----------+
    |ap_NS_fsm               |  27|          5|    1|          5|
    |in0_V_V_TREADY_int      |   9|          2|    1|          2|
    |weights_V_V_TREADY_int  |   9|          2|    1|          2|
    +------------------------+----+-----------+-----+-----------+
    |Total                   |  45|          9|    3|          9|
    +------------------------+----+-----------+-----+-----------+

    * Register:
    +---------------------------------------------+---+----+-----+-----------+
    |                     Name                    | FF| LUT| Bits| Const Bits|
    +---------------------------------------------+---+----+-----+-----------+
    |ap_CS_fsm                                    |  4|   0|    4|          0|
    |grp_Matrix_Vector_Activa_fu_28_ap_start_reg  |  1|   0|    1|          0|
    +---------------------------------------------+---+----+-----+-----------+
    |Total                                        |  5|   0|    5|          0|
    +---------------------------------------------+---+----+-----+-----------+

================================================================
== Interface
================================================================
* Summary:
+--------------------+-----+-----+--------------+--------------------------+----
----------+
|      RTL Ports     | Dir | Bits|   Protocol   |       Source Object      |
C Type    |
+--------------------+-----+-----+--------------+--------------------------+----
----------+
|ap_clk              |  in |    1| ap_ctrl_none | StreamingFCLayer_Batch_0 | ret
urn value |
|ap_rst_n            |  in |    1| ap_ctrl_none | StreamingFCLayer_Batch_0 | ret
urn value |
|in0_V_V_TDATA       |  in |   56|     axis     |          in0_V_V         |
pointer   |
|in0_V_V_TVALID      |  in |    1|     axis     |          in0_V_V         |
pointer   |
|in0_V_V_TREADY      | out |    1|     axis     |          in0_V_V         |
pointer   |
|weights_V_V_TDATA   |  in |  784|     axis     |        weights_V_V       |
pointer   |
|weights_V_V_TVALID  |  in |    1|     axis     |        weights_V_V       |
pointer   |
|weights_V_V_TREADY  | out |    1|     axis     |        weights_V_V       |
pointer   |
|out_V_V_TDATA       | out |   16|     axis     |          out_V_V         |
pointer   |
|out_V_V_TVALID      | out |    1|     axis     |          out_V_V         |
pointer   |
|out_V_V_TREADY      |  in |    1|     axis     |          out_V_V         |
pointer   |
+--------------------+-----+-----+--------------+--------------------------+----
----------+

次に、 /tmp/finn_dev_masaaki/code_gen_ipgen_StreamingFCLayer_Batch_0_av35euos のソースコードの top_StreamingFCLayer_Batch_0.cpp の関数定義文を示す。

void StreamingFCLayer_Batch_0(
                    hls::stream<ap_uint<49>> &in0,
                    hls::stream<ap_uint<784>> &weights,
                    hls::stream<ap_uint<16>> &out
                    )

これを合成した HDL の内の VHDL ファイルを示す。
/tmp/finn_dev_masaaki/code_gen_ipgen_StreamingFCLayer_Batch_0_av35euos/project_StreamingFCLayer_Batch_0/sol1/syn/vhdl ディレクトリの StreamingFCLayer_Batch_0_StreamingFCLayer_Batch_0.vhd の entity 文だけを引用する。

entity StreamingFCLayer_Batch_0_StreamingFCLayer_Batch_0 is
port (
    ap_clk : IN STD_LOGIC;
    ap_rst_n : IN STD_LOGIC;
    in0_V_V_TDATA : IN STD_LOGIC_VECTOR (55 downto 0);
    in0_V_V_TVALID : IN STD_LOGIC;
    in0_V_V_TREADY : OUT STD_LOGIC;
    weights_V_V_TDATA : IN STD_LOGIC_VECTOR (783 downto 0);
    weights_V_V_TVALID : IN STD_LOGIC;
    weights_V_V_TREADY : OUT STD_LOGIC;
    out_V_V_TDATA : OUT STD_LOGIC_VECTOR (15 downto 0);
    out_V_V_TVALID : OUT STD_LOGIC;
    out_V_V_TREADY : IN STD_LOGIC );
end;

次に、 code_gen_ipgen_StreamingDataWidthConverter_Batch_0_nzm_awqb を見てみよう。

hls_syn_StreamingDataWidthConverter_Batch_0.tcl やソースコードの top_StreamingDataWidthConverter_Batch_0.cpp などがある。

/tmp/finn_dev_masaaki/code_gen_ipgen_StreamingDataWidth
Converter_Batch_0_nzm_awqb/project_StreamingDataWidthConverter_Batch_0/sol1 に移動して、syn ディレクトリや syn/report ディレクトリを見た。

StreamingDataWidthConverter_Batch_0_csynth.rpt を見てみよう。

================================================================
== Vivado HLS Report for 'StreamingDataWidthConverter_Batch_0'
================================================================
* Date:           Tue Jun  2 19:55:07 2020

* Version:        2019.2 (Build 2698951 on Thu Oct 24 19:15:34 MDT 2019)
* Project:        project_StreamingDataWidthConverter_Batch_0
* Solution:       sol1
* Product family: zynq
* Target device:  xc7z020-clg400-1

================================================================
== Performance Estimates
================================================================
+ Timing:
    * Summary:
    +--------+----------+----------+------------+
    |  Clock |  Target  | Estimated| Uncertainty|
    +--------+----------+----------+------------+
    |ap_clk  | 10.00 ns | 5.723 ns |   1.25 ns  |
    +--------+----------+----------+------------+

+ Latency:
    * Summary:
    +---------+---------+-----------+-----------+-----+-----+---------+
    |  Latency (cycles) |   Latency (absolute)  |  Interval | Pipeline|
    |   min   |   max   |    min    |    max    | min | max |   Type  |
    +---------+---------+-----------+-----------+-----+-----+---------+
    |        7|        7| 70.000 ns | 70.000 ns |    7|    7|   none  |
    +---------+---------+-----------+-----------+-----+-----+---------+

    + Detail:
        * Instance:
        +----------------------------------+------------------------+---------+---------+-----------+-----------+-----+-----+---------+
        |                                  |                        |  Latency (cycles) |   Latency (absolute)  |  Interval | Pipeline|
        |             Instance             |         Module         |   min   |   max   |    min    |    max    | min | max |   Type  |
        +----------------------------------+------------------------+---------+---------+-----------+-----------+-----+-----+---------+
        |grp_StreamingDataWidthCo_1_fu_26  |StreamingDataWidthCo_1  |        4|        4| 40.000 ns | 40.000 ns |    4|    4|   none  |
        +----------------------------------+------------------------+---------+---------+-----------+-----------+-----+-----+---------+

        * Loop:
        N/A

================================================================
== Utilization Estimates
================================================================
* Summary:
+-----------------+---------+-------+--------+-------+-----+
|       Name      | BRAM_18K| DSP48E|   FF   |  LUT  | URAM|
+-----------------+---------+-------+--------+-------+-----+
|DSP              |        -|      -|       -|      -|    -|
|Expression       |        -|      -|       0|      2|    -|
|FIFO             |        -|      -|       -|      -|    -|
|Instance         |        -|      -|      65|    241|    -|
|Memory           |        -|      -|       -|      -|    -|
|Multiplexer      |        -|      -|       -|     36|    -|
|Register         |        -|      -|       5|      -|    -|
+-----------------+---------+-------+--------+-------+-----+
|Total            |        0|      0|      70|    279|    0|
+-----------------+---------+-------+--------+-------+-----+
|Available        |      280|    220|  106400|  53200|    0|
+-----------------+---------+-------+--------+-------+-----+
|Utilization (%)  |        0|      0|   ~0   |   ~0  |    0|
+-----------------+---------+-------+--------+-------+-----+

+ Detail:
    * Instance:
    +----------------------------------+------------------------+---------+-------+----+-----+-----+
    |             Instance             |         Module         | BRAM_18K| DSP48E| FF | LUT | URAM|
    +----------------------------------+------------------------+---------+-------+----+-----+-----+
    |grp_StreamingDataWidthCo_1_fu_26  |StreamingDataWidthCo_1  |        0|      0|  65|  241|    0|
    +----------------------------------+------------------------+---------+-------+----+-----+-----+
    |Total                             |                        |        0|      0|  65|  241|    0|
    +----------------------------------+------------------------+---------+-------+----+-----+-----+

    * DSP48E:
    N/A

    * Memory:
    N/A

    * FIFO:
    N/A

    * Expression:
    +-------------------------------------------------+----------+-------+---+----+------------+------------+
    |                  Variable Name                  | Operation| DSP48E| FF| LUT| Bitwidth P0| Bitwidth P1|
    +-------------------------------------------------+----------+-------+---+----+------------+------------+
    |grp_StreamingDataWidthCo_1_fu_26_out_V_V_TREADY  |    and   |      0|  0|   2|           1|           1|
    +-------------------------------------------------+----------+-------+---+----+------------+------------+
    |Total                                            |          |      0|  0|   2|           1|           1|
    +-------------------------------------------------+----------+-------+---+----+------------+------------+

    * Multiplexer:
    +--------------------+----+-----------+-----+-----------+
    |        Name        | LUT| Input Size| Bits| Total Bits|
    +--------------------+----+-----------+-----+-----------+
    |ap_NS_fsm           |  27|          5|    1|          5|
    |in0_V_V_TREADY_int  |   9|          2|    1|          2|
    +--------------------+----+-----------+-----+-----------+
    |Total               |  36|          7|    2|          7|
    +--------------------+----+-----------+-----+-----------+

    * Register:
    +-----------------------------------------------+---+----+-----+-----------+
    |                      Name                     | FF| LUT| Bits| Const Bits|
    +-----------------------------------------------+---+----+-----+-----------+
    |ap_CS_fsm                                      |  4|   0|    4|          0|
    |grp_StreamingDataWidthCo_1_fu_26_ap_start_reg  |  1|   0|    1|          0|
    +-----------------------------------------------+---+----+-----+-----------+
    |Total                                          |  5|   0|    5|          0|
    +-----------------------------------------------+---+----+-----+-----------+

================================================================
== Interface
================================================================
* Summary:
+----------------+-----+-----+--------------+-------------------------------------+--------------+
|    RTL Ports   | Dir | Bits|   Protocol   |            Source Object            |    C Type    |
+----------------+-----+-----+--------------+-------------------------------------+--------------+
|ap_clk          |  in |    1| ap_ctrl_none | StreamingDataWidthConverter_Batch_0 | return value |
|ap_rst_n        |  in |    1| ap_ctrl_none | StreamingDataWidthConverter_Batch_0 | return value |
|in0_V_V_TDATA   |  in |   16|     axis     |               in0_V_V               |    pointer   |
|in0_V_V_TVALID  |  in |    1|     axis     |               in0_V_V               |    pointer   |
|in0_V_V_TREADY  | out |    1|     axis     |               in0_V_V               |    pointer   |
|out_V_V_TDATA   | out |    8|     axis     |               out_V_V               |    pointer   |
|out_V_V_TVALID  | out |    1|     axis     |               out_V_V               |    pointer   |
|out_V_V_TREADY  |  in |    1|     axis     |               out_V_V               |    pointer   |
+----------------+-----+-----+--------------+-------------------------------------+--------------+

project_StreamingDataWidthConverter_Batch_0 は 7 クロックしかかかっていない。
たしかこれは、ストリームの幅を変えると言っていたので、ソースコードの top_StreamingDataWidthConverter_Batch_0.cpp の関数宣言部分を見てみよう。

void StreamingDataWidthConverter_Batch_0(hls::stream > &in0, hls::stream > &out)

16 ビット幅から 8 ビット幅に変換しているようだ。

合成された VHDL ファイルの StreamingDataWidthConverter_Batch_0_StreamingDataWidthConverter_Batch_0.vhd の entity 部分を見てみよう。

entity StreamingDataWidthConverter_Batch_0_StreamingDataWidthConverter_Batch_0 i
s
port (
    ap_clk : IN STD_LOGIC;
    ap_rst_n : IN STD_LOGIC;
    in0_V_V_TDATA : IN STD_LOGIC_VECTOR (15 downto 0);
    in0_V_V_TVALID : IN STD_LOGIC;
    in0_V_V_TREADY : OUT STD_LOGIC;
    out_V_V_TDATA : OUT STD_LOGIC_VECTOR (7 downto 0);
    out_V_V_TVALID : OUT STD_LOGIC;
    out_V_V_TREADY : IN STD_LOGIC );
end;

in0_V_V_TDATA が 16 ビット幅で、out_V_V_TDATA が 8 ビット幅だった。

/tmp/finn_dev_masaaki/code_gen_ipgen_StreamingDataWidthConverter_Batch_0_nzm_awqb/project_StreamingDataWidthConverter_Batch_0/sol1/impl/ip ディレクトリに行くと、IP の ZIP 圧縮ファイルの xilinx_com_hls_StreamingDataWidthConverter_Batch_0_1_0.zip が生成されていた。

FPGAの部屋

カテゴリ： finn

finn をやってみる14（tfc_end2end_example.ipynb その9）

finn をやってみる13（tfc_end2end_example.ipynb その8）

finn をやってみる12（tfc_end2end_example.ipynb その7）

finn をやってみる11（tfc_end2end_example.ipynb その6）

finn をやってみる10（tfc_end2end_example.ipynb その5）