Tensorコア

Tensorコア（英: Tensor Cores）はNVIDIAが開発する混合精度行列積和アクセラレータである^[1]。2017年、データセンター用GPU「Tesla V100」（Volta世代）に初めて搭載された。GeforceとQuadroにはRTXシリーズ（Turing世代）で初めて搭載された。

概要

高い並列計算能力をもったグラフィックス専用の処理ユニットであるGPUは、その並列計算特性が着目され現在ではHPCなど様々な分野で並列計算機として利用されている（GPGPU）。用途が広がるにつれ、深層学習をはじめとする一部の計算では高い数値計算精度よりも高い演算速度が求められることがわかってきた。NVIDIA GeForceシリーズをはじめとした様々なGPUを手掛けるNVIDIAがこれに応えるために開発した、行列積和アクセラレータがTensorコアである^[1]^[2]。

Tensorコアは行列の混合精度融合積和演算（FMA）に特化した機能を有する。すなわち1命令で低精度行列積 + 高精度アキュムレータ加算を実行する。例えばFP32の行列 $A,B,C$ を積和演算する際、行列積 $BC$ をFP16でおこないこれをFP32の $A$ へとアキュムレートする。

専用回路（Tensorコア）でこれを実行するため、1命令で大量の演算を処理できる。行列積は低精度であるため計算負荷が小さく、和は高精度かつFMAであるため追加の誤差を生じさせない。例えば NVIDIA A100 GPU では単純なFP32が 19.5 TLOPSであるのに対し、FP16の低精度積でTensorCoreを使った場合は 312 TFLOPS すなわち16倍の演算を実行できる。このようにTensorコアは混合精度行列FMAのアクセラレータとして機能する。

対応

Tensorコアには世代があり、世代ごとに速度およびサポートする精度が異なる。

表. Tensorコア世代とサポート精度
世代 (対応arch)	multiply 精度								accum 精度
世代 (対応arch)	FP64	TF32	FP16	BF16	FP8	INT8	INT4	INT1	FP64	FP32	FP16	INT32
1 (Volta) ^[3]	-	-	✔	-	-	-	-	-	-	✔	✔	-
2 (Turing) ^[4]	-	-	✔	-	-	✔	✔	✔	-	✔	✔
3 (Ampere) ^[5]	✔	✔	✔	✔	-	✔	✔	✔	✔	✔	✔	✔
4 (Hopper) ^[6]^[7]	✔	✔	✔	✔	✔	✔	-	-	✔	✔	✔	✔

TensorFloat-32

TensorFloat-32（TF32）はTensorコアにおける混合精度FMAモードの1つである^[8]^[9]。

Tensorコアは低精度行列積を高速実行するアクセラレータであり、FP32 FMAをそのまま高速計算はできない。TF32モードではFP32入力を内部的に19bitへキャスト、その行列積をTensorコアで高速計算し、最終的にFP32のアキュムレータへ加算する。すなわち、TensorFloat-32はFP32 FMAの内部低精度高速FMAモードである^[10]。

低精度行列積はある程度の計算誤差が不可避である。従来用いられていた16bit演算（FP16・BF16）は混合精度計算によりその悪影響を最低限に抑えており、またフレームワークの自動混同精度（AMP）機能により最低限のコード変更で実行を可能にしていた。しかし計算誤差が推論精度へ大きく影響するケースが一部にはあり、また最低限であれコードの変更が必須であった。

TF32は 1bit の符号、8bit の指数（= BF16の指数）、10bit の仮数（=FP16の仮数）からなる19bitの表現を利用しており、16bitに比べて精度低下を更に軽減している。またTensorコア内部でキャストしFP32にアキュムレートしているため外部的にはFP32 FMAをそのまま実行しているように見え、コードの変更が一切必要ない。ゆえにTF32は最低限の精度低下かつコード変更なしで数倍の演算効率を得られるモードになっている。あくまでコア内部精度の変更であるため、データ量（例: 使用GPUメモリ量）の減少等はできない。またTensorコアが19bit表現を高速処理できるからこそ高速化されるのであり、TF32はあくまでTensorコアがもつ特化機能/モードの一種である^[9]。

脚注

^ ^a ^b "Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy." NVIDIA. NVIDIA Tensor Cores. 2023-05-16閲覧.
^ "Tensor Cores are specialized high-performance compute cores for matrix math operations that provide groundbreaking performance for AI and HPC applications. " NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧.
^ "FP16 precision introduced on the Volta Tensor Core ... Volta GPU ... Results are accumulated into FP32 for mixed precision training or FP16 for inference." NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧.
^ "the INT8, INT4 and binary 1-bit precisions added in the Turing Tensor Core" NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧.
^ "In addition to FP16 precision ... INT8, INT4 and binary 1-bit precisions ... the A100 Tensor Core adds support for TF32, BF16 and FP64 formats" NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧.
^ "NVIDIA H100 GPU ... New fourth-generation Tensor Cores" NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. 2023-05-16閲覧.
^ "FP8, FP16, BF16, TF32, FP64, and INT8 MMA data types are supported." NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. 2023-05-16閲覧.
^ "TF32 is a new compute mode added to Tensor Cores in the Ampere generation of GPU architecture." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧.
^ ^a ^b "TF32 is only exposed as a Tensor Core operation mode, not a type." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧.
^ "All storage in memory and other operations remain completely in FP32, only convolutions and matrix-multiplications convert their inputs to TF32 right before multiplication." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧.

[:13-1] "Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy." NVIDIA. NVIDIA Tensor Cores. 2023-05-16閲覧.

[2] "Tensor Cores are specialized high-performance compute cores for matrix math operations that provide groundbreaking performance for AI and HPC applications. " NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧.

[3] "FP16 precision introduced on the Volta Tensor Core ... Volta GPU ... Results are accumulated into FP32 for mixed precision training or FP16 for inference." NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧.

[4] "the INT8, INT4 and binary 1-bit precisions added in the Turing Tensor Core" NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧.

[5] "In addition to FP16 precision ... INT8, INT4 and binary 1-bit precisions ... the A100 Tensor Core adds support for TF32, BF16 and FP64 formats" NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧.

[6] "NVIDIA H100 GPU ... New fourth-generation Tensor Cores" NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. 2023-05-16閲覧.

[7] "FP8, FP16, BF16, TF32, FP64, and INT8 MMA data types are supported." NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. 2023-05-16閲覧.

[8] "TF32 is a new compute mode added to Tensor Cores in the Ampere generation of GPU architecture." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧.

[:0-9] "TF32 is only exposed as a Tensor Core operation mode, not a type." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧.

[10] "All storage in memory and other operations remain completely in FP32, only convolutions and matrix-multiplications convert their inputs to TF32 right before multiplication." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

概要

対応

TensorFloat-32

関連項目

脚注