Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

Hiroyuki Ootomo and Rio Yokota

ECP multiprecision meeting, April 21

#### Achievements of this work

Our SGEMM emulation on Tensor Cores outperforms the theoretical peak performance of FP32 SIMT Core while achieving the same level of accuracy.



Paper: https://arxiv.org/abs/2203.03341

#### Related work

- Markidis *et al.* proposed a SGEMM emulation method on Tensor Cores. NVIDIA Tensor Core Programmability, Performance & Precision, https://arxiv.org/abs/1803.04014 However, their method does not fit the accuracy of FP32 SIMT Core.
- Feng *et al.* claimed to fix the accuracy of Markidis' method by recovering the mantissa length kept by the method.
  EGEMM-TC: accelerating scientific computing on tensor cores with extended precision, PPoPP'21
  Is it true? We have also investigated the mantissa length problem and concluded that it is not the problem of Markidis' method.

#### Related work : Markidis' method

Compute a multiplitation of FP32 matrices  $\mathbf{A}_{\text{F32}}$  and  $\mathbf{B}_{\text{F32}}.$ 

Split one FP32 matrix into two FP16 matrices

$$\begin{split} \mathbf{M}_{\mathsf{F16}} &\leftarrow \mathsf{toF16}\left(\mathbf{M}_{\mathsf{F32}}\right) \\ \Delta \mathbf{M}_{\mathsf{F16}} &\leftarrow \mathsf{toF16}\left(\mathbf{M}_{\mathsf{F32}} - \mathsf{toF32}\left(\mathbf{M}_{\mathsf{F16}}\right)\right) \end{split}$$

2 Multiply and accumulate using Tensor Core

$$\begin{split} \mathbf{C} &\leftarrow \left(\mathbf{A}_{\mathsf{F16}} + \Delta \mathbf{A}_{\mathsf{F16}}\right) \left(\mathbf{B}_{\mathsf{F16}} + \Delta \mathbf{B}_{\mathsf{F16}}\right) \\ &\sim \mathbf{A}_{\mathsf{F16}} \mathbf{B}_{\mathsf{F16}} + \Delta \mathbf{A}_{\mathsf{F16}} \mathbf{B}_{\mathsf{F16}} \\ &+ \mathbf{A}_{\mathsf{F16}} \Delta \mathbf{B}_{\mathsf{F16}} + \Delta \mathbf{A}_{\mathsf{F16}} \Delta \mathbf{B}_{\mathsf{F16}} \end{split}$$



#### Related work

- Markidis et al. proposed a SGEMM emulation method on Tensor Cores. NVIDIA Tensor Core Programmability, Performance & Precision, https://arxiv.org/abs/1803.04014 However, their method does not fit the accuracy of FP32 SIMT Core.
- Peng et al. claimed to fix the accuracy of Markidis' method by recovering the mantissa length kept by the method. EGEMM-TC: accelerating scientific computing on tensor cores with extended precision. PPoPP'21

Is it true? We have also investigated the mantissa length problem and concluded that it is not the problem of Markidis' method.

# Contribution (1/2)

- We have found that the rounding for accumulator inside Tensor Cores – RZ – causes the low accuracy of Markidis' method.
- To avoid this rounding, we use FP32 SIMT Core for the accumulation outside of Tensor Cores.



5/24

 $\mathbf{D}_{\mathrm{FP32}}$ 

# Contributions (2/2)

- Improve the accuracy of Markidis' method
  - 1 Calculat expectation mantissa length
  - 2 Found the causes the low accuracy: rounding inside Tensor Core
  - 3 Develop a method to avoid this rounding.
- 4 Reduce the underflow probability during the error correction by scaling error correction terms
- **5** Reduce computational complexity by omitting negligible error correction step
- **6** Demonstrate that our method outperforms the FP32 SIMT Core peak performance and consumes lower consumption while the the same level accuracy.

## (1/6) The expectation mantissa length of Markidis' method

 $\begin{array}{l} \mbox{[Markidis' method] } \mathbf{M}_{\text{F32}} \sim (\mathbf{M}_{\text{F16}} + \Delta \mathbf{M}_{\text{F16}}) \mbox{ where} \\ \mathbf{M}_{\text{F16}} \leftarrow \mbox{toF16} \left( \mathbf{M}_{\text{F32}} \right), \Delta \mathbf{M}_{\text{F16}} \leftarrow \mbox{toF16} \left( \mathbf{M}_{\text{F32}} - \mbox{toF32} \left( \mathbf{M}_{\text{F16}} \right) \right) \end{array}$ 

Feng *et al.* claimed that the Markidis' method can only keep 20 per 23 bits of the FP32 mantissa and this is the main cause of the low accuracy of the method.

#### ls it true?

#### - Answer from our investigation

No.

We found that the expectation mantissa length kept by Markidis' method is 22.75 per 23 bit.

Furthermore, we showed that this 0.25 bits of mantissa loss is not the main cause of the low accuracy of Markidis' method.

## (1/6) The expectation mantissa length of Markidis' method

To show that the mantissa loss is not the main cause of the low accuracy of Markidis' method, we conducted a small experiment.

- Compared Markidis' method to a preprocessed SGEMM, which sets the LSB of mantissa to zero.
- By this preprocess, the expectation of mantissa length becomes 22.5 bits (< 22.75 bits)</li>
- The accuracy of Markidis' method is worse than this preprocessed SGEMM.
- $\Rightarrow$  The 0.25 bits of mantissa loss is not the (main) cause of the low accuracy of Markidis' method.



## (2/6) Cause of Low Accuracy: Rounding Inside Tensor Core

- Made two types of Tensor Cores emulators which compute A<sub>FP16</sub>B<sub>FP16</sub> + C<sub>FP32</sub> using
   RN in C<sub>FP32</sub> addition.
  - **2** RZ in  $C_{FP32}$  addition.
- The accuracy of Markidis' method using "RZ in C<sub>FP32</sub> addition" is similar to the real Tensor Cores.



## (3/6) Avoiding the Rounding Inside Tensor Core

- We use FP32 SIMT adder to compute the C<sub>FP32</sub> addition using RN.
- This method avoids the direct RZ for C<sub>FP32</sub> addition.



## (3/6) Avoiding the Rounding Inside Tensor Core

With this approach, we improved the accuracy of Markidis' method.



 $\begin{array}{l} \label{eq:constraint} \mbox{(4/6) Reducing Underflow Probability with Scaling} \\ \mbox{One computation underflows with high probability in Markidis' method.} \\ \mbox{$M_{F16} \leftarrow toF16 (M_{F32})$} \\ \mbox{$\Delta M_{F16} \leftarrow toF16 (M_{F32} - toF32 (M_{F16}))$} \\ \end{array}$ 

This computation

We have investigated the underflow and gradual underflow probabilities.



 P<sub>u+gu</sub>: The underflow and gradual underflow probability.

 P<sub>u</sub> : The underflow probability.

## (4/6) Reducing Underflow Probability with Scaling

Markidis' method

$$\begin{split} \mathbf{M}_{\text{F16}} &\leftarrow \text{toF16}\left(\mathbf{M}_{\text{F32}}\right) \\ \Delta \mathbf{M}_{\text{F16}} &\leftarrow \text{toF16}\left(\mathbf{M}_{\text{F32}} - \text{toF32}\left(\mathbf{M}_{\text{F16}}\right)\right) \\ \mathbf{C} &\sim \mathbf{A}_{\text{F16}} \mathbf{B}_{\text{F16}} + \Delta \mathbf{A}_{\text{F16}} \mathbf{B}_{\text{F16}} + \mathbf{A}_{\text{F16}} \Delta \mathbf{B}_{\text{F16}} + \Delta \mathbf{A}_{\text{F16}} \Delta \mathbf{B}_{\text{F16}} \end{split}$$

 $\blacksquare$  Our prototype method scales the  $\Delta \mathbf{M}_{\text{F16}}$  computation.

$$\begin{split} \mathbf{M}_{\text{F16}} &\leftarrow \text{toF16}\left(\mathbf{M}_{\text{F32}}\right) \\ \Delta \mathbf{M}_{\text{F16}} &\leftarrow \text{toF16}\left(\left(\mathbf{M}_{\text{F32}} - \text{toF32}\left(\mathbf{M}_{\text{F16}}\right)\right) \times 2^{11}\right) \\ \mathbf{C} &\sim \mathbf{A}_{\text{F16}} \mathbf{B}_{\text{F16}} + \left(\Delta \mathbf{A}_{\text{F16}} \mathbf{B}_{\text{F16}} + \mathbf{A}_{\text{F16}} \Delta \mathbf{B}_{\text{F16}}\right) / 2^{11} + \Delta \mathbf{A}_{\text{F16}} \Delta \mathbf{B}_{\text{F16}} / 2^{11 \times 2} \end{split}$$

The "11" comes from the mantissa length of FP16 with implicit one bit, 10 + 1 = 11.

## (4/6) Reducing Underflow Probability with Scaling

The comparison of representation accuracy and range.

- Markidis' halfhalf : Represents one FP32 using two FP16s
- **Our halfhalf** : Markidis' halfhalf + scaling.
- Our tf32tf32 : Our halfhalf using TF32 (e8m10) instead of FP16.



## (5/6) Reducing computational complexity

Our prototype method

 $\mathbf{C} \sim \mathbf{A}_{\mathsf{F16}} \mathbf{B}_{\mathsf{F16}} + \left( \Delta \mathbf{A}_{\mathsf{F16}} \mathbf{B}_{\mathsf{F16}} + \mathbf{A}_{\mathsf{F16}} \Delta \mathbf{B}_{\mathsf{F16}} \right) / 2^{11} + \underline{\Delta \mathbf{A}_{\mathsf{F16}} \Delta \mathbf{B}_{\mathsf{F16}} / 2^{11 \times 2}}_{\mathsf{F16}}$ 

• Our final method omits " $+\Delta A_{F16}\Delta B_{F16}/2^{11\times2}$ " since the effect of this error correction computation is negligible to the FP32 23 bits of mantissa.

 $\mathbf{C} \sim \mathbf{A}_{\mathsf{F16}} \mathbf{B}_{\mathsf{F16}} + \left( \Delta \mathbf{A}_{\mathsf{F16}} \mathbf{B}_{\mathsf{F16}} + \mathbf{A}_{\mathsf{F16}} \Delta \mathbf{B}_{\mathsf{F16}} \right) / 2^{11}$ 



# (6/6) Evaluation

We have incorporated our method into NVIDIA CUTLASS 2.5.<sup>1</sup> to use the high performance functions, e.g. memory blocking strategies and thread allocations.

- Labels

- cutlass\_halfhalf : Uses FP16 Tensor Cores
- cutlass\_tf32tf32 : Uses TF32 Tensor Cores

Evaluation environment
 NVIDIA A100 40GB SXM4
 CUDA 11 3

<sup>1</sup>NVIDIA includes 3xTF32 SGEMM emulation into CUTLASS 2.8 independently from us.

# Accuracy evaluation (1/2)

We evaluate our implementation using various exponent distribution matrices.



- Error evaluation

$$\mathsf{RelativeResidual} = ||\mathbf{C}_{\mathsf{ref}} - \mathbf{C}||_F / ||\mathbf{C}_{\mathsf{ref}}||_F,$$

where  $|| \cdot ||_F$  is Frobenius norm and  $C_{ref}$  is a reference computation result in FP64.

# Accuracy evaluation (1/2)

- Test matrices



# Accuracy evaluation (2/2)



## Throughput evaluation



We have also measured the power consumption of each method and found thet our method consumes lower power compared to cuBLAS SGEMM.

#### To improve the throughput of our implementations...

- Reduce the shared memory bank conflict. Currently, the shared memory layout (skew) is not suitable to this error correction method and a lot of shared memory bank conflicts occur.
- Use mma.m16n8k16 instruction instead of mma.m16n8k8 for cutlass\_halfhalf. By using this instruction, we can reduce the latency. However, mma.m16n8k16 has an additional RZ between first half 8 accumulations and latter half 8 accumulations. We need to carefully investigate the effect.



#### Conclusion

We have improved Markidis' error correction method and demonstrated that our SGEMM emulation on Tensor Cores outperforms the theoretical peak performance of FP32 SIMT Core while achieving the same level of accuracy.



### Open problem

1 We haven't done the theoretical error analysis of avoiding RZ in our method. Although the RZ for adding  $C_{FP32}$  is avoided, the rounding for  $A_{FP16} \cdot B_{FP16}$  is still RZ.