Apple Neural Engine on A15 Outperform Nvidia RTX2080 by up to 6.7%

Recently I came across a blog posted by Apple with a picture showing the performance of Apple Neural Engine (ANE).
Peformance

As an iOS developer and being the tech nerd I am, I was shocked when I saw that Apple stated that the FP16 performance of the A15 can reach close to 16TFLOPS. As a comparison, the FP16 performance of RTX 2070 is 14.93 TFLOPS. I’m curious how the A15 performs in practice and I decided, I better test them out.

1.Comparing Apple Neural Engine(A15) against RTX 2080

ANE	ANE	RTX2080
FP16	15.8TFLOPS	20.14 TFLOPS
FP32	-	10.07 TFLOPS

How?

Considering that ANE is specially optimized for Conv Network and I recently used RealESRGAN_x4plus to develop PixAI. I will use RealESRGAN_x4plus for testing.

If you want to experience this model on your phone, you can download it from here.

Model Spec

Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.

Model	Total Params	Param Size
RealESRGAN_x4plus	16,697,987	66.79 MB

Model Structure Struct

Test method:

Input: 1 x 3 x 256 x 256 RGB image
RTX2080: First do 10 inferences to warm up, then loop through 100 inferences and calculate the average inference time
MacBook Pro 2018 2.3 GHz 4C8T Intel Core i5 with Metal GPU acceleration
A15 Neural Engine: Test the average inference time using Xcode’s built-in performance testing tools
A15 Metal GPU: Same as above

So how do the A15 chips go in real world performance?

This article is strictly focused on performance. For design, inputs, outputs, battery life, there’s plenty of other resources out there.

2.Start Testing

RTX2080 BatchSize 1 FP32

Test code:

# First do 10 inferences to warm up, then loop through 100 inferences and calculate the average inference time
with torch.no_grad():
    example_input = torch.rand(1,3,256,256,dtype=torch.float32).to("cuda")
    esrganWrapper.eval()
    for _ in range(0,10):
        out = esrganWrapper(example_input)
    start = time.time()
    for _ in range(0,100):
        out = esrganWrapper(example_input)
    end = time.time()
avg = (end-start)/100

Here we get 53.11s calculation time for 100 cycles, and the average value is 531ms

RTX2080 BatchSize 6 FP32

Test code:

with torch.no_grad():
    example_input = torch.rand(1,6,256,256,dtype=torch.float32).to("cuda")
    esrganWrapper.eval()
    for _ in range(0,2):
        out = esrganWrapper(example_input)
    start = time.time()
    for _ in range(0,25):
        out = esrganWrapper(example_input)
    end = time.time()
avg = (end-start)/150

When the BatchSize is 6, the VRAM occupation is 7.6GB, which is already the limit of RTX2080.

Here we get 75.54s calculation time for 150 cycles, and the average value is 504ms

RTX2080 BatchSize 6 FP16

test code is same as above

Here we get 33.2s calculation time for 150 cycles, and the average value is 221ms

MacBook Pro 2018

I use Xcode’s built-in CoreML performance testing tools for this section.

Xcode uses the median as a measure, while we use the average on the RTX2080.

macbook

As can be seen from the figure, the model runs entirely on the integrated GPU of the Intel I5 chip.

Here we get the median value is 3238ms, which is about 6 times slower than RTX2080.

Apple GPU (A15)

The test method here is the same as that of the MacBook.

macbook

As can be seen from the figure, the model runs entirely on the Apple GPU of the A15 chip.

Here we get the median value is 2618ms, which is about 5 times slower than RTX2080. It’s still faster than a MacBook 2018 though, which has an Intel Iris Plus Graphics 655 GPU.

Apple Neural Engine (A15)

Finally we’re going to test our main character, hope it won’t let us down.
The test method here is the same as above, except that the compute unit is limited to CPU and Neural Engine.

macbook

As can be seen from the figure, the model runs entirely on the Neural Engine of the A15 chip.
In the end the Apple Neural Engine doesn’t let us down and is even a bit faster than the RTX2080 10.07 TFLOPS FP32.
Here we get the median value is 495ms, which is 36ms faster than RTX2080 FP32!

3.Conclusion

The performance of Apple Neural Engine did not disappoint us, it is even faster than the RTX2080 at 220W FP32. Although ANE’s FP16 did not meet the performance expectations, considering the power consumption, ANE can have such a performance is very impressive.

Device	Inference Time	Adv Percentage
RTX2080 BS=1 FP32	531 ms	0%
RTX2080 BS=6 FP32	504 ms	5.1%
RTX2080 BS=6 FP16	221 ms	58.3%
Intel Iris Plus 655	3238 ms	-510%
A15 GPU	2618 ms	-393%
A15 ANE	495 ms	+6.7%

However, ANE itself also has shortcomings, it cannot support all networks ops like gpu. But as a low-power processor, the performance it brings us is already surprising enough.

Twitter Facebook LinkedIn

Apple Neural Engine on A15 Outperform Nvidia RTX2080 by up to 6.7%

1.Comparing Apple Neural Engine(A15) against RTX 2080

How?

Model Spec

Test method:

2.Start Testing

RTX2080 BatchSize 1 FP32

RTX2080 BatchSize 6 FP32

RTX2080 BatchSize 6 FP16

MacBook Pro 2018

Apple GPU (A15)

Apple Neural Engine (A15)

3.Conclusion

Comments

You May Also Enjoy

PixAI Models Comparision

Openwrt in Docker

HandyDiffu iOS App User Manual

PixAI iOS App User Manual