Recently I came across a blog posted by Apple with a picture showing the performance of Apple Neural Engine (ANE).
Peformance

As an iOS developer and being the tech nerd I am, I was shocked when I saw that Apple stated that the FP16 performance of the A15 can reach close to 16TFLOPS. As a comparison, the FP16 performance of RTX 2070 is 14.93 TFLOPS. I’m curious how the A15 performs in practice and I decided, I better test them out.

1.Comparing Apple Neural Engine(A15) against RTX 2080

ANE ANE RTX2080
FP16 15.8TFLOPS 20.14 TFLOPS
FP32 - 10.07 TFLOPS

How?

Considering that ANE is specially optimized for Conv Network and I recently used RealESRGAN_x4plus to develop PixAI. I will use RealESRGAN_x4plus for testing.
realesrgan
If you want to experience this model on your phone, you can download it from here.

Download on the App Store

Model Spec

Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.

Model Total Params Param Size
RealESRGAN_x4plus 16,697,987 66.79 MB

Model Structure Struct

Test method:

  • Input: 1 x 3 x 256 x 256 RGB image
  • RTX2080: First do 10 inferences to warm up, then loop through 100 inferences and calculate the average inference time
  • MacBook Pro 2018 2.3 GHz 4C8T Intel Core i5 with Metal GPU acceleration
  • A15 Neural Engine: Test the average inference time using Xcode’s built-in performance testing tools
  • A15 Metal GPU: Same as above

So how do the A15 chips go in real world performance?

This article is strictly focused on performance. For design, inputs, outputs, battery life, there’s plenty of other resources out there.

2.Start Testing

RTX2080 BatchSize 1 FP32

Test code:

# First do 10 inferences to warm up, then loop through 100 inferences and calculate the average inference time
with torch.no_grad():
    example_input = torch.rand(1,3,256,256,dtype=torch.float32).to("cuda")
    esrganWrapper.eval()
    for _ in range(0,10):
        out = esrganWrapper(example_input)
    start = time.time()
    for _ in range(0,100):
        out = esrganWrapper(example_input)
    end = time.time()
avg = (end-start)/100

Here we get 53.11s calculation time for 100 cycles, and the average value is 531ms

RTX2080 BatchSize 6 FP32

Test code:

with torch.no_grad():
    example_input = torch.rand(1,6,256,256,dtype=torch.float32).to("cuda")
    esrganWrapper.eval()
    for _ in range(0,2):
        out = esrganWrapper(example_input)
    start = time.time()
    for _ in range(0,25):
        out = esrganWrapper(example_input)
    end = time.time()
avg = (end-start)/150

When the BatchSize is 6, the VRAM occupation is 7.6GB, which is already the limit of RTX2080.

Here we get 75.54s calculation time for 150 cycles, and the average value is 504ms

RTX2080 BatchSize 6 FP16

test code is same as above

Here we get 33.2s calculation time for 150 cycles, and the average value is 221ms

MacBook Pro 2018

I use Xcode’s built-in CoreML performance testing tools for this section.

Xcode uses the median as a measure, while we use the average on the RTX2080.

macbook

As can be seen from the figure, the model runs entirely on the integrated GPU of the Intel I5 chip.

Here we get the median value is 3238ms, which is about 6 times slower than RTX2080.

Apple GPU (A15)

The test method here is the same as that of the MacBook.

macbook

As can be seen from the figure, the model runs entirely on the Apple GPU of the A15 chip.

Here we get the median value is 2618ms, which is about 5 times slower than RTX2080. It’s still faster than a MacBook 2018 though, which has an Intel Iris Plus Graphics 655 GPU.

Apple Neural Engine (A15)

Finally we’re going to test our main character, hope it won’t let us down.
The test method here is the same as above, except that the compute unit is limited to CPU and Neural Engine.

macbook

As can be seen from the figure, the model runs entirely on the Neural Engine of the A15 chip.
In the end the Apple Neural Engine doesn’t let us down and is even a bit faster than the RTX2080 10.07 TFLOPS FP32.
Here we get the median value is 495ms, which is 36ms faster than RTX2080 FP32!

3.Conclusion

The performance of Apple Neural Engine did not disappoint us, it is even faster than the RTX2080 at 220W FP32. Although ANE’s FP16 did not meet the performance expectations, considering the power consumption, ANE can have such a performance is very impressive.

Device Inference Time Adv Percentage
RTX2080 BS=1 FP32 531 ms 0%
RTX2080 BS=6 FP32 504 ms 5.1%
RTX2080 BS=6 FP16 221 ms 58.3%
Intel Iris Plus 655 3238 ms -510%
A15 GPU 2618 ms -393%
A15 ANE 495 ms +6.7%

However, ANE itself also has shortcomings, it cannot support all networks ops like gpu. But as a low-power processor, the performance it brings us is already surprising enough.

Updated:

Comments