Mandelbrot Fractal VFP version
Kuemmel (439) 189 posts 
With lots of help from Terje, I managed to code a VFPversion for single and double precision of my Mandelbrot fractal benchmark FixFrac. You can find it here: FracVFP As expected the speed compared to my fixed point math Mandelbrot code is not that high. If one looks up the instruction timings for VFP commands the results fit. I know it’s not really comparable to fixed point but here are the results in seconds (800 MHz) to calculate that Mandelbrot picture:
So it’s about 5 times slower. But anyway, I’m really happy that it works and it’s fun to code for the VFP, as you don’t have to deal with all the issues of fixed point math. There’s a nice Quick Reference Card from ARM to code for VFP. As I’m using only basic FMUL/FADD/FSUB instructions (due to the nature of the algorithm) of the VFP in the time critical section I guess the benefit for the more complex instructions (VDIV, VSQRT) using the VFP can be quite higher compared to fixed point math.
If you want to assemble the source code with ExtASM, make sure you got the latest version, as we had to deal with a small problem regarding the DCFD command (different encoding of FPA and CortexA8VFP). Make sure you got a desktop at 16 million colours and minimum resolution of 800×600 when starting the application. It’s really just a first release, so some problems might be still there. Any comments welcome. 
Terje Slettebø (285) 219 posts 
Great job, Michael! I just wanted to add that the “only” thing I did was to correct the extASM DCFD assembly directive (declare double precision floatingpoint value) so that it worked for the VFP: It turns out that the FPA and VFP stores doubleprecision values in different word order, so it now uses the VFP word order, and to get the original FPA word order, one needs to add an #fpa directive in the source code. Furthermore, it seems Michael got some help from reading the CubeDemo source code, as well as debugging tools from there, which I’m happy that he found useful. Lastly, the VFP init code is originally from Jeffrey Lee. :) (It should also be unnecessary in later versions of RISC OS, being included in the VFPSupport module, although I haven’t tested that, yet) To my knowledge, this is the first Beagleboard RISC OS demo using the VFP/NEON unit, which is quite cool. :) Even though the timing may not be that impressive compared to fixedpoint math, the time should be at least cut in half in Cortex A9, with its faster VFP unit (Cortex A8 only has a “VFP Lite” unit). Furthermore, using the NEON unit, it should be possible to get comparative timing to fixedpoint, even today. The last version of extASM hasn’t been uploaded, yet, but I’ll do that tonight. 
Bryan Hogan (339) 238 posts 
Does anyone have the original FastBrot program written by Stephen Streater? I seem to remember that only took a few seconds to calculate on an 8MHz ARM2! It would be interesting to get this running on a BB and see how fast it goes. Plug – Stephen is the guest speaker at ROUGOL on Monday 17th January. It would be fun to have FastBrot running there. 
Trevor Johnson (329) 1650 posts 
Sorry, no. I guess you’ve already done some searches. It’s listed here and there’s a small chance that Simon Burrows in Nottingham is the same person. (I presume the [Edit: Or perhaps it’s more likely to be this Simon Burrows. (He was apparently a member of The ARM Club.)] 
Kuemmel (439) 189 posts 
@Terje: I think I’ll try to do a NEON version soon, just to learn more about that unit of the CPU. It’ll requite a bit more time, as the Mandelbrot points you can iterate in parallel within a SIMD instructions can have a different end of iteration, while others still have to continue, so far more logic has to be implemented to do that fast. I did this in SSE2 on x86 a while ago, so it’ll be nice to see that running on RiscOS, too. 
Kuemmel (439) 189 posts 
I did my first steps with the NEON unit. Though even as I used it only in a non SIMD way (making use of just one of the possible 4 single precision numbers in one 128bit wide “Qx”Register) it was way faster than the VFP for single precision. Compared to the numbers above I got it done in 5,67 seconds ! This makes again sense when looking at instruction cycles (e.g. VADD is about 910 cycles for VFP and 2 cycles for NEON for CortexA8). So, I hope soon to be able to implement all the iteration logic using all 4 instead of 1 numbers without much overhead. So in an ideal way (of course some overhead will be there due to the algoritm) a further speed up by a factor of 4 is possible. That will beat the fixed point stuff by far. I guess if one only want to use single precision, NEON is the way to do it and that promises a real speed bargain compared to fixed point math. 
Terje Slettebø (285) 219 posts 
I did my first steps with the NEON unit. Though even as I used it only in a non SIMD way (making use of just one of the possible 4 single precision numbers in one 128bit wide “Qx”Register) it was way faster than the VFP for single precision. Cool. :)
Yes indeed: :) “Recommendations: For floatingpoint operations, use the NEON unit where possible, and only use the VFP unit when needed.” 
Kuemmel (439) 189 posts 
Finally I managed to code the first real SIMD NEON version of my Mandelbrot fractal benchmark. You can find it here: FracNEONVFP It includes also the VFP version. The NEON code is more than 10 times faster and still “mathematically equal” to the single precision VFP code. There’s still some possibilities for optimisation, as I always wait for all 4 pixels in the QxSIMDregister to diverge. And of course I’m still learning more about NEON each day. One could implement a logic to feed new pixels into the iteration chain, but that’s for later ;) At the moment I’m just impressed how fast it is…so “brave new world” of NEON for RiscOS for all applications that require fast single precision math. Here are the updated results in seconds (800 MHz). I tuned also the VFP code a bit.
@EDIT: Some small corrections due to some code trash in Frac VFP. Download and result corrected.

Trevor Johnson (329) 1650 posts 
This sounds pretty impressive :) And in case you’ve not all had enough of poor Christmas cracker jokes… What does the B in Benoît B Mandelbrot stand for? 
Kuemmel (439) 189 posts 
@Trevor :) I tuned my NEON code again. As stated before now I was successfull to implement all the code logic so that if one or more pixels diverged or reached maximum iterations are instantly replaced by new ones, so the full potential of SIMD is used. You can find it here: FracNEONVFP It includes also the unchanged VFP version. I updated results table in seconds (800 Mhz) here again.
As I did a similar x86 assembly code I want give you a feeling how fast or slow the CA8 from that perspective is (even if of course Intel can use dual precision (SSE2) and so I got to divide the CA8 results in half for fair play). Even with that the CA8 is still about 2 times faster clock by clock than an Intel ATOM, and about 5 times slower than the latest Intel i7, which I find still very good, and of course I guess as always CA8 is the winner in terms of performance/Watt ;) Now I would really like to see a CA9 result…praying for a port of RiscOS one day ;) I might try a real time video style zoom demo, as at low iteration depth the speed even at 800×600 can be more than 10 Frames per second. My benchmark uses an iteration depth of 4096. As I looked more closely at NEON it could also do wonders for Integer stuff, the 128 Bit wide registers can be used for pairwise additions in single instructions very efficiently (e.g. look here: Sum Integers ...this can speed up some data (e.g. image) manipulations at an order of magnitude I guess. I’ll try some day what it can do to my fire benchmark… 
W P Blatchley (147) 247 posts 
I haven’t had a chance to look at this yet, but I’m really looking forward to it! Thanks for making your efforts public. Would you consider putting some information on the ROOL Wiki about some of the things you’ve found out while working on this VFP / NEON code? It could benefit others in the future, I should think! 
Trevor Johnson (329) 1650 posts 
My data’s in the table in the ‘Benchmarks’ thread. 
Reply
To post replies, please first log in.