FP support
David J. Ruck (33) 1503 posts |
Great stuff, any chance of the same thing for Fireworkz? |
Stuart Swales (8827) 1257 posts |
Quite likely, I would think ;-) |
Paolo Fabio Zaino (28) 1795 posts |
@ Stuart I’ve got my hands full ATM, but if you want I can give it a try this weekend while testing other stuff. If so link to pull down the source for build (or binary for test only) pleaseeeee :) P.S. Awesome work! |
Stuart Swales (8827) 1257 posts |
Have a whirl with this source tarball. Interested in comments at present as to whether to polish further – should we go to Code Review? Note that I haven’t yet implemented fetestexcept() and friends for the VFP world (expect NaNs and Infs rather than SIGFPE barfs). http://croftnuisk.co.uk/coltsoft-downloads/other/apcs_softpcs_20210924.zip In the end I had to change PipeDream very little to use this library; I didn’t get C99 double complex with Norcroft /softfp working so had to revert to my own old implementation, and change one trivial inline function to non-inline to stop a compiler barf. [Edit: I forgot about adding -DAPCS_SOFTPCS as well as using -apcs /softfp as the compiler doesn’t seem to defined anything useful] |
Chris Gransden (337) 1150 posts |
I did a quick test with the flops.c benchmark.
|
Stuart Swales (8827) 1257 posts |
Thanks Chris! 10x, but could be 10x better, eh. As I mentioned somewhere else, I see it as a useful stepping-stone towards using some of the potential performance offered by new hardware without abandoning the old. I’m sure I’m not the only person who is pretty much tied into continuing to use Norcroft for RISC OS targets given various pragmas and globs of assembler. Chris: I tried with flops.c from the interweb to see what ops that used and get vastly different results to yours – could I have a copy please? Ta. |
Chris Gransden (337) 1150 posts |
I’ve just sent it. I’ll see I can find and build something that is more of a real world test. |
Stuart Swales (8827) 1257 posts |
Thanks – results now believably closer. Mine are somewhat lower due to older HW (ARMX6@1GHz) but the gap between Norcroft -Otime with apcs_softpcs and gcc -O2 -mfpu=vfp (4.7.4) is also lower, about a factor of three to four, not ten. I do see a factor of ten still between Norcroft FPA and Norcroft with apcs_softpcs. |
David Pitt (3386) 1248 posts |
Some results using this flops.c built with GCC4.7.4 :- *gcc flops.c -o flops -mfpu=vfp On the 1.5MHz Titanium :- FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Iterations = 512000000 NullTime (usec) = 0.0014 MFLOPS(1) = 291.6420 MFLOPS(2) = 295.3921 MFLOPS(3) = 343.1982 MFLOPS(4) = 347.2868 On The RPi400 at 2.4MHz :- Iterations = 512000000 NullTime (usec) = 0.0009 MFLOPS(1) = 697.6939 MFLOPS(2) = 811.5966 MFLOPS(3) = 886.8430 MFLOPS(4) = 898.4188 |
Stuart Swales (8827) 1257 posts |
David: Try with -O2, might get closer to Chris’ results. |
Chris Gransden (337) 1150 posts |
I used -O3. While trying to link something else I get an undefined symbol for __apcs_softpcs__lrintf. |
Stuart Swales (8827) 1257 posts |
Ah, overzealous bit of macro-ing in apcs_softpcs.h! Thanks. lrint and llrint (and friends) didn’t need wrapping, and might not benefit much from VFP-ing as their implementation in the C library is pure ARM w/o FPA. [Edit: the above is true for lrint/lrintl/llrint/llrintl but NOT for lrintf/llrintf when compiled with |
Rick Murray (539) 13440 posts |
Norcroft really needs to move away from emitting FPA instructions, and these examples are (yet another) demonstration why. I noticed a few versions ago it has some options for the FPU type. I don’t think they do anything, but maybe it’s planned? Fingers crossed! |
David Pitt (3386) 1248 posts |
David: Try with -O2, might get closer to Chris’ results. Thanks both, O2 good O3 better. (RPi400 2400kHz) *gcc flops.c -o flops -mfpu=vfp -O2 *flops FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Iterations = 512000000 NullTime (usec) = 0.0000 MFLOPS(1) = 1669.2163 MFLOPS(2) = 1127.8841 MFLOPS(3) = 1596.2417 MFLOPS(4) = 1860.7029 * *gcc flops.c -o flops -mfpu=vfp -O3 *flops FLOPS C Program (Double Precision), V2.0 18 Dec 1992 Iterations = 512000000 NullTime (usec) = 0.0000 MFLOPS(1) = 1889.5671 MFLOPS(2) = 1158.4400 MFLOPS(3) = 1661.5248 MFLOPS(4) = 2009.1419 * |
Stuart Swales (8827) 1257 posts |
Not just the compiler, Rick. The C library is chokka with FPA assembler. For instance the lrintf code could be sped-up usefully by having a VFP code branch as well as the existing FPA branch just to do the callee-narrowing prior to the common ARM bit. But how much are we prepared to annoy people who for good reasons are still running old hardware (“it works for me in my setup”) or emulators (“don’t have hardware at work, just RPCEmu on the laptop”)? I wouldn’t bother to release a VFP-only version of PipeDream or Fireworkz as the performance gains in those applications wouldn’t be worth it for 99% of users, whereas something that usefully boosts performance for anyone with modern-ish hardware without unduly penalising the other users looks like a win to me. Anyone who needs to run at highest performance needs to grab code and compile it to suit their needs. |
Chris Gransden (337) 1150 posts |
Commenting out lrintf in the header got it to link. Here’s the results for twolame converting a wav file to mp2 on a RPi CM4 @2.4GHz. fpa 374.7 secs softpcs (APCS_SOFTPCS_RUNTIME_SWITCH: TRUE) 16.18 secs softpcs (APCS_SOFTPCS_RUNTIME_SWITCH: FALSE) 15.02 secs gcc 4.7.4 vfp 3.89 secs |
Stuart Swales (8827) 1257 posts |
Wow! That’s a win. Edit: Note that you can assemble the library with APCS_SOFTPCS_RUNTIME_SWITCH set to {FALSE} for more performance when you know the target will have VFP. That setting eliminates a LDR/LDR/TEQ/BEQ for each basic f.p. operator. e.g. on my 1.0GHz ARMX6: *flops-vfpf [no run-time switch, so VFP required for basic operations, VFP with FPA fallback for library functions] Iterations = 128000000 NullTime (usec) = 0.0020 MFLOPS(1) = 73.4334 MFLOPS(2) = 73.0166 MFLOPS(3) = 78.5507 MFLOPS(4) = 82.1033 *flops-vfps [run-time switch for VFP/FPA for everything] Iterations = 128000000 NullTime (usec) = 0.0020 MFLOPS(1) = 60.0289 MFLOPS(2) = 61.3677 MFLOPS(3) = 65.3769 MFLOPS(4) = 67.8630 This benchmark is just really exercising the basic f.p. operators. I forgot to mention, these figures are WITHOUT -Otime for flops.c as that degraded it slightly… Several of the individual methods seem to run faster when flops.c is compiled with -arch 2 -cpu 3 as it uses LDM (and STM) rather than two LDRs to move the double precision values around. |
Rick Murray (539) 13440 posts |
I know. It will need some stuff duplicated, but then maybe a little bit of smarts will be able to set up the Stubs jump table accordingly if called using a new SWI (LibInitAPCS32VFP or something?) in order that older FPA software also works as expected.
My thoughts on this are that it isn’t really an annoyance as such. Everything they own and everything they use won’t mysteriously cease to function. The only difference is that upgrades and new releases of some things won’t work. Firstly, there is precedence from Acorn (think of all the RiscPC extended stuff that was never officially made available for older machines (why do you think Dummy Dynamic Areas was created?)). Secondly, there is precedence, take a look at https://www.riscosports.co.uk/vfp/ and note that it isn’t aimed at anything pre-5.2x with VFP. Thirdly, this should be a question for each individual author. Some bend over backwards to use StubsG to support “damn near everything”, while others figure after over a quarter of a century, the RiscPC has had a good run, but it shouldn’t be a millstone preventing future progress. “Because ancient machines” is a pretty lousy excuse for not having the DDE compiler support hardware maths, and that sort of logic might push people less lazy than me to GCC. I wouldn’t move, as I suck at maths so my code doesn’t tend to be maths heavy, but Chris has provided yet another example of the limitations of emulated FP. I mean, literally, six odd minutes (FPA) versus a mite under four seconds (VFP). I’m not entirely certain what softpcs actually is, but even that hands FPA it’s arse on a plate, running in at sixteen seconds. Which is way closer to four then it is to six freaking minutes! As such, softpcs would seem an acceptable alternative (how might I use this in my programs? (Norcroft compiler)), as, really, it’s FPA that’s not fit for purpose… 1 It used to be, but these days I try to avoid turning on the power hungry Windows box. |
Stuart Swales (8827) 1257 posts |
RE: apcs_softpcs – see my post from 19 hours ago (I have no idea how to paste links to individual posts here) |
David Pitt (3386) 1248 posts |
At the required post click on the 19hours link, that is the link required, copy it from the URL bar. I do this in a second browser window to avoid loosing my place. |
Chris Gransden (337) 1150 posts |
Down from 16.18 secs to 15.02 secs. |
Stuart Swales (8827) 1257 posts |
I did wonder about having the first instruction of each function being @David: Thanks – I can only see links to topics and posts but have now found the post id to use by HTML inspection. Let’s see if I can do it: https://www.riscosopen.org/forum/forums/2/topics/3457?page=1#posts-45080 was what inspired me to do this. |
David Pitt (3386) 1248 posts |
It only works from within the topics but not from “Recent post”. Contemplating this message look up at the one above, the time, above the name, is a link to that post including the |
Rick Murray (539) 13440 posts |
Mmm, just read bits of it on my phone. It looks like the sort of FP support that was provided with TurboC way back when – use hardware if available, else emulate. It’s a good compromise. Thanks. ;-) |
Rick Murray (539) 13440 posts |
It’s hiding. Don’t use Recent Posts, go into the actual thread. Then look at the posting time above the user’s icon. There’s your link. [alternate: Firefox, install the Display #Anchors add-on] |
Stuart Swales (8827) 1257 posts |
It could be an even better compromise if the fallback emulation wasn’t to FPEmulator here but a generic softfp library! This was just really a proof-of-concept a few days ago that went somewhat better than I expected. Anyone trying to use apcs_softpcs: note Thanks for the hints about where that specific link is hiding! |
Colin Ferris (399) 1755 posts |
What about having a rmodule – a bit like SharedClib – ie SharedFpe. Then different versions for various hardware options. |
Stuart Swales (8827) 1257 posts |
The best module to implement this would of course be the SharedCLibrary ;-) |
Steve Drain (222) 1620 posts |
A bit like my Float module of 2016? This was never properly finished, but has a registered release name of SmartFP. You could base something on it or I could pick it up again. If VFP context switching is possible it uses VFP but otherwise FPE, allowing for the float word order. It also provides VFP based transcendental operations such as SIN and EXP, so no falling back to FPE for these. |
Stuart Swales (8827) 1257 posts |
What I’ve tried to provide in the apcs_softpcs ‘package’ is something that can help boost C application performance for really very minimal changes: authors just have to add one #include, one initialisation call, compile with -apcs /softfp and link with the helper library to provide functionality expected by the compiler in that mode. Then you get an application that performs much better on hardware with VFP without much degradation with FPA (given the cost of emulation). It’s not as fast as it would be compiled purely for VFP, but I don’t think that’s the right way to go just yet except for specialist applications (use gcc). It can be made a wee bit faster by removing the run-time selection of VFP/FPA use; possibly the best way to do that would be to implement the required compiler support functions currently provided by apcs_softpcs in the SharedCLibrary as Ben thought way back (https://www.riscosopen.org/forum/forums/2/topics/3457#posts-45080) so that the SCL provided by your system would provide best performance on that system.
Some fool had to do it :-) It helped re-energise my remaining grey cell. I’d forgotten that you’d implemented transcendentals in the Float module otherwise I might have gone looking there; the first RISC OS source I looked at for the state of VFPSupport didn’t have an implementation of SWI VFPSupport_ElementaryFunctions so thought that was a future feature! Then I downloaded the current source :-) |
GavinWraith (26) 1540 posts |
Support libraries for numerical functions should contain an entry for Horner’s method – inputs: a pointer to an array of coefficients |
Stuart Swales (8827) 1257 posts |
apcs_softpcs is just designed to provide the minimum required to adapt an existing C application to using VFP if possible. Currently basic f.p. operators (+,-,*,/,==,!=,<,<=,>,>=) are provided in VFP, along with select C library functions (e.g. isgreater()). It does also use VFPSupport for implementing the standard transcendental functions that would normally be provided by the C library such as cos(). Hopefully, if this idea catches on, I will bother to provide VFP functions for more library functions. If there’s a call for a numerical function library, then I’d be happy to contribute to that as a separate project. It could usefully use the same run-time switching to adapt to being executed on VFP and FPA systems. Fireworkz and PipeDream do have the SERIESSUM spreadsheet function, and use Horner’s method to evaluate various spreadsheet functions, so I know what you are on about. [Edit: I thought that SERIESSUM evaluation used Horner’s method too, but I was wrong! Hadn’t done it that way as the coefficient array can be arranged horizontally or vertically, so had nested loops. That was a pretty easy fix – which uncovered a compiler bug (nothing to do with /softfp, I hasten to add).] |
Steve Drain (222) 1620 posts |
VFPSupport_ElementaryFunctions has appeared in the six years since Float. I seem to remember Jeffrey say that they were going to be better there than in a separate module and it looks as though that is just what he has implemented. I think that is the right place, too, so nothing more for Float. Now I will have to look whether BASICVFP uses this. It did not when I last looked a good while ago, but I bet it does now. ;-) |
Stuart Swales (8827) 1257 posts |
You may be pleasantly surprised – I was! No great fanfare as I recall. |
Martin Avison (27) 1449 posts |
I suspect it was part of this |
Stuart Swales (8827) 1257 posts |
Thanks Martin. My memory is awful. I did say that I was down to one brain cell… Anyhow, here’s an update. Some additional notes, source file names contracted so you can see better what they are in a standard Filer view, and a pre-built library for those who just want to use it: http://croftnuisk.co.uk/coltsoft-downloads/other/apcs_softpcs_20210926.zip |
Matthew Phillips (473) 690 posts |
Thanks Stuart, I’ll download the latest and take a look. I spent a bit of time hacking some code about this morning to try to get one of our applications to compile. I found I had to comment out the bits of your header file to do with time.h as the compiler was complaining about a duplicate definition. Not sure whether this was a fault in my code, but as I’m not using time.h the quickest fix was to remove it. I’ll have another look at doing it properly and report the exact error if I’m still having problems. My main aim was to compile the application and see what the speed improvement might be. Turned out to be useful, but not so dramatic that I could tell without timing it! A task that took 51 seconds on a Pi3 sped up to 44 seconds, using a recent ROM image. I think that makes a 16% speed improvement, though I may be getting my percentages in a mess. |
Matthew Phillips (473) 690 posts |
By the way, the linker threw up so many warnings about “code/data or FP calling standard conflict” that I did not notice it had actually produced a binary for a good few minutes! |
Stuart Swales (8827) 1257 posts |
Thanks for feedback, Matthew. It’d possibly be useful for the apcs_softpcs.h not to #define things relating to various headers like time.h if that header hadn’t been included! It’s only difftime() there that needs redirecting to a function that returns the double in ARM registers rather than F0.
Is that more recent than June? That’s when VFPSupport started to support the elementary functions. |
Matthew Phillips (473) 690 posts |
No discernable difference in speed between VFPSupport 0.13 and VFPSupport 0.16. I’m not sure whether there should be, or whether that only affects BASIC. |
Matthew Phillips (473) 690 posts |
By “recent” I meant 5.29 downloaded today. Should have been more specific. My timings (all Raspberry Pi 3) were: RISC OS 5.25 (11-May-18), VFPSupport 0.13: 43 seconds Original version of application: 51 seconds. I had no intuition as to what the speed improvement might be: it’s a complex application relying on lots of other things, including DrawFile_Render. I’m not even sure how much it relies on basic floating point operations and how much is trigonometry. |
Stuart Swales (8827) 1257 posts |
apcs_softpcs uses VFPSupport to create/destroy VFP contexts, and with the advent of 0.16, sin, cos, exp and friends. All the other VFP arithmetic is done by apcs_softpcs, so VFPSupport version should not matter for that. |
Stuart Swales (8827) 1257 posts |
Sadly unavoidable (unless some guru is kind enough to point out the magic settings) as areas need to have the VFP attribute to assemble VFP instructions! I suppose I could polish it by hand-mangling out the VFP attribute from each area. :-) |
Stuart Swales (8827) 1257 posts |
An interesting observation on applications like Matthew’s: if your application isn’t that f.p. intensive, you wont get much gain by using hardware f.p.. If your workload is: total(normal) = overhead + fp(normal) = 100s and fp(normal) is, say, just 20% of that total, that leaves the unavoidable overhead being 80s, whatever speed the f.p. part can be run at. Moving to apcs_softpcs on a system with hardware f.p. could reduce the time consumed by f.p. calculations just in the f.p. part by a factor of five to ten: fp(sfp) = fp(normal) x 0.1. So (sfp == run-time switchable f.p. between FPA and VFP): total(sfp)= overhead + fp(sfp) = 80s + 0.1 x 20s = 82s So why use apcs_softpcs? You could just recompile using gcc to VFP, but then would have to ship two binaries if you have customers that can’t move over to new ARM hardware with VFP, or who use emulation. If you get yet another factor of ten improvement over apcs_softpcs (which can only execute one f.p. instruction at a time) by using fully optimised gcc/VFP (which can pipeline them) you have fp(vfp) = fp(sfp) x 0.1. So: total(vfp) = overhead + fp(vfp) = 80s + 0.1 x (0.1 x 20s) = 80.2s a far less significant hike in overall performance that the previous step. Which doesn’t look to me like it’s worth maintaining two builds for, for these types of application, says the Devil’s Advocate ;-) |
Matthew Phillips (473) 690 posts |
Yes, interesting calculations. When the application is RiscOSM which is very processor-intensive, it is tempting to build an apcs_softpcs version to give users on modern machines a speed boost. Unfortunately it also gives users of machines without VFP a speed disadvantage. Is it fair to speed up the experience for those who are already on faster machines at the expense of those who are on slower ones? So we may need to issue a “normal” APCS build for older machines anyway. But the ability to do this without having to rebuild the entire application and libraries using gcc instead of Norcroft is very welcome! |
Stuart Swales (8827) 1257 posts |
Ah, that’s interesting, Matthew – it is proving to be that much slower for the older systems? I hadn’t noticed myself, and had found that some code paths were being made faster by the compiler avoiding issuing SFM/LFM at procedure entry/exit when the code path being used through the procedure didn’t always use f.p.. But I think the point about keeping using a toolchain that we ‘know and love’ is very valid and lowers the barrier to adopting measures like these. |
Matthew Phillips (473) 690 posts |
A 1:40:000 map of London took 2 minutes instead of 1:50 on our Iyonix using the VFP version. So a 9% increase in the time taken. On the Pi3 we got a 13% improvement. |
Rick Murray (539) 13440 posts |
Is it fair to hold back and restrict the experience of users of faster machines because of those who wish to remain using ancient slower hardware? The question is valid both ways around. The answer? That’s harder. ;-) However, I would suggest that the ones you offer the best experience to (old or new) should be your majority. If most of your users have modern fast machines, then implement this as the speed benefit is obvious and worthwhile. Aim to please the majority. |
Stuart Swales (8827) 1257 posts |
Indeed! |
Matthew Phillips (473) 690 posts |
Stuart, I was puzzled in the apcs_softpcs header file that you have #if defined(APCS_SOFTPCS) near the top. How is this different from #ifdef APCS_SOFTPCS I was wanting to put similar conditions in my application so as to be able to compile soft float and FPE versions, and I was wondering whether I am missing something. |