FP support
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Steve Pampling (1551) 8163 posts |
I would suggest that rather than being stuck with having to keep a link to the past we currently choose to stick with that link. |
Theo Markettos (89) 919 posts |
The mailing list is the canonical place, but I’ve asked for you :)
To build things with GCCSDK today you basically just do:
The autobuilder will do all that, with some heuristics (eg it’ll try to figure out the name and version of your package from the source code path). You can start by making an empty
The point in essence is that it’s easy to solve a problem once (fix up GCCSDK to work on Windows, port XYZ program to RISC OS). But next week the upstream folks (GCC, cygwin, Ubuntu, XYZ, dependent libraries) are likely to change something. We want to make it as easy as possible to not only port something once, but port the new something for every single upstream change, and do so with zero manual work. We’re almost there. We build essentially vanilla sources with no RISC OS changes – everything is in the compatibility libraries (Unixlib, SDL, DRenderer, ChoX11). This is the only way we can achieve sufficient scale. We can handle more customised RISC OS ports, but each one requires someone to make the changes and keep them up to date – and for the most part those people don’t exist. Then the problems become that someone needs to fix the breakages in our builds as upstream evolves. That’s why building GCCSDK for Windows is one thing, but supporting those builds for 15 years and catching all the breakages, including breakages in building all the packages GCCSDK/Win32 might be asked to build, with minimal manpower, is quite another. |
Steve Drain (222) 1620 posts |
Doesn’t GCC use it own float code?
Hence the understandable desire to get the FPE to use VFP when it can. I am sure it is possible, but having looked at the source I would not know where to start. ;-(
Substitute calls to Float, or something similar, and the decisions are made. It does not deal with the context switching, although I have some thoughts about that.
Please use UAL, You could use my VFP StrongHelp manual instead of an old ARMARM.
If that is feasible then it must be the way to go. I outlined that BASIC would have to do something similar. What does Charm do?
Very true, but that number is only for the arithmetic operations, and aiming for compatibility will reduce the advantage somewhat. |
Rick Murray (539) 13815 posts |
We have a choice? I liken “disregard older1 software” to be akin to foot, meet heavy artillery. 1 Interesting definition of “older” being “old, current, and some future”. ;-) |
Rick Murray (539) 13815 posts |
No idea. I don’t use GCC. Perhaps somebody could compile some code using
It looks like an awful mess designed to catch, interpret, and dispatch FP instructions to some code that pretends to do the same thing in long-winded fashion. Simply (!) rip out all the ARM code and drop in calls to the VFP instead. It won’t be as fast as native FP due to the exception handling, but it ought to be faster than an emulation. Unfortunately I’m inclined to think that a transcendental number is something to do with Buddhism. I looked at the FPEm code and was lost at the comments, never mind the code, so I’m afraid I’m probably the LEAST qualified person here. However, take a look at MultFPE in mixed.RiscOS.Sources.HWSupport.FPASC.coresrc.s.arith. I do not know how many similarities there are between FPA and VFP. Specifically I do not know if it is necessay to to the normalisation and such or if this is handled by the hardware. It might be possible to rip out all of that code and replace it with Assuming we want to modify FPEmulator, that is… I’m inclined to want to leave it because, while it is slow, not terrible quick, it is proven.
At the moment it isn’t that much faster than using FPEmulator; though I would be interested to plug in some code to jump directly to your SWI handler to see how much of a saving can be made by bypassing the SWI call mechanism.
;-) I guess I’ll need to pencil in the UAL versions of the opcodes into the ARM ARM.
Certainly, it is a highly artificial example. That’s why I said that if the compiler’s default is to use FPA instructions (as it does now) then nothing will change for code compiled in the future. Those who use a handful of FP instructions for dealing with some small float use will hardly notice the impact of using FPEm. Thanks to you, I now know that I can handle a double load using two single loads backwards; so no pratting around swapping the word order in ARM code. This means the entire conversion process can be used inline (replace FPE’s LDFD with two VLDR calls; ditto for saving); so asides from inserting calls to the stubs routines to sort out contexts; it looks like we might get away with the primary overhead being no more than an extra instruction for loading/saving. |
David Feugey (2125) 2709 posts |
Thanks Theo
So basically ro-config replace configure and ro-make make. There is still the problem of configure commands that will not work because of bad assumptions on system. Can I use traditional configure to generate the makefile, then ro-make?
Yes, thanks. But it’s the first time I get a clear explanation on how it works. The only documentation I did find was talking about making (quite complex) scripts to make clean ports for RISC OS. My needs are much simple: configure (x86-GCC or ARM-GCC, depends of which will work), then make. |
jim lesurf (2082) 1438 posts |
Yes. That’s why it seems feasible/logical to me for the FPE on a given machine to simply act like a ‘HAL’ and trap them to generate a sequence that works optimally on that specific hardware. You make the FPE module’s behaviour harware-targetted instead of having to generate multiple versions of the code compiled by the compiler. Write once. Run everywhere. The word order is a complication because it means a decision between: A) Do we stick with the old ordering and have any calculations on newe hardware shuffled. B) Have a ‘new order’ flag in the compiled code to show it has word ordering in the new arrangement. (Thus older code will be found to lack this flag, and must have its ordering re-arranged on the fly – until the programmer re-compiles). Whether the FPE shoves in VFP or something else should depend on the machine. None of that need block having a new compiler that – when told by the programmer to do so – does write targetted, optimised code which then doesn’t get trapped or changed on the targetted hardware. But it means that for those who prefer, they can write and compile without finding they need to make multiple versions for distribution. FWIW Personally, I have the habit of providing source code so people could modify or recompile anyway. But I’m try to avoid a situation where I and others have to keep generating multiple versions of things. The computer is meant to be working for us, not the other way around. 8-] Jim |
Dave Higton (1515) 3502 posts |
I’m sure I read a statement somewhere that there are multiple variations in modern FP systems – VFP/NEON. But are there different VFP instructions – as in, one VFP instruction does different things on different platforms, not just superset/subset of instructions? I’m wondering if we could(in principle!) do things like this: Old software written for the FPA/FPE continues to work because wither the FPE is there (usually; slow of course) or the FPA is there. New software should at some point start to use VFP instructions, and has to create/manipulate/destroy FP contexts appropriately. There is the possibility to write an updated FPE that makes use of VFP instructions – but could that new FPE do what is necessary in terms of FP contexts? Naive questions as usual – this lot is at least partly beyond my understanding. |
Jeffrey Lee (213) 6048 posts |
In principle it’s a pure superset/subset situation – no instructions have been changed to do wildly different things. However some features do get dropped (e.g. short vector mode), which will cause the undefined instruction vector to be taken and will require support code to emulate the behaviour (VFPSupport will already do this for short vector mode). Plus there’s one fairly major compatibility issue, which is that for the systems where floating point exceptions can’t be trapped in hardware (which is most of them – the Pi 1 is the only current system which can trap them in hardware) there’s no way to provide full compatibility for the old behaviour (unless you disable the coprocessor and fall back to a full software emulation of VFP). So software written specifically for the Pi 1 which relies on the hardware to trap FP exceptions may fail in unpredictable ways on other systems.
Yes, that’s basically what’s been suggested at least a couple of times already (I think this thread is going round in circles a bit!) |
Steve Drain (222) 1620 posts |
A new version of the StrongHelp VFP manual is now available. This has a number of corrections and additions, but it now documents the VFPSupport module as well. I hope this is OK with ROOL. |
Steve Drain (222) 1620 posts |
Going round, but I like to think of it as an inward spiral. ;-) |
Rick Murray (539) 13815 posts |
Like the drain of a bath…eventually we’ll all disappear down the plughole. Um… I was reading the help file (looks good so far) and I wondered – what happens in the case of a single tasking VFP program in a taskwindow? Will the Wimp deal with the contexts? I just tried my test program three times in three task windows, but it only crashed ’cos I forgot to load the Float module. ;-) Edit: Wait – hang on… VADD, VSUB, VMUL, VDIV (etc) are described in the VFP page as being VFP2 instructions. Are you saying VFP couldn’t do anything much more than load and store registers? Would it be possible to include a page listing pre-UAL mnemonics and what they are called now? There’s resource material out there with the old naming convention (hint-nudge-hint). |
Jeffrey Lee (213) 6048 posts |
It’s fine by me.
Yes.
Looks like that’s my mistake – VFPv1 and VFPv2 have identical instruction sets. From an application programmers’ perspective there’s no difference between the two architectures – I think the only real difference between them is in the support code requirements.
I think I’ll leave that exercise to Steve ;-) |
Rick Murray (539) 13815 posts |
Houston – we have a problem. I’m not getting any result when trying to multiply using Float (via SWI call). If I load module v0.60 it works, if I load v0.65 it doesn’t. I noticed the instruction has a different register ordering to it (R1 is result, not R3), so I did a quick binary hack to the v0.60 module to make it read/write the same registers for MUL handling. With module version 0.60 I see the result 80779.85blahblah. If I load module version 0.65, I get the result 0.000000. Ditto if I force FPA mode by passing a null FP context. Simplified test code is:
The problem – a hacked version of v0.60 works, v0.65 doesn’t. It looks as if the entry check is completely different in v0.65. That just checks whether or not R0 is 0; the earlier module did a lot of things. Perhaps something in there explains the difference? [machine: basic Pi B rev 2; 256MiB] |
Steve Drain (222) 1620 posts |
I think there are the VFP common instructions, that will also be used for NEON, and then VFP2 instructions are the ones that do something useful. At least, that is how I understood it from Jeffrey’s list. Is there any processor that does not have VFP2?
Pre-UAL has been deprecated for some while now and it is unlikely that I wiil do that. If you provide the list that you want I will attempt to incorporate it, though. ;-) |
Steve Drain (222) 1620 posts |
Good. Mistakes are always helpful. I am re-writing Float to try to make it more flexible, so I will have a careful look at what you have found. It is likely to be finger trouble and has slipped through the Test programs. ;-( |
Rick Murray (539) 13815 posts |
A quick (!) note about NEON. It is perhaps best to think of NEON not as a floating point unit, but as an application accelerator which can deal with floating point. I’ll give a short example (some information from the book Professional Embedded ARM Development). Imagine you have a picture. Straight RGB data, one byte per colour element, three bytes per pixel. In memory, it would look like this: Word 1 : R B G R Word 2 : G R B G Word 3 : B G R B Now let’s assume that one of the functions you want to perform is to convert the image to greyscale. Simple, right? Just read the three values and average them. …. [fx: sound of record scratch] …. NEON can do it. VLD3.8 {D0, D1, D2}, [R0] That will load from R0 (pointer) the eight bit pixel data into the three registers given with a step value of three – meaning it can take RGBRGBRGBRGB style data and load it into D0 as RRRRRRRR, D1 as GGGGGGGG, and D2 as BBBBBBBB. You can then use With NEON, we loaded, converted, and saved eight pixels of a weighted full colour image to greyscale in six instructions; by using some of the NEON’s rather amazing capabilities – to load data in a variety of formats, and to be able to hold multiple elements of bytewise data in the registers and promote them to halfword data, and back to bytes; performing the calculation individually on multiple elements within the register at the same time. VFP aims to deal with (traditional) floating point data, while NEON aims more to SIMD (processing multiple data elements in a single instruction). VFP is IEEE754 compliant, NEON is not. However, if you have a device with NEON (and I’m thinking about phones and such) and codecs to match, the use of NEON can greatly enhance the media capabilities of the device, making it possible to do more with less processing, which means longer battery life and a device that is less laggy. More here: http://www.arm.com/products/processors/technologies/neon.php |
Steve Drain (222) 1620 posts |
But not successfully. I hoped that changing context on each call using the FastAPI would be feasible, but it is only a bit faster than using the SWI here and also makes the Float direct calls unsuable in USER mode. So back to the original requirement for contexts at task level. |
Rick Murray (539) 13815 posts |
What I don’t understand is why this isn’t working. If I set up registers, then branch into your module, how is this different from setting up registers and then placing the FP code inline?
Out of interest – why? I ADR R14 to the return address, then I push the calculated location into PC – effectively a branch. Your code is supposed to exit by picking up my R14 and pushing it into PC. As long as you preserve R14 around SWI calls (and possibly FPA code as it is emulated), why should it make any difference whether the client is in USR or SVC mode? You aren’t expected to preserve state or change mode so…….? |
Steve Drain (222) 1620 posts |
I am talking about speed. The overhead of changing context for each call is too large and makes for little advantage over using FPA. That I knew already. The gain from using VFPSupport’s FastAPI rather than SWIs offered a speed gain, but it is not sufficient to make it worthwhile..
You can call Float SWI code direct in User mode. VFPSupport FastAPI must be called in SVC mode according to the documentation. Combine these and User mode is out. |
Rick Murray (539) 13815 posts |
Ah, I see. |
Steve Drain (222) 1620 posts |
It’s not checking, its changing, context. Here’s the problem as I see it:
When I first looked at this, a couple of years back, I realised that second choice was a non-runner, because the overheads removed most of the advantage of VFP over FPE. So the first choice was the way to go. This last week I decided to revisit this in the light of the more recent FastAPI, that removes the SWI overhead, and the possibility of a context with much reduced register numbers. These failed to make sufficient gains to be worthwhile. Had it been worthwhile for speed, that would have made calling Float routines in User mode no longer possible: the FastAPI calls that would have been used by Float have to be called in a privileged mode and R14_svc must be corruptible. Does that clarify the situation. ;-) PS I still have to look at your problem code. ;-( 1 VFP is shorthand for VFP/NEON. |
Rick Murray (539) 13815 posts |
Okay. Here is minimal code that works with Float 0.60 (hacked to use new register assignment) and does not work with Float 0.65. Sorry there is no archive to download. My website now uses SFTP with a private crypto key (so even if you knew the insane password, you would be kicked out). Anything on RISC OS that can do that? ;^) Anyway… Here is the C code:
Here is the assembler part:
Here is an annotated disassembly of the interesting parts of the program:
Placing the current context in R5 and then putting it right back into R0 afterwards is an anachronism because this code originally looped four million times…and I forgot to take that bit out. ;-) Finally, what happens:
Hope this helps. |
Steve Drain (222) 1620 posts |
OK. I do not know whether I can solve it, but your code is odd. Here are some thoughts, although I am out of my depth with C.
Here is a shorter routine that looks right to me, but I have not checked it. It is for 0.65 only. Of course, this is quite the wrong way to employ Float and it will be rather slow.
|
Rick Murray (539) 13815 posts |
How so? Apart from the quirk I mentioned, it is pretty much the same as the code generated by C only with the FP stuff moved into an assembler routine (so it can be changed – the compiler can’t do that). Remember, the function is nominally laid out in three parts:
Then ignore the C and look at the assembler. ;-) The line The line Everything in between is fairly plain code, but you’ll notice every so often calls to a large pile of lines that read Why this nonsense? Because you can read the current stack frame (at the point of failure) by reading FP, SP, and SL and passing that to _kernel_unwind() to ‘walk’ the list of functions. This is the (in)famous “backtrace” and if the program is compiled with function names then you can work backwards from the point of failure so you can get an idea of where the program was and how it got there. It isn’t as good as a debugger, but it is better than “I did stuff and it crashed”. To give an example from one of my programs. I handle my own backtracing because CLib’s default spits it out as VDU text which is useful to nobody. My version logs it all in the error log, like this (the first two lines were output by the Menu_Load function to indicate that something was going wrong):
That dump was my program failing with Pain. This is the backtrace of the program at the point of the crash. SIGSEGV (type 5) means an invalid memory access. ZeroPain permitted my program to run, but killing ZeroPain caused it to crash – so I could look at the backtrace and see where it was crashing.
That makes sense – but the code is setting up a context, doing some FP, then removing the context – all in a handful of lines of code. There is nothing complex like multiple contexts going on here.
On entry to the assembler code, R0 is a pointer to where the output result is to be placed. If you look at the dump, you will see that the compiler does
Anachronism. I explained earlier that it was a leftover of the test code that loops doing the operation four million times. Putting R0 into R5 and then R5 back into R0 is silly and makes no sense in this context, just ignore it. ;-)
ZEROPAIN!!!!!! :-D If you look, you are placing the pointer to the result in to R1. Not bad, but unfortunately Float_Start returns values in R0-R2 so it will corrupt R1. Worse, if there is no previous context, R1 will be zero, so Float_MUL will be asked to write the result to &0. ;-)
And, here you go… There’s MY facepalm moment. Duh. |
Pages: 1 2 3 4 5 6 7 8 9 10 11 12