FP support

280 posts, 29 voices

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

Aug 5, 2015 6:10pm Steve Pampling (1551) 8163 posts	The hardware has improved. RISC OS itself is improving. But we’re stuck with having to keep a link to the past. I would suggest that rather than being stuck with having to keep a link to the past we currently choose to stick with that link.

Aug 5, 2015 6:16pm Theo Markettos (89) 919 posts	Thanks. So I ask :) The mailing list is the canonical place, but I’ve asked for you :) Is it still possible to make this kind of switch, and how? To build things with GCCSDK today you basically just do: Download the source code (`apt-get source fooapp` on Debian-based distros, for instance) Unpack and cd into the source tree `ro-config` if it’s an autoconf thing. cmake and other alternatives are available. `ro-make` in place of ‘make’ `ro-make install` if it’s a library you want installing in your GCCSDK build tree Then it’s built (or you have an error you need to fix) The autobuilder will do all that, with some heuristics (eg it’ll try to figure out the name and version of your package from the source code path). You can start by making an empty `fooapp/setvars` file somewhere in your autobuilder tree, and then `autobuilder/build fooapp` which will fetch the Debian sources and try to build them. You can add other settings to tell it to fetch the sources from other places. Once that works, the main thing to add is a section to make a package zip, which is essentially just a script to copy the right files. All the PackMan information is added automatically. Our worked. But I have not access any more the the developer team responsible of this project. David, with GCCSDK, this “trick” is no longer necessary, because basically the ARM GCC outputting the RISC OS binaries is already compatible with all this bash autoconfig configure make stuff. The point in essence is that it’s easy to solve a problem once (fix up GCCSDK to work on Windows, port XYZ program to RISC OS). But next week the upstream folks (GCC, cygwin, Ubuntu, XYZ, dependent libraries) are likely to change something. We want to make it as easy as possible to not only port something once, but port the new something for every single upstream change, and do so with zero manual work. We’re almost there. We build essentially vanilla sources with no RISC OS changes – everything is in the compatibility libraries (Unixlib, SDL, DRenderer, ChoX11). This is the only way we can achieve sufficient scale. We can handle more customised RISC OS ports, but each one requires someone to make the changes and keep them up to date – and for the most part those people don’t exist. Then the problems become that someone needs to fix the breakages in our builds as upstream evolves. That’s why building GCCSDK for Windows is one thing, but supporting those builds for 15 years and catching all the breakages, including breakages in building all the packages GCCSDK/Win32 might be asked to build, with minimal manpower, is quite another.

Aug 5, 2015 6:39pm Steve Drain (222) 1620 posts	You do realise – the compiler inserts FP instructions directly into the code when required? Doesn’t GCC use it own float code? FPEmulator takes care of them, they’re nothing to do with CLib. Hence the understandable desire to get the FPE to use VFP when it can. I am sure it is possible, but having looked at the source I would not know where to start. ;-( I suppose the compiler could change FP instructions to call an FP library that will make a decision based upon the system in use, I’m just wondering if it is necessarily viable. Substitute calls to Float, or something similar, and the decisions are made. It does not deal with the context switching, although I have some thoughts about that. Those are pre-UAL names, I’m reading from my old ARM ARM. ;-) Please use UAL, You could use my VFP StrongHelp manual instead of an old ARMARM. VFP-enabled code would need to call two new functions (to enable/disable the VFP context) If that is feasible then it must be the way to go. I outlined that BASIC would have to do something similar. What does Charm do? You just can’t ignore FPE running 40x slower than the native hardware……… Very true, but that number is only for the arithmetic operations, and aiming for compatibility will reduce the advantage somewhat.

Aug 5, 2015 8:29pm Rick Murray (539) 13815 posts	we currently choose to stick with that link. We have a choice? I liken “disregard older¹ software” to be akin to foot, meet heavy artillery. ¹ Interesting definition of “older” being “old, current, and some future”. ;-)

Aug 5, 2015 8:55pm Rick Murray (539) 13815 posts	Doesn’t GCC use it own float code? No idea. I don’t use GCC. Perhaps somebody could compile some code using `double` variables, and look at a disassembly? With the DDE, as long as it isn’t trying to be smart and optimise all the maths out, it will deal with floats using FP instructions. Hence the understandable desire to get the FPE to use VFP when it can. I am sure it is possible, but having looked at the source I would not know where to start. ;-( It looks like an awful mess designed to catch, interpret, and dispatch FP instructions to some code that pretends to do the same thing in long-winded fashion. Simply (!) rip out all the ARM code and drop in calls to the VFP instead. It won’t be as fast as native FP due to the exception handling, but it ought to be faster than an emulation. Somebody, at some point, wrote an FPEmulator that used the maths co-pro on the 486 card, so it can be done with a bit of lateral thinking. ;-) Unfortunately I’m inclined to think that a transcendental number is something to do with Buddhism. I looked at the FPEm code and was lost at the comments, never mind the code, so I’m afraid I’m probably the LEAST qualified person here. However, take a look at MultFPE in mixed.RiscOS.Sources.HWSupport.FPASC.coresrc.s.arith. I do not know how many similarities there are between FPA and VFP. Specifically I do not know if it is necessay to to the normalisation and such or if this is handled by the hardware. It might be possible to rip out all of that code and replace it with `VMUL.F64` once you’ve determined what FP registers are supposed to be used. There will be a pile of ARM code (to break apart an FPA opcode and work out how to execute this under VFP) but the hard part ought to be able to be handled to the hardware FP. Luckily there are only eight FPA registers; because I suspect that it might be faster to use a big ugly jump table instead of trying to synthesise a nice instruction because when it comes to flushing caches and synchronising code areas, we’ll have just killed any hope of a speed saving over FPEmulator (which sort of defeats the purpose!). Assuming we want to modify FPEmulator, that is… I’m inclined to want to leave it because, while it is slow, not terrible quick, it is proven. Substitute calls to Float, or something similar, and the decisions are made. At the moment it isn’t that much faster than using FPEmulator; though I would be interested to plug in some code to jump directly to your SWI handler to see how much of a saving can be made by bypassing the SWI call mechanism. Please use UAL, ;-) I guess I’ll need to pencil in the UAL versions of the opcodes into the ARM ARM. Very true, but that number is only for the arithmetic operations, and aiming for compatibility will reduce the advantage somewhat. Certainly, it is a highly artificial example. That’s why I said that if the compiler’s default is to use FPA instructions (as it does now) then nothing will change for code compiled in the future. Those who use a handful of FP instructions for dealing with some small float use will hardly notice the impact of using FPEm. Programs that make heavy use of FP maths (mandelbrots, audio transformations, etc) can be compiled with the VFP option enabled and they should see a big increase in speed. Thanks to you, I now know that I can handle a double load using two single loads backwards; so no pratting around swapping the word order in ARM code. This means the entire conversion process can be used inline (replace FPE’s LDFD with two VLDR calls; ditto for saving); so asides from inserting calls to the stubs routines to sort out contexts; it looks like we might get away with the primary overhead being no more than an extra instruction for loading/saving. Furthermore, a benefit of this is that we don’t need to update CLib at the same time. It will need to be done, certainly, but it can continue using FPA instructions on the same data until somebody has the time to update the library code and add compilation switches (so non-VFP machines get an FPEm build).

Aug 6, 2015 6:08am David Feugey (2125) 2709 posts	The mailing list is the canonical place, but I’ve asked for you :) Thanks Theo ro-config / ro-make So basically ro-config replace configure and ro-make make. There is still the problem of configure commands that will not work because of bad assumptions on system. Can I use traditional configure to generate the makefile, then ro-make? David, with GCCSDK, this “trick” is no longer necessary, because basically the ARM GCC outputting the RISC OS binaries is already compatible with all this bash autoconfig configure make stuff. Yes, thanks. But it’s the first time I get a clear explanation on how it works. The only documentation I did find was talking about making (quite complex) scripts to make clean ports for RISC OS. My needs are much simple: configure (x86-GCC or ARM-GCC, depends of which will work), then make.

Aug 6, 2015 9:07am jim lesurf (2082) 1438 posts	You do realise – the compiler inserts FP instructions directly into the code when required? These are not library calls, they are instructions for the co-processor to execute. Yes. That’s why it seems feasible/logical to me for the FPE on a given machine to simply act like a ‘HAL’ and trap them to generate a sequence that works optimally on that specific hardware. You make the FPE module’s behaviour harware-targetted instead of having to generate multiple versions of the code compiled by the compiler. Write once. Run everywhere. The word order is a complication because it means a decision between: A) Do we stick with the old ordering and have any calculations on newe hardware shuffled. B) Have a ‘new order’ flag in the compiled code to show it has word ordering in the new arrangement. (Thus older code will be found to lack this flag, and must have its ordering re-arranged on the fly – until the programmer re-compiles). Whether the FPE shoves in VFP or something else should depend on the machine. None of that need block having a new compiler that – when told by the programmer to do so – does write targetted, optimised code which then doesn’t get trapped or changed on the targetted hardware. But it means that for those who prefer, they can write and compile without finding they need to make multiple versions for distribution. FWIW Personally, I have the habit of providing source code so people could modify or recompile anyway. But I’m try to avoid a situation where I and others have to keep generating multiple versions of things. The computer is meant to be working for us, not the other way around. 8-] Jim

Aug 6, 2015 9:55am Dave Higton (1515) 3502 posts	I’m sure I read a statement somewhere that there are multiple variations in modern FP systems – VFP/NEON. But are there different VFP instructions – as in, one VFP instruction does different things on different platforms, not just superset/subset of instructions? I’m wondering if we could(in principle!) do things like this: Old software written for the FPA/FPE continues to work because wither the FPE is there (usually; slow of course) or the FPA is there. New software should at some point start to use VFP instructions, and has to create/manipulate/destroy FP contexts appropriately. There is the possibility to write an updated FPE that makes use of VFP instructions – but could that new FPE do what is necessary in terms of FP contexts? Naive questions as usual – this lot is at least partly beyond my understanding.

Aug 6, 2015 1:27pm Jeffrey Lee (213) 6048 posts	I’m sure I read a statement somewhere that there are multiple variations in modern FP systems – VFP/NEON. But are there different VFP instructions – as in, one VFP instruction does different things on different platforms, not just superset/subset of instructions? In principle it’s a pure superset/subset situation – no instructions have been changed to do wildly different things. However some features do get dropped (e.g. short vector mode), which will cause the undefined instruction vector to be taken and will require support code to emulate the behaviour (VFPSupport will already do this for short vector mode). Plus there’s one fairly major compatibility issue, which is that for the systems where floating point exceptions can’t be trapped in hardware (which is most of them – the Pi 1 is the only current system which can trap them in hardware) there’s no way to provide full compatibility for the old behaviour (unless you disable the coprocessor and fall back to a full software emulation of VFP). So software written specifically for the Pi 1 which relies on the hardware to trap FP exceptions may fail in unpredictable ways on other systems. I’m wondering if we could(in principle!) do things like this: Yes, that’s basically what’s been suggested at least a couple of times already (I think this thread is going round in circles a bit!)

Aug 7, 2015 3:41pm Steve Drain (222) 1620 posts	A new version of the StrongHelp VFP manual is now available. This has a number of corrections and additions, but it now documents the VFPSupport module as well. I hope this is OK with ROOL.

Aug 7, 2015 3:43pm Steve Drain (222) 1620 posts	I think this thread is going round in circles a bit! Going round, but I like to think of it as an inward spiral. ;-)

Aug 7, 2015 6:20pm Rick Murray (539) 13815 posts	Going round, but I like to think of it as an inward spiral. ;-) Like the drain of a bath…eventually we’ll all disappear down the plughole. Um… I was reading the help file (looks good so far) and I wondered – what happens in the case of a single tasking VFP program in a taskwindow? Will the Wimp deal with the contexts? I just tried my test program three times in three task windows, but it only crashed ’cos I forgot to load the Float module. ;-) Edit: Wait – hang on… VADD, VSUB, VMUL, VDIV (etc) are described in the VFP page as being VFP2 instructions. Are you saying VFP couldn’t do anything much more than load and store registers? Would it be possible to include a page listing pre-UAL mnemonics and what they are called now? There’s resource material out there with the old naming convention (hint-nudge-hint).

Aug 7, 2015 7:02pm Jeffrey Lee (213) 6048 posts	This has a number of corrections and additions, but it now documents the VFPSupport module as well. I hope this is OK with ROOL. It’s fine by me. I was reading the help file (looks good so far) and I wondered – what happens in the case of a single tasking VFP program in a taskwindow? Will the Wimp deal with the contexts? Yes. Edit: Wait – hang on… VADD, VSUB, VMUL, VDIV (etc) are described in the VFP page as being VFP2 instructions. Are you saying VFP (1) couldn’t do anything much more than load and store registers? Looks like that’s my mistake – VFPv1 and VFPv2 have identical instruction sets. From an application programmers’ perspective there’s no difference between the two architectures – I think the only real difference between them is in the support code requirements. Would it be possible to include a page listing pre-UAL mnemonics and what they are called now? There’s resource material out there with the old naming convention (hint-nudge-hint). I think I’ll leave that exercise to Steve ;-)

Aug 7, 2015 7:11pm Rick Murray (539) 13815 posts	Houston – we have a problem. I’m not getting any result when trying to multiply using Float (via SWI call). If I load module v0.60 it works, if I load v0.65 it doesn’t. I noticed the instruction has a different register ordering to it (R1 is result, not R3), so I did a quick binary hack to the v0.60 module to make it read/write the same registers for MUL handling. With module version 0.60 I see the result 80779.85blahblah. If I load module version 0.65, I get the result 0.000000. Ditto if I force FPA mode by passing a null FP context. Simplified test code is: `first DCFD 123.456 ; in FPA word order second DCFD 654.321 ; ditto do_kappa STMFD R13!, {R4, R5, R14} MOV R4, R0 SWI Float_Start STMFD R13!,{R0-R1} MOV R5, R0 MOV R0, R5 ; context MOV R1, R4 ; result pointer ADR R2, first ADR R3, second SWI Float_MUL LDMFD R13!, {R0-R1} SWI Float_Stop LDMFD R13!, {R4, R5, PC}` The problem – a hacked version of v0.60 works, v0.65 doesn’t. It looks as if the entry check is completely different in v0.65. That just checks whether or not R0 is 0; the earlier module did a lot of things. Perhaps something in there explains the difference? [machine: basic Pi B rev 2; 256MiB]

Aug 7, 2015 7:17pm Steve Drain (222) 1620 posts	Are you saying VFP couldn’t do anything much more than load and store registers? I think there are the VFP common instructions, that will also be used for NEON, and then VFP2 instructions are the ones that do something useful. At least, that is how I understood it from Jeffrey’s list. Is there any processor that does not have VFP2? Would it be possible to include a page listing pre-UAL mnemonics and what they are called now? Pre-UAL has been deprecated for some while now and it is unlikely that I wiil do that. If you provide the list that you want I will attempt to incorporate it, though. ;-)

Aug 7, 2015 7:37pm Steve Drain (222) 1620 posts	Houston – we have a problem. Good. Mistakes are always helpful. I am re-writing Float to try to make it more flexible, so I will have a careful look at what you have found. It is likely to be finger trouble and has slipped through the Test programs. ;-(

Aug 7, 2015 8:37pm Rick Murray (539) 13815 posts	A quick (!) note about NEON. It is perhaps best to think of NEON not as a floating point unit, but as an application accelerator which can deal with floating point. I’ll give a short example (some information from the book Professional Embedded ARM Development). Imagine you have a picture. Straight RGB data, one byte per colour element, three bytes per pixel. In memory, it would look like this: Word 1 : R B G R Word 2 : G R B G Word 3 : B G R B Now let’s assume that one of the functions you want to perform is to convert the image to greyscale. Simple, right? Just read the three values and average them. But how? Well, we can start with LDRB to read a byte at a time into three registers to… …. [fx: sound of record scratch] …. NEON can do it. VLD3.8 {D0, D1, D2}, [R0] That will load from R0 (pointer) the eight bit pixel data into the three registers given with a step value of three – meaning it can take RGBRGBRGBRGB style data and load it into D0 as RRRRRRRR, D1 as GGGGGGGG, and D2 as BBBBBBBB. You can then use `VMULL.U8` to multiply the entire red values (all eight at once) by 77 (a weighting factor). Take the result of that and `VMLAL.U8` to multiply all of the green values by 151 and merge that into the result. Finally repeat, with a weighting of 28 for the blue value. You might wonder what is going to happen if the input value is 255 and you multiply it by 151. The result will overflow a byte and mess up other data right? Wrong. Used like this, NEON treats the bytewise data as eight “lanes” and the data is promoted to a larger register type (Q instead of D) where it can hold more. This isn’t infinite, but it is sufficient for what we need. Essentially, eight bit data is expanded to a sixteen bit storage. We can then use `VSHRN.I16` to shift the 16 bit integer data eight places to the right (divide by 256, to turn our calculated weighted RGB data into a single greyscale value). As this is being done, the data is then “demoted” back to byte (eight bit) data, from which we can then `VST1.8` to write the destination (result) register to memory. This will be writing to a different memory array where the values are single bytes representing 256 levels of grey. With NEON, we loaded, converted, and saved eight pixels of a weighted full colour image to greyscale in six instructions; by using some of the NEON’s rather amazing capabilities – to load data in a variety of formats, and to be able to hold multiple elements of bytewise data in the registers and promote them to halfword data, and back to bytes; performing the calculation individually on multiple elements within the register at the same time. VFP aims to deal with (traditional) floating point data, while NEON aims more to SIMD (processing multiple data elements in a single instruction). VFP is IEEE754 compliant, NEON is not. However, if you have a device with NEON (and I’m thinking about phones and such) and codecs to match, the use of NEON can greatly enhance the media capabilities of the device, making it possible to do more with less processing, which means longer battery life and a device that is less laggy. More here: http://www.arm.com/products/processors/technologies/neon.php

Aug 8, 2015 4:41pm Steve Drain (222) 1620 posts	I am re-writing Float to try to make it more flexible But not successfully. I hoped that changing context on each call using the FastAPI would be feasible, but it is only a bit faster than using the SWI here and also makes the Float direct calls unsuable in USER mode. So back to the original requirement for contexts at task level.

Aug 8, 2015 5:40pm Rick Murray (539) 13815 posts	But not successfully. I hoped that changing context on each call using the FastAPI would be feasible, What I don’t understand is why this isn’t working. If I set up registers, then branch into your module, how is this different from setting up registers and then placing the FP code inline? Either I’m missing something really obvious or this doesn’t make sense… and also makes the Float direct calls unsuable in USER mode. Out of interest – why? I ADR R14 to the return address, then I push the calculated location into PC – effectively a branch. Your code is supposed to exit by picking up my R14 and pushing it into PC. As long as you preserve R14 around SWI calls (and possibly FPA code as it is emulated), why should it make any difference whether the client is in USR or SVC mode? You aren’t expected to preserve state or change mode so…….?

Aug 8, 2015 6:55pm Steve Drain (222) 1620 posts	Either I’m missing something really obvious or this doesn’t make sense… I am talking about speed. The overhead of changing context for each call is too large and makes for little advantage over using FPA. That I knew already. The gain from using VFPSupport’s FastAPI rather than SWIs offered a speed gain, but it is not sufficient to make it worthwhile.. Out of interest – why? You can call Float SWI code direct in User mode. VFPSupport FastAPI must be called in SVC mode according to the documentation. Combine these and User mode is out.

Aug 8, 2015 8:52pm Rick Murray (539) 13815 posts	I am talking about speed. The overhead of changing context for each call is too large and makes for little advantage over using FPA. Ah, I see. It’s a shame that just checking VFP-or-not doesn’t seem to want to work. Then your module would be behaving more or less exactly like my code, the only difference being your module would be in the RMA. Is VFPSupport not liking the FP code being outside of user workspace?

Aug 9, 2015 10:04am Steve Drain (222) 1620 posts	It’s a shame that just checking VFP-or-not doesn’t seem to want to work. It’s not checking, its changing, context. Here’s the problem as I see it: To use VFP there must be an active context – that is one significant reason you cannot just slot VFP instructions into the FPE. Jeffrey has ruled out the concept of a global context, for good reasons he explained back in 2010. It is no problem writing assembler to create a context and use VFP¹ instructions, with the very large speed gains that you have seen. Higher level languages, C and BASIC particularly, that might use VFP in place of FPE cannot do so without managing contexts. They do not have that facility at the moment, although Charm does. Float, or something similar, might offer some help with this. Managing contexts boils down² to a choice between: A task creates it own unique context for all VFP operations. This is convenient, because Jeffrey has made task switching control context changes. It is less so, because you have to change the way a task is set up. In speed terms, VFP operations will have only small overheads, similar to using FPE. A language creates a context that it changes to whenever it uses a VFP instruction. This is quite an overhead, because of the amount of data moved when changing context and the overhead of calling SWI VFPSupport_ChangeContext. However, it would make exchanging VFP for FPE more convenient. When I first looked at this, a couple of years back, I realised that second choice was a non-runner, because the overheads removed most of the advantage of VFP over FPE. So the first choice was the way to go. This last week I decided to revisit this in the light of the more recent FastAPI, that removes the SWI overhead, and the possibility of a context with much reduced register numbers. These failed to make sufficient gains to be worthwhile. Had it been worthwhile for speed, that would have made calling Float routines in User mode no longer possible: the FastAPI calls that would have been used by Float have to be called in a privileged mode and R14_svc must be corruptible. Does that clarify the situation. ;-) PS I still have to look at your problem code. ;-( ¹ VFP is shorthand for VFP/NEON. ² There are equivalents, of which the Float approach is one.

Aug 9, 2015 12:14pm Rick Murray (539) 13815 posts	Okay. Here is minimal code that works with Float 0.60 (hacked to use new register assignment) and does not work with Float 0.65. Sorry there is no archive to download. My website now uses SFTP with a private crypto key (so even if you knew the insane password, you would be kicked out). Anything on RISC OS that can do that? ;^) Anyway… Here is the C code: `#include <stdio.h> extern int do_kappa(double ); int main(int argc, char argv[]) { double fpval; do_kappa((double )&fpval); printf("Result is %f\n", fpval); return 0; }` Here is the assembler part: ; assembler file for FP maths AREA \|C$$code\|, CODE, READONLY, A32bit EXPORT do_kappa first DCFD 123.456 ; FPA word order second DCFD 654.321 ; FPA word order do_kappa ; R4 is pointer to share result should be placed (R0 on entry) ; R5 is a copy of the current context (Float needs it) STMFD R13!, {R4-R5, R14} MOV R4, R0 MOV R0, #0 ; ensure output FP value STR R0, [R4, #0] ; starts off as zero STR R0, [R4, #8] SWI &C0040 ; Float_Start - returns VFP contexts STMFD R13!, {R0-R1} ; Store current context and previous MOV R5, R0 ; keep a copy of the current context MOV R0, R5 ; context MOV R1, R4 ; store result at what R4 points at ADR R2, first ; pointer to first parameter ADR R3, second ; pointer to second parameter SWI &C004A ; Float_MUL (@R1 = @R2 @R3) LDMFD R13!, {R0-R1} ; Load current/previous contexts SWI &C0041 ; Float_Stop LDMFD R13!, {R4-R5, PC} END Here is an annotated disassembly of the interesting parts of the program: [...main starts here...] 00008090 : .Ø-é : E92DD803 : STMDB R13!,{R0,R1,R11,R12,R14,PC} ; APCS entry 00008094 : .°Lâ : E24CB004 : SUB R11,R12,#4 00008098 : ..]á : E15D000A : CMP R13,R10 0000809C : ™..K : 4B00018D : BLMI &000086D8 ; stack extend? 000080A0 : .ÐMâ : E24DD008 : SUB R13,R13,#8 ; space on stack for FP result 000080A4 : .. á : E1A0000D : MOV R0,R13 ; put it in R0 000080A8 : ...ë : EB00000C : BL &000080E0 ; jump to assembler code 000080AC : ..‡è : E89D0006 : LDMIA R13,{R1,R2} ; pick up FP value from stack 000080B0 : ..•â : E28F0F02 : ADR R0,&000080C0 ; pointer to string 000080B4 : Þ..ë : EB0001DE : BL &00008834 ; printf 000080B8 : .. ã : E3A00000 : MOV R0,#0 ; return code is zero 000080BC : .¨.é : E91BA800 : LDMDB R11,{R11,R13,PC} ; APCS exit (=>program exit) 000080C0 : Resu : 75736552 : LDRVCB R6,[R3,#-1362]! ; string data 000080C4 : lt i : 6920746C : STMVSDB R0!,{R2,R3,R5,R6,R10,R12-R14} 000080C8 : s %f : 66252073 : Undefined instruction 000080CC : .... : 0000000A : ANDEQ R0,R0,R10 000080D0 : /Ý^@ : 405EDD2F : SUBMIS R13,R14,PC,LSR #26 ; FP data 123.456 000080D4 : w¾ﬂ. : 1A9FBE77 : BNE &FE7F7AB8 000080D8 : ’r✘@ : 40847291 : UMULLMI R7,R4,R1,R2 ; FP data 654.321 000080DC : !°rh : 6872B021 : LDMVSDA R2!,{R0,R5,R12,R13,PC}^ 000080E0 : 0@-é : E92D4030 : STMDB R13!,{R4,R5,R14} ; assembler entry 000080E4 : .@ á : E1A04000 : MOV R4,R0 ; preserve where to put result 000080E8 : .. ã : E3A00000 : MOV R0,#0 ; ensure result starts off 000080EC : ..✘å : E5840000 : STR R0,[R4,#0] ; as zero (being on the stack 000080F0 : ..✘å : E5840008 : STR R0,[R4,#8] ; could be anything there) 000080F4 : @..ï : EF0C0040 : SWI Float_Start ; call Float_Start 000080F8 : ..-é : E92D0003 : STMDB R13!,{R0,R1} ; preserve curr/prev VFP context 000080FC : .P á : E1A05000 : MOV R5,R0 ; preserve current context in R5 00008100 : .. á : E1A00005 : MOV R0,R5 ; put current context in R0 00008104 : .. á : E1A01004 : MOV R1,R4 ; where to put result is in R1 00008108 : @ Oâ : E24F2040 : ADR R2,&000080D0 ; first FP val is in R2 0000810C : <0Oâ : E24F303C : ADR R3,&000080D8 ; second FP val is in R3 00008110 : J..ï : EF0C004A : SWI Float_MUL ; Call SWI to multiply 'em 00008114 : ..½è : E8BD0003 : LDMIA R13!,{R0,R1} ; pick up curr/prev contexts 00008118 : A..ï : EF0C0041 : SWI Float_Stop ; call Float_Stop 0000811C : 0€½è : E8BD8030 : LDMIA R13!,{R4,R5,PC} ; exit assembler function [...CLib stuff follows...] Placing the current context in R5 and then putting it right back into R0 afterwards is an anachronism because this code originally looped four million times…and I forgot to take that bit out. ;-) Finally, what happens: `. Dir. SDFS::Kiichigo.$.Coding.Projects.FPTest Option 02 (Run) CSD SDFS::Kiichigo.$.Coding.Projects.FPTest Lib. SDFS:"Unset" URD SDFS:"Unset" c D/ Float060hacked WR/ Float065 WR/ MakeFile WR/WR o D/ s D/ Test WR/ RMLoad Float060hacked Test Result is 80779.853376 RMLoad Float065 Test Result is 0.000000 ` Hope this helps.

Aug 9, 2015 3:20pm Steve Drain (222) 1620 posts	Here is minimal code that works with Float 0.60 (hacked to use new register assignment) and does not work with Float 0.65. OK. I do not know whether I can solve it, but your code is odd. Here are some thoughts, although I am out of my depth with C. 0.60 does a check that the current context is the one that was set up by Float_Start and defaults to FPA if not. It might just possibly be why you get a non-zero result with that, but not 0.65, which assumes that an application only uses the one context. The reason for doing this is to reduce the overhead to a minimum. There is no actual purpose in having other contexts unless your application is also using some specialist machine code, in which case it is that which should be doing the necessary nesting of contexts. your code has the pointer to the result in R0 on entry, I think, but it does not preserve that register. Is R0 required on exit? What are you using R4 for? Float only requires that the current and previous contexts are preserved so that Float_Stop restores the previous situation correctly. You stack these values and I can see nothing wrong with that. What are you using R5 for? Here is a shorter routine that looks right to me, but I have not checked it. It is for 0.65 only. Of course, this is quite the wrong way to employ Float and it will be rather slow. do_kappa STMFD R13!, {R0-R3, R14}; preserve pointer for result MOV R1, R0 ; copy pointer for result MOV R0, #0 ; ensure output FP value STR R0, [R1, #0] ; starts off as zero STR R0, [R1, #4] ; -- note #4 not #8 -- SWI &C0040 ; Float_Start - returns VFP contexts STMFD R13!, {R0-R1} ; Store current context and previous ADR R2, first ; pointer to first parameter ADR R3, second ; pointer to second parameter SWI &C004A ; Float_MUL (@R1 = @R2 * @R3) LDMFD R13!, {R0-R1} ; Load current/previous contexts SWI &C0041 ; Float_Stop LDMFD R13!, {R0-R3, PC} END

Aug 9, 2015 6:47pm Rick Murray (539) 13815 posts	but your code is odd. How so? Apart from the quirk I mentioned, it is pretty much the same as the code generated by C only with the FP stuff moved into an assembler routine (so it can be changed – the compiler can’t do that). Remember, the function is nominally laid out in three parts: 1. Preamble – sets up registers, contexts, blah blah. 2. MUL instruction. 3. Tidy-up. It is written this way as it isn’t feasible to run the instruction once and time it. So step two, the multiplication, is repeated 4,096,000 times (that value chosen as it is a shifted immediate). That takes long enough that you can get reasonable comparative results from the centisecond ticker – just run the program three times and average the results; though “about 7cs” vs “about 290cs” doesn’t need much correction. ;-) although I am out of my depth with C. Then ignore the C and look at the assembler. ;-) The line `STMDB R13!,{R0,R1,R11,R12,R14,PC}` followed by some messing around with R10-R13 and an optional branch – those lines set up an APCS frame. That marks the start of a function. The line `LDMDB R11,{R11,R13,PC}` is the function exit. The registers may change, but note we’re pulling from R11 and the list contains R13 (as well as R11!). This is an APCS exit. Just consider this the end of a function. Everything in between is fairly plain code, but you’ll notice every so often calls to a large pile of lines that read `MOV PC,#0`. That’s the C jump table. CLib will overwrite that with the correct addresses when the program starts. Why this nonsense? Because you can read the current stack frame (at the point of failure) by reading FP, SP, and SL and passing that to _kernel_unwind() to ‘walk’ the list of functions. This is the (in)famous “backtrace” and if the program is compiled with function names then you can work backwards from the point of failure so you can get an idea of where the program was and how it got there. It isn’t as good as a debugger, but it is better than “I did stuff and it crashed”. To give an example from one of my programs. I handle my own backtracing because CLib’s default spits it out as VDU text which is useful to nobody. My version logs it all in the error log, like this (the first two lines were output by the Menu_Load function to indicate that something was going wrong): [L00] 2015/05/08.10:02 : OhCrap: Menu data read in was less than we expected (=0). [L00] 2015/05/08.10:02 : Menu fail trying to carry on regardless (backptr is ""). [L00] 2015/05/08.10:02 : Type 5 crash - Bad memory access [SIGSEGV] - backtrace follows: [L00] 2015/05/08.10:02 : PC = &FC17B1DC : Function <unknown> (at &FC17B180) [L00] 2015/05/08.10:02 : PC = &FC16BB58 : Function Menu_ReadLine (at &0000DCF4) [L00] 2015/05/08.10:02 : PC = &0000D084 : Function Menu_LoadDefs (at &0000CF40) [L00] 2015/05/08.10:02 : PC = &0000CE70 : Function Menu_Load (at &0000CC30) [L00] 2015/05/08.10:02 : PC = &0000E0A8 : Function Script_Run (at &0000DF38) [L00] 2015/05/08.10:02 : PC = &000086E4 : Function main (at &00008098) [L00] 2015/05/08.10:02 : All those moments will be lost in time, like tears in rain, time to die. That dump was my program failing with Pain. This is the backtrace of the program at the point of the crash. SIGSEGV (type 5) means an invalid memory access. ZeroPain permitted my program to run, but killing ZeroPain caused it to crash – so I could look at the backtrace and see where it was crashing. which assumes that an application only uses the one context. The reason for doing this is to reduce the overhead to a minimum. That makes sense – but the code is setting up a context, doing some FP, then removing the context – all in a handful of lines of code. There is nothing complex like multiple contexts going on here. your code has the pointer to the result in R0 on entry, I think, but it does not preserve that register. Is R0 required on exit? What are you using R4 for? On entry to the assembler code, R0 is a pointer to where the output result is to be placed. If you look at the dump, you will see that the compiler does `SUB R13, R13, #8` and `MOV R0, R13` prior to calling this code. So essentially, R0 on entry points to two words of space that has been reserved on the stack. I place this address into R4, as the following call to Float_Start would trash it (it returns data in R0). I can’t stack the original R0 as I need to stack your R0,R1 and recall the original R0 afterwards. Doable but messy, so much simpler just to push it into R4 until it is needed. What are you using R5 for? Anachronism. I explained earlier that it was a leftover of the test code that loops doing the operation four million times. Putting R0 into R5 and then R5 back into R0 is silly and makes no sense in this context, just ignore it. ;-) Here is a shorter routine that looks right to me, but I have not checked it. ZEROPAIN!!!!!! :-D If you look, you are placing the pointer to the result in to R1. Not bad, but unfortunately Float_Start returns values in R0-R2 so it will corrupt R1. Worse, if there is no previous context, R1 will be zero, so Float_MUL will be asked to write the result to &0. Bang. ;-) `-- note #4 not #8 --` And, here you go… There’s MY facepalm moment. Duh.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

Reply

To post replies, please first log in.

Forums → Wish lists →

FP support

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options