FP support
David Feugey (2125) 2687 posts |
Quick question about FP support. What’s the state and possibilities for FPU support in… ??? |
jim lesurf (2082) 1330 posts |
Just want to echo the question to flag that I also would like FP hardware support. Jim |
David Feugey (2125) 2687 posts |
Nota, there are two subjects here: That supposes too that FPEM will trap FPA instructions for software that use Hard Float mode on computer without FPU. |
Steffen Huber (91) 1945 posts |
Basic for all Basic code: certainly not, because Basic V has its own float implementation and does not use the FPE at all. Maybe Basic VI could be enhanced? |
Jeffrey Lee (213) 6046 posts |
Using VFP for BASIC64 is certainly a possibility. However there are a couple of annoyances which would need resolving:
For the DDE, objasm 4.0 introduced full support for the VFP/NEON instruction set, so assembler code can use it as much as it wants. C support is IIRC held back by lack of development time/money, and lack of a plan on what PCS should be used – i.e. whether to create a new version of the APCS or whether to use an existing version (e.g. whatever GCC currently uses – AAPCS?). The ARMv7 inline assembler bounty will get things part-way there, but it’s just the tip of the iceberg. For the FPEmulator, producing a version which uses VFP is something that I did consider early on in development but have later dismissed, for a few reasons:
|
Ben Avison (25) 445 posts |
Technically, objasm 4.01, but yes. Objasm supports a huge number of PCS variants (but then it’s easy for objasm, it doesn’t need to do anything to marshall the arguments itself, just set flags in the object file).
One of the easier approaches is to base it on the softfp PCS that the compiler already supports – which uses FPA endianness for doubles. The main thing lacking is an implementation in the C library of all the runtime support functions that would need, though it would provide a simple way to avoid optimising for any one platform over the others, by ensuring that the ROM C library uses the best instruction set for the platform. For example, the double + double function _dadd could be implemented in the IOMD ROM C library as STMDB r13!, {r0-r3} and in the ROM C library for ARMv6+ VMOV d0, r1, r0 ; note, no particular penalty for FPA ordering and for the ROM C library for Iyonix, or for softloadable versions, you’d pinch the integer implementation from FPEmulator to do the same thing. Quite a lot of work to do the whole set of functions three times over, but doesn’t need too much exposure to the compiler (other than breaking its assumption that even in the softfp case you can return floating point types in FPA registers, because they don’t actually exist in new hardware). |
Steve Drain (222) 1620 posts |
Here’s what I think I know, which is not compehensive: FPE remains largely as it has been. Assembler can be written using FPA co-processor instructions that will use it. BASIC VI uses those instructions for its float type. Assembler can be written using VFP/NEON instructions for those processors that support them. The more recent versions of BASIC will assemble such instructions, but they are not documented under the HELP [ command. The VFP instructions are limited in scope and cannot directly do all that the FPA/FPE offers. There are single and double precision instructions, but not extended. There are no trancendental instructions: SIN, EXP etc. I have written two documentary aids: A BASICHelp module that attempts to reproduce BASIC’s HELP command as a *command, with more information and the extended VFP instructions. This is not registered and uses a syntax that might not meet approval, so comments are welcome. I have not done anything with it for more than a year. A VFP/NEON StrongHelp manual encapsulating the extensive information Jeffrey posted here a long while back. This is fairly sound for VFP, but needs more editing of the complex NEON instructions, although I doubt that these are likely to be needed any time soon. ;-) I have also written a Float module that provides double precision floating point support through SWIs. This uses VFP when available and FPA when not, through a single interface. This does have trancendental functions implemented using VFP instructions. The SWI interface is a large overhead and the speed increase is small, but the code can be called directly for significantly better performance. This is not registered and comments are welcome. I have done some work with Basalt to implement double precision floats in BASIC V using my Float module, but this is not published. In reply to Jeffrey:
Float treats all external double precision floats as big-endian and converts to little-endian for VFP.
Float provides all the the trancendental operations of FPA.
Float uses FPA in the absence of VFP.
Float has my best efforts at providing suitable algorithms. As far as I can judge, accuracy is as good as FPA, except for the POW function, which may loose one place in 17. The source is available, so if anyone want to use those algorithms in a more suitable format, they have my blessing.
Float requires a program to create and release a context with Float_Start and Float_Stop, which do a little more than just VFPSupport.
Calling the code directly is a very significant gain over FPA, despite the overheads.
Float takes the reverse approach, but to the same end. Changing that approach would not be difficult. I did nearly all this about 18 months ago and have hardly visited it since. There are certain to be improvement to be made, so I would welcome som feedback. ;-) |
David Feugey (2125) 2687 posts |
Thanks for this very long (and useful) answer. I agree on VFP point VS legacy FPA. Just need to get it in C :) For Basic, could I suggest to open a branch for a beta of BBC Basic VI? So if something does not work, it’ll be a compatibility problem we don’t need to solve. The same way some code made for BBC Basic IV does not work on V. Of course, if ABC can support BBC Basic VI, everything will be perfect. If you borrow trig functions from existing ones, it could be even a 5 minute job :) It could be a good idea to borrow some ideas from the PC version. The closer the two products will be, the better it’ll for us. For example, BBCBasic4Win provides 80bit floats and 64bit integers. Unicode is supported too. Could be cool for BBC Basic VI, as there are effort for Unicode support in RISC OS 5. Another fantastic idea would be the possibility to define & rewrite keywords. It was possible, but never done. Basically it’s a trick to implement in the parser. Could be fantastic, to change some features, or extend basic (by loading a library of keywords implemented in Basic or assembly). A good way to make the core of BBC Basic becoming smaller and to give the possibility to ‘non power users’ to work to the evolution of the project. For more modern use, a parser that can accept keywords in lower case, too. |
Fred Graute (114) 627 posts |
A useful resource Steve, especially as I’m extending StrongED code colouring to cover ARMv7 and VFPv3. Unfortunately the links on the VFP page don’t work due to a rogue I also notice that negative multiply instructions (VNMUL etc) are missing. (They are also missing in BASIC V 1.60 but Debugger 1.90 does know them.)
That will be very handy to test the new code colouring. Most instruction are already coloured correctly, just a few more to go. |
Rick Murray (539) 13385 posts |
Do you have any statistics? Every so often when discussing CLib, the “quirk” of linking directly into the module itself is raised and the suggestion of calling C functions via SWIs is often raised. Asides from the constant jumping in and out of SVC mode, I can only imagine the sort of speed hit that repeatedly calling the SWI handler would cause, over one load and two branches (which is what the jumptable method requires). Your post implies that you might have tested things and so have figures for direct entry vs SWI’d entry. Do you? If so, please share! |
David Feugey (2125) 2687 posts |
Nota: Basic VI could definitively be a bounty. I have about 300 € ready for this one. |
jim lesurf (2082) 1330 posts |
Afraid I don’t understand all the above details. My situation is that I write programs that get complied/linked using the ‘ROOL’ ‘C’ compiler. For precision, etc, these use double floats a lot of the time. So I simply wish for a situation I had eons ago when I had a machine with real FP hardware. i.e. that all the double floating point instructions got done by the relevant hardrware. The result tended to be about a 20x speed hike. Makes a big difference when doing something like going though a 100 MB audio file and FFT’ing all the chunks in that. Maybe I could use some of the above I don’t understand like ‘Neon’ or ‘Float’. But would that mean total re-writes of the programs? If so, not keen as it would seem more sensible for the hardware to support the established language and not make added assumptions about what is available. Jim |
Steve Drain (222) 1620 posts |
I can see that #prefix command, but with the old version of StrongHelp that I mainly use it does not cause any problem, ie: the prefix is a null string. I have a much more fully edited version here and I will try to check it and upload it soon. Thanks.
If they are missing it is most likely because Jeffrey did not include them. The basic production of the manual was directly from his file. If you have details of what is missing, I will include it. |
Steve Drain (222) 1620 posts |
I think I have fooled you, and perhaps myself. My attention is almost entirely towards BASIC, so when I refer to calling SWIs I am really thinking of SYS. That adds quite a large additional overhead. At a theoretical level, a Float SWI call is in place of single machine code instruction. With the FPE this actually hides a considerable amount of integer code and overheads may not be very significant. With VFP it could be just that one instruction and any overhead is likely to be significant. Even for the trancendental operations that need a number of VFP instructions, SYS/SWI overheads seem to be significant. The testing I did is from BASIC and the code is included with the module. It is the usual FOR NEXT timing loops. I did not record timings, but noted the differences and satisfied myself that a direct call of the code was very much faster. It might be worth noting that BASIC VI has to have a fair amount of overhead before it calls just one FPA instruction to implement it floating point operations. ;-) |
Steve Drain (222) 1620 posts |
BASIC VI already exists; I claim my 300 euros. ;-) Seriously, I am puzzled by what you want to see, or imagine can be done. If you could produce an actual specification it could be assessed more accurately.
To some extent I think this reflects the underlying machine, but remember, ARM BASIC dates back more than 25 years and has not had the continual attention that BB4W has enjoyed from Richard Russell. If you want to use 80-bit floats, then you can write assembler for the FPE and hide it away in BASIC routines, but it will be slow and no modern ARM processor supports them. If you want to use 64-bit integers you can do the same, and I am not alone in writing a library to do this. It is also on my list of things to include in Basalt.
How possible? And why? I do extend the use of keywords with Basalt, but that is not integral to BASIC. I have long considered the possibility of modularising, but have yet to see a way. As for BASIC itself, the code is very unfriendly to such a concept, I think.
How is this different from libraries of PROCs and FNs? |
Dave Higton (1515) 3404 posts |
That’s what I thought. I was puzzled by these references. Didn’t BASIC VI exist in the 1980s? What became of it? |
Martin Avison (27) 1417 posts |
Enter *help basic64 at a command line! Basic VI is the FP version of Basic V |
Steve Drain (222) 1620 posts |
BASIC VI is the release of interpreter version 51 that uses 64-bit floats. Otherwise BASIC V an VI are pretty well identical to the programmer. It was a soft-load option but is now included in the ROM. 1 In case anyone quibbles with this terminology, look at the identity word in the BASIC module preceding the environment information pointer passed in R14 to CALL. In both cases it is &BA51C005. |
Rick Murray (539) 13385 posts |
The thing is, consider the following program:
The nonsense with _kernel_swi() is to prevent the compiler being smart and optimising out most of the code. ;-) This translates to:
It looks pretty good, right? The problem is, these are FPA instructions. If a hardware FPA is available, it will execute the instructions. If not, the ARM will raise an undefined instruction exception at which point the FPEmulator will step in and perform the operation. As you can see, executing this nice tight little three instruction FP multiply could involve something in the order of three hundred instructions being executed. If you are a maths nerd, you could probably do something better using fixed point and integer maths to fake it… There are two alternatives. The first is “VFP”, a newer type of floating point built into most ARMs made in the last decade.
I think. I reserve the right to be utterly wrong. At any rate, it takes about six FP instructions instead of hundreds of ARM ones. The alternative FP implementation, found on ARMv7A (Cortex-A) is called NEON. It is supposed to three times faster than VFPv2 (ARMv5) and twice as fast as VFPv3?4? (ARMv6); but it comes with caveats. It isn’t IEEE compatible and only works with single precision mathematics.
Why? The program above could be recompiled to use VFP or NEON simply by passing flags to the compiler and recompiling. Unfortunately, passing “-cpu cortex-a8” to the compiler still generates FPA code. Maybe in the future it would be able to use something more appropriate to the processor type/family chosen. It would require a little more complication in CLib, though, in order for it to recognise different types of float in printf() and so on, for the float variations may be NEON, VFP, or FPE.
Sometimes you have to draw a line. Would you prefer half a dozen FP instructions, or hundreds of ARM instructions. Remember, I am talking about potentially one hundred ARM instructions per FP instruction. Sometime or other we will have to accept that going native is the only sensible option for programs that make heavy use of FP. For me, I am not that bothered as I rarely use FP. I used to for working out percentages, but by rearranging the calculation I can do it in straight integer maths. For you, with the work you do on audio samples, I can imagine a lot of FP would be necessary. I wonder how much faster (less system load) AMPlayer would be if there was a VFP version… |
David Feugey (2125) 2687 posts |
I mean VII :)
Nothing, or many things. SupermanLee told that VF support could break compatibility with some code. So just change the version, and give the people the choice to use V, VI or VII. No more problems, and many opportunities to make other bigger changes.
To provide Basic & non system programmers a way to create and modify keywords. Basic programmers prove every day that they can make useful things. But they need simple interfaces to help RISC OS. The same with other parts of the OS (skeletons for image conversion modules, for example, could help non system C developers to port things). I love plug-ins :)
You could define new keywords, or even change some existing ones. Richard, for example, make a big change on sound command, available as a patch for BBCB4Win. Just load it, and play. To change BBC Basic is not for everyone, but to extend it from Basic, would be – IMHO – simpler. That’s just a parser thing anyway (the new sound could simply be replaced on the fly by some FNnewsound) |
jim lesurf (2082) 1330 posts |
Sorry, afraid I still didn’t understand how to use VFP/NEON without changing my existing ‘C’ code. I think you may be answering a question I wasn’t asking! :-) I do understand how any FP instructions are caught, and lacking real access to FP hardware are emulated by bucketloads of int instructions, etc. That’s why the process is so tediously slow. What I’m asking is if there is a way now (or soon) for having the hardware simply access and use FP hardware without my having to change my existing C code, etc. At present I’ll have lots of lines with things like a = b*c; z = g/pi;etc where a, b, etc are all double precision floats. and of course all the calls like a = cos(pi2*f);ditto. How do I tell the machine now to handle the resulting compilied and linked code using FP hardware? And if not now, when/how? One point of course is to avoid users having to recompile if their machine lacks these hardware alternatives to the FPE. I understand the point of having an FPE which can trap and handle via integer or pass on to accessible FP hardware. It means the person compiling and linking doesn’t have to worry about generating multiple versions and ensuring the user runs the ‘correct’ one. The only worry being the dramatic difference is speed for the users between the two, which is unavoidable. FWIW I did give away my remaining FPA11 (IIRC the chip number) years ago. Maybe I should have kept it as a reminder of what we have since lost because Acorn seemed to decide this area simply didn’t matter. My guess is that what is needed is a modern update to the FPE which traps the instructions and then sends something appropriate to NEON, etc. But I’m not sure I’ve understood. Jim |
Rick Murray (539) 13385 posts |
You don’t, unless you wasn’t to drop to assembler and add your own routines. As it is, we are stuck with something that was “old” a quarter of a century ago.
That’s the question.
Or to give the compiler the ability to output VFP (NEON?) instructions and let the programmer decide? Personally, I’m concerned about the majority suffering for the minority. We ought to offer two versions of programs in that case – one for old machines, one for newer.
Oh hell yes! I never understood why Acorn seemed so against hardware FP, when it was being introduced on the competitor platform. I know that the FPEmulator is clever, but it is no match for real FP. |
Rick Murray (539) 13385 posts |
And if not now, when/how? Actually… If we can forget about NEON for now, it might be simpler than it first seems. I have the FPA10 datasheet and it claims to be IEEE 754 compliant. I also have the VFP data in the ARM ARM 2 and it claims to be IEEE 754 compliant. In this case, I would imagine saving FP registers to memory would use the same format? With this in mind, I’m going to have a crack at writing the program I gave earlier to drop to an assembler routine to use VFP instructions instead. With any luck, printf() will work, which would imply that as long as the FP values are stored to memory, the CLib functions ought to still work. |
Dave Higton (1515) 3404 posts |
So let me ask a naive question or several. Could the shared C library discover what FP system (if any) a given platform has, and use the best available? Could BASIC use the shared C library? Should it? And presumably anyone writing code could use it, although the documentation may or may not be adequate. |
David Feugey (2125) 2687 posts |
Not with a FVPEmulator module
Perhaps because it’s much more difficult to design than an ALU? :)
Only needed for VFP/Neon problem. for the other cases (FPU/noFPU), VFPEmulation will be OK. Of course, if there is no VFP Emulation, we need two versions of a program: one for VFP, another for FPA (that will use FPEmulator). For speed issues and old software, FPEmulator should be able to use VFP if present. It’s what we call Soft FP. So 3 things here: For Basic and other ASM software, it can be directly VFP code. But you’ll need one of the two: For a potential BBC Basic VII, I suggest VFP mode, so just for modern motherboards… until we’ll have a VFPEmulator module. Some functions will be a bit different, as between V and VI. Could also be the occasion to solve some problems with zero page. |
GavinWraith (26) 1531 posts |
Jim:
Not with a new Shared C Library. Dave:
Yes. Charm does this already. See the files lib.src.fp and lib.src.maths in the Charm 2.6.6 distribution. Recompilation would be necessary if the C library were extended by functions to check which FP system were available. It may be that the C runtime would need changing to save and restore vfp state – I am not sure about that. Otherwise I think it might be possible to have different CLib modules to suit each FP system. |
Rick Murray (539) 13385 posts |
Okay. For the lulz. Here’s a C program (you’re on your own for the MakeFile but note that objasm will whinge like hell as we’re mixing FPA and VFP and pre-UAL and UAL):
And here’s some assembler to go with it:
Everything is double as the C compiler appears to convert single (float) to double prior to calling printf(). The FPE code seems to be broken. It consistently outputs an insanely large value. It used to work, and the code is more or less equivalent to that taken by the FPE code generated by the compiler, so I don’t know what’s going wrong. I don’t see what can go wrong with “load a value, load another value, multiply them, store the result”… Doesn’t matter, really. The point is the timings. Um.
290cs vs 7cs. This is fairly consistent, running on a standard Pi, single tasking. For the time it takes to do this once with FPE, I could do it forty one times with VFP. This is why the DDE ought to start supporting native hardware FP instead of ancient emulated FP. |
Steve Pampling (1551) 7921 posts |
I maybe off track here, but isn’t one of the reasons for RO multi-media handling being a bit sucky something to do with an absence of decent FP? Or rather the use of FPE instead of hardware FP? |
David Feugey (2125) 2687 posts |
This, and slow disc accesses. |
Steve Pampling (1551) 7921 posts |
I did say “one of the reasons” Unless you have a magic wand deal with problems one at a time. Faster disc access is at least partially inflicted by limitations of the current hardware. |
Rick Murray (539) 13385 posts |
Err, not really. Same hardware, different OS, makes the standard Pi quite a nice media playback system. One of the main issues is the closed nature of many of the GPUs. The things can be controlled with a binary blob supplied by the manufacturer which will slot into Linux and together with the media framework will provide what is necessary for HD H.264 video. You can see, by looking at our MPlayer port and its just-about 320×240 capabilities exactly how much the GPU does assist. Without this, we’re kind of stuck. Anyway, for now, for today, it might be a nice idea if our compiler could perhaps make better use of the easily available facilities. 1 A possible alternative could be to supply code paths for VFP and non-VFP and select which one is used at runtime depending on the facilities of the host system? This might impact the efficiency of the compiler so… |
David Feugey (2125) 2687 posts |
Holidays? :) The problem is more on Neon side. We could have a NeonEmulator, but the use of Neon code will not be very optimal on VFP (without Neon) systems. Conclusion: VFP is definitively possible without any drawback, except on FPA systems (rare today). Neon is possible, but with speed problems on VFP only systems. One solution would be to have the two ASM code in the binary. A simple solution would be to have some *ifNeon CLI command to launch a specific RunImage if Neon is present (easy to make [compile twice], easy to remove if some people want to save space [remove the unneeded RunImage). If you really want, some *ifFPA and *ifVFP commands could be provided too. IMHO, several binaries will be much more flexible than a fat-binary. And it’s more RISCOSish too :) |
jim lesurf (2082) 1330 posts |
IIRC That was always the case and I assumed it was because the IEEE compliance was based on double precision error/rounding specs. BTW for me, IEEE compliance was important. And at least one version of the FPE emulation failed at one time. Most were fine, though. As was a faster 3rd party version I used for a while (I’ve forgotten who produced that). And back in the day when I still had a machine with real FP hardware I found that hardware was indeed 20 – 40 times faster than emulation. Made a big difference to programs that used a lot of floating point number bashing. I confess to being wary of having to generate multiple binary versions, etc. Opens up scope for odd failures as people start saying “doesn’t work here” or even more frustrating “I get a slightly different answer”. Seems to me that dealing with this via Clib/FPEmulator as the go-between makes most sense. But of course I have no idea what VFP/NEON entail, so what I’m saying may not be possible. In the past I mainly wanted FP hardware for engineering/academic/scientific calculations. These days I’d be in a boat with more passengers as I feel it would help a lot with processing ‘AV’ data files and streams. These often involve bucketloads of data as well as require a lot of number crunching. Sometimes it can all be integer, but other times not. Jim |
Steve Drain (222) 1620 posts |
I have tried to keep an eye on Charm. When I looked several months ago I think I remember the floating point only extended to the arithmetic operations; now I see that all the operations offered by FPA are there. If I understand correctly, to exploit VFP you have to recompile the compiler (written in Charm), so object code will only work in either of the areas we might be concerned with, not both. I have not delved into the source, but I would be interested to find out how Charm provides the trancendental functions using VFP, and how it handles VFP contexts. Nevertheless, it is clearly an exemplar for changing the C compiler. |
Steve Drain (222) 1620 posts |
I have never been in doubt about the huge speed advantages of using VFP. My own tests confirmed it, but I thought it too obvious to make a point of. Thanks for your explicit numbers. ;-) |
Steve Drain (222) 1620 posts |
Have you looked at my Float module? I considered all the issues raised here when writing it. My solution may not be suitable as it stands, but it is a solution. First, as Jeffrey pointed out, you have to deal with context. At what level you do this is an important decision, and one that did not have to be made when using FPA. My solution is at task level – not per instruction nor at system. If BASIC were to do this, it would be during task initialisation with *Basic and the context pointers stored in the spare words still available in the workspace. That is how I have experimented with it for Basalt. I expect that this can be done similarly in C. Next is the compatiblity between FPA and VFP. The general feeling here seems to be for separate compilation depending on the processor, as with Charm, but I dislike the idea of having different code, despite arguments about what is the route forward. ;-) My solution is to set up a context on systems that can use VFP, but to have null pointers otherwise. The choice of which code to run is then made on whether there is a context or not. This single instruction and branch is of no significance to FPA and a miniscule delay to VFP. Then is the problem of endianness:
I agree. Any existing data is stored for FPA, cf BASIC. So my solution is the same and requires the overhead of swapping registers before and after VFP code. I think this is acceptable for compatibility. A problem that has only been discussed tangentially, I think, is precision. Certainly double precision IEEE is needed and that is where I stopped, because that is all BASIC requires. However, single precision is used. VFP can provide this and I think we can ignore NEON. Extended precision is out of the question with VFP, but how important is that? Lastly is the problem of existential operations not provided by VFP. This is not trivial, but it has been a problem for computers for a long time. As an amateur, my solutions took some time and effort to tease out, but I expect those with computer science qualifications could rustle them up. How many of those do we have here? The algorithms might need explanation, but that is for another time, if anyone is interested. ;-) A final comment. I do not see an FPE replacement using VFP as feasible, for all the reasons Jeffrey listed, and I think a VFPE alternative would impose unnecessary conditions on programs running on non-VFP machines, so I would rule both out. |
David Feugey (2125) 2687 posts |
Yep, an elegant solution. But I was just thinking that perhaps it’s time to get this by default.
We could then fall back to VFPEmulator specific software functions made to be closed to FPEmulator. That’s why I suggested Basic VII. A new solution with support for VFP/Neon (and non VFP computers with VFPEmulator), the fastest way possible, but with small differences with BBC Basic VI that can lead to incompatible code. The same as between BBC Basic V and VI. And perhaps it will be the good time to add some features present in BBC Basic for Windows : “data structures, PRIVATE variables, long strings, event interrupts, an address of operator, byte variables, a line continuation character, indirect procedure and function calls and improved numeric accuracy.” Directives for (de)tokeniser (AllowLowcaseKeywords, RemoveFN, RemovePROC, AllowKeywordRewrites, Aliases, Renames) could be added to, to get something more modern (I do this today with a very limited and buggy preprocessor).
So the solution is not hardware FP?
A VFPEmulator, the same way with current software that needs FPEmulator. Not really a big change for users. |
David Feugey (2125) 2687 posts |
From a strategic point of view, RISC OS attracts some people because of BBC Basic. And simply because it’s probably the fastest interpreter on ARM in the world (with no JIT). I think it’s really important to keep this advantage. On Windows, BBC Basic is one the smallest, and the fastest interpreter too. That’s really a good reason to use it, and to make cross platform software (OK: games) with it. I’m OK with add ons (as the really excellent Basalt), and with support of legacy platforms, but we can also move on. Just to claim that we still have the fastest ARM interpreter in the world :) |
Steffen Huber (91) 1945 posts |
Can someone summarize the situation with GCC and float stuff? Before we got all that shiny new hardware, I remember that there were “hard float” (FPA/FPE) and “soft float” (an internal GCC math lib) being the choice. “Soft float” was a lot faster on non-FPU hardware. ISTR that libraries needed to be compiled for the correct calling standard. |
David Feugey (2125) 2687 posts |
On non-FPU hardware, Soft float is the fastest solution. Hard Float the slowest. VFP support should be complete in both GCC and UnixLib Is it available in stock GCC or in a specific beta version? I don’t know. |
GavinWraith (26) 1531 posts |
It probably depends on what program you run, but I would claim that Lua is faster. The Lua binary is about 88K; that includes extra libraries like lpeg (parser expression grammars) and bc (big numbers). Basic 64 is smaller at 51K. But Lua does not have the nostalgia factor. |
David Feugey (2125) 2687 posts |
LUA has some libs that can lead to much faster results, but I doubt that each opcode is decoded and run with only a few ARM instructions. BBC Basic engine is very optimised here (I could say the same with BBC Basic 4 Windows). That does not remove qualities of LUA anyway. |
Rick Murray (539) 13385 posts |
Well, there’s one way to settle this. CODEFIGHT!!!
|
GavinWraith (26) 1531 posts |
The standard Lua distribution has always had “#define LUA_NUMBER double” as
You would be right. Some of the Lua VM instructions are pretty complex, especially the ones dealing with tables. Almost every operation depends on whether its operands have metatables. So an addition (+), which in the simplest case would come down to an “ADD result, arg1, arg2” ARM-instruction might be implemented by arbitrarily long user code in Lua (or C or assembler), if either of the operands has been set up to demand it. This is the penalty that has to be paid for user-controlled syntax. In this sense, Lua is not so much a language as a language-kit. It mandates certain aspects ( garbage collected memory management, lexical scoping, multi-return values from functions) but leaves a great deal else free to be defined by the user. The intention behind the register-based Lua VM was that each instruction should do as much work as possible, to cut down on interpretive overhead. |
Rick Murray (539) 13385 posts |
Doesn’t this run the risk of marginalising the language away from serious use? After all, there are always two sides to the story. For example, all of my Windows programs are written with (true)VB because I didn’t grok how C programs started themselves up and I wasn’t confident to alter one of the demo apps to figure it out. VB on the other hand is overly friendly and extremely simple to use even if you pay for it in efficiency (the 6502 emulator I started in VB was taking the piddly more than anything else). Perhaps a person might feel more confident with Lua?
Just like the FPE instructions…
Wouldn’t that be the same in BASIC? |
Jeffrey Lee (213) 6046 posts |
VFP support should be complete in both GCC and UnixLib Currently the only way to get a VFP/NEON capable version of GCC is to build it yourself. It’s also worth pointing out that any programs compiled to use VFP/NEON (using GCC) will need SharedUnixLibrary 1.13 – which hasn’t seen a public release yet. If you build GCC yourself you’ll get a copy of it, but to avoid potential differences once the official 1.13 comes along I don’t think the GCC team will be happy with you distributing your own version. Rick: The reason your do_fpe function returns the wrong value is because of the differing word order for doubles between FPA & VFP. It looks like objasm decided you wanted to use VFP word ordering, which is why your VFP code needs to swap the order on save (for interaction with the FPA CLib) but not on load. Lack of VNMUL & friends in BASIC looks like an oversight – they should be there now in BASIC 1.61 |
Fred Graute (114) 627 posts |
Thanks for the quick fix, Jeffrey. Here’s a few more anomalies I found while extending StrongED’s ASM colouring:
|
Jeffrey Lee (213) 6046 posts |
What’s the hex for those instructions? For VLDM/VSTM the register count is stored in an 8 bit field, so you could theoretically load/store up to 255 registers if the hardware had that many. So I suspect that the debugger is disassembling it correctly, and it’s actually the instruction which is at fault. Which would then lead on to a second question of how you assembled those instructions! |
Steve Drain (222) 1620 posts |
I have looked back at the long list of VFP instructions you provided, but I cannot see the VNMUL & friends in it. Could you, or Fred, please post them or point me directly to the relevant ARM document. I will then add them to the VFP manual. However, I notice that I missed out the VFP4 instructions VF[N]MA and VF[N]MS fused multiply instructions. I cannot recall why, but I do not suppose that they are the same. |
GavinWraith (26) 1531 posts |
Lua has been designed with very specific aims. In particular, it is designed as a C library to be embedded in applications written in C. Lua as a separate programming language is something of a side-issue. The idea is that if you are after speed, you cater for that on the C side of things. So something like RiscLua, which is a statically compiled C application to interpret an appropriate dialect of Lua for RISC OS, is only showing half the story. So, yes, embedding Lua in a pre-existing number-crunching package, to make it easier to use, makes sense. Adding number-crunching facilities to RiscLua makes less sense IMHO. That is not to say that a future version of RiscLua won’t be using vfp or NEON. Lua 5.3, the latest version, after years of discussion on the forums, addresses the problem that whereas doubles may be a useful number type to expose for the user, the internal code is only interested in pointers, essentially an integer type. My personal preference is for no coercion and keeping types separate, but that is not seen as making things simple for the casual user. |