Thinking ahead: Supporting multicore CPUs

636 posts, 79 voices

Pages: 1 ... 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Nov 27, 2016 11:10pm Steffen Huber (91) 1945 posts	However… Didn’t somebody make a version of FPEmulator that pushed all the hard FP over to the PC card to be executed by the 486’s maths co-pro? FPEPC by WSS: http://wss.co.uk/products.html#FPEPC

Jan 4, 2017 2:46pm Jeffrey Lee (213) 6048 posts	Over Christmas I was able to spend a bit more time working on SMP-related things. Compared to this post the status is that: Support for shareable pages is now in the OS I have a local copy of the BCM2835 HAL that has: Support for the extra Pi 2/3 interrupts, and all the HAL interrupt calls are MP-safe Implementation of a couple of new HAL calls that are used for starting the other cores Implementation of a basic doorbell HAL device that can be used to trigger interrupts on other core(s) I’ve got a local copy of the kernel that’s had some changes made to make it more SMP-friendly: Alternate ARMop implementations for ARMv7 which will broadcast the operations to the other cores (where possible) Global OS_SynchroniseCodeAreas operations changed to only sync the RMA and application space (if OS is running in SMP-friendly mode – since global cache ops generally don’t broadcast) The test app has also seen some improvements: It’s now a module It now brings up all the cores instead of just the first Each core has its own stacks & processor vectors (n.b. the code assumes that all multicore devices will implement the security extensions, so we can set the processor vectors to any arbitrary 32-byte aligned address instead of just the &00000000 or &ffff0000 supported by the base architecture. Using different addresses for each core helps make things simpler) Added basic IRQ dispatcher and message queue data structure implementations I’m currently in the middle of rewriting the code to be mostly C (before it was pure assembler) – although this presents some challenges I’m hoping it will be worth it when it comes to developing and experimenting with all the areas where IPC is required. Once the C rewrite is done I should be able to focus on fleshing out the microkernel side of things – adding a basic SWI to the module to allow tasks to be started, and reporting the task status back to the primary core when an abort or SWI is issued. As things progress I’m starting to get a better idea of the challenges and design decisions that we’ll be facing when updating the OS. So at some point I’ll probably have to put together a doc which summarises all of the issues and potential solutions. I was hoping that I’d be able to get things to the point where I’d be able to submit the pending kernel + BCM HAL changes to CVS (i.e. I’d be confident that the changes are sensible and aren’t going to break things for regular users), but since there’s a lot of room for experimentation with different solutions to problems I suspect the better option would be to submit the changes to a new branch.

Jan 4, 2017 6:31pm Anthony Vaughan Bartram (2454) 456 posts	This sounds really cool Jeffrey. I’ve dropped you an e-mail – but it sounds like I was a bit behind. I was looking at some C re-writing. If I can help, please reply to my e-mail with any tasks I can pick up or a source download. Even if its just code review (ASM or C). Although I’d love to have a go at running/reviewing the test app in any case. Thanks, Tony

Jan 4, 2017 9:07pm George T. Greenfield (154) 726 posts	Very encouraging that you’re making progress in this key area Jeffrey. All being well I should have a ‘spare’ Pi2 available for use running test modules/ROMs if and when required in a few weeks.

Jan 4, 2017 10:02pm Kuemmel (439) 384 posts	Sounds great Jeffrey ! Can you explain how you will deal with shared data or memory access of the different tasks running on different CPU’s ? For example if more than one task needs information from the other one in form of a changing variable or memory address ? In the x86 world a “lock” prefix was invented to prevent any “leakage” in assembly language. After googleing for something similar in the ARM-World it seems that there’s LDREX and STREX…is that the way to go ? Would be nice to have some basic example code to learn from once you are ready.

Jan 4, 2017 11:53pm Jeffrey Lee (213) 6048 posts	In the x86 world a “lock” prefix was invented to prevent any “leakage” in assembly language. After googleing for something similar in the ARM-World it seems that there’s LDREX and STREX…is that the way to go ? There are actually a bunch of instructions that you need to use. LDREX and STREX provide atomic read-modify-write access to individual memory elements (byte, halfword, word, doubleword). They are an implementation of the LL/SC concept – unlike SWP they don’t lock the bus, they merely watch the indicated address for changes. When interacting with larger structures or when signaling changes to other cores the DMB, DSB and (occasionally) ISB instructions are required (to ensure memory accesses and certain instructions occur in “program order” when viewed by the other cores) To signal other cores that something has happened, you can either use an interrupt-based system (e.g. hardware mailboxes or FIFOs) or the SEV and WFE instructions Would be nice to have some basic example code to learn from once you are ready. Correct use of the above instructions is difficult – if you’ve got access to the ARMv7 ARM have a read through the “barrier litmus tests” chapter for examples of how things can go wrong. So “basic example code” is likely to take the form of “here, use this C library which implements high-level synchronisation primitives” rather than examples of how to use the underlying instructions directly :-) (e.g. SyncLib has spinlock and mutex implementations which contain all the necessary barriers to make them suitable for general-purpose use. But if you use the barrier or CPU event functions directly then extra care is necessary.)

Jan 5, 2017 7:20am Clive Semmens (2335) 3209 posts	“Barrier Litmus Tests” – ooh, nice! That wasn’t in the original wot I was involved in writing. Had huge fun* with writing scraps of code to test my understanding of these instructions, just to be sure I was getting the instruction documentation right…and no, don’t rely on me as an expert on them after all these years…the silicon I was running them on wasn’t even production silicon… * according to my dictionary, this means “trouble”

Jan 15, 2017 5:43pm Jeffrey Lee (213) 6048 posts	Over here is an archive containing the source for my current HAL+kernel changes and the SMP module. There’s also a prebuilt Pi ROM + SMP module so that people can experiment with it without needing to have a working ROM build environment, and a readme covering the functionality of the SMP module. Everything’s still very much in the prototype phase, but it has reached the point where there’s a useful microkernel running on the other cores: There’s a SWI dispatcher, with the ability to register & deregister SWIs at runtime (the kernel only allows MP-safe SWIs to be called, so a registration system is used to allow code to indicate which SWIs are safe). By default the SMP module SWIs are available, along with a few simple kernel SWIs Each core has its own IRQ dispatcher, OS_ClaimDeviceVector and OS_ReleaseDeviceVector can be used to manage IRQ handlers There’s a simple thread scheduler – creating and running threads is how you’ll get your code running on the other cores If a thread triggers an exception or calls a non-X SWI which returns an error then the thread will be terminated (and the relevant information will be available via *SMPThreads or SMP_ExamineThread), so there’s at least a minimal safety net to allow you to continue after something goes wrong The ROM should work fine on all models of Pi, but the SMP module will only do something interesting on a Pi 2 or 3 See the readme for a list of gotchas to watch out for Also there’s no sample code, so unless you’re planning on writing something yourself or reviewing the code there’s probably no point downloading the archive. I’m not sure yet what the next step from here is going to be – there are a lot of areas that could be worked on, and a lot of scope to have people working on things in parallel. So if you’re interested in getting involved, watch this space for future information – but also feel free to poke holes in the code I’ve posted above.

Jan 15, 2017 6:51pm Rick Murray (539) 13593 posts	Well… damn! That’s all I can say. Kinda think I might put some cash aside real soon now for a later Pi.

Jan 17, 2017 12:14am Anthony Vaughan Bartram (2454) 456 posts	Well this is sort of like Hello World (assuming the use of the prototype RISCOS.IMG on a PI 2 and prior rmload smp) Probably not cleaning up but I need to read the readme again…. #include < stdio.h > #include < time.h > #include "kernel.h" #include "swis.h" /* Threading example. Sets caller parameter to '1' if thread is executed. / typedef int thread_t; #define USR 16 #define SMP_CreateThread 0x0c1242 // n.b. This should really thunk through assembler to set up a stack etc. // But as the example entry point only sets a value & for simplicity // this is not implemented. // thread_t ThreadCreate(const char name, void * entry, void * param) { thread_t tid = -1; _ kernel_swi_regs regs; _ kernel_oserror err; regs.r[ 0 ] = (int)name; regs.r[ 1 ] = 0; // Affinity mask regs.r[ 2 ] = 0; // Pollword regs.r[ 3 ] = (int)param; // Parameter regs.r[ 4 ] = 0; regs.r[ 5 ] = (int)entry; regs.r[ 6 ] = USR; _ kernel_swi(SMP_CreateThread,&regs,&regs); tid = regs.r[ 0 ]; return tid; } int ThreadEntry(void param) { int * out = (int * )param; ( * out ) = 1; return 0; } int main(int argc, char * argv[]) { time_t start; start = time(NULL); printf("Start Time %d\n\n", start); printf("Attempting to invoke thread\n"); { volatile int rtn = 0; thread_t t = -1; t = ThreadCreate("first", (void*)&ThreadEntry, &rtn); while (time(NULL) < (start + 2)); printf("tid = %x rtn = %d\n", t, rtn); if (rtn == 1) printf("Thread was called successfully\n"); } return 0; }

Jan 17, 2017 1:21am Anthony Vaughan Bartram (2454) 456 posts	Worth being careful with the parameter address. As documented in the readme, memory allocated in the application space is the same in the WIMP for each task… So the anything derivative of the example above should really be run outside the WIMP otherwise you can get errors visible with SMPMetrics.

Jan 17, 2017 1:36pm Jeffrey Lee (213) 6048 posts	n.b. Tried some of the documented textile suppression codes. Didn’t have any joy. Which one is supported (hence the total lack of indentation)? Thanks. HTML <pre> tags usually work best for me. Probably not cleaning up but I need to read the readme again…. Yes, a call to SMP_DestroyThread is required to clean up a thread after it’s exited/terminated. A couple of other things to be wary of with that code: When threads are created R13_usr will be zero. So for USR/SYS mode threads you’ll need a veneer which sets up a stack, and (for C) the APCS SL register. On the other hand, you will automatically get an SVC stack, and the fact that R12 is passed through means that modules can use standard CMHG veneers to set up their C environment (although it is a bit overkill to have the veneer save/restore r0-r9 on entry/exit) Technically ThreadEntry should return an int, since r0 will be used as the thread exit code ‘rtn’ should technically be ‘volatile int rtn’ to make sure the compiler doesn’t optimise out any reads/writes

Jan 17, 2017 8:15pm Anthony Vaughan Bartram (2454) 456 posts	Hi Jeffrey, Thanks I suspected as much. I thought of implementing an a thunk in ASM and then using BL to invoke the user supplied method. Looking at bsd kthreads as discussed on e-mail at present. I’ll correct the example above at least to include volatile and return an int (will add thread destruction later). I’ll try out the pre tag. Thanks, Tony

Jan 18, 2017 9:48pm Kuemmel (439) 384 posts	Hi Anthony, would that C-Code also work if transformed to BASIC ? Any chance you would do that including that e.g. 2 tasks would be created on 2 cores and also do something, like a small calculation or just print some text individually…so we would have a nice Hello World for both main coding platforms on Risc OS that encourages people start coding…including myself beeing a lousy single task coder…

Jan 18, 2017 11:56pm Anthony Vaughan Bartram (2454) 456 posts	Hi Kuemmel, Not at the moment I believe. Multi-threaded BBC BASIC. That would be quite cool. I think you would probably have to compile the BASIC to ASM and then a toolkit would be required to implement wrapper functions including setup/teardown i.e. BBC BASIC has no concept of re-entrant code & the BASIC module itself is not multicore safe. Of course, once some modules exist e.g. for background processing to do useful tasks, then BASIC could call that like any other SWI to do useful work e.g. anything that is processor intensive which you do not wish to block the UI. Tony

Jan 19, 2017 6:14pm Kuemmel (439) 384 posts	@Jeffrey, @Anthony: May be BBC Basic could start itself individually as a thread when a code is started on each core and run BASIC code…may be it’s to insane to believe it would work. But how is that done when I start lets say 2 or more WIMP Basic programs anyway under Risc OS now without the support of multiple cores ? Is each WIMP program starting a new BBC Basic interpreter to run ? If that would be the case this could be assigned to cores somehow I would guess ?

Jan 19, 2017 6:56pm David Feugey (2125) 2709 posts	Hum, or a cut down version of BBC Basic that could run on other cores? (No GFX, no sound, limited SWI set, limited memory set).

Jan 19, 2017 8:13pm Rick Murray (539) 13593 posts	But how is that done when I start lets say 2 or more WIMP Basic programs anyway under Risc OS now without the support of multiple cores ? Is each WIMP program starting a new BBC Basic interpreter to run ? No. The BASIC interpreter stores “state” in the area between LOMEM and PAGE, and I think some more up around HIMEM? I’m sure somebody will correct me here. Anyway, what happens is exactly what happens with C programs. The “state” of the program is held within its own application space. As such, when the Wimp is polled the registers are saved and the entire application memory is paged out¹. Another application is paged in¹ and the registers for that app are restored before the Wimp_Poll call of the new application returns, and everything just carries on from where it left off. The view from the application: Calls Wimp_Poll, it’ll return with an event that needs to be handled. The view from the Wimp: Application called Wimp_Poll. Shuffle it out of the way and find the next app. Repeat in a round-robin fashion, noting if the app is using PollIdle or doesn’t want specific events (like Null polls). To consider: Between your app calling Wimp_Poll and the SWI returning to you, it is entirely possible that fifty other apps have polled fifty hundred thousand times and twenty minutes have gone by. If that would be the case this could be assigned to cores somehow I would guess ? I think the problem that is going to rear its head the most with multi-core work is exactly how much of the internals of RISC OS are not re-entrant. You cannot access files on TickerV or CallAfter/CallEvery. You must use those to trigger a Callback (which is invoked when the system is next “not busy”, so could be any time really…) and then perform the file access. While this sort of thing is fairly low level and may not be of consequence to many, it does raise the obvious question of how numerous operations on one core would be dealt with when we don’t necessarily know the status of the core running RISC OS Prime. Hum, or a cut down version of BBC Basic that could run on other cores? (No GFX, no sound, limited SWI set, limited memory set). To be honest, that would be kind of pointless, don’t you think? A better idea, and one that might be doable (just) is to have a version of BASIC that implements a TUBE-like interface to something (module?) on RISC OS Prime. The other-core program (and not just BASIC) will be stalled awaiting the host module doing its thing and sending data (if any) back to the other core. In this way, other cores can access the usual system facilities, they just might take a speed hit awaiting the opportunity. And, yes, this does mean RISC OS Prime will be doing work for the other cores. Trust me, you only want one thing talking to devices in a controlled manner. Anything else is going to be messy. I believe Jeffrey mentioned a TaskWindow that uses other cores? If so, that’s sort of like what I’m thinking of, and could well be a good start. The RISC OS mini kernels on the other cores should, necessarily, do as much as they can for themselves (otherwise it kind of ruins the point of having multiple cores to use), but it should absolutely NOT be afraid of kicking more complicated things back to the host. There’s only one keyboard, only one screen, so be like TUBE and let the host deal with that and the co-pro (co-core?) programs deal with them via the host. ¹ Lazy task swapping uses some sort of mechanism to only page in partial bits of memory. I’m not sure of the exact method, suffice to say that while whole-appspace swapping was acceptable for older slower machines with small apps (remember, the MEMC with 4MB installed would use 32K pages), modern ARMs with 4K pages repeatedly swapping 10-20MB appspace…was rather painful.

Jan 19, 2017 9:21pm Anthony Vaughan Bartram (2454) 456 posts	Hi Rick, David, Kuemmel et al, This thread is (and possibly has through most of its life) been an enjoyable list of speculative and exploratory ideas. However, I recommend re-reading Jeffrey’s post. It outlines a threading scheme with the option to list & permit safe SWI calls. MultiCore Prototype Development tasks are being tracked outside of this thread in order to action a series of changes to incrementally build a multi-core capability. In its simplest form – launching threads to perform background tasks are useful. Ideally auto-scheduling safe portions of a process on different cores is feasible. This is consistent with other OS’s which employ abstraction of a process away from its execution core/processor. If you would like to help perform some of this work, please reach out to Jeffrey and myself and individual tasks could be assigned to you. Thanks, Tony

Jan 19, 2017 9:49pm David Feugey (2125) 2709 posts	To be honest, that would be kind of pointless, don’t you think? Perhaps, but in fact, what you suggest is what I was thinking about.

Feb 5, 2017 12:27am rob andrews (112) 200 posts	Hi Jeffery i see that you are on so can i ask if you are going to do an SMP module for OMAP5 or will the current one work with your changes to the kernal?? If i try to run it it comes back with number of core wrong ie Bad core count Do i Need to do a build with your kernal changes??

Feb 5, 2017 12:56am Jeffrey Lee (213) 6048 posts	The plan is to have it so that the same SMP module can work with all of the multi-core machines. But in order to support this, each machine will need to have some changes made to its HAL. At the moment I’m trying to get OMAP4 working, so I can test out a new way of managing interrupts. There are also a couple of DMA/memory management things that need fixing before multi-core code can safely be used on other machines (the Pi and OMAP4 are fine since they don’t make much use of DMA). Once that’s all done it shouldn’t take much effort to get the other machines working (the changes that are being made to the OMAP4 HAL will be almost identical to the changes that will need making to the other HALs, since they all use the same type of interrupt controller).

Feb 5, 2017 2:01am rob andrews (112) 200 posts	Good news I look forward to testing it when it becomes available.

Jun 14, 2017 1:10pm Jeffrey Lee (213) 6048 posts	Does anyone have any suggestions for some test code? I need something that I can use to detect any cache maintenance/memory management bugs – i.e. have a test harness which repeatedly runs the code on the other core(s) and watches for any failures while the main core messes with the system. Specifically, I’m looking for an algorithm which will touch a reasonably large amount of memory (lots of reading & writing, repeated reads & writes of the same location, etc.), will run for a reasonable amount of time (e.g. at least a second or two), and produces the same result when given the same input. Further constraints are: The code must be C or assembler, and capable of running from the RMA (Obviously) the code must be MP-safe. For a moment I thought I could use the Squash compress/decompress SWIs for testing, then I looked at the source and saw that it uses static variables to store the state during each invocation. Doh! The core loop must be self-contained – no SWI calls, minimal C library calls (since the C library isn’t MP-safe yet)

Jun 14, 2017 1:48pm Alan Robertson (52) 420 posts	Would a Fast Fourier Transformation algorithm be of use here?