Using another core on the RPi 2
Bill Antonia (2466) 120 posts |
I’ve been trying to run code on another core, without success. So far I’ve written a very simple module in assembler which has initialisation code and uses OS_Memory 13 to map in address &4000009C, this should be the location where the start address of the routine to be used is poked for the next core. The routine itself is set to change GPIO pin 25 repeatedly from low to high and back again with a delay between each change, the output to be seen using a Gertboard. The routine to run is part of the module, loops on itself and has no use of any stack or operating system calls. So apart from being in RMA space is fairly isolated unless the module gets moved. The following is part of some code written by Peter Lemon, true it’s not for RISC OS.
I have already written a module to manage the states of the GPIO pins (similar to Tank’s but smaller and only for the Raspberry Pi) and a simple application to communicate through that module to the pins, and that works fine. Note the two modules are not related, the second is here just for information to say I know how to communicate using the GPIO pins. As far as I’m aware it’s just a matter of storing the start address of the routine to be used in the correct place for each core to begin executing. |
Dave Higton (1515) 3404 posts |
Have you mapped in the hardware pin addresses too, so that the second core can access them? |
Bill Antonia (2466) 120 posts |
The GPIO memory was mapped as well using OS_Module 13 before entering the routine, as with my GPIO module, also 25 pin was set to output. Thoughts later this afternoon, does the second core use the same memory mapping as the primary core or does it ignore the memory mapping set by RISC OS? |
Jeffrey Lee (213) 6046 posts |
The additional cores will start with MMU (and caches?) disabled. So you don’t need to worry about mapping anything in for them to use, you just need to make sure your code is using physical addresses for everything. For stuff to work reliably you’ll also have to:
An easy way of satisfying the above three constraints would be to store your code in some memory allocated by PCI_RAMAlloc. That call will also return the physical address of the buffer, which is useful for working out the start address of the core (the MMU is disabled so specifying a logical address won’t work) |
Bill Antonia (2466) 120 posts |
So: |
Jeffrey Lee (213) 6046 posts |
PCI_RAMAlloc will map in the memory for you (for the primary core, obviously). It returns both a logical & physical address. |
Bill Antonia (2466) 120 posts |
Ahhhhh……… Thank you. I followed the link in your post about PCI_RAMAlloc and briefly saw the mention of physical address returned, I didn’t clock the logical address was also returned. Overall, is the list of “things to do” within the ball-park? I’ve looked through threads on the forum regarding multiprocessor capability but haven’t seen any examples (yet, not to say there are none). I’ve also looked at the RPi forums, particularly for “Bare metal” to get some clues. I have written some bare metal code but that was tested on the B+ so didn’t use more than one core. |
Jeffrey Lee (213) 6046 posts |
Yes, your list looks correct to me. |
Bill Antonia (2466) 120 posts |
Thank you. |
Bill Antonia (2466) 120 posts |
Got back to this recently. I know GPIO pin 25 is being set as an output and I am able set the pin high and low using core 0, so there is nothing wrong with the first part of the code. Allocating space using the PCI call works and the logical and physical addresses are saved. The relocatable code is then copied into that area and the physical start address of the copied code placed into one of the logical addresses of the core 1 mailbox ready for execution. However nothing happens? I’m using a Gertboard to sense GPIO pin 25 which is wired up correctly and does blink when using core 0. I’m not able to spot where it’s functioning incorrectly! Any clues?
|
Jeffrey Lee (213) 6046 posts |
You probably need a memory barrier between copying the code to the PCI RAM and writing the start address to the mailbox (PCI RAM will be mapped as bufferable, so the CPU may run ahead and perform the mailbox write before the code has finished copying) Also, is the mailbox address correct? The code you posted up top uses &4000009C, but you’re using &40000090 |
Bill Antonia (2466) 120 posts |
Thank you, didn’t spot the &40000090/&4000009C difference! I’ll check that out when I set up tomorrow, along with the additional memory barriers. |
Bill Antonia (2466) 120 posts |
Changed the line “A7_Core1Mailbox0WriteSet EQU &90” to “A7_Core1Mailbox3WriteSet EQU &9C”, and changed the str line accordingly where the physical address of the start location of the copied relocatable code is posted. I’ve also added a memory buffer “mcr p15, 0, a4, c7, c10, 5” just after the “bne next_instruction”. However, still no joy. This proof of concept is proving challenging! |
Jeffrey Lee (213) 6046 posts |
If you try with the latest ROM you might have more success – the ARM boot stub where the other cores sit waiting for the mailbox write was getting wiped by RISC OS when it was doing the RAM clear during boot (and I’m hoping this was the cause of some random crashes I was seeing during the boot sequence on my Pi 2). I’ve now fixed the OS to wake up the other cores and put them into a different sleep loop, located within the HAL. You should be able to get them out of that sleep loop by writing to the same mailboxes as before (core 1 mailbox 3, core 2 mailbox 3, etc.) and then using the SEV instruction to send an event. If you’ve got a serial cable hooked up you should see the core briefly announce itself (‘1’, ‘2’ or ‘3’) before it branches to your code. Also one thing I’ve realised is that because the multi-core code will only ever work on the Pi 2 and above, you can use the ARMv7 DSB/DMB memory barrier instructions rather than the MCR barriers (less chance of making a typo!). Assuming you’ve got a fairly recent version of the DDE (SEV, DSB/DMB and all the other ARMv7/v8 instructions are available in objasm 4) Edit: Spoke too soon, just spotted a bug in my code where they’re reading from the wrong mailbox registers :-) Maybe I should actually check that the cores can be woken from the sleep loop this time! |
Jeffrey Lee (213) 6046 posts |
Fix is now checked in, so expect to see it in tomorrow’s ROM. I’ve also made it so that LR is set up to point to the sleep loop, so if you’re testing simple code sequences you can return back there afterwards (just try not to crash the cores because they won’t have any processor vectors set up!) Here’s some test code which works (prints “hi” to the serial port – although the debug code in the HAL means it’ll actually print “1hi1”) ON ERROR PRINT REPORT$;" at ";ERL : END DIM launcher% 1024 FOR pass=0 TO 2 STEP 2 P%=launcher% [ OPT pass SWI "OS_EnterOS" ; Copy the code over (PCI RAM not accessible from user mode, so can't assemble the code to it directly) ADR r3,code_begin MOV r4,#code_end-code_begin-4 .copy_loop LDR r5,[r3,r4] STR r5,[r2,r4] SUBS r4,r4,#4 BGE copy_loop ; Start it running DSB STR r0, [r1] DSB SEV SWI "OS_LeaveOS" MOV pc, lr .code_begin LDR r0, uart MOV r1, #ASC("h") STRB r1, [r0] MOV r1, #ASC("i") STRB r1, [r0] MOV pc, lr .uart EQUD &3F000000 + &201000 .code_end ] NEXT pass SYS "OS_Memory",13,&40000000,4096 TO ,,,qa7% SYS "PCI_RAMAlloc",512 TO log%,phys% A%=phys% B%=qa7%+&9C : REM core 1 mbox 3 C%=log% CALL launcher% This will leak a bit of memory in the PCI heap, fixing that is left as an exercise to the reader (when the cores enter the sleep loop they’ll write a non-zero value to the core 0 mbox <N>, so if you clear that mailbox before starting the core you should be able to wait for it to become non-zero again and then free the memory) |
Jan Rinze (235) 367 posts |
Very interested in this. |
Jeffrey Lee (213) 6046 posts |
The extra cores will start with the MMU off, so you should be able to access video memory directly by physical address. However finding the physical address might be a bit tricky – RISC OS will have mapped it using 1MB sections, which OS_Memory 0 doesn’t understand. I guess you could read the page tables directly, or use the mailbox interface to query the GPU (which should be fairly straightforward now that BCMSupport is available)
If you want a general solution which will work on all multicore devices then using a RAM-based message queue would make more sense. I think the only time we’d want to use machine-specific features for IPC would be when we need to trigger an interrupt on another core. When the OS receives that interrupt it would know it needs to check its RAM message queue for new messages. |