RISC OS Open: Forum: Using another core on the RPi 2

May 21, 2015 1:50pm

I’ve been trying to run code on another core, without success. So far I’ve written a very simple module in assembler which has initialisation code and uses OS_Memory 13 to map in address &4000009C, this should be the location where the start address of the routine to be used is poked for the next core. The routine itself is set to change GPIO pin 25 repeatedly from low to high and back again with a delay between each change, the output to be seen using a Gertboard. The routine to run is part of the module, loops on itself and has no use of any stack or operating system calls. So apart from being in RMA space is fairly isolated unless the module gets moved.

The following is part of some code written by Peter Lemon, true it’s not for RISC OS.




; Setup SMP (Boot Offset = $4000008C + ($10 * Core), Core = 1..3)
Core1Boot = $4000008C + ($10 * 1) ; Core 1 Boot Offset
Core2Boot = $4000008C + ($10 * 2) ; Core 2 Boot Offset
Core3Boot = $4000008C + ($10 * 3) ; Core 3 Boot Offset
org $8000
FB_Init:
imm32 r0,FB_STRUCT + MAIL_TAGS
imm32 r1,PERIPHERAL_BASE + MAIL_BASE + MAIL_WRITE + MAIL_TAGS
str r0,[r1] ; Mail Box Write
imm32 r1,FB_POINTER
ldr r10,[r1] ; R10 = Frame Buffer Pointer
cmp r10,0 ; Compare Frame Buffer Pointer To Zero
beq FB_Init ; IF Zero Re-Initialize Frame Buffer
; Wake SMP Cores
imm32 r0,Core1Code ; R0 = Core 1 Code Offset
imm32 r1,Core1Boot ; R1 = Core 1 Boot Offset
str r0,[r1] ; Write Core 1 Code Offset To Core 1 Boot Offset
imm32 r0,Core2Code ; R0 = Core 2 Code Offset
imm32 r1,Core2Boot ; R1 = Core 2 Boot Offset
str r0,[r1] ; Write Core 2 Code Offset To Core 2 Boot Offset
imm32 r0,Core3Code ; R0 = Core 3 Code Offset
imm32 r1,Core3Boot ; R1 = Core 3 Boot Offset
str r0,[r1] ; Write Core 3 Code Offset To Core 3 Boot Offset

I have already written a module to manage the states of the GPIO pins (similar to Tank’s but smaller and only for the Raspberry Pi) and a simple application to communicate through that module to the pins, and that works fine. Note the two modules are not related, the second is here just for information to say I know how to communicate using the GPIO pins.

As far as I’m aware it’s just a matter of storing the start address of the routine to be used in the correct place for each core to begin executing.

May 21, 2015 2:11pm

Dave Higton (1515) 3404 posts

Have you mapped in the hardware pin addresses too, so that the second core can access them?

May 21, 2015 8:38pm

Bill Antonia (2466) 120 posts

The GPIO memory was mapped as well using OS_Module 13 before entering the routine, as with my GPIO module, also 25 pin was set to output.

Thoughts later this afternoon, does the second core use the same memory mapping as the primary core or does it ignore the memory mapping set by RISC OS?

May 22, 2015 10:41am

Jeffrey Lee (213) 6046 posts

The additional cores will start with MMU (and caches?) disabled. So you don’t need to worry about mapping anything in for them to use, you just need to make sure your code is using physical addresses for everything.

For stuff to work reliably you’ll also have to:

Make sure the code doesn’t cross any page boundaries (since there physical pages that RISC OS allocated for the RMA might not be contiguous)
Make sure the code has been flushed from the caches
Make sure RISC OS doesn’t reallocate the pages behind your back (software is allowed to claim specific physical pages, and RISC OS will remap whatever’s currently using it – this will obviously result in failure since RISC OS doesn’t know the other core is using the page)

An easy way of satisfying the above three constraints would be to store your code in some memory allocated by PCI_RAMAlloc. That call will also return the physical address of the buffer, which is useful for working out the start address of the core (the MMU is disabled so specifying a logical address won’t work)

May 22, 2015 11:52am

Bill Antonia (2466) 120 posts

So:
1. The base address to use to access the GPIO area for the second core is &3F200000 rather than the address returned by OS_Memory 13 for that block, which would normally be used with the primary core.
2. The address to poke the start address of the code to run on the second core would be that returned by OS_Memory 13 for &4000009C and the value to use in the str would be that returned from PCI_RAMAlloc
3. Finally (but not done in this order above, just a list of things to achieve), the code to run on the second core must be relocatable and copied into the block of memory returned by PCI_RAMAlloc, I assume this base address has to be processed through OS_Memory 13 to allow the primary core to map in the block so the copy can take place from RMA space or somewhere from file.

May 22, 2015 12:04pm

Jeffrey Lee (213) 6046 posts

3. Finally (but not done in this order above, just a list of things to achieve), the code to run on the second core must be relocatable and copied into the block of memory returned by PCI_RAMAlloc, I assume this base address has to be processed through OS_Memory 13 to allow the primary core to map in the block so the copy can take place from RMA space or somewhere from file.

PCI_RAMAlloc will map in the memory for you (for the primary core, obviously). It returns both a logical & physical address.

May 22, 2015 12:18pm

Bill Antonia (2466) 120 posts

Ahhhhh……… Thank you. I followed the link in your post about PCI_RAMAlloc and briefly saw the mention of physical address returned, I didn’t clock the logical address was also returned. Overall, is the list of “things to do” within the ball-park?

I’ve looked through threads on the forum regarding multiprocessor capability but haven’t seen any examples (yet, not to say there are none). I’ve also looked at the RPi forums, particularly for “Bare metal” to get some clues. I have written some bare metal code but that was tested on the B+ so didn’t use more than one core.

May 22, 2015 12:26pm

Jeffrey Lee (213) 6046 posts

Yes, your list looks correct to me.

May 22, 2015 12:42pm

Bill Antonia (2466) 120 posts

Thank you.

Jun 18, 2015 2:33pm

Bill Antonia (2466) 120 posts

Got back to this recently. I know GPIO pin 25 is being set as an output and I am able set the pin high and low using core 0, so there is nothing wrong with the first part of the code.

Allocating space using the PCI call works and the logical and physical addresses are saved. The relocatable code is then copied into that area and the physical start address of the copied code placed into one of the logical addresses of the core 1 mailbox ready for execution.

However nothing happens? I’m using a Gertboard to sense GPIO pin 25 which is wired up correctly and does blink when using core 0.

I’m not able to spot where it’s functioning incorrectly! Any clues?

;GET h.swinames ;GET h.rpi-asm AREA |Module$$code|, CODE, READONLY OS_Memory EQU &000068 PCI_RAMAlloc EQU &050390 BCM2836_PERI_BASE EQU &3F000000 BCM2836_GPIO_OFFSET EQU &00200000 BCM2836_GPIO_RANGE EQU &B4 BCM2836_REG_BASE EQU &40000000 BCM2836_REG_RANGE EQU &100 ; BCM2836 Register offsets GPFSEL2 EQU &08 ; GPIO Function Select 2 GPSET0 EQU &1C ; GPIO Pin Output Set 0 GPCLR0 EQU &28 ; GPIO Pin Output Clear 0 A7_Core1Mailbox0WriteSet EQU &90 ENTRY Module_BaseAddr DCD 0 ; Start code DCD init_code -Module_BaseAddr ; Initialization code DCD final_code -Module_BaseAddr ; Finalization code DCD 0 ; Service call handler DCD title_string -Module_BaseAddr ; Title string DCD help_string -Module_BaseAddr ; Help string DCD 0 ; Help and Command keyword table DCD 0 ; SWI chunk base number DCD 0 ; SWI handler code DCD 0 ; SWI decoding table DCD 0 ; SWI decoding code DCD 0 DCD module_flags -Module_BaseAddr ; Module flags offset title_string DCB “smpModule”,0 help_string DCB “smpModule\t0.01 (18 June 2015) Bill Antonia”,0 ALIGN module_flags DCD 1 init_code mov mov add mov swi str mov mcr ldr ldr and mov mov orr str mcr mov mov mov swi str mov mov mov swi str str mov adr next_instruction ldr str add cmp bne ldr ldr str mcr stmfd sp!, {a1-a4, lr} a1, #13 a2, #BCM2836_PERI_BASE ; Generate physical address of GPIO area a2, a2, #BCM2836_GPIO_OFFSET a3, #BCM2836_GPIO_RANGE ; Select block of &B4 addresses OS_Memory ; Map block and get logical address of block a4, GPIO_Base_Logical ; Save logical address of GPIO block for later if required a1, a4 ; Transfer logical address to register a1 p15, 0, a2, c7, c10, 5 ; Memory barrier a2, [a1, #GPFSEL2] ; Get current state of GPIO Function Select 2 a3, MASK a2, a2, a3 ; Mask off GPFSEL2 to remove bits for pin 25 a4, #1 a4, a4, lsl #15 a2, a2, a4 a2, [a1, #GPFSEL2] ; Set bits for pin 25 as an output p15, 0, a2, c7, c10, 5 ; Memory barrier a1, #13 a2, #BCM2836_REG_BASE ; Load the physical address of the BCM2836 registers a3, #BCM2836_REG_RANGE OS_Memory ; Map block of &100 addresses from physical address &40000000 a4, BCM2836_Registers_Logical ; Save the returned logical address a1, #end_code-start_code ; Calculate the size of the relocatable code to run with core 1 a2, #0 a3, #0 PCI_RAMAlloc ; Grab a block of PCI RAM for the code a1, Allocated_PCI_RAM_Logical ; Store the returned logical address a2, Allocated_PCI_RAM_Physical ; Store the returned physical address a3, #0 ; Copy the relocatable code to the allocated RAM a2, start_code a4, [a2, a3] a4, [a1, a3] a3, a3, #4 a3, #end_code-start_code next_instruction a1, Allocated_PCI_RAM_Physical ; Load the physical address of the allocated PCI RAM a2, BCM2836_Registers_Logical ; Load the logical address of the BCM2836 registers a1, [a2, #A7_Core1Mailbox0WriteSet] ; Store the physical start address of the relocatated code ; into the mailbox register for core 1 p15, 0, a1, c7, c10, 5 ; Memory barrier ldmfd sp!, {a1-a4, pc} ; Return from module initialisation final_code stmfd sp!, {lr} ldmfd sp!, {pc} ; Relocatable code for core 1 using physical addresses to set and clear GPIO pin 25, pin already set for output start_code mov a1, #BCM2836_PERI_BASE ; Build physical address of GPIO base add a1, a1, #BCM2836_GPIO_OFFSET mov a2, #1 ; Choose bit to set for pin 25 mov a2, a2, lsl #25 loop str a2, [a1, #GPSET0] ; set pin mcr p15, 0, a2, c7, c10, 5 ; Memory barrier mov a3, #&1000000 ; Set loop counter for pause pause1 sub a3, a3, #1 ; Subtract 1 from pause counter cmp a3, #0 ; Compare counter with zero bne pause1 ; If not zero loop again str a2, [a1, #GPCLR0] ; clear pin mcr p15, 0, a2, c7, c10, 5 ; Memory barrier mov a3, #&1000000 ; Set loop counter for pause pause2 sub a3, a3, #1 ; Subtract 1 from pause counter cmp a3, #0 ; Compare counter with zero bne pause2 ; If not zero loop again b loop ; Branch back to repeat the cycle end_code MASK DCD &3FFC7FFF AREA |Module$$data|, DATA GPIO_Base_Logical DCD 0 BCM2836_Registers_Logical DCD 0 Allocated_PCI_RAM_Logical DCD 0 Allocated_PCI_RAM_Physical DCD 0 ; Lengths &17C &40 END

Jun 18, 2015 3:28pm

Jeffrey Lee (213) 6046 posts

You probably need a memory barrier between copying the code to the PCI RAM and writing the start address to the mailbox (PCI RAM will be mapped as bufferable, so the CPU may run ahead and perform the mailbox write before the code has finished copying)

Also, is the mailbox address correct? The code you posted up top uses &4000009C, but you’re using &40000090

Jun 18, 2015 9:23pm

Bill Antonia (2466) 120 posts

Thank you, didn’t spot the &40000090/&4000009C difference! I’ll check that out when I set up tomorrow, along with the additional memory barriers.

Jun 19, 2015 9:57am

Bill Antonia (2466) 120 posts

Changed the line “A7_Core1Mailbox0WriteSet EQU &90” to “A7_Core1Mailbox3WriteSet EQU &9C”, and changed the str line accordingly where the physical address of the start location of the copied relocatable code is posted. I’ve also added a memory buffer “mcr p15, 0, a4, c7, c10, 5” just after the “bne next_instruction”. However, still no joy. This proof of concept is proving challenging!

Mar 29, 2016 1:13pm

Jeffrey Lee (213) 6046 posts

If you try with the latest ROM you might have more success – the ARM boot stub where the other cores sit waiting for the mailbox write was getting wiped by RISC OS when it was doing the RAM clear during boot (and I’m hoping this was the cause of some random crashes I was seeing during the boot sequence on my Pi 2).

I’ve now fixed the OS to wake up the other cores and put them into a different sleep loop, located within the HAL. You should be able to get them out of that sleep loop by writing to the same mailboxes as before (core 1 mailbox 3, core 2 mailbox 3, etc.) and then using the SEV instruction to send an event. If you’ve got a serial cable hooked up you should see the core briefly announce itself (‘1’, ‘2’ or ‘3’) before it branches to your code.

Also one thing I’ve realised is that because the multi-core code will only ever work on the Pi 2 and above, you can use the ARMv7 DSB/DMB memory barrier instructions rather than the MCR barriers (less chance of making a typo!). Assuming you’ve got a fairly recent version of the DDE (SEV, DSB/DMB and all the other ARMv7/v8 instructions are available in objasm 4)

Edit: Spoke too soon, just spotted a bug in my code where they’re reading from the wrong mailbox registers :-) Maybe I should actually check that the cores can be woken from the sleep loop this time!

Mar 29, 2016 7:26pm

Jeffrey Lee (213) 6046 posts

Edit: Spoke too soon, just spotted a bug in my code where they’re reading from the wrong mailbox registers :-) Maybe I should actually check that the cores can be woken from the sleep loop this time!

Fix is now checked in, so expect to see it in tomorrow’s ROM. I’ve also made it so that LR is set up to point to the sleep loop, so if you’re testing simple code sequences you can return back there afterwards (just try not to crash the cores because they won’t have any processor vectors set up!)

Here’s some test code which works (prints “hi” to the serial port – although the debug code in the HAL means it’ll actually print “1hi1”)

ON ERROR PRINT REPORT$;" at ";ERL : END
DIM launcher% 1024
FOR pass=0 TO 2 STEP 2
P%=launcher%
[ OPT pass
SWI "OS_EnterOS"
; Copy the code over (PCI RAM not accessible from user mode, so can't assemble the code to it directly)
ADR r3,code_begin
MOV r4,#code_end-code_begin-4
.copy_loop
LDR r5,[r3,r4]
STR r5,[r2,r4]
SUBS r4,r4,#4
BGE copy_loop
; Start it running
DSB
STR r0, [r1]
DSB
SEV
SWI "OS_LeaveOS"
MOV pc, lr

.code_begin
LDR r0, uart
MOV r1, #ASC("h")
STRB r1, [r0]
MOV r1, #ASC("i")
STRB r1, [r0]
MOV pc, lr
.uart EQUD &3F000000 + &201000
.code_end
]
NEXT pass
SYS "OS_Memory",13,&40000000,4096 TO ,,,qa7%
SYS "PCI_RAMAlloc",512 TO log%,phys%
A%=phys%
B%=qa7%+&9C : REM core 1 mbox 3
C%=log%
CALL launcher%

This will leak a bit of memory in the PCI heap, fixing that is left as an exercise to the reader (when the cores enter the sleep loop they’ll write a non-zero value to the core 0 mbox <N>, so if you clear that mailbox before starting the core you should be able to wait for it to become non-zero again and then free the memory)

Jul 20, 2016 11:53am

Jan Rinze (235) 367 posts

Very interested in this.
Is it allowed to map video memory to be shared with the second CPU?
I see that the Mailbox is used for IPC so that should be the main communication channel.

Jul 20, 2016 1:07pm

Jeffrey Lee (213) 6046 posts

Is it allowed to map video memory to be shared with the second CPU?

The extra cores will start with the MMU off, so you should be able to access video memory directly by physical address.

However finding the physical address might be a bit tricky – RISC OS will have mapped it using 1MB sections, which OS_Memory 0 doesn’t understand. I guess you could read the page tables directly, or use the mailbox interface to query the GPU (which should be fairly straightforward now that BCMSupport is available)

I see that the Mailbox is used for IPC so that should be the main communication channel.

If you want a general solution which will work on all multicore devices then using a RAM-based message queue would make more sense. I think the only time we’d want to use machine-specific features for IPC would be when we need to trigger an interrupt on another core. When the OS receives that interrupt it would know it needs to check its RAM message queue for new messages.

Using another core on the RPi 2

Reply

Search forums

Social

ROOL Store

Donate! Why?

RISC OS IPR

Description

Voices

Options

May 21, 2015 1:50pm Bill Antonia (2466) 120 posts	I’ve been trying to run code on another core, without success. So far I’ve written a very simple module in assembler which has initialisation code and uses OS_Memory 13 to map in address &4000009C, this should be the location where the start address of the routine to be used is poked for the next core. The routine itself is set to change GPIO pin 25 repeatedly from low to high and back again with a delay between each change, the output to be seen using a Gertboard. The routine to run is part of the module, loops on itself and has no use of any stack or operating system calls. So apart from being in RMA space is fairly isolated unless the module gets moved. The following is part of some code written by Peter Lemon, true it’s not for RISC OS. ; Setup SMP (Boot Offset = $4000008C + ($10 * Core), Core = 1..3) Core1Boot = $4000008C + ($10 * 1) ; Core 1 Boot Offset Core2Boot = $4000008C + ($10 * 2) ; Core 2 Boot Offset Core3Boot = $4000008C + ($10 * 3) ; Core 3 Boot Offset org $8000 FB_Init: imm32 r0,FB_STRUCT + MAIL_TAGS imm32 r1,PERIPHERAL_BASE + MAIL_BASE + MAIL_WRITE + MAIL_TAGS str r0,[r1] ; Mail Box Write imm32 r1,FB_POINTER ldr r10,[r1] ; R10 = Frame Buffer Pointer cmp r10,0 ; Compare Frame Buffer Pointer To Zero beq FB_Init ; IF Zero Re-Initialize Frame Buffer ; Wake SMP Cores imm32 r0,Core1Code ; R0 = Core 1 Code Offset imm32 r1,Core1Boot ; R1 = Core 1 Boot Offset str r0,[r1] ; Write Core 1 Code Offset To Core 1 Boot Offset imm32 r0,Core2Code ; R0 = Core 2 Code Offset imm32 r1,Core2Boot ; R1 = Core 2 Boot Offset str r0,[r1] ; Write Core 2 Code Offset To Core 2 Boot Offset imm32 r0,Core3Code ; R0 = Core 3 Code Offset imm32 r1,Core3Boot ; R1 = Core 3 Boot Offset str r0,[r1] ; Write Core 3 Code Offset To Core 3 Boot Offset I have already written a module to manage the states of the GPIO pins (similar to Tank’s but smaller and only for the Raspberry Pi) and a simple application to communicate through that module to the pins, and that works fine. Note the two modules are not related, the second is here just for information to say I know how to communicate using the GPIO pins. As far as I’m aware it’s just a matter of storing the start address of the routine to be used in the correct place for each core to begin executing.

May 21, 2015 2:11pm Dave Higton (1515) 3404 posts	Have you mapped in the hardware pin addresses too, so that the second core can access them?

May 21, 2015 8:38pm Bill Antonia (2466) 120 posts	The GPIO memory was mapped as well using OS_Module 13 before entering the routine, as with my GPIO module, also 25 pin was set to output. Thoughts later this afternoon, does the second core use the same memory mapping as the primary core or does it ignore the memory mapping set by RISC OS?

May 22, 2015 10:41am Jeffrey Lee (213) 6046 posts	The additional cores will start with MMU (and caches?) disabled. So you don’t need to worry about mapping anything in for them to use, you just need to make sure your code is using physical addresses for everything. For stuff to work reliably you’ll also have to: Make sure the code doesn’t cross any page boundaries (since there physical pages that RISC OS allocated for the RMA might not be contiguous) Make sure the code has been flushed from the caches Make sure RISC OS doesn’t reallocate the pages behind your back (software is allowed to claim specific physical pages, and RISC OS will remap whatever’s currently using it – this will obviously result in failure since RISC OS doesn’t know the other core is using the page) An easy way of satisfying the above three constraints would be to store your code in some memory allocated by PCI_RAMAlloc. That call will also return the physical address of the buffer, which is useful for working out the start address of the core (the MMU is disabled so specifying a logical address won’t work)

May 22, 2015 11:52am Bill Antonia (2466) 120 posts	So: 1. The base address to use to access the GPIO area for the second core is &3F200000 rather than the address returned by OS_Memory 13 for that block, which would normally be used with the primary core. 2. The address to poke the start address of the code to run on the second core would be that returned by OS_Memory 13 for &4000009C and the value to use in the str would be that returned from PCI_RAMAlloc 3. Finally (but not done in this order above, just a list of things to achieve), the code to run on the second core must be relocatable and copied into the block of memory returned by PCI_RAMAlloc, I assume this base address has to be processed through OS_Memory 13 to allow the primary core to map in the block so the copy can take place from RMA space or somewhere from file.

May 22, 2015 12:04pm Jeffrey Lee (213) 6046 posts	3. Finally (but not done in this order above, just a list of things to achieve), the code to run on the second core must be relocatable and copied into the block of memory returned by PCI_RAMAlloc, I assume this base address has to be processed through OS_Memory 13 to allow the primary core to map in the block so the copy can take place from RMA space or somewhere from file. PCI_RAMAlloc will map in the memory for you (for the primary core, obviously). It returns both a logical & physical address.

May 22, 2015 12:18pm Bill Antonia (2466) 120 posts	Ahhhhh……… Thank you. I followed the link in your post about PCI_RAMAlloc and briefly saw the mention of physical address returned, I didn’t clock the logical address was also returned. Overall, is the list of “things to do” within the ball-park? I’ve looked through threads on the forum regarding multiprocessor capability but haven’t seen any examples (yet, not to say there are none). I’ve also looked at the RPi forums, particularly for “Bare metal” to get some clues. I have written some bare metal code but that was tested on the B+ so didn’t use more than one core.

May 22, 2015 12:26pm Jeffrey Lee (213) 6046 posts	Yes, your list looks correct to me.

May 22, 2015 12:42pm Bill Antonia (2466) 120 posts	Thank you.

Jun 18, 2015 2:33pm Bill Antonia (2466) 120 posts	Got back to this recently. I know GPIO pin 25 is being set as an output and I am able set the pin high and low using core 0, so there is nothing wrong with the first part of the code. Allocating space using the PCI call works and the logical and physical addresses are saved. The relocatable code is then copied into that area and the physical start address of the copied code placed into one of the logical addresses of the core 1 mailbox ready for execution. However nothing happens? I’m using a Gertboard to sense GPIO pin 25 which is wired up correctly and does blink when using core 0. I’m not able to spot where it’s functioning incorrectly! Any clues? ;GET h.swinames ;GET h.rpi-asm AREA \|Module$$code\|, CODE, READONLY OS_Memory EQU &000068 PCI_RAMAlloc EQU &050390 BCM2836_PERI_BASE EQU &3F000000 BCM2836_GPIO_OFFSET EQU &00200000 BCM2836_GPIO_RANGE EQU &B4 BCM2836_REG_BASE EQU &40000000 BCM2836_REG_RANGE EQU &100 ; BCM2836 Register offsets GPFSEL2 EQU &08 ; GPIO Function Select 2 GPSET0 EQU &1C ; GPIO Pin Output Set 0 GPCLR0 EQU &28 ; GPIO Pin Output Clear 0 A7_Core1Mailbox0WriteSet EQU &90 ENTRY Module_BaseAddr DCD 0 ; Start code DCD init_code -Module_BaseAddr ; Initialization code DCD final_code -Module_BaseAddr ; Finalization code DCD 0 ; Service call handler DCD title_string -Module_BaseAddr ; Title string DCD help_string -Module_BaseAddr ; Help string DCD 0 ; Help and Command keyword table DCD 0 ; SWI chunk base number DCD 0 ; SWI handler code DCD 0 ; SWI decoding table DCD 0 ; SWI decoding code DCD 0 DCD module_flags -Module_BaseAddr ; Module flags offset title_string DCB “smpModule”,0 help_string DCB “smpModule\t0.01 (18 June 2015) Bill Antonia”,0 ALIGN module_flags DCD 1 init_code mov mov add mov swi str mov mcr ldr ldr and mov mov orr str mcr mov mov mov swi str mov mov mov swi str str mov adr next_instruction ldr str add cmp bne ldr ldr str mcr stmfd sp!, {a1-a4, lr} a1, #13 a2, #BCM2836_PERI_BASE ; Generate physical address of GPIO area a2, a2, #BCM2836_GPIO_OFFSET a3, #BCM2836_GPIO_RANGE ; Select block of &B4 addresses OS_Memory ; Map block and get logical address of block a4, GPIO_Base_Logical ; Save logical address of GPIO block for later if required a1, a4 ; Transfer logical address to register a1 p15, 0, a2, c7, c10, 5 ; Memory barrier a2, [a1, #GPFSEL2] ; Get current state of GPIO Function Select 2 a3, MASK a2, a2, a3 ; Mask off GPFSEL2 to remove bits for pin 25 a4, #1 a4, a4, lsl #15 a2, a2, a4 a2, [a1, #GPFSEL2] ; Set bits for pin 25 as an output p15, 0, a2, c7, c10, 5 ; Memory barrier a1, #13 a2, #BCM2836_REG_BASE ; Load the physical address of the BCM2836 registers a3, #BCM2836_REG_RANGE OS_Memory ; Map block of &100 addresses from physical address &40000000 a4, BCM2836_Registers_Logical ; Save the returned logical address a1, #end_code-start_code ; Calculate the size of the relocatable code to run with core 1 a2, #0 a3, #0 PCI_RAMAlloc ; Grab a block of PCI RAM for the code a1, Allocated_PCI_RAM_Logical ; Store the returned logical address a2, Allocated_PCI_RAM_Physical ; Store the returned physical address a3, #0 ; Copy the relocatable code to the allocated RAM a2, start_code a4, [a2, a3] a4, [a1, a3] a3, a3, #4 a3, #end_code-start_code next_instruction a1, Allocated_PCI_RAM_Physical ; Load the physical address of the allocated PCI RAM a2, BCM2836_Registers_Logical ; Load the logical address of the BCM2836 registers a1, [a2, #A7_Core1Mailbox0WriteSet] ; Store the physical start address of the relocatated code ; into the mailbox register for core 1 p15, 0, a1, c7, c10, 5 ; Memory barrier ldmfd sp!, {a1-a4, pc} ; Return from module initialisation final_code stmfd sp!, {lr} ldmfd sp!, {pc} ; Relocatable code for core 1 using physical addresses to set and clear GPIO pin 25, pin already set for output start_code mov a1, #BCM2836_PERI_BASE ; Build physical address of GPIO base add a1, a1, #BCM2836_GPIO_OFFSET mov a2, #1 ; Choose bit to set for pin 25 mov a2, a2, lsl #25 loop str a2, [a1, #GPSET0] ; set pin mcr p15, 0, a2, c7, c10, 5 ; Memory barrier mov a3, #&1000000 ; Set loop counter for pause pause1 sub a3, a3, #1 ; Subtract 1 from pause counter cmp a3, #0 ; Compare counter with zero bne pause1 ; If not zero loop again str a2, [a1, #GPCLR0] ; clear pin mcr p15, 0, a2, c7, c10, 5 ; Memory barrier mov a3, #&1000000 ; Set loop counter for pause pause2 sub a3, a3, #1 ; Subtract 1 from pause counter cmp a3, #0 ; Compare counter with zero bne pause2 ; If not zero loop again b loop ; Branch back to repeat the cycle end_code MASK DCD &3FFC7FFF AREA \|Module$$data\|, DATA GPIO_Base_Logical DCD 0 BCM2836_Registers_Logical DCD 0 Allocated_PCI_RAM_Logical DCD 0 Allocated_PCI_RAM_Physical DCD 0 ; Lengths &17C &40 END

Jun 18, 2015 3:28pm Jeffrey Lee (213) 6046 posts	You probably need a memory barrier between copying the code to the PCI RAM and writing the start address to the mailbox (PCI RAM will be mapped as bufferable, so the CPU may run ahead and perform the mailbox write before the code has finished copying) Also, is the mailbox address correct? The code you posted up top uses &4000009C, but you’re using &40000090

Jun 18, 2015 9:23pm Bill Antonia (2466) 120 posts	Thank you, didn’t spot the &40000090/&4000009C difference! I’ll check that out when I set up tomorrow, along with the additional memory barriers.

Jun 19, 2015 9:57am Bill Antonia (2466) 120 posts	Changed the line “A7_Core1Mailbox0WriteSet EQU &90” to “A7_Core1Mailbox3WriteSet EQU &9C”, and changed the str line accordingly where the physical address of the start location of the copied relocatable code is posted. I’ve also added a memory buffer “mcr p15, 0, a4, c7, c10, 5” just after the “bne next_instruction”. However, still no joy. This proof of concept is proving challenging!

Mar 29, 2016 1:13pm Jeffrey Lee (213) 6046 posts	If you try with the latest ROM you might have more success – the ARM boot stub where the other cores sit waiting for the mailbox write was getting wiped by RISC OS when it was doing the RAM clear during boot (and I’m hoping this was the cause of some random crashes I was seeing during the boot sequence on my Pi 2). I’ve now fixed the OS to wake up the other cores and put them into a different sleep loop, located within the HAL. You should be able to get them out of that sleep loop by writing to the same mailboxes as before (core 1 mailbox 3, core 2 mailbox 3, etc.) and then using the SEV instruction to send an event. If you’ve got a serial cable hooked up you should see the core briefly announce itself (‘1’, ‘2’ or ‘3’) before it branches to your code. Also one thing I’ve realised is that because the multi-core code will only ever work on the Pi 2 and above, you can use the ARMv7 DSB/DMB memory barrier instructions rather than the MCR barriers (less chance of making a typo!). Assuming you’ve got a fairly recent version of the DDE (SEV, DSB/DMB and all the other ARMv7/v8 instructions are available in objasm 4) Edit: Spoke too soon, just spotted a bug in my code where they’re reading from the wrong mailbox registers :-) Maybe I should actually check that the cores can be woken from the sleep loop this time!

Mar 29, 2016 7:26pm Jeffrey Lee (213) 6046 posts	Edit: Spoke too soon, just spotted a bug in my code where they’re reading from the wrong mailbox registers :-) Maybe I should actually check that the cores can be woken from the sleep loop this time! Fix is now checked in, so expect to see it in tomorrow’s ROM. I’ve also made it so that LR is set up to point to the sleep loop, so if you’re testing simple code sequences you can return back there afterwards (just try not to crash the cores because they won’t have any processor vectors set up!) Here’s some test code which works (prints “hi” to the serial port – although the debug code in the HAL means it’ll actually print “1hi1”) ON ERROR PRINT REPORT$;" at ";ERL : END DIM launcher% 1024 FOR pass=0 TO 2 STEP 2 P%=launcher% [ OPT pass SWI "OS_EnterOS" ; Copy the code over (PCI RAM not accessible from user mode, so can't assemble the code to it directly) ADR r3,code_begin MOV r4,#code_end-code_begin-4 .copy_loop LDR r5,[r3,r4] STR r5,[r2,r4] SUBS r4,r4,#4 BGE copy_loop ; Start it running DSB STR r0, [r1] DSB SEV SWI "OS_LeaveOS" MOV pc, lr .code_begin LDR r0, uart MOV r1, #ASC("h") STRB r1, [r0] MOV r1, #ASC("i") STRB r1, [r0] MOV pc, lr .uart EQUD &3F000000 + &201000 .code_end ] NEXT pass SYS "OS_Memory",13,&40000000,4096 TO ,,,qa7% SYS "PCI_RAMAlloc",512 TO log%,phys% A%=phys% B%=qa7%+&9C : REM core 1 mbox 3 C%=log% CALL launcher% This will leak a bit of memory in the PCI heap, fixing that is left as an exercise to the reader (when the cores enter the sleep loop they’ll write a non-zero value to the core 0 mbox <N>, so if you clear that mailbox before starting the core you should be able to wait for it to become non-zero again and then free the memory)

Jul 20, 2016 11:53am Jan Rinze (235) 367 posts	Very interested in this. Is it allowed to map video memory to be shared with the second CPU? I see that the Mailbox is used for IPC so that should be the main communication channel.

Jul 20, 2016 1:07pm Jeffrey Lee (213) 6046 posts	Is it allowed to map video memory to be shared with the second CPU? The extra cores will start with the MMU off, so you should be able to access video memory directly by physical address. However finding the physical address might be a bit tricky – RISC OS will have mapped it using 1MB sections, which OS_Memory 0 doesn’t understand. I guess you could read the page tables directly, or use the mailbox interface to query the GPU (which should be fairly straightforward now that BCMSupport is available) I see that the Mailbox is used for IPC so that should be the main communication channel. If you want a general solution which will work on all multicore devices then using a RAM-based message queue would make more sense. I think the only time we’d want to use machine-specific features for IPC would be when we need to trigger an interrupt on another core. When the OS receives that interrupt it would know it needs to check its RAM message queue for new messages.