Safeguarding the past, present and future of RISC OS for everyone

News | Downloads | Bugs | Bounties | Forums | Library

Forums → Community Support →

PCI ram alloc ram slow ?

9 posts, 2 voices

Sep 9, 2018 9:28am Michael Grunditz (467) 531 posts	Don’t know which forum is right for this.. I have a driver for a device with internal DMA controller. I need to pass a physical address of a RAM buffer to it. All fine , I create a buffer with PCI_RAMAlloc. However after hardware read is done the buffer fills up VERY slowly. Is there a better way to get a buffers physical address? How big can a buffer be to be effective?

Sep 9, 2018 11:14am Jeffrey Lee (213) 6046 posts	Memory allocated by PCI_RAMAlloc will be mapped as non-cacheable. This will make it slow for CPU access (especially read access – writes can usually be buffered). Access by DMA controllers should be the same performance as any other memory in the system. When writing DMA drivers you need to be careful to use the right memory barriers at the right times, otherwise the CPU or DMA may see incorrect data. This generally involves executing a DMB_Write barrier just before you start the DMA transfer (to make sure any data previously written by the CPU arrives before the DMA reads or writes ontop of it), and a DMB_Read barrier after you’ve detected the DMA has completed (so that if you try and read data out of the buffer which was written by the DMA, the CPU won’t be using prefetched data from before the DMA had finished writing). You can use arbitrary cacheable RAM for DMA, which will result in a buffer which is a lot faster for CPU access. But performing the correct cache maintenance operations, and dealing with remapping of physical pages by the OS, adds a lot of complexity. OS_Memory 19 was designed to help with this – it will perform the cache maintenance and barrier operations for you, and performs the logical → physical address translation.

Sep 9, 2018 11:47am Michael Grunditz (467) 531 posts	You can use arbitrary cacheable RAM for DMA, which will result in a buffer which is a lot faster for CPU access. But performing the correct cache maintenance operations, and dealing with remapping of physical pages by the OS, adds a lot of complexity. OS_Memory 19 was designed to help with this – it will perform the cache maintenance and barrier operations for you, and performs the logical → physical address translation. Thanks! Seems like OS_Memory 19 is the way to go. But I am uncertain how to read the docs for it. The Input and Output functions are noted by registers, are the functions supposed to point to a list of words? Am I correct in that I need to call PCI_RAMAlloc before to actually get the buffer/region? EDIT , I might understand. The Input function supplies the address for the dma buffer that I can use for the device internal controller, somehow the buffer will extend itself (?) Also guess that the Output function is used internally… OK , I have to admit I am confused! Reading the SATA driver since is use OS_Memory 19.. will take some time to understand. Should I do RAM allocation inside the Input fuction? In SATA driver it seems like the functions needs to be implemented.

Sep 9, 2018 2:28pm Jeffrey Lee (213) 6046 posts	The overall method for using OS_Memory 19 will be: Allocate the data buffer If DMA is reading from RAM, fill the buffer with data Call OS_Memory 19 with bit 10 of R0 clear to indicate that it’s the start of the transfer. Provide function pointers for the input function and output function, which OS_Memory 19 will use to communicate with you Tell the controller to start the DMA. No extra barrier operations will be needed, OS_Memory 19 will have performed the necessary barriers for you. Wait for DMA to complete Call OS_Memory 19 again, with bit 10 of R0 set to indicate that it’s the end of the transfer. Provide an input function pointer so that it can determine what areas of memory need post-DMA cache maintenance performing. Do whatever you want to do with the buffer (read the data out of it, throw it away, re-use it for another transfer, etc.). The key thing is to make sure steps 3-6 get repeated if you’re re-using the buffer. For the typical use case, OS_Memory 19 uses the input function to get the logical address of your buffer, and the output function to tell you what the physical address is (which you’d then use to program the DMA controller). There is some complexity here – the call is designed to support scatter-gather type I/O, so it will call the input function multiple times in order to get the address of each block (until the input function returns a block with a length of zero). The output function will also be called multiple times – and this may be different to how many times the input function gets called. This is because it breaks the buffers into physically contiguous blocks (ordinarily the OS doesn’t guarantee that RAM is physically contiguous – which is why PCI_RAMAlloc was created, to provide an easy way of getting physically contiguous RAM). Also, if DMA is writing to cacheable RAM, and the start/end of an input block isn’t aligned to a cache line boundary, then the block will be split into pieces in order to make sure the middle section is cache line aligned. This could result in three calls to the output function – one for the start of the block (with the “bounce buffer must be used” flag set), one for the cache-aligned middle of the block (which can be used directly by the DMA), and one for the end of the block (again with the “bounce buffer must be used” flag set). The reason for this is that when a dirty cache line gets written to memory by the CPU, it’s typically the entire cache line that gets written out, only if one or two bytes had been changed. So a write by the CPU to a byte at the end of a cache line could inadvertently overwrite data which DMA had written to the start of the cache line. If your program was in full control over the memory surrounding the data buffer then this would be fine, but OS_Memory 19 assumes that the memory is being shared by other programs in the system (e.g. the buffer could be in the RMA), so for unaligned writes it’ll request that you perform that part of the transfer using a bounce buffer (which you’d typically allocate using PCI_RAMAlloc). There are a couple of bits which OS_Memory 19 doesn’t make it easy to deal with. The first is that you can’t determine ahead of time how much memory to allocate for the address list/descriptor that’s going to be used by the DMA controller. Compared to OS_Memory 0 (which operated on the basis of 4K pages) it’ll probably be shorter, because for long transfers there are likely to be quite a few physically contiguous pages. But worst-case it could be longer, due to the cache alignment constraint. The second problem is that you still need to deal with Service_PagesUnsafe, which can (theoretically) be triggered by any dynamic area grow/shrink operation, or by any callback (since the callback might grow/shrink a DA). This makes it awkward to grow the DMA descriptor on demand inside your output function implementation – the memory allocation might trigger pages to be moved, invalidating the list of physical addresses you’ve been building. So if you detect a call to Service_PagesUnsafe while you’re building the transfer list, it’s easiest to just throw away the list and start again.

Sep 9, 2018 5:22pm Michael Grunditz (467) 531 posts	For the typical use case, OS_Memory 19 uses the input function to get the logical address of your buffer, and the output function to tell you what the physical address is (which you’d then use to program the DMA controller). OK There is some complexity here – the call is designed to support scatter-gather type I/O, so it will call the input function multiple times in order to get the address of each block (until the input function returns a block with a length of zero). The output function will also be called multiple times – and this may be different to how many times the input function gets called. This is because it breaks the buffers into physically contiguous blocks (ordinarily the OS doesn’t guarantee that RAM is physically contiguous – which is why PCI_RAMAlloc was created, to provide an easy way of getting physically contiguous RAM). I need contiguous RAM. What happens if the initial buffer is allocated with PCI_RAMAlloc? Also, if DMA is writing to cacheable RAM, and the start/end of an input block isn’t aligned to a cache line boundary, then the block will be split into pieces in order to make sure the middle section is cache line aligned. This could result in three calls to the output function – I don’t think I have any unaligned writes at all.. But if I have.. I am trying to understand how the Output function in SATA works. But I don’t really understand whats happening in case of bouncing Notes If OS_Memory 19 indicates that a bounce buffer must be used for a region, then it’s imperative to not perform DMA to that region, and only perform CPU access instead. I can’t separate what I transfer with DMA or CPU, it is one or the other, and in this case it will always be DMA.. I have started to experiment now, but it just hangs in the swi call.

Sep 9, 2018 8:59pm Jeffrey Lee (213) 6046 posts	I need contiguous RAM. What happens if the initial buffer is allocated with PCI_RAMAlloc? It’ll be logically + physically contiguous, and the pages will be locked in place so that you don’t have to worry about Service_PagesUnsafe. OS_Memory 19 is a bit redundant if you’re using PCI_RAMAlloc. I don’t think I have any unaligned writes at all.. But if I have.. I am trying to understand how the Output function in SATA works. But I don’t really understand whats happening in case of bouncing The SATA code is a bit complicated, because the SATA controller only supports DMA that’s halfword aligned. So both the SATA logic and the OS_Memory 19 logic can cause data to be transferred via the bounce buffer. Notes If OS_Memory 19 indicates that a bounce buffer must be used for a region, then it’s imperative to not perform DMA to that region, and only perform CPU access instead. I can’t separate what I transfer with DMA or CPU, it is one or the other, and in this case it will always be DMA.. Let’s say that you’re trying to perform a DMA write to a buffer located at &8000, and the cache line size on the system is 32 bytes (&20). If the transfer length was &110 bytes then the last 16 bytes would only partially cover a cache line. So OS_Memory 19 will break it into two blocks: one from &8000 to &8100, which will have the “use bounce buffer flag” clear, and one from &8100 to &810, with the “use bounce buffer” flag set. To deal with this, your code would allocate a temporary buffer using PCI_RAMAlloc that’s 16 bytes in length (or re-use a buffer that’s been previously allocated). It can then instruct the device to write the first 256 bytes to the (physical address of) &8000, and the remaining 16 bytes to the bounce buffer. At the end of the transfer, you’ll call OS_Memory 19 again to allow it to perform the end-of-transfer cache maintenance, and then use the CPU to copy the 16 bytes from the bounce buffer to &8100. If the DMA controller can only deal with a single physically contiguous block (and you can’t break it into multiple transfers), then you have the option of either forcing the entire transfer to go via a bounce buffer whenever you detect that OS_Memory 19 has returned more than one block, or you can use a buffer which you know will be physically contiguous to begin with (e.g. PCI_RAMAlloc). If lack of caching is slowing things down then you can always create your own DA, use a PreGrow handler to request a contiguous region that was detected by OS_Memory 12, and then handle the cache maintenance yourself (which isn’t too hard, since the appropriate cache ops are available via OS_MMUControl 2).

Sep 10, 2018 12:24am Michael Grunditz (467) 531 posts	If the DMA controller can only deal with a single physically contiguous block (and you can’t break it into multiple transfers), then you have the option of either forcing the entire transfer to go via a bounce buffer whenever you detect that OS_Memory 19 has returned more than one block, or you can use a buffer which you know will be physically contiguous to begin with (e.g. PCI_RAMAlloc). If lack of caching is slowing things down then you can always create your own DA, use a PreGrow handler to request a contiguous region that was detected by OS_Memory 12, and then handle the cache maintenance yourself (which isn’t too hard, since the appropriate cache ops are available via OS_MMUControl 2). I can have separate buffers, but it will complicate and slowdown the process. This is how it works now: I have a linked list in which there are pointers to membuf. The preferred way is to have one buffer and use offsets in the list. Right now I simply allocate RAM with PCI_RAMalloc. The DMA transfer in itself is instant, but the synchronizing to the buffer takes a very long time, half a second for >16k and the bigger the transfer the sync goes slower and slower (incremental) not only for the size of the buffer. I think that the last suggestion (doing it myself) is best , and I will try that. I have the option of directly making the buffer cachable from PCI_RAMalloc on my whishlist!

Sep 10, 2018 9:58am Jeffrey Lee (213) 6046 posts	the synchronizing to the buffer takes a very long time, half a second for >16k What’s the code doing? That sounds like you have a serious bug in the code that’s interacting with the buffer. Uncached memory shouldn’t be anywhere near as slow as that.

Oct 10, 2018 6:45am Michael Grunditz (467) 531 posts	What’s the code doing? That sounds like you have a serious bug in the code that’s interacting with the buffer. Uncached memory shouldn’t be anywhere near as slow as that. There is nothing between the ram allocation → building up the dma descriptors (using offsets in the buffer) → giving it to the dma controller → hw makes its thing → I try to get the contents (need to wait before hw bytecount register reports full transer) and it is the last step that takes a long time. Edit.. it is twise RiscPC speed in some areas! But still slow. Edit.. it is twise RiscPC speed in some areas! But still slow.Judging from BSD it seems like caching is the key here..

Reply

To post replies, please first log in.

Forums → Community Support →

Search forums

Social

Follow us on

and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.

Description

Community-provided support for all users of RISC OS.

Voices

Options

Forums
Login

Contact Us | About Us

The RISC OS Open Beast theme is based on Beast's default layout
Site design © RISC OS Open Limited 2024 except where indicated

Hosted by Arachsys

Powered by Beast © 2006 Josh Goebel and Rick Olson
This site runs on Rails