PCI ram alloc ram slow ?
Michael Grunditz (467) 531 posts |
Don’t know which forum is right for this.. I have a driver for a device with internal DMA controller. I need to pass a physical address of a RAM buffer to it. |
Jeffrey Lee (213) 6046 posts |
Memory allocated by PCI_RAMAlloc will be mapped as non-cacheable. This will make it slow for CPU access (especially read access – writes can usually be buffered). Access by DMA controllers should be the same performance as any other memory in the system. When writing DMA drivers you need to be careful to use the right memory barriers at the right times, otherwise the CPU or DMA may see incorrect data. This generally involves executing a DMB_Write barrier just before you start the DMA transfer (to make sure any data previously written by the CPU arrives before the DMA reads or writes ontop of it), and a DMB_Read barrier after you’ve detected the DMA has completed (so that if you try and read data out of the buffer which was written by the DMA, the CPU won’t be using prefetched data from before the DMA had finished writing). You can use arbitrary cacheable RAM for DMA, which will result in a buffer which is a lot faster for CPU access. But performing the correct cache maintenance operations, and dealing with remapping of physical pages by the OS, adds a lot of complexity. OS_Memory 19 was designed to help with this – it will perform the cache maintenance and barrier operations for you, and performs the logical → physical address translation. |
Michael Grunditz (467) 531 posts |
Thanks! Seems like OS_Memory 19 is the way to go. But I am uncertain how to read the docs for it. The Input and Output functions are noted by registers, are the functions supposed to point to a list of words? Am I correct in that I need to call PCI_RAMAlloc before to actually get the buffer/region? EDIT , I might understand. The Input function supplies the address for the dma buffer that I can use for the device internal controller, somehow the buffer will extend itself (?) Also guess that the Output function is used internally… OK , I have to admit I am confused! Reading the SATA driver since is use OS_Memory 19.. will take some time to understand. Should I do RAM allocation inside the Input fuction? In SATA driver it seems like the functions needs to be implemented. |
Jeffrey Lee (213) 6046 posts |
The overall method for using OS_Memory 19 will be:
For the typical use case, OS_Memory 19 uses the input function to get the logical address of your buffer, and the output function to tell you what the physical address is (which you’d then use to program the DMA controller). There is some complexity here – the call is designed to support scatter-gather type I/O, so it will call the input function multiple times in order to get the address of each block (until the input function returns a block with a length of zero). The output function will also be called multiple times – and this may be different to how many times the input function gets called. This is because it breaks the buffers into physically contiguous blocks (ordinarily the OS doesn’t guarantee that RAM is physically contiguous – which is why PCI_RAMAlloc was created, to provide an easy way of getting physically contiguous RAM). Also, if DMA is writing to cacheable RAM, and the start/end of an input block isn’t aligned to a cache line boundary, then the block will be split into pieces in order to make sure the middle section is cache line aligned. This could result in three calls to the output function – one for the start of the block (with the “bounce buffer must be used” flag set), one for the cache-aligned middle of the block (which can be used directly by the DMA), and one for the end of the block (again with the “bounce buffer must be used” flag set). The reason for this is that when a dirty cache line gets written to memory by the CPU, it’s typically the entire cache line that gets written out, only if one or two bytes had been changed. So a write by the CPU to a byte at the end of a cache line could inadvertently overwrite data which DMA had written to the start of the cache line. If your program was in full control over the memory surrounding the data buffer then this would be fine, but OS_Memory 19 assumes that the memory is being shared by other programs in the system (e.g. the buffer could be in the RMA), so for unaligned writes it’ll request that you perform that part of the transfer using a bounce buffer (which you’d typically allocate using PCI_RAMAlloc). There are a couple of bits which OS_Memory 19 doesn’t make it easy to deal with. The first is that you can’t determine ahead of time how much memory to allocate for the address list/descriptor that’s going to be used by the DMA controller. Compared to OS_Memory 0 (which operated on the basis of 4K pages) it’ll probably be shorter, because for long transfers there are likely to be quite a few physically contiguous pages. But worst-case it could be longer, due to the cache alignment constraint. The second problem is that you still need to deal with Service_PagesUnsafe, which can (theoretically) be triggered by any dynamic area grow/shrink operation, or by any callback (since the callback might grow/shrink a DA). This makes it awkward to grow the DMA descriptor on demand inside your output function implementation – the memory allocation might trigger pages to be moved, invalidating the list of physical addresses you’ve been building. So if you detect a call to Service_PagesUnsafe while you’re building the transfer list, it’s easiest to just throw away the list and start again. |
Michael Grunditz (467) 531 posts |
OK
I need contiguous RAM. What happens if the initial buffer is allocated with PCI_RAMAlloc?
I don’t think I have any unaligned writes at all.. But if I have.. I am trying to understand how the Output function in SATA works. But I don’t really understand whats happening in case of bouncing
I can’t separate what I transfer with DMA or CPU, it is one or the other, and in this case it will always be DMA.. I have started to experiment now, but it just hangs in the swi call. |
Jeffrey Lee (213) 6046 posts |
It’ll be logically + physically contiguous, and the pages will be locked in place so that you don’t have to worry about Service_PagesUnsafe. OS_Memory 19 is a bit redundant if you’re using PCI_RAMAlloc.
The SATA code is a bit complicated, because the SATA controller only supports DMA that’s halfword aligned. So both the SATA logic and the OS_Memory 19 logic can cause data to be transferred via the bounce buffer. Notes If OS_Memory 19 indicates that a bounce buffer must be used for a region, then it’s imperative to not perform DMA to that region, and only perform CPU access instead. Let’s say that you’re trying to perform a DMA write to a buffer located at &8000, and the cache line size on the system is 32 bytes (&20). If the transfer length was &110 bytes then the last 16 bytes would only partially cover a cache line. So OS_Memory 19 will break it into two blocks: one from &8000 to &8100, which will have the “use bounce buffer flag” clear, and one from &8100 to &810, with the “use bounce buffer” flag set. To deal with this, your code would allocate a temporary buffer using PCI_RAMAlloc that’s 16 bytes in length (or re-use a buffer that’s been previously allocated). It can then instruct the device to write the first 256 bytes to the (physical address of) &8000, and the remaining 16 bytes to the bounce buffer. At the end of the transfer, you’ll call OS_Memory 19 again to allow it to perform the end-of-transfer cache maintenance, and then use the CPU to copy the 16 bytes from the bounce buffer to &8100. If the DMA controller can only deal with a single physically contiguous block (and you can’t break it into multiple transfers), then you have the option of either forcing the entire transfer to go via a bounce buffer whenever you detect that OS_Memory 19 has returned more than one block, or you can use a buffer which you know will be physically contiguous to begin with (e.g. PCI_RAMAlloc). If lack of caching is slowing things down then you can always create your own DA, use a PreGrow handler to request a contiguous region that was detected by OS_Memory 12, and then handle the cache maintenance yourself (which isn’t too hard, since the appropriate cache ops are available via OS_MMUControl 2). |
Michael Grunditz (467) 531 posts |
I can have separate buffers, but it will complicate and slowdown the process. This is how it works now:
I have the option of directly making the buffer cachable from PCI_RAMalloc on my whishlist! |
Jeffrey Lee (213) 6046 posts |
What’s the code doing? That sounds like you have a serious bug in the code that’s interacting with the buffer. Uncached memory shouldn’t be anywhere near as slow as that. |
Michael Grunditz (467) 531 posts |
There is nothing between the ram allocation → building up the dma descriptors (using offsets in the buffer) → giving it to the dma controller → hw makes its thing → I try to get the contents (need to wait before hw bytecount register reports full transer) and it is the last step that takes a long time. Edit.. it is twise RiscPC speed in some areas! Edit.. it is twise RiscPC speed in some areas! |