2009-11-14

Woopsi Updates and DMA Mayhem

I’ve been on a graphics and refactoring kick recently, and the latest set of changes reflects that. I’ve been ripping out, refactoring and bugfixing graphics code throughout Woopsi.

The Rect struct, which describes a rectangle (x/y/width/height), is used all over the Woopsi code. However, it was nested within the Gadget class, which made using it intensely annoying. Any attempt to create a Rect had to use the fully-qualified Gadget::Rect name (or WoopsiUI::Gadget::Rect if working outside of the WoopsiUI namespace). To fix this, I have moved the Rect struct into a separate header file. Rects can now be created simply by using the typename “Rect”. Much better.

The SuperBitmap class included facade methods for drawing to the internal bitmap. For example, it had a “drawText()” method that would just call “_bitmap->drawText()“. Since the bitmap no longer includes drawing methods, this became “_graphics->drawText()“. “_graphics” is a pointer to a Graphics object that can draw to the bitmap. However, this means that the SuperBitmap class is more cumbersome than it needs to be. Why not simply expose the Graphics object and get rid of the facade methods?

I’ve now done this. Drawing to a SuperBitmap used to work like this (semi-pseudocode):

SuperBitmap* bitmap = new SuperBitmap();
bitmap->drawText("sometext");

It now works like this:

SuperBitmap* bitmap = new SuperBitmap();
bitmap->getGraphics()->drawText("sometext");

The examples directory contains two new demos: bitmapdrawing and gadgetdrawing. They’re almost identical, except the first draws to a bitmap (displayed via a SuperBitmap gadget), whilst the second draws directly to an AmigaWindow. The first is, therefore, a demo of how to do persistent drawing, whilst the second is a demo of how to do non-persistent (but mildly faster) drawing.

Whilst writing these examples I came across a number of bugs. Some of them were created whilst consolidating the Bitmap and GraphicsPort drawing methods into a new hierarchy, but some have been around for a while. The GraphicsPort::drawPixel() and drawXORPixel() methods both clip correctly. The GraphicsPort::drawLine() method draws to the correct framebuffer (it previously only drew to the bottom framebuffer). Graphics::drawBitmap() clips correctly if the co-ordinates for the destination exceed the size of the destination bitmap, instead of crashing as it did previously.

Lastly, I’ve been trying to fix a long-standing problem with the DS’ data cache and its interaction with the DMA hardware. Here’s what happens when the ARM9 tries to access a piece of data:

  • ARM9 will attempt to read the data cache;
    • If data found, ARM9 will continue as normal;
    • If no data found, ARM9 will read main memory.

Now here’s what happens when the DMA hardware is used:

  • DMA will read data from main memory;
  • DMA will write data to main memory.

The DMA hardware cannot see the cache. Also, if the DMA hardware changes main memory but that memory is cached, the ARM9 will read the (outdated) cache instead of main memory.

Using the DMA therefore requires that the cache is correctly updated like this:

  • Write cache to main memory;
  • Use DMA to copy from main memory to main memory;
  • Mark the cache as invalid so the ARM9 fetches new data from main memory.

Woopsi was only performing the first of these three actions, and it wasn’t doing so consistently. The result of this is that, no matter what I do, the last 4 pixels of every bitmap I attempt to blit to another bitmap are not drawn. Some research led me to two useful sources: a GBADEV topic and a blog post from cearn on coranac.com.

Cearn gives the code for replacement copy and fill functions that perform a variety of checks, such as ensuring that the cache is written to RAM before trying to copy and checking that the copy is working with legal data.

However, trying to replace Woopsi’s existing DMA code with cearn’s results in a very nasty and very immediate crash. I initially assumed that his solution was incorrect. From the information in the GBADEV forum it seems that all calls to DC_FlushRange() and DC_InvalidateRange() must be performed on memory that is aligned to 32-bit boundaries. His functions do not check or enforce this.

I wrote replacements that took the most useful parts of cearn’s code and mixed in some alignment-enforcing jiggery-pokery to ensure that all cache handling is done to the correct boundaries. This, however, failed in exactly the same way. The call to DC_InvalidateRegion() kills Woopsi dead before it even appears on screen. Remove this, and it works - except those last damned pixels still aren’t drawn.

Some more research on GBADEV threw up this thread, in which it is determined that memcpy() is actually faster than the DMA when working with main memory. DMA is faster when working with VRAM. This does make sense. VRAM is uncached, so memcopy() will always have to go to main memory to fetch data. The DMA, on the other hand, does not need the cache to be flushed before it can see the latest state. The situation is reversed when dealing with main memory. memcopy() may be able to use the cache, whilst the cache must be flushed before the DMA can do its job. Using the DMA with main memory, therefore, will always result in cache misses somewhere along the line.

It then occurred to me that I could write a function that would use a for-loop when working with main memory and the DMA code when working with VRAM. The most obvious way to tackle this is to check if the source and destination pointers fall within the framebuffer address space. If so, use the DMA. If not, default to memcopy() instead. Some fiddling later and I have copy and fill methods that are theoretically faster than the original macros borrowed from PALib and draw everything correctly.

I’m considering removing external access to the mutable u16 array inside bitmaps that inherit from the MutableBitmapBase class. This would allow the FrameBuffer class to be a wrapper around an SDL screen (conditional compilation shinnanegans ahoy) and remove the double-copying bottleneck that currently exists. This would make Woopsi almost immediately portable to any handheld with a touchscreen and an SDL port.

To this end, all mutable bitmaps now have a blit() method, that can use the DMA hardware to copy a u16 array to a specific co-ordinate within itself, and a blitFill() method, that can use the DMA hardware to fill memory with a specified colour. There is now no reason, other than speed, for any object to get access to the non-const raw bitmap array within a Bitmap class.

Related to those methods, bitmaps also have a getData() method that will return a const pointer to the internal bitmap array at a specific set of co-ordinates. Instead of mucking about trying to calculate where rows of pixels start within the array, it’s now possible to just ask a bitmap to give you a pointer straight to any pixel.

Comments

cearn on 2009-11-18 at 13:53 said:

It shouldn’t be necessary to align the addresses passed to DC_FlushRange() or DC_InvalidateRange(): those functions align to 32-byte (not bit) boundaries internally. It’s odd that this would cause a crash.

What might be the problem is the following. Suppose you have your two bitmaps (which I assume are dynamically allocated). Chances are that these are not 32-byte aligned. What DC_InvalidateRange() does is clean all the cache lines that include the given range. In case of unaligned heads or tails, that whole block will be invalidated as well, including things that shouldn’t have been invalidated at all. If, for example, the bitmap ends halfway the cacheline and right behind it is some recently updated pointer data (that is, it’s in cache), the correct pointer would be thrown away by the invalidate. I think you can imagine what would happen next.

Regarding why RAM-RAM DMA is so slow, I think it works a little differently than what’s written here. DMA doesn’t require a cache flush to to its job; it needs a cache flush to do its job correctly. It will happily copy data without it if you tell it to, it’ll just come out wrong. I think you may have switched memcpy() and DMA in some places: memcpy() doesn’t always need to use main memory (thanks to cache), and DMA does need a flush for it to see the most recent state.

Also note that RAM->VRAM DMA-copies are actually about twice as fast as you can get even with good assembly; it’s only for RAM->RAM that it’s slow. Apparently, there’s something about RAM-RAM copies that causes the DMA to be unable to do reads and writes in parallel (which I think is what happens normally), and it changes the timings of each read/write. http://nocash.emubase.de/gbatek.htm#dsdmatransfers has a little more on this.

Does the problem with the last 4 pixels occur for every bitmap, or only some? Is it something with DMA only, or with other copy methods as well?

Ant on 2009-11-18 at 18:34 said:

cearn :

What might be the problem is the following. Suppose you have your two bitmaps (which I assume are dynamically allocated). Chances are that these are not 32-byte aligned. What DC_InvalidateRange() does is clean all the cache lines that include the given range. In case of unaligned heads or tails, that whole block will be invalidated as well, including things that shouldn’t have been invalidated at all. If, for example, the bitmap ends halfway the cacheline and right behind it is some recently updated pointer data (that is, it’s in cache), the correct pointer would be thrown away by the invalidate. I think you can imagine what would happen next.

All of Woopsi’s bitmaps are 4-byte aligned, but not 32-byte aligned. Interesting; I wonder if that’s the problem. However, if I call DC_FlushRange before I use the DMA, then call DC_InvalidateRange afterwards, with the same range passed to both functions (which I am doing), I’d have thought that there should be no chance that there would be data invalidated by the second function call that wasn’t already flushed to main memory by the first function call.

Regarding why RAM-RAM DMA is so slow, I think it works a little differently than what’s written here. DMA doesn’t require a cache flush to to its job; it needs a cache flush to do its job correctly. It will happily copy data without it if you tell it to, it’ll just come out wrong. I think you may have switched memcpy() and DMA in some places: memcpy() doesn’t always need to use main memory (thanks to cache), and DMA does need a flush for it to see the most recent state.

I intentionally switched it in a few places, though I probably didn’t do a great job of explaining the situation. I was suggesting that memcpy() always needs to use main memory when working with VRAM.

As I understand it, VRAM isn’t cached, as it’s a physically separate region of memory mapped into the main address space at 0x06000000. Therefore, memcpy() does not benefit from cache hits when copying from VRAM. DMA, on the other hand, must always work with main memory regardless of which area of memory it is using. Whilst memcpy() could be accelerated by a cache hit, DMA has no such optimisation. Worse, in order for a DMA copy to work correctly, the cache must first be flushed. It makes sense, then, that DMA will be faster in an area of memory that the CPU does not cache, whilst memcpy() will be faster everywhere else.

I didn’t know about the RAM->VRAM speed. It’d be good if I could get the DMA working for that.

Does the problem with the last 4 pixels occur for every bitmap, or only some? Is it something with DMA only, or with other copy methods as well?

Just DMA copies. Using a for-loop works fine. Strangely, using a memcpy() crashes too. This suggests that something is very wrong somewhere, but it’s bizarre that a for-loop should work.

This is the copy routine I’m currently using:

void woopsiDmaCopy(const u16* source, u16* dest, u32 count) {

    // Get memory addresses of source and destination
    u32 srca = (u32)source;
    u32 dsta = (u32)dest;

    // Precalculate the size of a single framebuffer for speed
    u32 bmpSize = SCREEN_WIDTH * SCREEN_HEIGHT * 2;

    // Precalculate boundaries of framebuffer VRAM
    u32 bmp[4];
    bmp[0] = 0x06000000;
    bmp[1] = 0x06000000 + bmpSize;
    bmp[2] = 0x06200000;
    bmp[3] = 0x06200000 + bmpSize;

    // Use DMA hardware if both source and destination are within VRAM
    if (((srca >= bmp[0]) && (srca < bmp[1])) ||
        ((srca >= bmp[2]) && (srca < bmp[3]))) {
        if (((dsta >= bmp[0]) && (dsta < bmp[1])) ||
            ((dsta >= bmp[2]) && (dsta < bmp[3]))) {

            // libnds DMA functions work in bytes
            count *= 2;

            // Choose fastest DMA copy mode
            if((srca|dsta|count) & 3)
                dmaCopyHalfWords(3, source, dest, count);
            else
                dmaCopyWords(3, source, dest, count);

            return;
        }
    }

    // Cannot use DMA as not working exclusively with VRAM
    // Use for-loop instead
    for (u32 i = 0; i < count; i++) {
        *(dest + i) = *(source + i);
    }
}

I only ever need to copy u16 data and never try to copy to illegal areas of memory, so I’ve chopped out some of your checks/void pointers.

Interestingly, calling DC_InvalidateAll() crashes my DS in the same way as trying to invalidate a range.

cearn on 2009-11-21 at 11:22 said:

I so hope wordpress doesn’t mangle the html.

Ant :
cearn : What might be the problem is the following. Suppose you have your two bitmaps (which I assume are dynamically allocated). Chances are that these are not 32-byte aligned. What DC_InvalidateRange() does is clean all the cache lines that include the given range. In case of unaligned heads or tails, that whole block will be invalidated as well, including things that shouldn?t have been invalidated at all. If, for example, the bitmap ends halfway the cacheline and right behind it is some recently updated pointer data (that is, it?s in cache), the correct pointer would be thrown away by the invalidate. I think you can imagine what would happen next.

All of Woopsi’s bitmaps are 4-byte aligned, but not 32-byte aligned. Interesting; I wonder if that’s the problem. However, if I call DC_FlushRange before I use the DMA, then call DC_InvalidateRange afterwards, with the same range passed to both functions (which I am doing), I’d have thought that there should be no chance that there would be data invalidated by the second function call that wasn’t already flushed to main memory by the first function call.

Yeah, I agree with that. Now that I think of it, maybe the invalidate isn’t even necessary if you flush the range before the DMA. Since a flush should always be safe to perform, maybe that’d be a better idea. Then again, still it still crashes, may be it’s something else entirely :(

Ant : As I understand it, VRAM isn’t cached, as it’s a physically separate region of memory mapped into the main address space at 0x06000000. Therefore, memcpy() does not benefit from cache hits when copying from VRAM. DMA, on the other hand, must always work with main memory regardless of which area of memory it is using. Whilst memcpy() could be accelerated by a cache hit, DMA has no such optimisation. Worse, in order for a DMA copy to work correctly, the cache must first be flushed. It makes sense, then, that DMA will be faster in an area of memory that the CPU does not cache, whilst memcpy() will be faster everywhere else. I didn’t know about the RAM→VRAM speed. It’d be good if I could get the DMA working for that.

I’m still a little fuzzy on what you mean with “main memory”. I thought you meant “main RAM” (that is, the 0x02000000 range), but now I’m not so sure. If you simply mean “any uncached memory”, then I understand what you’re saying. Main RAM is only relevant to the copying if either the source or destination pointer points there. Even though it’s unlikely to happen, a palRAM→VRAM would not use main RAM in any way. (Yeah, I know I’m being pedantic here, but loose definitions have a tendency to become problematic when it comes to tech stuff.)

I still think the analysis isn’t quite right, though. The thing about cache isn’t that memcpy() is fast because of it, but rather that it’s not as slow as the pure waitstates would suggest. Before you can get a cache-hit, you have to have had a cache-miss first, which loads up a whole cache line and actually costs more than a single access. DMA seems to have a method to get round the RAM waitstates entirely, as evidenced by the main-RAM→VRAM transfer speed. Only for mRAM→mRAM is DMA ridiculously slow.

Here are some numbers for copying and filling memory (in cycles/bytes) for word-aligned cases. Maybe they can be of use in understanding what exactly goes on here. armcpy/set() are my own asm functions, swiFillH is effectively a halfword-copy loop (what you’re using right now); the rest should be self-explanatory.

copy memcpy armcpy swiCopyHdmaCopyH dmaCopyW dmaCopySafe
main RAM → main RAM 3.00 1.69 4.75 8.00 4.50 4.83
main RAM → VRAM 2.38 1.47 4.69 1.04 0.79 0.96
VRAM → main RAM 2.25 1.28 4.50 1.02 1.02 1.18
VRAM→VRAM 2.69 1.25 5.50 1.00 1.00 1.00

fill memset armset swiFillH dmaFillH dmaFillW dmaFillSafe
0 → main RAM 2.25 0.94 4.00 1.02 0.77 0.92
0 → VRAM 1.44 0.66 3.50 1.00 0.75 0.75

Ant :
cearn : Does the problem with the last 4 pixels occur for every bitmap, or only some? Is it something with DMA only, or with other copy methods as well?

Just DMA copies. Using a for-loop works fine. Strangely, using a memcpy() crashes too. This suggests that something is very wrong somewhere, but it’s bizarre that a for-loop should work.

Yeah. I know that memcpy() can copy incorrectly to VRAM at times, but I can’t think of anything that should cause it to actually crash. If it crashes consistently, it may be possible to find out what’s going on by going through the assembly.

Ant : This is the copy routine I’m currently using: *snipzors code* I only ever need to copy u16 data and never try to copy to illegal areas of memory, so I’ve chopped out some of your checks/void pointers. Interestingly, calling DC_InvalidateAll() crashes my DS in the same way as trying to invalidate a range.

That woopsiDmaCopy() function should work alright. I think a call to swiCopy() will be faster than a manual loop though.

That DC_InvalidateAll() crashes things doesn’t really surprise me. The cache has the most up-to-date data. Invalidating it tosses that all out and you’re effectively stuck with data with different timestamps. It’s like if you try to commit parts of last year’s Woopsi into the current SVN: it might work, but the likelihood is pretty small.

Ant on 2009-11-24 at 10:44 said:

cearn :

I so hope wordpress doesn’t mangle the html.

I’m still a little fuzzy on what you mean with “main memory”. I thought you meant “main RAM” (that is, the 0x02000000 range), but now I’m not so sure. If you simply mean “any uncached memory”, then I understand what you’re saying. Main RAM is only relevant to the copying if either the source or destination pointer points there. Even though it’s unlikely to happen, a palRAM→VRAM would not use main RAM in any way. (Yeah, I know I’m being pedantic here, but loose definitions have a tendency to become problematic when it comes to tech stuff.)

Yep, that’s it - any uncached memory.

cearn :

I so hope wordpress doesn’t mangle the html.

Here are some numbers for copying and filling memory (in cycles/bytes) for word-aligned cases. Maybe they can be of use in understanding what exactly goes on here. armcpy/set() are my own asm functions, swiFillH is effectively a halfword-copy loop (what you’re using right now); the rest should be self-explanatory.

Looks like armcopy beats just about everything else by a pretty wide margin, except for those occasions when the DMA actually works properly. Is the source to that somewhere on your site?

cearn on 2009-11-26 at 18:55 said:

I’ve zipped it here: http://www.coranac.com/files/misc/armfuncs.zip . I’m using a few macros to create the standard assembly function boilerplate. They make defining functions easier, but I could convert them to pure asm if you want. I have tested these functions quite a bit and there shouldn’t be any problem with regards to alignment, but if memcpy() makes stuff go bewm I really can’t say how this will affect things.

Ant on 2009-11-28 at 17:08 said:

Thanks! I’ll have to try and work out how to include this…