3D support for ATI 6xx/7xx update
[info]jbridgman
[July 14 status update : glxgears buffer aging problem found & fixed, texture code pushed but not sure of status. Buffer swap (back to front copy) being done on CPU rather than GPU so glxgears gives ~15fps rather than ~1500. Most testing is being done on rv730/770; there were a few color problems on 6xx and RS780 last time I looked, probably still there for now. See Alex's blog entry for more details - www.botchco.com/agd5f/]

The 6xx/7xx 3D driver is starting to do useful things again after moving over to the radeon-rewrite mesa code base. As of last night, it seems to be behaving properly on 14 of the 63 tests in progs/redbook, drawing incorrectly on 24, and either not drawing or crashing on the remaining 25. Cooper found that the following tests rendered correctly : 

hello, plane, torus, list, aargb, smooth, tess, varray, tesswind, model, anti, bezcurve, picksquare, and cube.

The driver currently segfaults after running ~10 frames of glxgears - Richard is looking into that. Alex is hooking up the texture code to the new bufmgr code, and Cooper is going through the redbook tests and picking off problems as he finds them. Development and testing are primarily being done on discrete graphics cards; not sure of the current status on 760/780/790 IGP parts.

The r6xx-rewrite branch was synced up with mesa master recently, so merging it into master should be fairly straightforward. A bigger question is how and when we decide that the API to the drm is not likely to require further changes, ie when it makes sense to ask to merge the new drm ioctl support into the main kernel tree. The changes are relatively small and the chance of breaking any *other* support is extremely low, so maybe there's a chance of making 2.6.31 but we haven't had that discussion yet AFAIK.

If you want to try the latest code (with the caveat that it is *not* ready for general use), you want the r6xx-rewrite of mesa/mesa, and the r6xx-r7xx-3d branch of ~agd5f/drm. The drm code works with 2.6.28 and earlier, but has problems with 2.6.29 and higher.

How the X (aka 2D) driver affects 3D performance
[info]jbridgman
I see this discussion come up on IRC from time to time. The most common observation is that "2D acceleration made my 3D run slower", but there are other interactions worth understanding as well.

3D acceleration is in development but right now nearly everyone is using software rendering when running open source drivers on a 6xx or 7xx family GPU. As a result, 3D performance is determined primarily by (a) how fast the CPU is, and (b) how quickly the CPU can access the buffer where drawing is being done.

Over the last few months users have been switching from "shadowfb" acceleration to hardware acceleration using the EXA and Xv APIs. Shadowfb acceleration still uses software rendering, but places the primary copy of the frame buffer in system memory rather than video memory, and periodically blits the updated frame buffer contents to video memory to make it appear on your screen. Shadowfb acceleration gives you good performance because the software rendering is being done in system memory, which the CPU can access very quickly, rather than in GPU memory, which is much slower for CPU accesses.

If you are using software-rendered 3D, you will actually see HIGHER performance when running shadowfb than when running with hardware-accelerated EXA and Xv. This has nothing to do with the acceleration itself but has everything to do with the location of the frame buffer. When shadowfb puts the frame buffer in system memory, this has the side-effect of speeding up software rendered 3D as well as 2D.

takeaway #1 - shadowfb acceleration in 2D makes glxgears run fast with software 3D (Yay !!)

Enabling hardware accelerated EXA and Xv in the driver moves the frame buffer back into video memory, since the GPU is now doing most of the drawing, but this has a side effect of making software-rendered 3D run more SLOWLY since the CPU is now drawing into video memory rather than into system memory. Video memory is fast for GPU drawing but slow for CPU drawing.

takeaway #2 - hardware acceleration in 2D makes 2D go fast but makes glxgears run slow with software 3D (Boo !!)

This situation only applies to software-rendered 3D, since (a) the GPU actually runs *faster* when the frame buffer is in video memory, and (b) you can't mix shadowfb with accelerated 3D anyways.

Even after we move to hardware accelerated 3D, there will still be cases where the X driver affects 3D performance, mostly related to the X driver 's role in setting up video memory areas used by 3D, specifically attributes such as tiling.

In case anyone is interested, tiling in this context is basically swapping around some memory address bits so that a continuous chunk of memory addresses form a rectangular area on the screen rather than a horizontal line. This is important for a couple of reasons, all related to the fact that graphics operations tend to have 2D spatial locality rather than 1D locality.

GPUs tend to access locations which are close to previously accessed locations, but which are as likely to be up or down from the previous location rather than left or right. Since DRAM is organized into pages, and since transfers between GPU and memory happen in long bursts along a page, accesses to the *same* page are much faster than accesses to a *different* page from the previous access. Tiling greatly increases the chance of typical accesses being on the same page as previous locations, greatly improving performance.

You may be wondering "why isn't tiling turned on all the time ?". There are a bunch of reasons, but the main one is that CPUs use linear addressing, not tiled addressing, so when you are mixing software and hardware rendering it's possible that tiling may actually reduce overall performance. Typically buffers would be tiled and untiled when moving them between system and video memory. Note that tiling textures also helps performance, unless the texture is so small that it can fit entirely into cache.

Buffers which are used primarily for scanout (ie driving one or more displays) should ideally not be tiled, since displays really *do* access the memory in long horizontal lines. One option is to use a compositor as the bridge between tiled and linear buffers, with the result that textures and back buffers would be tiled while the front buffer would be linear.

takeaway #3 - tiling is a big pain but worth the effort - Keith wrote a great summary of the issues in his post :

keithp.com/blogs/UMA_Acceleration_Architecture/ (scroll down to "Tiled Surfaces")

Anyways, most of the interactions between the X driver and 3D performance will go away with time. Shifting from software- to hardware-accelerated 3D will make the biggest difference.

----- ADDENDUM (since I can't reply to IRC from the office yet) -----

A buffer where most of the activity is rendering should be tiled, of course, but one would not describe that as a "buffer which is primarily used for scanout". The issue is the linebuffer's ability to protect the display from memory delays, not the chip's ability to scanout from tiled surfaces. There are a number of system configurations where scanout can consume a fairly big chunk of the available memory bandwidth, and tiling only makes things worse, potentially resulting in display artifacts when acceleration is active.

When running with a compositor, there can be more reads from the buffer (driving the display) than writes (from the compositor copying render buffers), so scanout efficiency can be more important than update efficiency.  Since double-buffering would ideally be used for the compositor target, not the individual render targets, the compositor's front and back buffers would both be linear in that scenario with page-flipping rather than a back-to-front copy.

In cases where the compositor regenerated at the display frame rate, with back-to-front copies rather than page-flipping, you would essentially have two accesses related to copying for every access related to scanout, so tiling would make sense there as long as the linebuffer did not run dry resulting in visible artifacts. If page-flipping was being used, or if the compositor's refresh rate was slower than the display refresh rate, then linear memory would probably be more efficient as well as more robust.

I'm a bit fuzzy on the details of how a compositor fits into the back-to-front buffer flow under DRI2, since the ideal display stack would probably be a window-sized double-buffered render target for each individual app plus a screen-sized double-buffered render target for the compositor.


Home