Monthly Archives: August 2008

Direct3D 11 Details Part IV: Multithreaded Rendering

Direct3D 10 only allows graphics commands to be issued from a single thread (there is a multithreaded mode, but Microsoft explicitly warns against using it due to its poor performance). In an API such as Direct3D, issuing graphics commands involves a fair amount of CPU overhead. Given the trend towards increasing the number of cores on a processor rather than the performance of a single core, it is desirable to efficiently spread this work among multiple threads.

Direct3D 11 adds the ability to create display lists from multiple threads and execute them from the main rendering thread. In addition, the Device (which creates resources) has been separated from the Context (which issues graphics commands). This enables creating resources asynchronously. Deferred Contexts are used to create display lists and the Immediate Context issues graphics commands to the GPU, including the execution of display lists created on Deferred Contexts.

Unlike the other features in Direct3D 11, multithreaded rendering is not a hardware feature at all. With the appropriate drivers, D3D10 (perhaps even D3D9) hardware will be able to perform multithreaded rendering efficiently (some level of multithreaded performance will be available even without new drivers, but it was unclear what the limitations would be in this case).

Direct 3D 11 Details Part III: Compute Shaders & Unordered Memory

GPGPU (General-Purpose computation on GPU) approaches such as NVIDIA’s CUDA have become increasingly popular the last few years, recently coming full-circle with various non-traditional rendering algorithms (perhaps this should be called GPGPUG?). However, the existing solutions are vendor-specific, often requiring reprogramming even for different GPUs from the same vendor. They also tend not to “play well” with the traditional graphics pipeline. For example, on GeForce 8000-series GPUs using CUDA there is a large delay when switching between CUDA and traditional graphics rendering.

Direct3D 11 introduces a new kind of shader called a Compute Shader. A compute shader is invoked as a regular array of threads. The threads are divided into groups. Each group has 32KB of memory shared among the threads in the group. Thus the threads can use partial results computed by other threads in the same group, improving performance. Threads can also perform random-access reads and writes to graphics resources such as textures, vertex arrays or render targets. These memory accesses are unordered, although various synchronization instructions exist to impose ordering when needed.

Pixel shaders can also perform random-access (unordered) writes. This allows them to write data structures such as linked lists that can then be processed by a compute shader, or vice-versa (pixel shaders have always had the ability to perform random access reads via texture lookups).

Several examples of compute shaders were shown at Gamefest, performing post-process operations such as finding the average luminance of a render target, or computing a luminance histogram (both used in tone mapping). For these operations, a 2X speedup was quoted over the best performance possible using pixel shaders.

Compute shaders can also perform operations such as computing summed-area tables and fast-Fourier transforms significantly faster than traditional GPU methods. Microsoft is looking into providing library functions to perform such operations.

Microsoft speculated that algorithms such as A-buffer rendering and ray tracing could also be performed efficiently, but they don’t have any hard performance numbers for those.

Larrabee

Solid information about Intel’s new Larrabee architecture came out a few days ago, the Level of Detail blog has a good set of links. The major news is that Intel’s SIGGRAPH paper is now available for download from ACM’s Digital Library. Unfortunately, not everyone has access to this site’s resources (it costs money to subscribe). My contribution to the cause:

http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf

Thanks to Tom Forsyth for the link.

I’m excited by Larrabee not because of any particular technical feature (though I’m entirely savoring the paper itself, reading two pages a day at lunch), but rather by the fact that it opens up a whole new ecosystem for implementing graphics algorithms. Regardless of whether Larrabee wins or loses in the long-run, it will have a huge effect in increasing our knowledge by helping us explore different hardware and software designs for rendering.

Direct3D 11 Details Part II: Tessellation

Direct3D 11 adds three new pipeline stages, with the goal of enabling efficient tessellation of higher order surfaces. This is the Direct3D 10 pipeline, as shown in “Real-Time Rendering, 3rd Edition”:

Direct3D 10 Pipeline

The color of each stage indicates whether it is fully programmable (green), configurable (yellow) or fixed function (blue). The stages are described more fully in the “Graphics Processing Unit” chapter of the book. Note that the “Geometry Shader” stage is new to Direct3D 10, but the other stages have been in the pipeline for quite a while.

The Direct3D 11 pipeline adds three new stages between the vertex and geometry shader stages (framed in red). Two of the new stages are programmable (the hull and domain shader stages) and one is configurable (the tessellator stage):

Direct3D 11 Pipeline

This pipeline operates on meshes represented as a series of surface patches. Triangle and quad surface patches are primitives in Direct3D 11 (there is also a tessellated line primitive). The shape of each patch is defined by a number of control points. These control points are transformed, skinned and / or morphed one by one in the vertex shader.

The hull shader is called for each patch, using the patch control points from the vertex shader as inputs. The hull shader has two main responsibilities. The first is to (optionally) convert the control points from one representation (basis) to another. for example, it can implement the technique introduced in Loop and Schaefer‘s paper “Approximating Catmull-Clark Subdivision Surfaces with Bicubic Patches“. The control points are sent directly to the domain shader, bypassing the tessellator. The hull shader’s second responsibility is to compute appropriate tessellation factors, which are passed to the tessellation stage. This allows for adaptive tessellation, which can be used for continuous view-dependent LOD (level of detail). The tessellation factors are specified per patch edge, and range from 2 to 64. This means that each edge of the patch may be split into at least 2 (and as many as 64) triangle (or quad) edges.

The tessellator is a fixed-function (but highly configurable) stage, which uses the tessellation factors to tessellate (subdivide) the patch into multiple triangle or quad primitives. The tessellator does not have access to the control points – all tessellation decisions are made based on configuration and the tessellation factors passed on from the hull shader. Each vertex resulting from the tessellation is output to the domain shader. Only the patch parametrization coordinates are passed on for each vertex.

The domain shader operates on the patch parametrization coordinates of each vertex separately, although it can also access the transformed control points for the entire patch. The domain shader sends the complete data for the vertex (position, texture coordinates, etc.) to the geometry shader (or the clipping stage if no geometry shader is present). Effectively, it evaluates the surface representation at each vertex. Techniques such as displacement mapping can also be applied by this shader stage.

Although Microsoft gave an example using Catmull-Clark subdivision surfaces, the programmability of the pipeline enables other surface representations to be used. Alternatively, the tessellation stages can be turned off and traditional triangle or quad meshes can be used.

Direct3D 11 Details Part I: Intro

I attended Gamefest 2008 last week. Gamefest (formerly called Meltdown) is a Microsoft-run Windows and Xbox 360 game development conference. This year there were two notable announcements: XNA Community games (discussed in a previous blog post) and the first public disclosure of Direct3D 11.

Direct3D is, of course, the API used by most Windows games, but its importance extends beyond Windows. Direct3D features guide the development of graphics hardware in general, so these features are bound to show up in future consoles, as well as in OpenGL.

The announcement that Direct3D 11 would not be tied to the next version of Windows (as many had feared), and would be available on Windows Vista was very significant to Windows developers, many of whom complained about the tying of Direct3D 10 to Windows Vista. Direct3D 11 will also be available on Direct3D 9, 10, and 10.1 level graphics hardware (although the new features will not be available there, with the exception of some multithreading enhancements).

The fact that the Direct3D 11 API is a strict superset of the 10/10.1 API is also cause for relief among game developers. From Direct3D 9 to 10, the API went through extensive changes. These changes were mostly long-overdue cleanups and improvements, but they left developers supporting two very different APIs if they wanted to support the many customers using Windows XP and also expose the new Direct3D 10 hardware features.

This is the first part of a multi-part post which will summarize the essential facts about Direct3D 11, as known from the Gamefest slides. Eventually, the slides should show up on the XNA Presentations page.

Full disclosure of Direct3D 11 should occur later this year – the November 2008 DirectX SDK release will feature a preview version of the API, including full documentation and code samples.