_ __ ______ _ __
' ) ) / ' ) )
/--' __. __ , --/ __ __. _. o ____ _, / / _ , , , _
/ \_(_/|_/ (_/_ (_/ / (_(_/|_(__<_/ / <_(_)_ / (_ comments:
We did some volumetric rendering with raytracing and then "mixed" that with a
normal 3d scene (the scene was drawn using 3d acceleration).
Fair enough, it looks a bit blurry but it does give the impression of flying
through 3d clouds.
----
There is a quite impressive RT intro, Fresnel 2 by Kolor. You can get it at
http://www.kaoz.org/kolor
It has two versions, one for the PIII and the other for less powerful
machines. You need an OpenGL capable 3D card.
---------
The web site for Dr. David Rogers' book "An Introduction to NURBS" is at:
http://www.nar-associates.com/nurbs/nurbs.html
It includes information about the book, errata, source code, etc.
--------
The Graphics Gems site has gone through a major redesign (as well as getting
a more memorable URL), with all code now directly linked to the gems, which
have been organized by book, category, and author:
http://www.graphicsgems.org/
--------
Tidbits: an early draft of "Point in Polygon Strategies", and efficient code
for doing shaft culling, is now available at the site
http://www.erichaines.com/.
--------
If you just can't get enough on Pluecker coordinates, i.e. the articles in
RTNv10n3 and RTNv11n1 weren't enough for you, try:
http://www.flipcode.com/tutorials/tut_pluecker.shtml
It is derived from the RTN articles, but has some other information and may
be more approachable.
--------
A derivation of the formula for ray/general quadric intersection, from Paul
Kahler:
http://www.oakland.edu/~phkahler/techdoc/RayQuadric.doc
(it's a Word Doc file)
--------
The site:
http://www.schorsch.com/kbase/glossary/index.html
has a glossary of the various terms used in the area of lighting and related
fields. Reasonably done, though nothing deep. There are certainly a few I've
never heard of before (e.g. "etendue").
--------
Where can you find minimal indirect illumination, no atmospheric effects,
perfectly reflective surfaces, and hard-edged shadows (depending on the
location and size of the sun)? Yes, space, the real home of classical ray
tracing. Check out movies and stills at:
http://www.spacebattles.com/
----------------------------------------------------------------------------
Octree Implementation and RTRT Speeds, by Paul Kahler ,
Vlastimil Havran , Ingo Wald , John
Stone , and Nick Chirkov
[excerpted from the Realtime Raytracing mailing list at
http://groups.yahoo.com/group/realtime_raytracing/. - EAH]
I've done some checking, and ran across RTNv12n2 where there is an article
covering many (all?) of the various papers on ray-octree traversal. From
reading the descriptions, it seems that the method I worked so hard to come
up with is #18 by Gargantini and Atkinson from 1993. Yet another algorithm
I've independently reinvented :-) There was no link to the actual paper or
code, so I haven't compared my implementation to theirs yet. I must say that
this algorithm (and others like it) is VERY sensitive to the details of the
implementation. Let me explain...
I started with a simple octree using 8 pointers to children at each node,
with NULL pointers where there are no children. I use axially aligned boxes
with subdivision always happening in the center (I have good reasons to back
up this choice, but that's another story). My first implementation recursed
to all 8 children as long as there was no NULL pointer. Each node was
responsible for determining if the ray hit it. The hit test treated the node
as the intersection of 3 slabs and checked for the existence of a non-empty
span of the ray passing through all 3. I used a table of 8 orders to check
the children in front to back order based on the direction the ray came from.
The biggest problem with this is that the ray can hit at most 4 of the
children, this means potentially many recursive calls to intersection tests
with nodes the ray does not hit. Note that the computation lies completely
with the child nodes, the parent only checks pointers for NULL.
I realized that a little work in the parent node can reduce the number of
children checked to 4 (at most) and that helped a LOT. I ultimately reached
the algorithm described in the abstract I found. I'm still left with some
open questions. I can still put more work into the parent nodes, but it will
be done even for children with NULL pointers - this will slow things down in
some cases and speed it up in others. I can also do more work related to
sorting things, however there are only 5 numbers whose order matters and more
complex code may hurt things more than it helps - this one is not dependant
on the existence of children, so I should try it. I still need to find the
authors' code to compare - since this algorithm is so sensitive to what
optimization is done, it's conceivable that I can beat them. A modest 2x
performance increase would put it in direct competition with those grids
everyone keeps talking about. Don't laugh, I've already seen an increase of
more than 4x over my first implementation while keeping the same time
complexity. Also, there are some things that would be best handled by
combining sign-bits in floating point numbers as opposed to using conditional
code. This could reduce branch prediction errors enormously, but Pentium type
processors really aren't designed with that in mind. BTW, I expect Athlon to
smoke P4 running this code due to the branching complexity and the 20-stage
P4 pipes that will keep getting flushed - if the branches could be predicted,
we'd have an even better algorithm :-)
Anyway, there's a demo of it at:
http://www.oakland.edu/~phkahler/rtrt
[There is also an interesting article on doing what is essentially adaptive
supersampling of a low resolution version of the image (i.e. render every
eighth pixel to start) at Paul's site at
http://www.oakland.edu/~phkahler/rtrt/dre.html - EAH]
Select demo 4. It does a big fractal octree thing and lets you subdivide up
to 8 levels deep. Most rays from the camera hit the tree root, and with 2
light sources there's a lot more work to do also. Since the octree is fixed
in space, I attached the camera AND light sources as children of the same
transformation node in the scene graph, this make it look like the octree is
spinning :-) BTW, I can turn off leaf visibility and place objects in the
octree, but that starts making speed dependent on other things than tree
traversal which is what I've been trying to optimize.
----
[I asked Paul for some timing numbers, and he tried some tests. -EAH]
My terrain dataset is 511x511 squares each divided into 2 triangles. The
vertices are offset vertically using 2 frequencies of Perlin noise. I have
one light source up in the sky, and turning off shadow rays only added 15% to
the frame rate because those rays trace very efficiently.
I ran some more controlled tests. 320x240 images with a ray every 8th pixel
for a total of 9600 eye rays per image. The camera was aimed 0.5 radians down
from straight horizontal (it thrashes if you get the horizon in view with too
many polys). This means every ray hit the ground and had 1 shadow ray cast.
The camera was high enough to see several hundred (perhaps a couple
thousand?) triangles at a time. My DRE was turned off so it was 19200 rays
per frame.
So... with tracing a terrain grid of size 2047x2047x2 = 8,380,418 triangles I
got 4.0 FPS with some fluctuation up to 4.1 & 4.2 - call it 4.0 for a total
of 76800 rays per sec. The root of the octree was 11 levels deep (including
root and leaf). It took several minutes to build the terrain, and then
another minute for the thrashing to stop after it started rendering. Last
time I did this it allocated about 1GB of memory.
Next with 511x511x2 = 522,242 triangles in a 10 level octree I got... the
same result 4.0 FPS - not fluctuating up as often. It only takes a few
seconds to build the scene and the speed tops out in a few more. Tree depth
is really the determining factor, not polygon count.
Then I pointed the camera up higher so about 75% the image was ground and
the rest was sky (default blue color). This gave about the same speed as the
previous test. While there are fewer rays, they have to do more work -
passing over the ground at lower levels of the tree. The horizon looked
really bad due to the pixel interpolation. I turned on DRE and it cleared up
but dropped to 3.2 - 3.4 FPS.
Summary: I'm at about 76K-80K rays/second in a 10 level octree with 50% being
shadow rays. If I point the camera straight down it speeds up to the 130K
rays per second due to fewer intersection tests. All done on an Athlon 700
with 128M RAM.
Paul gave an update a few months later:
I am now getting 140K rays per second all the time (perhaps a dip to 130K)
and going up to 215K per second with the camera pointed down to avoid those
over the horizon rays.
[Paul got this speed boost from code tweaks like turning C++ to C, inlining,
and in translating inner loops into assembler. - EAH]
----
Vlastimil Havran comments:
I agree with the sensitivity to the implementation issues, I improved the
efficiency of the code for octree about three times by tuning and trying
different versions of the code etc. It seems that actual implementation is
still very important.
The determining of which child nodes have to be traversed is the real
potential source of the inefficiency. What I would recommend is to get the
paper from J. Revelles et al, "An Efficient Parametric Algorithm for Octree
Traversal", and implement it, since the optimization idea in this paper is
just what you are searching for and is very promising, I think.
Unfortunately, I have not yet time to reimplement it in my own source code.
The paper is available at: http://giig.ugr.es/~curena/papers/, and also
includes some pseudocode of the traversal algorithm.
----
Ingo Wald comments:
Two numbers from our interactive ray tracer [see
http://graphics.cs.uni-sb.de/rtrt/] may be of particular interest for you:
- about 90,000 rays per second for a 4-million triangle scene in which almost
all triangles are visible.
- about 150,000 to 600,000 primary rays per second (depending on view settings)
for a 1.5 million triangle scene with lots of occlusion.
(Both scenes are primary rays only, flat shading)
All that on a 800 MHz PIII.
[Part of the difference in speeds is because Ingo et al are using SIMD (SSE)
instructions in a clever way: they intersect four rays at once against a
single triangle. Like Paul, they also pay a lot of attention to and optimize
for cache lines and other architectural elements of their processor. -EAH]
----
John Stone notes (in a separate email; worth mentioning
here):
There's a heck of a lot that can be done to fit a ray tracer to an
architecture. Make sure to have tight loops that avoid branches, precompute
stuff, have the cache-lines get filled nicely by making data structures fit
them, ways to fit like data together to maximize coherency (I've seen some
very funky code for culling for z-buffers by Intel, for example, that has a
few copies of a mesh's data in different places to get better locality).
----
Nick Chirkov writes:
My ray tracer works on any polygonal models, builds BSP, and traces 40,000
uniformly distributed rays/sec (Celeron433) on the well-known Attic (~52,000
triangles) scene with ratio of correct hits 99.96%. All code is C++, SSE is
not used. I can get a speed up of 150sing Intel's C++ compiler and rewrite
something like if(*(DWORD*)&float_value<0) to if (float_value<0.0f). I can't
use occlusion techniques because of shadow generation [I'm not sure what this
means.
-EAH].
Uniformly distributed rays is slower than primary rays because of CPU cache
misses. In real image generation it was ~60,000 rays/sec on the Attic scene
with 3 light sources with shadows.
I tested a scene with ~480,000 triangles and one light source. It was an old
ship and all triangles were within the view frustum. I got ~30,000 rays/sec.
[See Nick's page at
http://www.geocities.com/SiliconValley/Bay/5604/raytrace.htm]
----------------------------------------------------------------------------
Loose Octrees for Dynamic Raytracing, by Eric Haines
In a dynamic animation environment, one problem for ray tracers to solve is
updating the spatial ray tracing structure as quickly as possible. This is
especially true if multiprocessor machines are being used. Reinhard,
Smits, and Hansen give one solution in their work, "Dynamic Acceleration
Structures for Interactive Ray Tracing,"
http://www.cs.utah.edu/~reinhard/egwr/. In this paper they insert objects
into a grid. As objects move outside the grid, they do a modulo operation for
the object's location. So, say an object is in the rightmost grid cell, #9,
in a 10x10x10 grid. It moves outside the grid - instead of recreating the
grid, the object is moved to cell #0, but is noted as being in a "copy" of
the grid at X location 1. Another way to express it is that each grid
coordinate has a number 0-9, but think of the world as being filled with
these grids, in a repeating grid. This meta-grid is accessed with the higher
numerals of the object's location number. Everything starts in grid #0. So 10
would mean meta-grid #1 along the axis, 20 is meta-grid #2, etc. Now when a
ray travels through a grid cell containing something outside the normal grid,
the object's real location is also checked. If the meta-grid location does
not match the ray's meta-grid location, the object is not actually there, so
it's not tested. As time goes on and more objects move outside the grid, this
scheme becomes less efficient as more objects have to be tested but can never
be hit. See the paper for how the authors decide to regenerate the grid when
it becomes inefficient.
What's clever about their scheme is that when an object moves, it is quick to
remove it from the grid and reinsert it. The grid does not have to be
regenerated. This scheme can also work with a hierarchy of grids (i.e. nested
grids). The authors note that normal octrees suffer from non-constant
insertion and deletion times, as the tree has to be traversed and an object
may get put into two or more nodes.
Thatcher Ulrich's "loose octree" spatial partitioning scheme has some
interesting features. Meant for collision detection, it may also have
application to real-time ray tracing. The basic idea is that you make each
octree node actually enclose twice the space, in each direction, as its
location in the octree. That is, normally an octree node does not overlap its
neighbors - space is precisely partitioned. In Ulrich's scheme, the octree
node box is extended by 1/2 in the six directions of its face. Anything
listed in this node is inside this extended box.
This makes for a less-efficient partitioning of space, but has a great
advantage for dynamic objects. As an object moves or otherwise changes, it
can be removed and inserted in the tree "instantly". Say you want to insert
spheres into this structure. The radius of the sphere determines exactly what
level of the octree you need to insert it at. For example, if the extended
octree node at some level is, say, 12x12x12 units, then a sphere with a
radius of 3 or less must fit inside this extended node if the center of the
sphere is inside the unextended octree node. If the radius is 1.5 or less, it
can be inserted in the next octree node down (6x6x6) or further, as it must
fit there. So just knowing the sphere's center and radius fully determines
which octree node to insert it into, without searching or bounds testing
(other than walking down the tree to the node itself).
Similarly, deletion from the octree is quick: each object exists in one and
only one octree node, which can be found immediately and so deleted from. It
might even be faster to hash the octree nodes by their level and address (as
Glassner did in his original scheme) to more quickly delete from them.
This gives at least some hope that octrees could be used in a dynamic ray
tracing system. Another nice feature of octrees is that if an object moves
outside the bounding box of the entire octree, this octree can become a
sub-node of a larger octree made on the fly that also encloses the space the
object moved to. I have to admit that the loose octree structure seems pretty
inefficient to trace rays against, but really I am just brainstorming here,
presenting a possible use for a clever, new data structure with some useful
properties. I can imagine combining loose octrees with other schemes, e.g.
also creating bounding boxes within each populated node lazily, as needed, to
further speed intersection testing.
See more about the loose octree idea at
http://www.tulrich.com/geekstuff/partitioning.html, and read more about it in
the book "Game Programming Gems." There are some optimizations I do not
mention here, like being able to sometimes push some objects one level deeper
into the octree.
----------------------------------------------------------------------------
Processor and Compiler Comparison as of August 2001, by John Stone
I have various new info on the compiler fronts.
For x86 machines, I've run a bunch of benchmarks on Intel P4 and AMD Athlon
machines, and have had interesting results. Along with this, I've tested gcc,
the Portland Group compilers, and the beta Intel compilers for Linux.
On the hardware front, for the testing I've done with Tachyon (John's ray
tracer, source at http://jedi.ks.uiuc.edu/~johns/raytracer/), the Athlon
machines have consistently annihilated the P4 machines with which I've
compared them. My 1.2Ghz Athlon at home is still posting a SPD balls time of
2.6 seconds, versus about 3.0 for a 1.7Ghz P4 we have at work. I tested on a
1.4Ghz Athlon previously and got 2.2 seconds!
From the testing I did recently, the PG compilers were good, but the most
recent versions of gcc are now at about the same level of performance (I
haven't tested recently with all combinations on the same box, but I believe
my results indicate that if PG compilers are still better than gcc, then the
margin has narrowed to about 3% so in performance.
In all cases, the new beta Intel compiler has proven to produce code that
runs *slower* than what I've been getting out of gcc, though since it is a
beta compiler, I can't complain too much yet. This was true when testing the
produced binaries on both an Athlon and on a P4, with Tachyon. We _have_ seen
some performance benefits running some other codes, on the older generation
Pentiums, but not yet on the P4. I suspect that Intel just needs a few more
months of work before their new linux compiler is as competitive as we'd
expect. I think the main advantages it has right now are much more
sophisticated vectorizing features which benefit linear algebra oriented
codes, but don't help ray tracers very much. (Tachyon has very few
vectorizable loops presently, unfortunately...)
Though the timings I have below span over a couple of months, none of the
core code in Tachyon has changed for quite a while.
----------------------------------------------------------------------------
Triangles per Vertex, by Eric Haines
I noticed a line in Peter Shirley's book "Realistic Ray Tracing"
Typically, a large mesh has each vertex being stored by about six
triangles, although there can be any number for extreme cases.
This is true enough, but it goes stronger and deeper than that, and something
I only recently realized (well, in 1998), so I'm passing it on here.
It turns out that the Euler Characteristic precludes there ever being more or
less than about six triangles per vertex for large meshes consisting entirely
of triangles, surprisingly enough. The formula is:
V + F - E = 2
(V=vertices, F=faces, E=edges) for a closed mesh (i.e. a solid object)
without any holes (i.e. topologically equivalent to a sphere; though a donut
merely changes the formula to V + F - E = 0; for an open mesh without holes
it's V + F - E = 1). The formula holds in general, but we can constrain it
when dealing with connected triangles in a closed mesh.
If every entity in the mesh is a triangle, we know that the number of edges
is going to be exactly 3/2 * the number of faces, since each face has three
edges and each edge in a closed mesh is shared by two faces. Substitute:
V + F - 3/2*F = 2
V ~= 1/2 * F
(I'm dropping the "2" constant, since it's relatively unimportant for large
meshes). We also know that each triangle has 3 vertices, so for the total
number of vertices to be 1/2 the number of faces, each vertex *must* be
shared by an average of 6 triangles. I love this result, as it applies to any
mesh. You can draw the mesh so that one vertex is shared by many triangles,
but then the rest of the vertices are shared by proportionally less,
balancing out to 6 - there's no way around it.
If the mesh is open, then the ratio will change, with the average number of
triangles going down. Topologically, with an open mesh (one not forming a
solid object but having some edges touching some single triangle) you can
think of the mesh as solid, but including a polygon with many edges along the
outside of the mesh. This many-sided polygon changes the ratio.
[I wrote this note to Pete, and he had just learned the same fact the day
before receiving my note! I figure if neither of us knew this until recently,
then it was worth presenting here.]
----------------------------------------------------------------------------
Bounding Box Intersection via Origin Location, by Eric Haines
Schmalstieg and Tobler had a clever article two years ago:
Dieter Schmalstieg and Robert F. Tobler, "Fast projected area computation for
three-dimensional bounding boxes," journal of graphics tools, 4(2):37-43,
1999. http://jgt.akpeters.com/papers/SchmalstiegTobler99/
The idea is that a bounding box can be thought of as six planes that divide
space into 27 parts (nothing new so far, it's used in orthographic view
volume clipping). Take the location of the eye and see which of the 27 parts
it's in. For each of the 27 positions there is a set of faces visible from
that position (1, 2, or 3; essentially which faces face that direction). Go a
step further, though: in a table for the 27 entries you can do better than
storing the faces that are visible. Instead, you store the list of box
vertices, 4 or 6, that make up the silhouette outline of the box from that
position. This silhouette list does not change within the volume of space (of
the 27) that you're in. You can use this list of vertices to project onto the
screen and directly compute the screen area of the convex polygon formed.
I was wondering if this algorithm might be useful for ray/box intersection. I
have not tried it out, but the idea's to use these silhouettes for 2D point
in polygon testing, and also find the distance along the ray by using the two
box corners that bound the ray's direction. There is only one closest (and
one farthest) box vertex for a given ray direction, and the index of this
vertex can be determined once for any given ray direction. This is an idea
from shaft culling (http://www.erichaines.com), where the closest box vertex
to a plane is entirely determined by the normal of the plane (and a ray's
values do define a plane, since they both have a normal and origin).
To begin, the closest and farthest vertex can be cast upon the ray's
direction to find out how far these vertices are along the ray. This is
simple: the vector from the ray's origin to the vertex dotted with the ray
direction gives the distance to the vertex. If the ray's extent (and rays
often are actually line segments, having a maximum distance to them, but
"line segment tracing" doesn't sound as good as "ray tracing") overlaps the
box's closest and farthest distances, then the ray is likely to hit the box.
This is not a perfect test, but is conservative: the test will sometimes say
there's an overlap when there's not, but will never miss a box that should be
hit.
If the ray survives this test, say the ray starts in a corner region. You now
have a polygon outline. Instead of computing the ray's intersection with 3
slabs or 3 faces (the traditional ray/box intersection methods), form a plane
at the closest corner that is perpendicular to the ray's direction. In fact,
you have all you need of this plane already, as the ray's normal is the
plane's normal and the plane's distance is actually irrelevant. The idea is
to use the ray's direction to cast the ray's origin and the silhouette
vertices onto a 2D plane. If the ray's 2D position is inside the vertex
list's 2D polygon, the ray is definitely inside the box. It doesn't really
matter what the frame of reference is for the 2D plane itself, anything will
do.
That's about it, and I hope it makes sense. To recap, you first look at the
problem as how far along the ray is the box. If you survive this test, you
see whether, looking along the ray's direction, the ray's origin is inside
the convex polygon formed by the silhouette vertices. I have my doubts as to
whether this test is any faster than traditional tests (the operation of
actually casting onto a 2D plane looks to be slow), but I thought it was
interesting that there was another way to skin this cat. This algorithm does
have the interesting feature that it appears to avoid using division at all,
it is mostly just dot products to cast vertices onto lines and planes.
----------------------------------------------------------------------------
END OF RTNEWS