www.opengl.org | Direct X | SGI OpenGL | nVidia | ATI 
Pros & Cons 
Research Focus 
Project Log 
Related Links 
RForce Method 
master thesis released
RForce introduced
NV38Box released
BrookBox released
project log updated
project log updated
project log updated
project log updated
project log updated
project log updated
links updated
14.08.2004Master Thesis published: RForce method (NV38Box) + Purcell's raytracer in BrookGPU (BrookBox)
I have introduced new method called RForce and released master thesis on this topic.

Features of RForce method:
- dynamic GPU raytracer running in realtime for simple scenes and low resolutions
- scene = 16 triangular objects per 64 triangles (total up to 1024 triangles)
- using AABB as an acceleration structure
- per pixel shading for both primary and secondary rays
- no dynamic flow control as opposite to Purcell's method
- hybrid algorithm (primary rays = classic rasterization)
- nVidia only, NV38 path, NV40 path, brute force for comparison
- implementation performs only secondary reflection rays,but method is more flexible in general
- tested on WinXP, GeForce 5950 Ultra, ForceWare 60.xx
- slow for greater resolutions than 128x128 (but working)

See some screenshots of raytraced scenes:

English translation of main parts from the thesis is available under the RForce method menu item
Thesis CD can be downloaded as whole zip file thesiscd.zip [53MB].

Secondary product is a Purcell's type raytracer implemented in BrookGPU (also included on CD).
25.05.2004NV38Box progress and vista method
I started implementation of NV38Box - shell apllication for set of raytracing methods running on GPU finetuned for NV38 chip. At the end there should be implemented Purcell's raytracer and optionally my own ideas.

Last week I started to work on my new idea of doing ray-tracing another way on my NV38. Let's call this method a vista method. I don't know any related papers to this method. I'm not sure if this approach is suitable for GPUs and I'm still not sure if it is even working. I did a set of implementation tests last week to find out if it is doable with current OpenGL API.

Vista Method
Let's start with Purcell's GPU raytracer scheme:

  generate rays
  loop while any rays active
    if (oracle())
  goto start if there are more rays to do

From my experience with test scenes in BrookGPU implementation, i need hundreds or thousands cycles for that loop. I have a very simple oracle repeating 5x traverse and 5x intersect in a row for scene with 10x10x10 grid in Brook (see posts below). Purcell discussed optimization using NV_occlusion_query and early z-culling in his dissertation. I hope that can help a lot but I don't believe it helps me running thousands passes on nvidia hardware in real time. Flow control is a problem when doing NV3x GPU stuff. I need some flow-control-constrained method. Maybe I would exchange it with less precision results. What about moving into image space and letting rasterizer do the hard work of finding ray-triangle intersections in less passes ? Is it possible ? What about aliasing problems ? Many other questions arise ...

OK, let's try it:
Method is based on:
1. depth peeling idea for generating lookup maps
2. rasterizing rays as lines in world space
3. rasterizing ray hits as points in camera space
4. method does 2x "rendering into vertex array"

Here is a rough scheme:
frame preprocess:
  generate vista maps
  generate rays
  find hits
  goto start if there are more rays to do

I tried to run this method on my previous simple test scene. 3x wall and one sphere (~2400 tris). Depth complexity is 2. All example pictures are taken from this scene. Second picture is OpenGL rendering with disabled lighting.

generate vista maps:
I render 6x8 vista maps. Camera is using ortogonal projection, placed outside the scene box and facing the center of scene box. 6x means "take all cube face directions" (FRONT, BACK, LEFT, RIGHT, TOP, BOTTOM). 8x means "perform depth peeling for level 0 up to level 7"). This is not needed because current scene has only 2 depth levels, but it fits resolution of the pbuffer well. I use pixel shader which writes trinagle id and fragment z. See picture:

However, after this step I have 48 vista maps, each with dimension 128x128 pixels. Scene was rendered using VBOs and render_to_depth_texture. Shadow mapping hardware was used to perform second depth test (see Cass Everit's paper). Every pixel knows it's triangle id and z value when rasterized.

find hits:
First see example pictures how is pbuffer memory organized: (note that buffer is not in full resolution and colors encode the data)

Hits are detected this way:
1. I have rendered scene from eye view. For every pixel I want to trace a ray, I know ray origin and ray end. I copy this data into vertex array 1 (VA1). (Note: If I don't want to trace line, I just fill some coordinates like start=[-1, -1, -1] and end=[-1,-1,-1] and hope line gets completelly clipped)
2. I draw VA1 as GL_LINES using rayinspector pixel shader. Ray origin and ray end are given in world space. I render VA1 48 times (for every vista map) with proper MODELVIEW matrix. The result of the line rasterization are confirmed hit points (look them as a yellow points the rest is black - discard). In pixel I store ray address in eye space in [x,y], z = time of hit event, w = triangle id of hit.
3. I copy hit points into VA2.
4. I draw VA2 as GL_POINTS using hitinspector pixel shader. This reports hits to their pixels in eye view. DEPTH_TEST must be enabled with GL_LESS to sort hits in pixel according to time.

RayInspector is a pixel shader used for rasterization fragments of the line. One line represents one ray. I render one ray 48 times (for each vista map). Line is rasterized in world space from ray's origin to the ray's end. In every fragment I have a chance to check the ray against trinagle whose ID is stored in the current vista map. Test ray vs. triangle is done analytically (see related links for the paper or see Purcell's paper). I need to reduce number of these tests because of two reasons:
1. in every pixel of vista map space I can remember only one hit, so flooding with same hit multiple times would be wasting this valuable space
2. ray-triangle intersection code is a complex fragment program
I can test z value of incomming fragment of the line against z-value stored in the vista map. If they are not close enough (within some specified interval) I should discard the fragment. This should help prevent flooding vista map space with redundant hits but on the other hand it can forget some hits. How big is this error ? Can I manage ray-triangle intersection deferred computing ? I hope this z-interval test can be accelerated in shadow mapping hardware (in texture units) and not computed in fragment program itself.

HitInspector is a pixel shader used for rasterization fragment(s) of the points. One point represents one hit produced by RayInspector. Point has position [x,y,z] where [x,y] are screen coordinates of pixel from eye view (this pixel was sending the ray = line was executed by this pixel). [z] is time when hit was detected. Using depth test with GL_LESS, we force rasterizer to return us only first hit of every pixel. [w] is a texture coordinate storing tringle ID. That mean's ID of triangle which was hit by the ray. With this information next shader pass can compute ray shading. If there was no hit between ray from [x,y] and triangles of the scene. No "hit pixel" is rasterized in the position [x,y].

Method's cons:
1. scene must have a limited depth complexity (no problem for me)
2. aliasing problems (oh, we know it from shadow mapping)
3. pathological cases (ray perpendicular to triangle both perpendicular to vista map view)
4. lost hits due to limit "one hit per vista map texel" (limit amount of reflective materials in the scene, increase resolution, limit somehow hit flooding, etc.)

Method's pros:
1. works for dynamic scenes (cool !!!)
2. better GPU utilization than Purcell's ?

Current implementation(=problems)
I'm using OpenGL 1.5 + GLEW library, Cg 1.2 + Cg runtime, vp30 and fp30 profiles, RenderTexture class. Forceware 61.12 drivers. GPU GeForce FX 5950U
allocated pbuffer 1024x1024 with parameters "rgba=32f doublebuffer depth texRECT depthTexRECT"
main loop
  render scene from eye view
  switch to pbuffer
    render 6x8 vista maps in front 
    // area [0..8x128, 0..6x128]
    render from eye view into front (write ray origins) 
    // area [0..255, 6x128...6x128+255]
    render from eyeview into front (write ray ends) 
    // area[256..256+255, 6x128...6x128+255]
    render vertical line grid (1px spacer) into back 
    // area [0..510, 6x128...6x128+255] 
    // (written every even line) output origins
    render vertical line grid (1px spacer) into back 
    // area [1..511, 6x128...6x128+255] 
    // (written every odd line) output ends
    render vertical line grid (1px spacer) into back 
    // area [512+0..512+510, 6x128...6x128+255] 
    // (written every even line) output ends
    render vertical line grid (1px spacer) into back 
    // area [512+1..512+511, 6x128...6x128+255] 
    // (written every odd line) output origins
    glReadPixels([0..511, 6x128...6x128+255], PBO1) from back
    glReadPixels([512+0..512+511, 6x128...6x128+255], PBO1+offset) from back
    // now i have line origins and ends in vertex buffer, 
    // the same information is stored in texture coordinates but swapped, 
    // so every vertex knows the origin and the end 
    // of the line vertex it belongs to (so every pixel knows too)
    rebind PBO1 as a VBO1 and render lines using rayinspector into back
    // batch of 256x256 lines is rendered 48 times for every vista map, ufff
    glReadPixels([0..8x128, 0...6x128], PBO2) from back
    // now i have hits in PBO2 [x,y] is address of original pixel in eye
    rebind PBO2 as a VBO2 and render points using hitinspector into front
    // area [512..512+255, 6x128...6x128+255]
  switch back to main context
end loop

1. rendering vista maps: using render_to_depth, binding depth buffer and GL_TEXTURE_COMPARE_MODE_ARB, GL_COMPARE_R_TO_TEXTURE_ARB for second depth test. Using static VBO for whole scene geometry (one glDrawArrays call).
2. Because of lack of MRT's on nvidia I render my scene two times from eye view. First render outputs ray origins, second outputs ray ends. Then I need to do 4 renders of line grid to interleave data for reading by PBO and binding as VBO for GL_LINES. I don't know better "readback" schema yet. Grid consist of 256 vertical lines long 256 pixels (spaced with 1px gap). Grid is stored in static VBO in the GPU. Interleaver uses just simple read from front buffer and does copy into back buffer.
3. PBO -> VBO is done standard way. In case of lines I need to call glReadPixels two times (i need to feed texture coordinates with other line point for vertex). Line needs to know ray origin and direction for analytic ray-triangle intersection (in every fragment).
4. lines are rasterized by calling 48xglDrawArrays(GL_LINES,...)
5. points are rasterized by calling glDrawArrays(GL_POINTS,...)
all vertex and fragment programs are compiled.

ufff, I think that is enough for now.
10.05.2004Raytracer (primary ray, secondary reflection and refraction rays + shadow rays)
Today, I have extended and bugfixed previous version of raycaster. Main concept stayed the same. I fixed some singular cases with shadow rays and debugged reflection code. Here is simple example of 2-level raytracer in BrookGPU.
Note: Unfortunatelly, I have reached HW limitations of nv30gl profile of BrookGPU (max textures per kernel). Now I'm running only CPU path.

Scene: 'Checker board with sphere', 1440 triangles, 2 lights, voxels 10x10x10, 4000 passes, resolution 512x512

Next focus in 1): continue in implementation of full Purcell's raytracer using BrookGPU, fix remaining bugs, test refraction, test more levels
06.05.2004Raycaster + primary shadow rays
Today, I have done working prototype of more advanced raycaster. Program is able to trace primary shadow rays. Multiple point ligths are supported. Current version is also prepared to be extended for secondary reflected and refracted rays. I simplified previous raycaster, I merged intersector_init and intersector_step kernels into one kernel. I introduced new kernel called dispatcher. This kernel is responsible for preparing new ray when previous ray is finished. Current version has following main loop schema:
samplephases = NUMRAYS per pixel
shadowphases = lights+1

[ samplephases*shadowphases
    [ 5x
    [ 5x
  shader // do blending

Scene: 'Triangle', 1 triangle, 2 lights, voxels 10x10x10, 450 passes, resolution 256x256

Scene: 'Torus', 492 triangles, 2 lights, voxels 10x10x10, 1500 passes, resolution 256x256

Scene: 'Torus', 492 triangles, 2 lights, voxels 10x10x10, 1500 passes, resolution 1024x1024

Next focus in 1): continue in implementation of full Purcell's raytracer using BrookGPU (reflected and refracted rays), fix some remaining issues with shadows
03.05.2004Raycaster implementation using BrookGPU
Two weeks ago I started to work on this project hard again (last decent work was done in January and then I had to stop mainly because of driver bugs).
I have quite good news here. I have working raycaster written using BrookGPU. At the moment only CPU path works correctly, I have some issues with nv30gl path. This is not so big problem, because I'm using Brook for fast prototyping of algortihms and for debugging purposes.

I have Purcell's raycaster running for simple scenes. This is NOT realtime at the moment.
Voxel grid is 10x10x10 (1000 voxels). Currently, I have 5 kernels: traverser_init, traverser_step, intersector_init, intersector_step, shader. Oracle is very simple: algorithm loops 5x traverser step + 5x intersector step. Last step is shading pass. Pixels can be shaded with three types of shading algorithms: SIMPLE, FLAT or SMOOTH.

Scene: 'Triangle', 1 triangle, 1 light, voxels 10x10x10, 250 passes, resolution 256x256

Scene: 'Box', 12 triangles, 1 light, voxels 10x10x10, 250 passes, resolution 256x256

Scene: 'Sphere', 960 triangles, 1 light, voxels 10x10x10, 400 passes, resolution 256x256

Next focus in 1): implement full Purcell's raytracer using BrookGPU (shadow rays, reflected and refracted rays)
29.03.2004Thesis 'Ray Tracing on a Stream Processor' by Tim Purcell
I was waiting for this wonderfull thesis, because there is a key information for my future research.
When I overcome implementation problems, I want to focus on:
1) fast implementation using BrookGPU
2) fine Purcell's raytracer implementation on FX5950 using Cg1.2+OpenGL1.5 and fp30 profile (some work already done)
3) my raytracer using bounding volumes as accelerarion structure (and more)
I'm leaving for a week to
The 12-th International Conference in Central Europe on
Computer Graphics, Visualization and Computer Vision'2004

I hope I will meet there some GPU related topics.
19.01.2004First implementation problems :-/
I started to implement Purcell's raytracer in the GPU as a reference implementation for my future research. I have choosen Cg1.1+OpenGL1.5+pf30 platform. I have written a simple framework for my project (wrappers dealing with fragment programs using Cg runtime, pbuffer class, pbuffer state machine, quads renderer etc.). I started to heavily use cgc.exe and Cg runtime for compiling, binding and running my program kernels. At the moment i'm not very familiar with fp30 assembler (NV_fragment_program), mainly because it has so poor documentation in the form of extension. There was a lot of try-it-again or do-it-another-way situations. I learned many new issues mainly about pbuffers, render_to_texture and NV_fragment_program extensions. Some of my implementaion problems are described here.

At this point I have working Ray Generator kernel and Grid Traverser kernel as described in Purcell's paper.
For traversing grid structure I used method described here.
I have also successfully tested ray-triangle intersection routine (described here).
I use imdebug for image debugging - a great tool !.

Implementation details:
I'm using double-buffered floating point buffers. I use render_to_texture with ping-pong scheme in buffer. Simple state machine tracks next output rectangle and binds last output rectangle as input texture for next pass. I avoid switching pbuffers, all work is done within single pbuffer and switching surfaces using glDrawBuffer function. At the moment I have ran into troubles with Cg runtime. Maybe speed expectations were higher, but this is my first try to implement complex GPU algorithm.

I have studied new papers. The interesting ones I added into workspace (check papers/important folder).
I browsed nearly whole GPGPU forums looking for similar implementation solutions.
08.01.2004The "Day D"
Till now, it was just a theory. From now it starts to be a real hardcoding :).
Yes, yes, I bought GigaByte MAYA GV-N595U256V 256MB for my Inferno research. From last time I did some work on reading papers and specifications. I have gained quite a deep knowlege about nVidia OpenGL extensions. I also browsed tons of web pages ;). I have also some positive reactions from the visitors of this site and GPGPU pages.
05.12.2003Site improvements and studying
I did quite more work on Inferno last month in comparsion with last year ;-). Here is a brief report:
I have been snuffing on internet forums and around looking for some GPU related projects and info.
I found great site www.gpgpu.org (thanks to jezeq).

Papers, I have studied:
OpenGL 2.0 specifications (overview first reading)
Photon Mapping on Programmable Graphics Hardware - awesome paper
Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware

The Inferno site changes:
"Workspace" section opened - workspace is dedicated to files that i'm working with and are related to inferno project.
"Pros & Cons" section opened - text dedicated to comparsion of raytracing pros and cons on GPUs.
Site statistics (not public).

Future tasks:
"Research Focus" section on the site. At this point site wil be announced on gpgpu.org
pbuffer OpenGL extension
DirectX 9.0 features
Homepage of Inferno Project, best viewed in Internet Explorer or Mozilla browsers, min. resolution 800x600, (c) Antonin Hildebrand 2004