Author Topic: Performance with many objects (Read 7630 times)

jpro · « **on:** November 02, 2010, 03:26:17 am »

Is there something I should be doing to prevent rendering objects that are not visible? In the simple demo app I've developed I build a 4x4x4 grid of cube Object3Ds, stacked into a block. That gives me around 250 FPS. If I increase the size of the stack to 16^3 or 4096 Cubes, FPS drops to about 15. I originally was doing this as a physics demo so I went ahead and disabled the physics entirely but that didn't change anything. Profiling the code indicates the bottleneck is rendering.

A lot of the cubes (the ones in the center) are not visible at all, but I imagine they are still being updated/rendered in some way? Is there something I should be doing to prevent this? Unless I'm mistaken I can't merge them into some united list or object, since the physics engine will be altering them individually.

Many apologies if this is obvious or has been covered. I spent about two hours trying to find a solution but nothing has worked yet so I got desperate.

edit:

Some of what I read online indicates that this type of culling is mostly unnecessary. If so, why does my performance drop so much when rendering a lot cubes? I'm not doing anything fancy like turning off back face culling.

EgonOlsen · « **Reply #1 on:** November 02, 2010, 01:03:58 pm »

There's no point in doing that kind of visibility detection when using the hardware renderer. Current hardware is fast enough to handle the additional overdraw. The software renderer already uses some span based hidden surface removal.
You haven't mentioned which renderer you are using. If it's the software renderer, you can't do much except enabling multithreaded rendering if your cpu has more than one core. See the wiki for details.
If you are using the hardware renderer, you should try to:

Make all cubes use the same mesh. The best way to do this is to create a blueprint cube and clone the other ones from that one using the same mesh. There's a method in Object3D to do this.
Compile the cubes. This enables a more hardware oriented pipeline than the default one. Simply call compile() on all cubes. Again, see the wiki for more details.
Make the compiled cubes use the same data (Object3D.shareCompiledData(...)).

That should actually improve performance as long as you aren't limited by your graphics hardware anyway (read: Intel onboard crapsets).

jpro · « **Reply #2 on:** November 02, 2010, 02:47:20 pm »

Excellent. I appreciate the advice. For the record I am using hardware rendering.

My current Cube implementation is something like:

Code: [Select]

class GameCube extends Object3D
{
    static Object3D MODEL = Primitives.getCube(1f);
    
    public GameCube()
    {
        super(MODEL);
    }
}

which I thought would reuse the same model, but it sounds like there's some additional steps that I need to take to make that happen. I'll report back once I've implemented the changes you've outlined.

EgonOlsen · « **Reply #3 on:** November 02, 2010, 05:03:35 pm »

That will work too. Just do super(CUBE, true); to be sure. That will enable mesh reusage on the Java side. Simply compile each cube and do that share...thing to enable reusage on the gpu side too.

jpro · « **Reply #4 on:** November 05, 2010, 04:23:02 pm »

I got frustrated trying to make this work so I ended up prototyping with raw OpenGL. It works much faster with direct calls and now I'm wondering how I can reproduce these results using JPCT. My OpenGL solution uses a single display list (created via GL11.glGenLists(1)) which I reuse repeatedly (GL11.glCallList(that same id)) changing the position per cube of course.

I get up to around 10,000 cubes before performance is impacted, so I know this is possible. I'm sure there's a way to duplicate these results in JPCT with compile or clone, as you've stated above, but all of the combinations I've tried thus far haven't worked.

Long story short... what JPCT calls duplicate the display list functionality of OpenGL?

EgonOlsen · « **Reply #5 on:** November 05, 2010, 07:58:05 pm »

As said: compile (static), make all cubes use the same mesh and make them share compiled data. Of course, there's some overhead when compared to raw gl calls....but "haven't worked" is a bit too vague...what exactly failed? Was it too slow, didn't it work at all? Do you have a log output of your jPCT run, so that we can verify that it's doing what it actually is supposed to do!?

jpro · « **Reply #6 on:** November 05, 2010, 08:09:23 pm »

I apologize for the vagueness -- I don't have access to the code at the moment. I will give it another shot later tonight. Thanks for the advice/patience.

jpro · « **Reply #7 on:** November 05, 2010, 11:20:31 pm »

This what I'm doing in an attempt to reuse the model/mesh:

Code: [Select]

public class Block extends Object3D
{
    public static Object3D MODEL = null;

    public static Object3D getModel()
    {
        if(MODEL == null)
        {
            MODEL = Primitives.getCube(0.5f);
            MODEL.setTexture("box");
            MODEL.setEnvmapped(Object3D.ENVMAP_ENABLED);
            MODEL.setCollisionMode(Object3D.COLLISION_CHECK_OTHERS);
            
            MODEL.build();

            MODEL.rotateY((float) -Math.PI / 4f);
        }

        Object3D newModel = MODEL.cloneObject();
        newModel.shareCompiledData(MODEL);
        newModel.compile();
        
        return newModel;
    }

    public Block()
    {
        super(getModel());
    }

... etc.

Doing these steps improves framerate from ~5 FPS to ~15 FPS. Obviously still not enough to be playable so hopefully I'm just not doing this right or there are other steps that can be taken.

EgonOlsen · « **Reply #8 on:** November 05, 2010, 11:43:44 pm »

Looks fine at first glance. Please post the complete log that jPCT prints out.

jpro · « **Reply #9 on:** November 06, 2010, 05:45:07 am »

Code: [Select]

Java version is: 1.6.0_20
-> support for BufferedImage
Version helper for 1.5+ initialized!
-> using BufferedImage
Software renderer (OpenGL mode) initialized
Software renderer disposed
Current mode:800 x 600 x 32 @75Hz
Driver is: atiu9p64 aticfx64 aticfx64 atiu9pag aticfx32 aticfx32 atiumd64 atidxx64 atidxx64 atiumdag atidxx32 atidxx32 atiumdva atiumd6a atitmm64/null on ATI Technologies Inc. / ATI Radeon HD 4800 Series 
GL_ARB_texture_env_combine supported and used!
FBO supported and used!
OpenGL renderer initialized (using 4 texture stages)
Loading Texture...res/rock.jpg
Loading Texture...res/sky.gif
Loading Texture...res/sky2.gif
Software renderer disposed
Hardware supports textures up to 8192*8192 in size!
OpenGL renderer disposed

EgonOlsen · « **Reply #10 on:** November 06, 2010, 09:59:50 am »

That's all? Where's the good stuff...the compile messages...? This somehow doesn't look right to me...do you have a test case for me to try?

jpro · « **Reply #11 on:** November 06, 2010, 03:37:34 pm »

Sent a simple test case to hostmaster@jpct.net

EgonOlsen · « **Reply #12 on:** November 06, 2010, 11:29:15 pm »

Tried it...

First, your code didn't compile anything, because you didn't compile the actual object but the one that you used to be feeded in the super(Object3D)-call, which doesn't work. So "Block" should actually look like this:

Code: [Select]

package env;

import com.threed.jpct.*;

/**
 *
 * @author User
 */
public class Block extends Object3D
{
	private static final long serialVersionUID = 1L;
	public static Object3D MODEL = null;

    public static Object3D getModel()
    {
        if(MODEL == null)
        {
            MODEL = Primitives.getCube(0.5f);
            MODEL.setTexture("box");
            MODEL.setEnvmapped(Object3D.ENVMAP_ENABLED);
            MODEL.setCollisionMode(Object3D.COLLISION_CHECK_OTHERS);
            
            MODEL.build();
            MODEL.compile();
            
            MODEL.rotateY((float) -Math.PI / 4f);
        }
        
        return MODEL;
    }

    public Block()
    {
        super(getModel(), true);
        shareCompiledData(MODEL);
        compile();
    }
}

With that, the block will actually be compiled. However, that doesn't really help much (if at all)...

(But you should do it anyway...).

So, what's the reason for this? I did some benchmarking with your test code and discovered some minor bottlenecks, which i've fixed in this beta jar: http://www.jpct.net/download/beta/jpct.jar
However, even with this jar, it's still not very much faster. So i decided to tweak settings for this particular scene (i.e. all blocks in view, nothing clipped, nothing culled, nothing transparent etc...) by adding these lines:

Code: [Select]

 Config.doSorting=false;
 Config.alwaysSort=false;
 Config.glMultiPassSorting=false;

 Config.useFrustumCulling=false;

The first three ones disable sorting completely. It's not needed in your scene. It might be needed in more realistic scenes though.
The last line disables frustum culling, which isn't helpful if all objects of a scene are in view.

With that, performance slightly increases on my machine (Core2 Quad @ 3.2Ghz, Radeon HD 5870, Windows Vista, Java 6), but it's still pretty low. So i did some more detailed benchmarks to see where the time is actually spend.

The time taken for a complete render cycle is around: 34ms

These consist of:

Engine work, i.e. iterate over all objects, setup transformation matrices, set engine states...that kind of stuff: 12ms

Render time, i.e. setting up OpenGL states and draw the display lists: 22ms

The latter 22ms split into:

Setting up GL and similar actions: 9ms
Rendering the DL: 13ms

So we have 13ms that we can't avoid, because that's simple display list drawing that has to be done no matter what (Which means that the highest possible framerate when only making display list calls (which isn't possible) is somewhere around 76fps).

That leaves us with 9ms for GL setup and 12ms for engine work. Those 12ms again split into 9ms pure processing code and 3ms state management for OpenGL. Those 3ms can't be optimized away without killing performance on everything that's not as simple as this test scene.

So we have 2 * 9ms left. The parts that take this time are already pretty much optimized. It might be possible to squeeze out one ms or the other, but that's pretty much what it takes.

This application is a very special case as it renders very simple objects but tons of them. All the overhead (GL's and jPCT's) stacks up pretty high in this case. I don't really see how i can improve this much further. I would like to try your direct-GL-testcase to see, how that one really performs and if it's up to the assumptions about maximum possible performance that i made above. Maybe i'm doing something really stupid that i just don't see right now...

jpro · « **Reply #13 on:** November 07, 2010, 02:20:56 am »

Very comprehensive analysis. I've gone back to my OpenGL test case and perhaps I had some confirmation bias going on at the time because the performance isn't as good as I stated previously. I wasn't running an FPS monitor and it just felt smoother, but the numbers don't lie. It's a little better, but not significantly. After doing the adjustments you've laid out the performance is pretty similar to JPCT, so thanks for that advice.

Is there any culling I could implement to have the renderer ignore the cubes that are clearly not visible? In the case I sent you most of those cubes are completely obscured by the ones on top/in front of them. I'll look into what I can do, but I'm also curious if you have any suggestions.

EgonOlsen · « **Reply #14 on:** November 07, 2010, 08:42:15 pm »

If the scene stays that way, i.e. a static block consisting of many smaller blocks, then it shouldn't be too hard to find a way to do some gross culling for the inner blocks. However, that won't help once that large block breaks apart and that is the reason why it actually consists of smaller blocks, isn't it?
I've updated the beta jar with another version that adds some very subtle optimizations and fixes a bug in the former one. Even if there's no performance gain, you should go with the newer version.
An option might be to use brute force to improve performance by going multi-threading if you have at least 2 cores (and who doesn't nowadays!?). That's pretty simple with jPCT. Just set Config.useMultipleThreads to true at your application's startup. Make sure to use an IPaintListener to count the fps, because otherwise, you'll count only logic and not rendering. With that, the performance increased from around 28 to 38 fps on my machine.

News:

Author Topic: Performance with many objects (Read 7630 times)

jpro

Performance with many objects

EgonOlsen

Re: Performance with many objects

jpro

Re: Performance with many objects

EgonOlsen

Re: Performance with many objects

jpro

Re: Performance with many objects

EgonOlsen

Re: Performance with many objects

jpro

Re: Performance with many objects

jpro

Re: Performance with many objects

EgonOlsen

Re: Performance with many objects

jpro

Re: Performance with many objects

EgonOlsen

Re: Performance with many objects

jpro

Re: Performance with many objects

EgonOlsen

Re: Performance with many objects

jpro

Re: Performance with many objects

EgonOlsen

Re: Performance with many objects