Squeezing Performance out of your Unity Gear VR Game

General Tips

VR application debugging is a matter of getting insight into how the application is structured and executed, gathering data to evaluate actual performance, evaluating it against expectation, then methodically isolating and eliminating problems.

When analyzing or debugging, it is crucial to proceed in a controlled way so that you know specifically what change results in a different outcome. Focus on bottlenecks first. Only compare apples to apples, and change one thing at a time (e.g., resolution, hardware, quality, configuration).

Always be sure to profile, as systems are full of surprises. We recommend starting with simple code, and optimizing as you go – don’t try to optimize too early.

We recommend creating a 2D, non-VR version of your camera rig so you can swap between VR and non-VR perspectives. This allows you to spot check your scenes, and it may be useful if you want to do profiling with third-party tools (e.g., Adreno Profiler).

It can be useful to disable Multithreaded Rendering in Player Settings during performance debugging. This will slow down the renderer, but also give you a clearer view of where your frame time is going. Be sure to turn it back on when you’re done!

Performance Targets

Before debugging performance problems, establish clear targets to use as a baseline for calibrating your performance.

These targets can give you a sense of where to aim, and what to look at if you’re not making frame rate or are having performance problems.

Below you will find some general guidelines for establishing your baselines, given as approximate ranges unless otherwise noted.

Mobile

60 FPS (required by Oculus)
50-100 draw calls per frame
50,000-100,000 triangles or vertices per frame

PC

90 FPS (required by Oculus)
500-1,000 draw calls per frame
1-2 million triangles or vertices per frame

For more information, see:

Unity Profiling Tools

This section details tools provided by Unity to help you diagnose application problems and bottlenecks.

Unity Profiler

Unity comes with a built-in profiler (see Unity’s Profiler manual). The Unity Profiler provides per-frame performance metrics, which can be used to help identify bottlenecks.

PC Setup

To use Unity Profiler with a Rift application, select Development Build and Autoconnect Profilerin Build Settings and build your application. When you launch your application, the Profiler will automatically open.

Mobile Setup

You may profile your application as it is running on your Android device using adb or Wi-Fi. For steps on how to set up remote profiling for your device, please refer to the Android section of the following Unity documentation:https://docs.unity3d.com/Documentation/Manual/Profiler.html.

The Unity Profiler displays CPU utilization for the following categories: Rendering, Scripts, Physics, GarbageCollector, and Vsync. It also provides detailed information regarding Rendering Statistics, Memory Usage (including a breakdown of per-object type memory usage), Audio and Physics Simulation statistics.

GPU Usage data for Android is not available at this time.

The Unity profiler only displays performance metrics for your application. If your app isn’t performing as expected, you may need to gather information on what the entire system is doing.

Show Rendering Statistics

Unity provides an option to display real-time rendering statistics, such as FPS, Draw Calls, Tri and Vert Counts, VRAM usage. While in the Game View, pressing the Stats button above the Game View will display an overlay showing realtime render statistics.Viewing stats in the Editor can help analyze and improve batching for your scene by indicating how many draw calls are being issued and how many are being saved by batching (the OverDraw render mode is helpful for this as well).

Show GPU Overdraw

Unity provides a specific render mode for viewing overdraw in a scene. From the Scene View Control Bar, select OverDraw in the drop-down Render Mode selection box.

In this mode, translucent colors will accumulate providing an overdraw “heat map” where more saturated colors represent areas with the most overdraw.

Unity Built-in Profiler

Unity Built-in Profiler (not to be confused with Unity Profiler) provides frame rate statistics through logcat, including the number of draw calls, min/max frametime, number of tris and verts, et cetera.

To use this profiler, connect to your device over Wi-Fi using ADB over TCPIP as described in theWireless usage section of Android’s adb documentation. Then run adb logcat while the device is docked in the headset.

See Unity’s Measuring Performance with the Built-in Profiler for more information. For more on using adb and logcat, see Android Debugging in the Mobile SDK documentation.

Oculus Performance Heads-Up Display (HUD) for the Rift

The Oculus Performance Heads-Up Display (HUD) is an important, easy-to-use tool for viewing timings for render, latency, and performance headroom in real-time as you run an application in the Oculus Rift. The HUD is easily accessible through the Oculus Debug Tool provided with the PC SDK. For more details, see the Performance Heads-Up Display and Oculus Debug Toolsections of the Oculus Rift Developers Guide.

Oculus Remote Monitor for Gear VR

Oculus Remote Monitor is a client for Windows and Mac OS X that connects to VR applications running on remote devices to capture, store, and display the streamed-in data. It provides visibility into Android VR and GLES activity, and includes low-res rendered image snapshots for a visual reference to its timeline-based display. Remote Monitor is available for download from our Downloads page.

The Remote Monitor client uses VrCapture, a low-overhead remote monitoring library. VrCapture is designed to help debug behavior and performance issues in mobile VR applications. VrCapture is included automatically in any project built with Unity 5, or compiled with the Legacy Integration.

For more information on setup, configuration, and usage, please see VrCapture and Oculus Remote Monitor.

Additional Third-Party Tools

ETW + GPUView

Event Tracing for Windows (ETW) is a trace utility provided by Windows for performance analysis. GPUView view provides a window into both GPU and CPU performance with DirectX applications. It is precise, has low overhead, and covers the whole Windows system. Custom event manifests.

ETW profiles the whole system, not just the GPU. For a sample debug workflow using ETW to investigate queueing and system-level contention, see Example Workflow: PC below.

Windows 10 replaces ETW with Tracelogging.

Systrace

Reports complete Android system utilization. Available here:http://developer.android.com/tools/help/systrace.html

NVIDIA NSight

NSight is a CPU/GPU debug tool for NVIDIA users, available in a Visual Studio version and anEclipse version.

Mac OpenGL Monitor

An OpenGL debugging and optimizing tool for OS X. Available here:https://developer.apple.com/library/mac/technotes/tn2178/_index.html#//apple_ref/doc/uid/DTS40007990

APITrace

https://apitrace.github.io/

Analyzing Slowdown

In this guide, we take a look at three of the areas commonly involved with slow application performance: pixel fill, draw call overhead, and slow script execution.

Pixel Fill

Pixel fill is a function of overdraw and of fragment shader complexity. Unity shaders are often implemented as multiple passes (draw diffuse part, draw specular part, and so forth). This can cause the same pixel to be touched multiple times. Transparency does this as well. Your goal is to touch almost all pixels on the screen only one time per frame.

Unity’s Frame Debugger (described in Unity Profiling Tools) is very useful for getting a sense of how your scene is drawn. Watch out for large sections of the screen that are drawn and then covered, or for objects that are drawn multiple times (e.g., because they are touched by multiple lights).

Z-testing is faster than drawing a pixel. Unity does culling and opaque sorting via bounding box. Therefore, large background objects (like your Skybox or ground plane) may end up being drawn first (because the bounding box is large) and filling a lot of pixels that will not be visible. If you see this happen, you can move those objects to the end of the queue manually. SeeMaterial.renderQueue in Unity’s Scripting API Reference for more information.

Frame Debugger will clearly show you shadows, offscreen render targets, et cetera.

Draw Calls

Modern PC hardware can push a lot of draw calls at 90 fps, but the overhead of each call is still high enough that you should try to reduce them. On mobile, draw call optimization is your primary scene optimization.

Draw call optimization is usually about batching multiple meshes together into a single VBO with the same material. This is key in Unity because the state change related to selecting a new VBO is relatively slow. If you select a single VBO and then draw different meshes out of it with multiple draw calls, only the first draw call is slow.

Unity batches well when given properly formatted source data. Generally:

Batching is only possible for objects that share the same material pointer.
Batching doesn’t work on objects that have multiple materials.
Implicit state changes (e.g. lightmap index) can cause batching to end early.

Here is a quick checklist for maximizing batching:

Use as few textures in the scene as possible. Fewer textures require fewer unique materials, so they are easier to batch. Use texture atlases.
Bake lightmaps at the largest atlas size possible. Fewer lightmaps require fewer material state changes. Gear VR can push 4096 lightmaps without too much trouble, but watch your memory footprint.
Be careful not to accidentally instance materials. Note that accessing Renderer.materialautomatically creates an instance (!) and opts that object of batching. Use Renderer.sharedMaterialinstead whenever possible.
Watch out for multi-pass shaders. Add noforwardadd to your shaders whenever you can to prevent more than one directional from applying. Multiple directionals generally break batching.
Mark all mesh that never moves as Static in the editor. Note that this will cause the mesh to be combined into a mega mesh at build time, which can increase load time and app size on disk, though usually not in a material way. You can also create a static batch at runtime (e.g., after generating a procedural level out of static parts) using StaticBatchingUtility.
Watch your static and dynamic batch count vs the total draw call count using the Profiler, internal profiler log, or stats gizmo.

Script Performance

Unity’s C# implementation is fast, and slowdown from script is usually the result of a mistake and/or an inadvertent block on slow external operations such as memory allocation. The Unity Profiler can help you find and fix these scripts.

Try to avoid foreach, lamda, and LINQ structures as these allocate memory needlessly at runtime. Use a for loop instead. Also, be wary of loops that concatenate strings.

GameObject creation and destruction takes time. If you have a lot of objects to create and destroy (say, several hundred in a frame), we recommend pooling them.

Don’t move colliders unless they have a rigidbody on them. Creating a rigidbody and settingisKinematic will stop physics from doing anything but will make that collider cheap to move. This is because Unity maintains two collider structures, a static tree and a dynamic tree, and the static tree has to be completely rebuilt every time any static object moves.

Note that coroutines execute in the main thread, and you can have multiple instances of the same coroutine running on the same script.

We recommend targeting around 1-2 ms maximum for all Mono execution time.

PC Debug Workflow

In this guide, we’ll use the example of a hypothetical stuttering app scene and walk through basic steps debugging steps.

Where to Start

Begin by running the scene with the Oculus Performance HUD.

If the scene drops more than one frame every five seconds, check the render time. If it’s more than 8 ms, have a look at GPU utilization. Otherwise, look at optimizing CPU utilization. If observed latency is greater than 30 ms, have a look at queueing.

CPU Profiling (Unity Profiler)

Look for the tallest bars in the CPU Usage graph in the Unity Profiler. Sort hierarchy by Total CPU time, and expand to see which objects and calls take the most time.

If you find garbage collection spikes, don’t allocate memory each frame.

GPU Profiling (Unity Profiler)

Are your rendering stats too high? (For reference baselines, see Performance Targets.

Check for hogs in your hierarchy or timeline view, such as any single object that takes 8 ms to render. The GPU may also wait for long stalls on CPU. Other potential problem areas are mesh rendering, shadows, vsync, and subsystems.

Queueing and System-Level Contention

In this example, we’ll use Event Tracing for Windows (ETW) and GPUView (see Other Tools for an overview) with Windows 8.1.

Install the Windows 8.1 SDK for ETW.

Start Oculus event tracing:

cd C:\Program Files (x86)\Oculus\Tools\ETW
ovrlog

Run your app.
Stop Oculus event tracing.
Run ovrlog again.
Open trace\Merged.etl in GPUView. This is usually C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\gpuview\GPUView.exe
Highlight your processes.
Show vsyncs.
Zoom in on problem area.

So you’ve decided to build a VR game in Unity and have settled on the Samsung Gear VR as your target platform. Getting it up and running on the device was easy enough, but there’s a problem—the frame rate is just too low. Your reticle snaps and sticks, there are flickering black bars on the sides of your vision, and motion looks like somebody just kicked the camera operator in the shins. You’ve read about how important it is to maintain a solid frame rate, and now you know why—in mobile VR, anything less than 60 frames per second doesn’t just look bad, it feels bad. Your high-end PC runs the game at about 1000 frames per second, but it sounds like a jet engine and actually levitates slightly when the fans really get going. What you need is a way to optimize your masterpiece to run on a mobile chipset. This series of articles is designed to help you do just that.

Part 1: The Gear VR Environment and Traits of Efficient VR Games

This isn’t an all-encompassing expose on performance optimization for the Gear VR–it’s more of a quick start. In this first post we’ll discuss the Gear VR hardware and traits of a well-designed mobile VR application. A follow-up post will cover performance improvement for apps you’ve already built. I’ve elected to base this article on the behavior (and quirks) of Unity, as it seems to be very popular amongst Gear VR developers. Still, the concepts presented here should apply to just about any game engine.

Know Your Hardware

Before you tear your project apart looking for inefficiencies, it’s worth thinking a little about the performance characteristics of mobile phones. Generally speaking, mobile graphics pipelines rely on a pretty fast CPU that is connected to a pretty fast GPU by a pretty slow bus and/or memory controller, and an OpenGL ES driver with a lot of overhead. The Gear VR runs on the Samsung Note 4 and the Samsung Galaxy S6. These two product lines actually represent a number of different hardware configurations:

The Note 4 comes in two chipset flavors. Devices sold in North America and Europe are based on Qualcomm’s Snapdragon chipset (specifically, a Snapdragon 805), while those sold in South Korea and some other parts of Asia include Samsung’s Exynos chipset (the Exynos 5433). The Snapdragon is a quad-core CPU configuration, while the Exynos has eight cores. These devices sport two different GPUs: the Adreno 420 and Mali-T760, respectively.
The Note 4 devices are further segmented by operating system. Most run Android 4.4.4 (KitKat) but Android 5 (Lollipop) is now available as an update on most carriers around the world. The Exynos-based Note 4 devices all run Android 5.
The Galaxy S6 devices are all based on the same chipset: the Exynos 7420 (with a Mali-T760M8 GPU). There is actually a second version of the S6, the Galaxy S6 Edge, but internally it is the same as the S6.
All Galaxy S6 devices ship with Android 5.

If this seems like a lot to manage, don’t worry: though the hardware varies from device to device, the performance profile of all these devices is pretty similar (with one serious exception—see “Gotchas” below). If you can make it fast on one device, it should run well on all the others.

As with most mobile chipsets, these devices have pretty reliable characteristics when it comes to 3D graphics performance. Here are the things that generally slow Gear VR projects down (in order of severity):

Scenes requiring dependent renders (e.g., shadows and reflections) (CPU / GPU cost).
Binding VBOs to issue draw calls (CPU / driver cost).
Transparency, multi-pass shaders, per-pixel lighting, and other effects that fill a lot of pixels (GPU / IO cost).
Large texture loads, blits, and other forms of memcpy (IO / memory controller cost).
Skinned animation (CPU cost).
Unity garbage collection overhead (CPU cost).

On the other hand, these devices have relatively large amounts of RAM and can push quite a lot of polygons. Note that the Note 4 and S6 are both 2560×1440 displays, though by default we render to two 1024×1024 textures to save fill rate.

Know the VR Environment

VR rendering throws hardware performance characteristics into sharp relief because every frame must be drawn twice, once for each eye. In Unity 4.6.4p3 and 5.0.1p1 (the latest releases at the time of this writing), that means that every draw call is issued twice, every mesh is drawn twice, and every texture is bound twice. There is also a small amount of overhead involved in putting the final output frame together with distortion and TimeWarp (budget for 2 ms). It is reasonable to expect optimizations that will improve performance in the future, but as of right now we’re stuck with drawing the whole frame twice. That means that some of the most expensive parts of the graphics pipeline cost twice as much time in VR as they would in a flat game.

With that in mind, here are some reasonable targets for Gear VR applications.

This frame is about 30,000 polygons and 40 draw calls.

50 – 100 draw calls per frame
50k – 100k polygons per frame
As few textures as possible (but they can be large)
1 ~ 3 ms spent in script execution (Unity Update())

Bear in mind that these are not hard limits; treat them as rules of thumb.

Also note that the Oculus Mobile SDK introduces an API for throttling the CPU and GPU to control heat and battery drain (see OVRModeParams.cs for sample usage). These methods allow you to choose whether the CPU or GPU is more important for your particular scene. For example, if you are bound on draw call submission, clocking the CPU up (and the GPU down) might improve overall frame rate. If you neglect to set these values, your application will be throttled down significantly, so take time to experiment with them.

Finally, Gear VR comes with Oculus’s Asynchronous TimeWarp technology. TimeWarp provides intermediate frames based on very recent head pose information when your game starts to slow down. It works by distorting the previous frame to match the more recent head pose, and while it will help you smooth out a few dropped frames now and then, it’s not an excuse to run at less than 60 frames per second all the time. If you see black flickering bars at the edges of your vision when you shake your head, that indicates that your game is running slowly enough that TimeWarp doesn’t have a recent enough frame to fill in the blanks.

Designing for Performance

The best way to produce a high-performance application is to design for it up-front. For Gear VR applications, that usually means designing your art assets around the characteristics of mobile GPUs.

Setup

Before you start, make sure that your Unity project settings are organized for maximum performance. Specifically, ensure that the following values are set:

Static batching
Dynamic batching
GPU skinning
Multithreaded Rendering
Default Orientation to Landscape Left

Batching

Since we know that draw calls are usually the most expensive part of a Gear VR application, a fantastic first step is to design your art to require as few draw calls as possible. A draw call is a command to the GPU to draw a mesh or a part of a mesh. The expensive part of this operation is actually the selection of the mesh itself. Every time the game decides to draw a new mesh, that mesh must be processed by the driver before it can be submitted to the GPU. The shader must be bound, format conversions might take place, et cetera; the driver has CPU work to do every time a new mesh is selected. It is this selection process that incurs the most overhead when issuing a draw call.

However, that also means that once a mesh (or, more specifically, a vertex buffer object, orVBO) is selected, we can pay the selection cost once and draw it multiple times. As long as no new mesh (or shader, or texture) is selected, the state will be cached in the driver and subsequent draw calls will issue much more quickly. To leverage this behavior to improve performance, we can actually wrap multiple meshes up into a single large array of verts and draw them individually out of the same vertex buffer object. We pay the selection cost for the whole mesh once, then issue as many draw calls as we can from meshes contained within that object. This trick, called batching, is much faster than creating a unique VBO for each mesh, and is the basis for almost all of our draw call optimization.

All of the meshes contained within a single VBO must have the same material settings for batching to work properly: the same texture, the same shader, and the same shader parameters. To leverage batching in Unity, you actually need to go a step further: objects will only be batched properly if they have the same material object pointer. To that end, here are some rules of thumb:

A texture atlas

Macrotexture / Texture Atlases: Use as few textures as possible by mapping as many of your models as possible to a small number of large textures.
Static Flag: Mark all objects that will never move as Static in the Unity Inspector.
Material Access: Be careful when accessing Renderer.material. This will duplicate the material and give you back the copy, which will opt that object out of batching consideration (as its material pointer is now unique). Use Renderer.sharedMaterial.
Ensure batching is turned on: Make sure Static Batching andDynamic Batching are both enabled in Player Settings (see below).

Unity provides two different methods to batch meshes together: static batching and dynamic batching.

Static Batching

When you mark a mesh as static, you are telling Unity that this object will never move, animate, or scale. Unity uses that information to automatically batch together meshes that share materials into a single, large mesh at build time. In some cases, this can be a significant optimization; in addition to grouping meshes together to reduce draw calls, Unity also burns transformations into the vertex positions of each mesh, so that they do not need to be transformed at runtime. The more parts of your scene that you can mark as static, the better. Just remember that this process requires meshes to have the same material in order to be batched.

Note that since static batching generates new conglomerate meshes at build time, it may increase the final size of your application binary. This usually isn’t a problem for Gear VR developers, but if your game has a lot of individual scenes, and each scene has a lot of static mesh, the cost can add up. Another option is to use StaticBatchingUtility.Combine at runtime to generate the batched mesh without bloating the size of your application (at the cost of a one-time significant CPU hit and some memory).

Finally, be careful to ensure that the version of Unity you are using supports static batching (see “Gotchas” below).

Dynamic Batching

Unity can also batch meshes that are not marked as static as long as they conform to the shared material requirement. If you have the Dynamic Batching option turned on, this process is mostly automatic. There is some overhead to compute the meshes to be batched every frame, but it almost always yields a significant net win in terms of performance.

Other batching Issues

Note that there are a few other ways you can break batching. Drawing shadows and other multi-pass shaders requires a state switch and prevents objects from batching correctly. Multi-pass shaders can also cause the mesh to be submitted multiple times, and should be treated with caution on the Gear VR. Per-pixel lighting can have the same effect: using the default Diffuse shader in Unity 4, the mesh will be resubmitted for each light that touches it. This can quickly blow out your draw call and poly count limits. If you need per-pixel lighting, try setting the total number of simultaneous lights in the Quality Settings window to one. The closest light will be rendered per-pixel, and surrounding lights will be calculated using spherical harmonics. Even better, drop all pixel lights and rely on Light Probes. Also note that batching usually doesn’t work on skinned meshes. Transparent objects must be drawn in a certain order and therefore rarely batch well.

The good news is that you can actually test and tune batching in the editor. Both the Unity Profiler (Unity Pro only) and the Stats pane on the Game window can show you how many draw calls are being issued and how many are being saved by batching. If you organize your geometry around a very small number of textures, make sure that you do not instance your materials, and mark static objects with the Static Flag, you should be well on your way to a very efficient scene.

Transparency, Alpha Test, and Overdraw

As discussed above, mobile chipsets are often “fill-bound,” meaning that the cost of filling pixels can be the most expensive part of the frame. The key to reducing fill cost is to try to draw every pixel on the screen only once. Multi-pass shaders, per-pixel lighting effects (such as Unity’s default specular shader), and transparent objects all require multiple passes over the pixels that they touch. If you touch too many of these pixels, you can saturate the bus.

As a best practice, try to limit the Pixel Light Count in Quality Settings to one. If you use more than one per-pixel light, make sure you know which geometry it is being applied to and the cost of drawing that geometry multiple times. Similarly, strive to keep transparent objects small. The cost here is touching pixels, so the fewer pixels you touch, the faster your frame can complete. Watch out for transparent particle effects like smoke that may touch many more pixels than you expect with mostly-transparent quads.

Also note that you should never use alpha test shaders, such as Unity’s cutout shader, on a mobile device. The alpha test operation (as well as clip(), or an explicit discard in the fragment shader) forces some common mobile GPUs to opt out of certain hardware fill optimizations, making it extremely slow. Discarding fragments mid-pipe also tends to cause a lot of ugly aliasing, so stick opaque geometry or alpha-to-coverage for cutouts.

Performance Throttling

Before you can test the performance of your scene reliably, you need to ensure that your CPU and GPU throttling settings are set. Because VR games push mobile phones to their limit, you are required to select a weighting between the CPU and GPU. If your game is CPU-bound, you can downclock the GPU in order to run the CPU at full speed. If your app is GPU-bound you can do the reverse. And if you have a highly efficient app, you can downclock both and save your users a bunch of battery life to encourage longer play sessions. See “Power Management” in the Mobile SDK documentation for more information about CPU and GPU throttling.

The important point here is that you must select a CPU and GPU throttle setting before you do any sort of performance testing. If you fail to initialize these values, your app will run in a significantly downclocked environment by default. Since most Gear VR applications tend to be bound on CPU-side driver overhead (like draw call submission), it is common to set the clock settings to favor the CPU over the GPU. An example of how to initialize throttling targets can be found in OVRModeParams.cs, which you can copy and paste into a script that executes on game startup.

Gotchas

Here are some tricky things you should keep in the back of your mind while considering your performance profile.

One particular device profile, specifically the Snapdragon-based Note 4 running Android 5, is slower than everything else; the graphics driver seems to contain a regression related to draw call submission. Games that are already draw call bound may find that this new overhead (which can be as much as a 20% increase in draw call time) is significant enough to cause regular pipeline stalls and drop the overall frame rate. We’re working hard with Samsung and Qualcomm to resolve this regression. Snapdragon-based Note 4 devices running Android 4.4, as well as Exynos-based Note 4 and S6 devices, are unaffected.
Though throttling the CPU and GPU dramatically reduces the amount of heat generated by the phone, it is still possible for heavy-weight applications to cause the device to overheat during long play sessions. When this happens, the phone warns the user, then dynamically lowers the clock rate of its processors, which usually makes VR applications unusable. If you are working on performance testing and manage to overheat your device, let it sit without the game running for a good five minutes before testing again.
Unity 4 Free does not support static batching or the Unity Profiler. However, Unity 5 Personal Edition does.
The S6 does not support anisotropic texture filtering.

That’s all for now. In the next post, we’ll discuss how to go about debugging real-world performance problems.

For more information on optimizing your Unity mobile VR development, see “Best Practices: Mobile” in our Unity Integration guide.

Part 2: Solving Performance Problems

In my last post I discussed ways to build efficient Gear VR games and the traits of Gear VR devices. In this installment I’ll focus on ways to debug Unity applications that are not sufficiently performant on those devices.

Performance Debugging

Even if you’ve designed your scene well and set reasonable throttling values, you may find that your game does not run at a solid 60 frames per second on the Gear VR device. The next step is to decipher what’s going on using three tools: Unity’s internal profiler log, Unity’s Profiler , and the Oculus Remote Monitor.

The very first thing you should do when debugging Unity performance is to turn on theEnable Internal Profiler option in Player Settings. This will spit a number of important frame rate statistics to the logcat console every few seconds, and should give you a really good idea of where your frame time is going.

To illustrate the common steps to debugging performance, let’s look at some sample data from a fairly complicated scene in a real game running on a Note 4 Gear VR device:

Android Unity internal profiler stats:
cpu-player>    min:  8.8   max: 44.3   avg: 16.3
cpu-ogles-drv> min:  5.1   max:  6.0   avg:  5.6
cpu-present>   min:  0.0   max:  0.3   avg:  0.1
frametime>     min: 14.6   max: 49.8   avg: 22.0
draw-call #>   min: 171    max: 177    avg: 174     | batched:    12
tris #>        min: 153294  max: 153386  avg: 153326   | batched:  2362
verts #>       min: 203346  max: 203530  avg: 203411   | batched:  3096
player-detail> physx:  0.1 animation:  0.1 culling  0.0 skinning:  0.0 
               batching:  0.4 render: 11.6 fixed-update-count: 1 .. 1
mono-scripts>  update:  0.9   fixedUpdate:  0.0 coroutines:  0.0 
mono-memory>   used heap: 3043328 allocated heap: 3796992  
               max number of collections: 0 collection total duration:  0.0

To capture this sample you’ll need to connect to your device over Wi-Fi using ADB over TCPIP and run adb logcat while the device is docked in the headset (for more information, see “Android Debugging” in the Mobile SDK documentation).

What the sample above tells us is that our average frame time is 22 ms, which is about 45 frames per second—way below our target of 60 frames per second. We can also see that this scene is heavy on the CPU—16.3 ms of our 22 ms total is spent on the CPU. We’re spending 5 ms in the driver (“cpu-ogles-drv”), which suggests that we’re sending the driver down some internal slow path. The probable culprit is pretty clear: at 174 draw calls per frame, we’re significantly over our target of 100. In addition, we’re pushing more polygons than we would like. This view of the scene doesn’t really tell us what’s happening on the GPU, but it tells us that we can explain our 45 ms frame rate just by looking at the CPU, and that reducing draw calls should be the focus of our attention.

This data also shows that we have regular spikes in the frametime (represented by max frametime of 49.8 ms). To understand where those are coming from, the next step is to connect the Unity Profiler and look at its output.

As expected, the graph shows a regular spike. During non-spike periods, our render time is similar to the values reported above, and there is no other significant contributor to our final frametime.

The profiler blames the spikes on something called Gfx.WaitForPresent. Curiously, our actual render time is not significantly increased in the spike frame. What’s going on here anyway?

Wait for present

WaitForPresent (and its cousin, Overhead) appears to be some repeated cost that comes along and destroys our frame. In fact, it does not actually represent some mysterious work being performed. Rather, WaitForPresent records the amount of time that the render pipeline has stalled.

One way to think of the render pipeline is to imagine a train station. Trains leave at reliable times—every 16.6 ms. Let’s say the train only holds one person at a time, and there is a queue of folks waiting to catch a ride. As long as each person in the queue can make it to the platform and get on the next train before it leaves, you’ll be able to ship somebody out to their destination every 16 ms. But if even one guy moves too slowly—maybe he trips on his shoelaces—he’ll not only miss his train, he’ll be sitting there waiting on the platform for another 16 ms for the next train to come. Even though he might have only been 1 ms late for his train, missing it means that he has to sit there and wait a long time.

In a graphics pipeline, the train is the point at which the front buffer (currently displayed) and back buffer (containing the next frame to display) swap. This usually happens when the previous frame finishes its scanout to the display. Assuming that the GPU can execute all of the render commands buffered for that frame in a reasonable amount of time, there should be a buffer swap every 16 ms. To maintain a 60 frames-per-second refresh rate, the game must finish all of its work for the next frame within 16 ms. When the CPU takes too long to complete a frame, even if it’s only late by 1 ms, the swap period is missed, the scanout of the next frame begins using the previous frame’s data, and the CPU has to sit there and wait for the next swap to roll around. To use the parlance of our example above, the train is the swap and the frame commands issued by the CPU are the passengers.

WaitForPresent indicates that this sort of pipeline stall has occurred and records the amount of time the CPU idles while waiting on the GPU. Though less common, this can also happen if the CPU finishes its work very quickly and has to wait for the next swap.

In this particular example, it’s pretty clear that our frame rate is inconsistent enough that we cannot schedule our pipeline reliably. The way to fix WaitForPresent is thus to ignore it in the profiler and concentrate on optimizing everything else, which in this case means reducing the number of draw calls we have in the scene.

Other Profiler Information

The Unity profiler is very useful for digging into all sorts of other runtime information, including memory usage, draw calls, texture allocations, and audio overhead. For serious performance debugging, it’s a good idea to turn off Multithreaded Rendering in the Player Preferences. This will slow the renderer down a lot but also give you a clearer view of where your frame time is going. When you’re done with optimizations, remember to turn Multithreaded Rendering back on.

In addition to draw call batching, other common areas of performance overhead include overdraw (often caused by large transparent objects or poor occlusion culling), skinning and animation, physics overhead, and garbage collection (usually caused by memory leaks or other repeated allocations). Watch for these as you dig into the performance of your scene. Also remember that displaying the final VR output, which includes warping and TimeWarp overhead, costs about 2 ms every frame.

Oculus Remote Monitor

OVRMonitor is a new tool recently released with the Oculus Mobile SDK. It helps developers understand the way pipelining works and identify pipeline stalls. It can also stream low resolution unwarped video from a Gear VR device wirelessly, which is useful for usability testing.

OVRMonitor is currently in development, but this early version can still be used to visualize the graphics pipeline for Gear VR applications. Here’s a shot of a tool inspecting a game running the same scene discussed above:

The yellow bar represents the vertical sync interrupt that indicates that scan out for a frame has completed. The image at the top of the window is a capture of the rendered frame, and the left side of the image is aligned to the point in the frame where drawing began. The red bar in the middle of the image shows the TimeWarp thread, and you can see it running parallel to the actual game. The bottom blue are indicates the load on CPU and GPU, which are constant (i.e., in this case, all four CPUs are running).

This shot actually shows us one of the WaitForPresent spikes we saw above in the Unity Profiler. The frame in the middle of the timeline began too late to complete by the next vertical blank, and as a result the CPU blocked for a full frame (evidenced by the lack of screen shot in the subsequent frame and the 16.25 ms WarpSwapInternal thread time).

OVRMonitor is a good way to get a sense of what is happening in your graphics pipeline from frame to frame. It can be used with any Gear VR app built against the latest SDK. See the documentation in SdkRoot/Tools/OVRMonitor for details. More documentation and features are coming soon.

Tips and Tricks

Here are a few performance tricks we’ve used or heard about from other developers. These are not guaranteed solutions for all VR applications, but they might give you some ideas about potential solutions for your particular scene.

Draw big, opaque meshes last. Try sorting your skybox into the Geometry+1 render queue so that it draws after all the other opaque geometry. Depending on your scene, this probably causes a lot of the pixels covered by the skybox to be discarded by depth test, thus saving you time. Ground planes and other static, opaque objects that touch a lot of pixels and are likely to be mostly occluded by other objects are candidates for this optimization.
Dynamically change your CPU / GPU throttling settings. You can change your throttling settings at any time. If you are able to run most of your game at a low setting but have one or two particularly challenging scenes, consider cranking the CPU or the GPU up just during those scenes. You can also drop the speed of one or both processors in order to save battery life and reduce heat during scenes that are known to be simple. For example, why not set GPU to 0 during a scene load?
Update render targets infrequently. If you have secondary render textures that you are drawing the scene to, try drawing them at a lower frequency than the main scene. For example, a stereoscopically rendered mirror might only refresh its reflection for only one eye each frame. This effectively lowers the frame rate of the mirror to 30 frames per second, but because there is new data every frame in one eye or the other, it looks okay for subtle movements.
Lower the target eye resolution. By trading a bit of visual quality you can often improve your fill-bound game performance significantly by slightly lowering the size of the render target for each eye. OVRManager.virtualTextureScale is a value between 0.0 and 1.0 that controls the size of the render output. Dropping resolution slightly when running on older devices is often an easy way to support slower hardware.
Compress your textures. All Gear VR devices support the ASTC compression format, which can significantly reduce the size of your textures. Note that as of this writing, Unity 4.6 expects ASTC compression to go hand-in-hand with OpenGL ES 3.0. If you use ASTC under GLES 2.0, your textures will be decompressed on load, which will probably lengthen your application’s start-up time significantly. ETC1 is a lower-quality but universally supported alternative for GLES 2.0 users.
Use texture atlases. As described above, large textures that can be shared across a number of meshes will batch efficiently. Avoid lots of small individual textures.

For more information on optimizing your Unity mobile VR development, see “Best Practices: Mobile” in our Unity Integration guide.