10. Manual Diagnostics: pt.2 Tooling-level

In the previous episode, I showed how to debug shaders by outputting values as colors. But there are still many cases where that is not enough. GPU frame capture tools can help a lot, but in practice I have also run into situations where they are not enough: cases where you need more complex GPU–CPU interaction at runtime, or where the issue has to be caught earlier in the pipeline because diagnosing it purely on the GPU side would be much harder. In this episode, we go beyond shaders and color outputs, and look at ways to inspect and process GPU data on the CPU side.

Separated shader editor

If you need to prototype a custom shader effect or procedural graphics, building a convenient sandbox can take a lot of time and setup, even with AI assistance. For this kind of work, I recommend using an external shader editor. There are many options, but I use KodeLife. I already mentioned it in the previous episode. It is simple enough to get started quickly, and in my experience their support is fast and helpful, so do not hesitate to contact them if you run into issues.

Texture viewing

If you need to inspect the contents of a texture during a normal debug session, you can select the texture in Xcode’s Variable View and press Space, or click the eye icon.

The problem is that Metal scheduling is asynchronous. At that point, the texture may not have been processed yet, because the command buffer has not been committed and completed. As a result, you may only see the initial state of the texture. Later, when command buffer execution has finished, you may no longer have convenient access to that texture from the debugger.

For stable access, I usually copy the texture I need and inspect the copy from the command buffer’s completion handler. It is easier to do this with small prepared helpers:

import Metal

@discardableResult
func makeSharedCopy(
    // (1)
    of source: MTLTexture,
    // (2)
    commandBuffer: MTLCommandBuffer,
    // (3)
    completion: @escaping (MTLTexture) -> Void
) -> MTLTexture? {
    // (4)
    guard source.sampleCount == 1 else {
        return nil
    }

    // (5)
    let descriptor = MTLTextureDescriptor()
    descriptor.textureType = source.textureType
    descriptor.pixelFormat = source.pixelFormat
    descriptor.width = source.width
    descriptor.height = source.height
    descriptor.depth = source.depth
    descriptor.mipmapLevelCount = source.mipmapLevelCount
    descriptor.sampleCount = source.sampleCount
    descriptor.arrayLength = source.arrayLength
    descriptor.usage = source.usage
    descriptor.storageMode = .shared
    descriptor.cpuCacheMode = .defaultCache
    descriptor.resourceOptions = [.storageModeShared]

    guard let destination = commandBuffer.device.makeTexture(descriptor: descriptor) else {
        return nil
    }
    guard let blitEncoder = commandBuffer.makeBlitCommandEncoder()
    else {
        return nil
    }

    // (6)
    for mipLevel in 0..<source.mipmapLevelCount {
        let width = max(1, source.width >> mipLevel)
        let height = max(1, source.height >> mipLevel)

        if source.textureType == .type3D {
            // (7)
            let depth = max(1, source.depth >> mipLevel)
            blitEncoder.copy(
                from: source,
                sourceSlice: 0,
                sourceLevel: mipLevel,
                sourceOrigin: MTLOrigin(x: 0, y: 0, z: 0),
                sourceSize: MTLSize(width: width, height: height, depth: depth),
                to: destination,
                destinationSlice: 0,
                destinationLevel: mipLevel,
                destinationOrigin: MTLOrigin(x: 0, y: 0, z: 0)
            )
        } else {
            // (8)
            for slice in 0..<source.arrayLength {
                blitEncoder.copy(
                    from: source,
                    sourceSlice: slice,
                    sourceLevel: mipLevel,
                    sourceOrigin: MTLOrigin(x: 0, y: 0, z: 0),
                    sourceSize: MTLSize(width: width, height: height, depth: 1),
                    to: destination,
                    destinationSlice: slice,
                    destinationLevel: mipLevel,
                    destinationOrigin: MTLOrigin(x: 0, y: 0, z: 0)
                )
            }
        }
    }

    // (9)
    blitEncoder.endEncoding()
    commandBuffer.addCompletedHandler { _ in
        // (10)
        completion(destination)
    }

    return destination
}

The original texture you want to copy.
The command buffer used to encode the GPU copy operation. Ideally, use the same command buffer as your processing pass, so the copy happens after the texture has been produced or modified.
A callback used to handle the shared copy once the GPU work is finished.
This function only supports regular single-sample textures. Multisample textures contain multiple samples per pixel, and a blit copy does not resolve them into a normal image. For MSAA textures, resolve them into a single-sample texture first.
Here we copy the source texture's descriptor attributes. If you do not need an exact structural copy, you may rely on default descriptor values. In many simple cases, dimensions, pixel format, usage, and storage mode are enough.
If the texture has multiple mip levels, you may need to copy all of them.
For 3D textures, copy each mip level as an entire 3D volume.
For texture arrays copy every slice for every mip level.
End the blit encoder after encoding all copy commands.
Add a completion handler to the command buffer because the destination texture contains valid copied data only after the GPU has finished executing the command buffer.

NOTE: addCompletedHandler does not overwrite existing completion handlers. You can register multiple completion handlers on the same command buffer, but you must add them before calling commit().

Usage of the helper could look like this. Set a breakpoint at print, then use Quick Look on texture:

// ...
makeSharedCopy(of: texture, commandBuffer: commandBuffer) { texture in
    print("Texture: \(texture.width)x\(texture.height)")
}
// ...

You can also read the texture contents into a vImage_Buffer, create a CGImage, or convert the data into any other representation that is convenient for debugging.

Intermediate render targets

If your graphics or image-processing pipeline has multiple stages, you often need to know exactly where a problem starts. The final output may be wrong, but that does not tell you whether the issue comes from the input data, an early processing stage, a later render pass, or post-processing.

A practical solution is to expose intermediate render targets. Instead of replacing the normal pipeline output, you can write additional debug textures from selected stages and inspect them later in a completion handler, a debug UI, or a texture viewer.

// (1)
private func captureStagePreview(step: ProcessorStep, texture: MTLTexture, commandBuffer: MTLCommandBuffer) {
    // (2)
    makeSharedCopy(of: texture, commandBuffer: commandBuffer) { [weak self] sharedTexture in
        guard let self,
                // (3)
                var buffer = downloadSharedTextureToVImageBuffer(sharedTexture) else {
            return
        }
        defer {
            free(buffer.data)
        }
        // (4)
        guard let image = makeOwnedCGImageFromRGBA8Buffer(buffer) else {
            return
        }
        // (5)
        DispatchQueue.main.async { [weak self] in
            guard let self else { return }
            self.stagePreviewImages[step.rawValue] = image
            self.publishStagePreviews()
        }
    }
}

Another helper function in my processing class (it's here for example purpose only) that captures a texture into a UI-ready image.
Taking a texture (you can find this function implementation above).
Downloading (or mapping) shared texture's content into a vImageBuffer.
Creating an image with copying the buffer content (we don't know how long the texture lives).
Setting a UI image value (all UI operations must be in the main thread).

The main advantage over GPU frame capture is that you can see changes in your debug output interactively while the application is running.

For a compute encoder, this is usually simple: bind an additional writable texture and write the values you want to inspect. For a render encoder, you need to add another color attachment and configure the fragment shader to write to multiple outputs.

The important part is that the normal pipeline should keep working as before. Debug outputs should observe the pipeline, not change its behavior. And because this is only diagnostic scaffolding, it should be disabled in release builds with preprocessor directives or another debug-only mechanism.

Also keep in mind that extra debug passes, copies, and readbacks can perturb frame timing, so profile with diagnostics disabled before drawing performance conclusions.

Difference views

This is not a special rendering or compute trick, but a general image-processing technique. I will describe it without code examples, because it can be implemented in many ways and at different levels of the pipeline.

The main idea is to visualize the difference between two images. This can help you detect whether one image is shifted, blurred, has different colors, contains missing details, or differs in any other way that matters for your task.

For example, you can compare:

output before and after a shader change
GPU output against a CPU reference
current frame against history
input and output of a processing stage
two versions of the same render target with different parameters

A simple absolute difference view is often enough to reveal that something changed. For more specific cases, you can use a signed difference, amplify the difference with a scale factor, compare only one channel, or show the difference as a heatmap.

This is especially useful when the images look almost identical at first glance, but small errors accumulate or appear only in specific regions.

Mip injection

If a texture has mip levels, you can manually fill those levels with diagnostic data. This can be done from the CPU side if the texture is accessible, or with a small compute kernel if you want to generate the contents on the GPU.

This is useful for checking how LOD selection works. For example, you can fill each mip level with a different color or pattern, then look at the rendered result to see which levels are actually being sampled.

This can help diagnose issues such as unexpected blur, shimmering, wrong texture gradients, missing mipmaps, or incorrect sampler configuration.

Keep in mind that the result also depends on the sampler settings. Depending on your task, you may need to test different mip_filter values, such as nearest or linear mip filtering.

Atomic statistics

Sometimes you need more than a visual debug output. For example, you may want to count how many pixels, elements, or threads match a specific condition. In that case, you can use atomic types and atomic operations on the GPU.

kernel void krnCountAlive(texture2d<float, access::read> in [[ texture(0) ]],
                          device atomic_uint *counter [[buffer(0)]],
                          uint2 gid [[thread_position_in_grid]])
{
    int2 size(in.get_width(), in.get_height());
    if (any(int2(gid) >= size)) {
        return;
    }
    if (in.read(gid).r == 1.0) {
        atomic_fetch_add_explicit(counter, 1, memory_order_relaxed);
    }
}

This example counts pixels where the red channel is equal to 1.0. In this case, it is implemented as a separate compute pass, but you can also integrate the same idea into the pass where the relevant data is produced or processed.

The same approach can be used from fragment and vertex shaders too, as long as the shader has access to a writable buffer in device address space.

For diagnostic counters, shared buffers are often convenient because you can read them from the CPU side without an additional copy step.

CPU tests

Metal shaders are written in a C++ subset, so you can extract most of the math and logic into a separate .h file and include it from your .metal file.

The goal is to reuse the same code on the CPU side and cover it with tests. This is especially useful for procedural effects, coordinate transformations, sampling logic, SDFs, noise functions, color conversions, and other math-heavy shader code.

The main problem is that not every Metal type or function maps cleanly to CPU-side libraries such as simd.h. Because of that, you may need a small compatibility wrapper that maps similar names to the Metal and CPU worlds:

#ifndef MATH_COMPAT_H
#define MATH_COMPAT_H

#if defined(__METAL_VERSION__)

#include <metal_stdlib>
using metal::float2;
using metal::float3;
inline float2 xy(float3 value) { return value.xy; }
inline float3 xyx(float2 value) { return value.xyx; }
inline float2 zy(float3 value) { return value.zy; }

#else

#include <simd/simd.h>
#include <cmath>
using float2 = simd_float2;
using float3 = simd_float3;
inline float2 xy(float3 value) { return float2(value.x, value.y); }
inline float3 xyx(float2 value) { return float3(value.x, value.y, value.x); }
inline float2 zy(float3 value) { return float2(value.z, value.y); }
#endif

#endif

Then you can test your GPU-ready common code with XCTest from .mm files without additional wrappers.

This approach works best for pure math and deterministic logic. It is less useful for code that depends heavily on GPU-specific behavior: derivatives, texture sampling, threadgroup memory, synchronization, address spaces, interpolation, precision differences, or sampler state. Those parts still need to be tested on the GPU side.

Conclusion

Use existing tools if you can.
Build a toolset of helpers: texture capturing, mip injections, etc.
Reuse deterministic shader math on the CPU side and cover it with tests (TDD with shaders).
Keep diagnostics isolated from production code and disabled in release builds.

<Prev | Content