In the previous episode, I showed how to debug shaders by outputting values as colors. But there are still many cases where that is not enough. GPU frame capture tools can help a lot, but in practice I have also run into situations where they are not enough: cases where you need more complex GPU–CPU interaction at runtime, or where the issue has to be caught earlier in the pipeline because diagnosing it purely on the GPU side would be much harder. In this episode, we go beyond shaders and color outputs, and look at ways to inspect and process GPU data on the CPU side.
If you need to prototype a custom shader effect or procedural graphics, building a convenient sandbox can take a lot of time and setup, even with AI assistance. For this kind of work, I recommend using an external shader editor. There are many options, but I use KodeLife. I already mentioned it in the previous episode. It is simple enough to get started quickly, and in my experience their support is fast and helpful, so do not hesitate to contact them if you run into issues.

If you need to inspect the contents of a texture during a normal debug session, you can select the texture in Xcode’s Variable View and press Space, or click the eye icon.

The problem is that Metal scheduling is asynchronous. At that point, the texture may not have been processed yet, because the command buffer has not been committed and completed. As a result, you may only see the initial state of the texture. Later, when command buffer execution has finished, you may no longer have convenient access to that texture from the debugger.
For stable access, I usually copy the texture I need and inspect the copy from the command buffer’s completion handler. It is easier to do this with small prepared helpers:
import Metal
@discardableResult
func makeSharedCopy(
// (1)
of source: MTLTexture,
// (2)
commandBuffer: MTLCommandBuffer,
// (3)
completion: @escaping (MTLTexture) -> Void
) -> MTLTexture? {
// (4)
guard source.sampleCount == 1 else {
return nil
}
// (5)
let descriptor = MTLTextureDescriptor()
descriptor.textureType = source.textureType
descriptor.pixelFormat = source.pixelFormat
descriptor.width = source.width
descriptor.height = source.height
descriptor.depth = source.depth
descriptor.mipmapLevelCount = source.mipmapLevelCount
descriptor.sampleCount = source.sampleCount
descriptor.arrayLength = source.arrayLength
descriptor.usage = source.usage
descriptor.storageMode = .shared
descriptor.cpuCacheMode = .defaultCache
descriptor.resourceOptions = [.storageModeShared]
guard let destination = commandBuffer.device.makeTexture(descriptor: descriptor) else {
return nil
}
guard let blitEncoder = commandBuffer.makeBlitCommandEncoder()
else {
return nil
}
// (6)
for mipLevel in 0..<source.mipmapLevelCount {
let width = max(1, source.width >> mipLevel)
let height = max(1, source.height >> mipLevel)
if source.textureType == .type3D {
// (7)
let depth = max(1, source.depth >> mipLevel)
blitEncoder.copy(
from: source,
sourceSlice: 0,
sourceLevel: mipLevel,
sourceOrigin: MTLOrigin(x: 0, y: 0, z: 0),
sourceSize: MTLSize(width: width, height: height, depth: depth),
to: destination,
destinationSlice: 0,
destinationLevel: mipLevel,
destinationOrigin: MTLOrigin(x: 0, y: 0, z: 0)
)
} else {
// (8)
for slice in 0..<source.arrayLength {
blitEncoder.copy(
from: source,
sourceSlice: slice,
sourceLevel: mipLevel,
sourceOrigin: MTLOrigin(x: 0, y: 0, z: 0),
sourceSize: MTLSize(width: width, height: height, depth: 1),
to: destination,
destinationSlice: slice,
destinationLevel: mipLevel,
destinationOrigin: MTLOrigin(x: 0, y: 0, z: 0)
)
}
}
}
// (9)
blitEncoder.endEncoding()
commandBuffer.addCompletedHandler { _ in
// (10)
completion(destination)
}
return destination
}
NOTE:
addCompletedHandlerdoes not overwrite existing completion handlers. You can register multiple completion handlers on the same command buffer, but you must add them before callingcommit().
Usage of the helper could look like this. Set a breakpoint at print, then use Quick Look on texture:
// ...
makeSharedCopy(of: texture, commandBuffer: commandBuffer) { texture in
print("Texture: \(texture.width)x\(texture.height)")
}
// ...

You can also read the texture contents into a vImage_Buffer, create a CGImage, or convert the data into any other representation that is convenient for debugging.
If your graphics or image-processing pipeline has multiple stages, you often need to know exactly where a problem starts. The final output may be wrong, but that does not tell you whether the issue comes from the input data, an early processing stage, a later render pass, or post-processing.
A practical solution is to expose intermediate render targets. Instead of replacing the normal pipeline output, you can write additional debug textures from selected stages and inspect them later in a completion handler, a debug UI, or a texture viewer.
// (1)
private func captureStagePreview(step: ProcessorStep, texture: MTLTexture, commandBuffer: MTLCommandBuffer) {
// (2)
makeSharedCopy(of: texture, commandBuffer: commandBuffer) { [weak self] sharedTexture in
guard let self,
// (3)
var buffer = downloadSharedTextureToVImageBuffer(sharedTexture) else {
return
}
defer {
free(buffer.data)
}
// (4)
guard let image = makeOwnedCGImageFromRGBA8Buffer(buffer) else {
return
}
// (5)
DispatchQueue.main.async { [weak self] in
guard let self else { return }
self.stagePreviewImages[step.rawValue] = image
self.publishStagePreviews()
}
}
}
vImageBuffer.main thread).The main advantage over GPU frame capture is that you can see changes in your debug output interactively while the application is running.

For a compute encoder, this is usually simple: bind an additional writable texture and write the values you want to inspect. For a render encoder, you need to add another color attachment and configure the fragment shader to write to multiple outputs.
The important part is that the normal pipeline should keep working as before. Debug outputs should observe the pipeline, not change its behavior. And because this is only diagnostic scaffolding, it should be disabled in release builds with preprocessor directives or another debug-only mechanism.
Also keep in mind that extra debug passes, copies, and readbacks can perturb frame timing, so profile with diagnostics disabled before drawing performance conclusions.
This is not a special rendering or compute trick, but a general image-processing technique. I will describe it without code examples, because it can be implemented in many ways and at different levels of the pipeline.
The main idea is to visualize the difference between two images. This can help you detect whether one image is shifted, blurred, has different colors, contains missing details, or differs in any other way that matters for your task.
For example, you can compare:
A simple absolute difference view is often enough to reveal that something changed. For more specific cases, you can use a signed difference, amplify the difference with a scale factor, compare only one channel, or show the difference as a heatmap.

This is especially useful when the images look almost identical at first glance, but small errors accumulate or appear only in specific regions.
If a texture has mip levels, you can manually fill those levels with diagnostic data. This can be done from the CPU side if the texture is accessible, or with a small compute kernel if you want to generate the contents on the GPU.
This is useful for checking how LOD selection works. For example, you can fill each mip level with a different color or pattern, then look at the rendered result to see which levels are actually being sampled.

This can help diagnose issues such as unexpected blur, shimmering, wrong texture gradients, missing mipmaps, or incorrect sampler configuration.
Keep in mind that the result also depends on the sampler settings. Depending on your task, you may need to test different mip_filter values, such as nearest or linear mip filtering.
Sometimes you need more than a visual debug output. For example, you may want to count how many pixels, elements, or threads match a specific condition. In that case, you can use atomic types and atomic operations on the GPU.
kernel void krnCountAlive(texture2d<float, access::read> in [[ texture(0) ]],
device atomic_uint *counter [[buffer(0)]],
uint2 gid [[thread_position_in_grid]])
{
int2 size(in.get_width(), in.get_height());
if (any(int2(gid) >= size)) {
return;
}
if (in.read(gid).r == 1.0) {
atomic_fetch_add_explicit(counter, 1, memory_order_relaxed);
}
}
This example counts pixels where the red channel is equal to 1.0. In this case, it is implemented as a separate compute pass, but you can also integrate the same idea into the pass where the relevant data is produced or processed.
The same approach can be used from fragment and vertex shaders too, as long as the shader has access to a writable buffer in device address space.
For diagnostic counters, shared buffers are often convenient because you can read them from the CPU side without an additional copy step.
Metal shaders are written in a C++ subset, so you can extract most of the math and logic into a separate .h file and include it from your .metal file.
The goal is to reuse the same code on the CPU side and cover it with tests. This is especially useful for procedural effects, coordinate transformations, sampling logic, SDFs, noise functions, color conversions, and other math-heavy shader code.
The main problem is that not every Metal type or function maps cleanly to CPU-side libraries such as simd.h. Because of that, you may need a small compatibility wrapper that maps similar names to the Metal and CPU worlds:
#ifndef MATH_COMPAT_H
#define MATH_COMPAT_H
#if defined(__METAL_VERSION__)
#include <metal_stdlib>
using metal::float2;
using metal::float3;
inline float2 xy(float3 value) { return value.xy; }
inline float3 xyx(float2 value) { return value.xyx; }
inline float2 zy(float3 value) { return value.zy; }
#else
#include <simd/simd.h>
#include <cmath>
using float2 = simd_float2;
using float3 = simd_float3;
inline float2 xy(float3 value) { return float2(value.x, value.y); }
inline float3 xyx(float2 value) { return float3(value.x, value.y, value.x); }
inline float2 zy(float3 value) { return float2(value.z, value.y); }
#endif
#endif
Then you can test your GPU-ready common code with XCTest from .mm files without additional wrappers.
This approach works best for pure math and deterministic logic. It is less useful for code that depends heavily on GPU-specific behavior: derivatives, texture sampling, threadgroup memory, synchronization, address spaces, interpolation, precision differences, or sampler state. Those parts still need to be tested on the GPU side.