ComputeShader

ComputeShader
参考资料

ComputeShader note.

<!– more –>

ComputeShader

Base

Compute Shader Base

参考资料

More Compute Shaders https://zhuanlan.zhihu.com/p/63223223
游戏引擎随笔 0x28：现代图形 API 的 Wave Intrinsics、Subgroup 以及 SIMD-group https://zhuanlan.zhihu.com/p/469436345

ComputeShader 优化 Blur

//=============================================================================
// Performs a separable Guassian blur with a blur radius up to 5 pixels.
//=============================================================================

cbuffer cbSettings : register(b0)
{
    // We cannot have an array entry in a constant buffer that gets mapped onto
    // root constants, so list each element.

    int gBlurRadius;

    // Support up to 11 blur weights.
    float w0;
    float w1;
    float w2;
    float w3;
    float w4;
    float w5;
    float w6;
    float w7;
    float w8;
    float w9;
    float w10;
};

static const int gMaxBlurRadius = 5;


Texture2D gInput            : register(t0);
RWTexture2D<float4> gOutput : register(u0);

#define N 256
#define CacheSize (N + 2*gMaxBlurRadius)
groupshared float4 gCache[CacheSize];

[numthreads(N, 1, 1)]
void HorzBlurCS(int3 groupThreadID : SV_GroupThreadID,
                int3 dispatchThreadID : SV_DispatchThreadID)
{
    // Put in an array for each indexing.
    float weights[11] = { w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10 };

    //
    // Fill local thread storage to reduce bandwidth.  To blur
    // N pixels, we will need to load N + 2*BlurRadius pixels
    // due to the blur radius.
    //

    // This thread group runs N threads.  To get the extra 2*BlurRadius pixels,
    // have 2*BlurRadius threads sample an extra pixel.
    if(groupThreadID.x < gBlurRadius)
    {
        // Clamp out of bound samples that occur at image borders.
        int x = max(dispatchThreadID.x - gBlurRadius, 0);
        gCache[groupThreadID.x] = gInput[int2(x, dispatchThreadID.y)];
    }
    if(groupThreadID.x >= N-gBlurRadius)
    {
        // Clamp out of bound samples that occur at image borders.
        int x = min(dispatchThreadID.x + gBlurRadius, gInput.Length.x-1);
        gCache[groupThreadID.x+2*gBlurRadius] = gInput[int2(x, dispatchThreadID.y)];
    }

    // Clamp out of bound samples that occur at image borders.
    gCache[groupThreadID.x+gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy-1)];

    // Wait for all threads to finish.
    GroupMemoryBarrierWithGroupSync();

    //
    // Now blur each pixel.
    //

    float4 blurColor = float4(0, 0, 0, 0);

    for(int i = -gBlurRadius; i <= gBlurRadius; ++i)
    {
        int k = groupThreadID.x + gBlurRadius + i;

        blurColor += weights[i+gBlurRadius]*gCache[k];
    }

    gOutput[dispatchThreadID.xy] = blurColor;
}

[numthreads(1, N, 1)]
void VertBlurCS(int3 groupThreadID : SV_GroupThreadID,
                int3 dispatchThreadID : SV_DispatchThreadID)
{
    // Put in an array for each indexing.
    float weights[11] = { w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10 };

    //
    // Fill local thread storage to reduce bandwidth.  To blur
    // N pixels, we will need to load N + 2*BlurRadius pixels
    // due to the blur radius.
    //

    // This thread group runs N threads.  To get the extra 2*BlurRadius pixels,
    // have 2*BlurRadius threads sample an extra pixel.
    if(groupThreadID.y < gBlurRadius)
    {
        // Clamp out of bound samples that occur at image borders.
        int y = max(dispatchThreadID.y - gBlurRadius, 0);
        gCache[groupThreadID.y] = gInput[int2(dispatchThreadID.x, y)];
    }
    if(groupThreadID.y >= N-gBlurRadius)
    {
        // Clamp out of bound samples that occur at image borders.
        int y = min(dispatchThreadID.y + gBlurRadius, gInput.Length.y-1);
        gCache[groupThreadID.y+2*gBlurRadius] = gInput[int2(dispatchThreadID.x, y)];
    }

    // Clamp out of bound samples that occur at image borders.
    gCache[groupThreadID.y+gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy-1)];


    // Wait for all threads to finish.
    GroupMemoryBarrierWithGroupSync();

    //
    // Now blur each pixel.
    //

    float4 blurColor = float4(0, 0, 0, 0);

    for(int i = -gBlurRadius; i <= gBlurRadius; ++i)
    {
        int k = groupThreadID.y + gBlurRadius + i;

        blurColor += weights[i+gBlurRadius]*gCache[k];
    }

    gOutput[dispatchThreadID.xy] = blurColor;
}

Tips:
Computer Shader 也可以使用硬件的线性插值，上面代码 cache 的是单个 texel 的值，硬件的线性插值不适用于上面代码。

Texture2D<float4> myTexture;
SamplerState linearClampSampler;

// Tips: 下面代码是错误的，computer shader 中无法自动计算mipmap level，必须手动指定
// float4 color = myTexture.Sample(linearClampSampler, uv);
float4 color = myTexture.SampleLevel(linearClampSampler, uv, 0);

GPUDrivenTerrain

DataStructure

world 10240m * 10240m
world 被划分为多个 node, 最小的 node 大小为 64m，被称为一个 sector
- lod0 对应的 node 为 64m
- node 大小随 lod 等级翻倍
每个 node 被划分为 8*8 个 patch
- lod0 对应的 patchXSize=64m/8=8m
- patch 大小随 lod 等级翻倍
每个 patch 被划分为 16*16 个 grid
- lod0 对应的 gridXSize=8m/16=0.5m
- grid 大小随 lod 等级翻倍

	patchXSize	gridPerPatch	gridXSize	gridXCount	gridPerNode	nodeXSize	nodeXCount	patchXPerNode	sectorXPerNode
Lod0	8m*1=8m	16*16	0.5m	10240m/0.5=20480	128*128	128*0.5=64	10240/64=160	64/8=8	64/64=1
Lod1	8m*2=16m	16*16	1m	10240m/1=10240	128*128	128*1=128	10240/128=80	128/16=8	128/64=2
Lod2	8m*4=32m	16*16	2m	10240m/2=5120	128*128	128*2=256	10240/256=40	256/32=8	256/64=4
Lod3	8m*8=64m	16*16	4m	10240m/4=2560	128*128	128*4=512	10240/512=20	512/64=8	512/64=8
Lod4	8m*16=128m	16*16	8m	10240m/8=1280	128*128	128*8=1024	10240/1024=10	1024/128=8	1024/64=16
Lod5	8m*32=256m	16*16	16m	10240m/16=640	128*128	128*16=2048	10240/2048=5	2048/256=8	2048/64=32

实现无高度图版本

创建 plane mesh

patch 为渲染的最小单位，每个 patch 使用相同的 plane 作为 mesh。上面图表可以得出最小的 patchSize 为 8m，所以我们生成的 plane mesh 大小为 8m*8m.

使用 plane 平铺场景

假设所有 node 都是 lod0，根据 nodeIndex 可以得出 nodeLoc，进而得出 nodePosition，从而得到 patchPosition。
使用如下函数生成 patch：

[numthreads(8,8,1)]
void BuildPatch(uint3 id : SV_DispatchThreadID, uint3 groupId:SV_GroupID, uint3 groupThreadId:SV_GroupThreadID)
{
    uint nodeId = groupId.x;
    uint2 nodeLoc = uint2(nodeId % NodeCountArr[0], nodeId / NodeCountArr[0]);
    uint2 patchLoc = groupThreadId.xy;

    Patch patch;
    patch.position = nodeLoc * NodeSizeArr[0] + patchLoc * PatchSizeArr[0];
    PatchListAppendBuffer.Append(patch);
}

使用如下函数进行绘制：

cmd.DrawMeshInstancedIndirect(setting.patchMesh, 0, setting.terrainMaterial, 0, setting.PatchIndirectArgs);

构建四叉树

从 maxLod 开始，遍历节点。判断节点离摄像机距离，如果比较远，则不分解该节点。否则分解该节点。
将不需要分解的节点添加到 QuadTreeBuffer, 需要分解的节点，分解为低一级的 4 个节点。
Tips:
低一级节点的 Loc 按照如下方式计算，以避免相邻节点分解所得的节点重叠

NodeListAppendBuffer.Append(nodeLoc * 2);
NodeListAppendBuffer.Append(nodeLoc * 2 + uint2(1, 0));
NodeListAppendBuffer.Append(nodeLoc * 2 + uint2(0, 1));
NodeListAppendBuffer.Append(nodeLoc * 2 + uint2(1, 1));

如，lodMax 的两个相邻节点(1,1)和(1,2)
(1,1) 分解后为(2,2) (3,2) (2,3) (3,3)
(1,2) 分解后为(2,4) (3,4) (2,5) (3,5)

使用如下代码，调试生成的节点个数：

uint[] tmpData = new uint[3];
setting.CSIndirectArgs.GetData(tmpData);
Debug.LogWarningFormat("tmpData {0}, {1}, {2}", tmpData[0],tmpData[1],tmpData[2]);

构建 patch

一个 node 分解为 8*8 个 patch, 得到每个 patch 对应的位置。

渲染 patch

在 vertex shader 中根据所属 patch 的位置和当前顶点的 localPos 得到当前顶点的 worldPos
根据当前顶点的 worldPos.xz 得到 heightTex 的采样 uv，从而得到当前顶点对应的 worldPos.y

实现高度图版本

引入高度图

高度图的每个像素表示的是 PlaneMesh 顶点的高度，相邻两个像素的间隔对应的是一个 grid。
根据顶点的世界坐标可以计算出顶点对应的采样 UV，对 HeightMap 进行采样得到顶点的高度。

	worldXSize(m)	gridXSize(m)	nodeXSize	nodeXVertexCount	worldXVertexCount
Lod0	1024	0.5	0.5168=64	16*8=128	1024/0.5=2048
	2048				2048/0.5=4096
Lod1	1024	1	1168=128	128	1024/1=1024
	2048				2048/1=2048

Node Lod 分解

引入高度图后，node 中心点的位置会由高度图所决定，一个 node 内，不同顶点的高度也会不同，之前计算所得的摄像机到 node 的距离就会不准确。我们可以通过下面方案来提高距离准确性：计算 node 范围内，最大高度和最小高度。以最大高度和最小高度的平均高度作为 node 的高度，来计算 node 的位置，然后再计算摄像机到 node 的距离。

ERROR

TODO D3D11 平台下，运行不正确，Vulkan 平台下没问题

参考资料

https://github.com/wlgys8/GPUDrivenTerrainLearn
天涯明月刀手游:如何应用 GPU Driven 优化渲染效果？ https://mp.weixin.qq.com/s/m3e_F5FL3O23FPTGa54wgA

参考资料

https://docs.unity3d.com/Manual/SL-ShaderCompileTargets.html
https://registry.khronos.org/OpenGL/extensions/ANDROID/ANDROID_extension_pack_es31a.txt
【Unity】Compute Shader 的基础介绍与使用 https://zhuanlan.zhihu.com/p/368307575 有道云备份
Compute shader support for mobile in 2022 https://forum.unity.com/threads/compute-shader-support-for-mobile-in-2022.1305024/
GPU Instancing 手机兼容性报告 https://zhuanlan.zhihu.com/p/72717290
ComputeShader 手机兼容性报告 https://zhuanlan.zhihu.com/p/68886986
SSBO https://www.khronos.org/opengl/wiki/Shader_Storage_Buffer_Object
图形管线中的 AsyncCompute 中的 Async 是指什么？ https://www.zhihu.com/question/276526226
Advanced API Performance: Async Compute and Overlap https://developer.nvidia.com/blog/advanced-api-performance-async-compute-and-overlap/

ComputeShader

Table of Contents

ComputeShader

Base

参考资料

ComputeShader 优化 Blur

GPUDrivenTerrain

DataStructure

实现无高度图版本

创建 plane mesh

使用 plane 平铺场景

构建四叉树

构建 patch

渲染 patch

实现高度图版本

引入高度图

Node Lod 分解

ERROR

TODO D3D11 平台下，运行不正确，Vulkan 平台下没问题

参考资料

参考资料