ComputeShader
Table of Contents
ComputeShader note.
<!– more –>
ComputeShader
Base
参考资料
- More Compute Shaders https://zhuanlan.zhihu.com/p/63223223
- 游戏引擎随笔 0x28:现代图形 API 的 Wave Intrinsics、Subgroup 以及 SIMD-group https://zhuanlan.zhihu.com/p/469436345
ComputeShader 优化 Blur
//============================================================================= // Performs a separable Guassian blur with a blur radius up to 5 pixels. //============================================================================= cbuffer cbSettings : register(b0) { // We cannot have an array entry in a constant buffer that gets mapped onto // root constants, so list each element. int gBlurRadius; // Support up to 11 blur weights. float w0; float w1; float w2; float w3; float w4; float w5; float w6; float w7; float w8; float w9; float w10; }; static const int gMaxBlurRadius = 5; Texture2D gInput : register(t0); RWTexture2D<float4> gOutput : register(u0); #define N 256 #define CacheSize (N + 2*gMaxBlurRadius) groupshared float4 gCache[CacheSize]; [numthreads(N, 1, 1)] void HorzBlurCS(int3 groupThreadID : SV_GroupThreadID, int3 dispatchThreadID : SV_DispatchThreadID) { // Put in an array for each indexing. float weights[11] = { w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10 }; // // Fill local thread storage to reduce bandwidth. To blur // N pixels, we will need to load N + 2*BlurRadius pixels // due to the blur radius. // // This thread group runs N threads. To get the extra 2*BlurRadius pixels, // have 2*BlurRadius threads sample an extra pixel. if(groupThreadID.x < gBlurRadius) { // Clamp out of bound samples that occur at image borders. int x = max(dispatchThreadID.x - gBlurRadius, 0); gCache[groupThreadID.x] = gInput[int2(x, dispatchThreadID.y)]; } if(groupThreadID.x >= N-gBlurRadius) { // Clamp out of bound samples that occur at image borders. int x = min(dispatchThreadID.x + gBlurRadius, gInput.Length.x-1); gCache[groupThreadID.x+2*gBlurRadius] = gInput[int2(x, dispatchThreadID.y)]; } // Clamp out of bound samples that occur at image borders. gCache[groupThreadID.x+gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy-1)]; // Wait for all threads to finish. GroupMemoryBarrierWithGroupSync(); // // Now blur each pixel. // float4 blurColor = float4(0, 0, 0, 0); for(int i = -gBlurRadius; i <= gBlurRadius; ++i) { int k = groupThreadID.x + gBlurRadius + i; blurColor += weights[i+gBlurRadius]*gCache[k]; } gOutput[dispatchThreadID.xy] = blurColor; } [numthreads(1, N, 1)] void VertBlurCS(int3 groupThreadID : SV_GroupThreadID, int3 dispatchThreadID : SV_DispatchThreadID) { // Put in an array for each indexing. float weights[11] = { w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10 }; // // Fill local thread storage to reduce bandwidth. To blur // N pixels, we will need to load N + 2*BlurRadius pixels // due to the blur radius. // // This thread group runs N threads. To get the extra 2*BlurRadius pixels, // have 2*BlurRadius threads sample an extra pixel. if(groupThreadID.y < gBlurRadius) { // Clamp out of bound samples that occur at image borders. int y = max(dispatchThreadID.y - gBlurRadius, 0); gCache[groupThreadID.y] = gInput[int2(dispatchThreadID.x, y)]; } if(groupThreadID.y >= N-gBlurRadius) { // Clamp out of bound samples that occur at image borders. int y = min(dispatchThreadID.y + gBlurRadius, gInput.Length.y-1); gCache[groupThreadID.y+2*gBlurRadius] = gInput[int2(dispatchThreadID.x, y)]; } // Clamp out of bound samples that occur at image borders. gCache[groupThreadID.y+gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy-1)]; // Wait for all threads to finish. GroupMemoryBarrierWithGroupSync(); // // Now blur each pixel. // float4 blurColor = float4(0, 0, 0, 0); for(int i = -gBlurRadius; i <= gBlurRadius; ++i) { int k = groupThreadID.y + gBlurRadius + i; blurColor += weights[i+gBlurRadius]*gCache[k]; } gOutput[dispatchThreadID.xy] = blurColor; }
Tips:
Computer Shader 也可以使用硬件的线性插值,上面代码 cache 的是单个 texel 的值,硬件的线性插值不适用于上面代码。
Texture2D<float4> myTexture; SamplerState linearClampSampler; // Tips: 下面代码是错误的,computer shader 中无法自动计算mipmap level,必须手动指定 // float4 color = myTexture.Sample(linearClampSampler, uv); float4 color = myTexture.SampleLevel(linearClampSampler, uv, 0);
- https://github.com/discosultan/dx12-game-programming/blob/d78a3316f2a528562ec990ff10384c93fb6c722f/Samples/13-Blur/Shaders/Blur.hlsl#L34
- GroupMemoryBarrierWithGroupSync https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/groupmemorybarrierwithgroupsync?redirectedfrom=MSDN
- https://discussions.unity.com/t/how-to-sample-a-texture-in-a-compute-shader/804113
GPUDrivenTerrain
DataStructure
- world 10240m * 10240m
- world 被划分为多个 node, 最小的 node 大小为 64m,被称为一个 sector
- lod0 对应的 node 为 64m
- node 大小随 lod 等级翻倍
- lod0 对应的 node 为 64m
- 每个 node 被划分为 8*8 个 patch
- lod0 对应的 patchXSize=64m/8=8m
- patch 大小随 lod 等级翻倍
- lod0 对应的 patchXSize=64m/8=8m
- 每个 patch 被划分为 16*16 个 grid
- lod0 对应的 gridXSize=8m/16=0.5m
- grid 大小随 lod 等级翻倍
- lod0 对应的 gridXSize=8m/16=0.5m
patchXSize | gridPerPatch | gridXSize | gridXCount | gridPerNode | nodeXSize | nodeXCount | patchXPerNode | sectorXPerNode | |
---|---|---|---|---|---|---|---|---|---|
Lod0 | 8m*1=8m | 16*16 | 0.5m | 10240m/0.5=20480 | 128*128 | 128*0.5=64 | 10240/64=160 | 64/8=8 | 64/64=1 |
Lod1 | 8m*2=16m | 16*16 | 1m | 10240m/1=10240 | 128*128 | 128*1=128 | 10240/128=80 | 128/16=8 | 128/64=2 |
Lod2 | 8m*4=32m | 16*16 | 2m | 10240m/2=5120 | 128*128 | 128*2=256 | 10240/256=40 | 256/32=8 | 256/64=4 |
Lod3 | 8m*8=64m | 16*16 | 4m | 10240m/4=2560 | 128*128 | 128*4=512 | 10240/512=20 | 512/64=8 | 512/64=8 |
Lod4 | 8m*16=128m | 16*16 | 8m | 10240m/8=1280 | 128*128 | 128*8=1024 | 10240/1024=10 | 1024/128=8 | 1024/64=16 |
Lod5 | 8m*32=256m | 16*16 | 16m | 10240m/16=640 | 128*128 | 128*16=2048 | 10240/2048=5 | 2048/256=8 | 2048/64=32 |
实现无高度图版本
创建 plane mesh
patch 为渲染的最小单位,每个 patch 使用相同的 plane 作为 mesh。上面图表可以得出最小的 patchSize 为 8m,所以我们生成的 plane mesh 大小为 8m*8m.
使用 plane 平铺场景
假设所有 node 都是 lod0,根据 nodeIndex 可以得出 nodeLoc,进而得出 nodePosition,从而得到 patchPosition。
使用如下函数生成 patch:
[numthreads(8,8,1)] void BuildPatch(uint3 id : SV_DispatchThreadID, uint3 groupId:SV_GroupID, uint3 groupThreadId:SV_GroupThreadID) { uint nodeId = groupId.x; uint2 nodeLoc = uint2(nodeId % NodeCountArr[0], nodeId / NodeCountArr[0]); uint2 patchLoc = groupThreadId.xy; Patch patch; patch.position = nodeLoc * NodeSizeArr[0] + patchLoc * PatchSizeArr[0]; PatchListAppendBuffer.Append(patch); }
使用如下函数进行绘制:
cmd.DrawMeshInstancedIndirect(setting.patchMesh, 0, setting.terrainMaterial, 0, setting.PatchIndirectArgs);
构建四叉树
从 maxLod 开始,遍历节点。判断节点离摄像机距离,如果比较远,则不分解该节点。否则分解该节点。
将不需要分解的节点添加到 QuadTreeBuffer, 需要分解的节点,分解为低一级的 4 个节点。
Tips:
低一级节点的 Loc 按照如下方式计算,以避免相邻节点分解所得的节点重叠
NodeListAppendBuffer.Append(nodeLoc * 2); NodeListAppendBuffer.Append(nodeLoc * 2 + uint2(1, 0)); NodeListAppendBuffer.Append(nodeLoc * 2 + uint2(0, 1)); NodeListAppendBuffer.Append(nodeLoc * 2 + uint2(1, 1));
如,lodMax 的两个相邻节点(1,1)和(1,2)
(1,1) 分解后为(2,2) (3,2) (2,3) (3,3)
(1,2) 分解后为(2,4) (3,4) (2,5) (3,5)
使用如下代码,调试生成的节点个数:
uint[] tmpData = new uint[3]; setting.CSIndirectArgs.GetData(tmpData); Debug.LogWarningFormat("tmpData {0}, {1}, {2}", tmpData[0],tmpData[1],tmpData[2]);
构建 patch
一个 node 分解为 8*8 个 patch, 得到每个 patch 对应的位置。
渲染 patch
- 在 vertex shader 中根据所属 patch 的位置和当前顶点的 localPos 得到当前顶点的 worldPos
- 根据当前顶点的 worldPos.xz 得到 heightTex 的采样 uv,从而得到当前顶点对应的 worldPos.y
实现高度图版本
引入高度图
高度图的每个像素表示的是 PlaneMesh 顶点的高度,相邻两个像素的间隔对应的是一个 grid。
根据顶点的世界坐标可以计算出顶点对应的采样 UV,对 HeightMap 进行采样得到顶点的高度。
worldXSize(m) | gridXSize(m) | nodeXSize | nodeXVertexCount | worldXVertexCount | |
---|---|---|---|---|---|
Lod0 | 1024 | 0.5 | 0.5*16*8=64 | 16*8=128 | 1024/0.5=2048 |
2048 | 2048/0.5=4096 | ||||
Lod1 | 1024 | 1 | 1*16*8=128 | 128 | 1024/1=1024 |
2048 | 2048/1=2048 |
Node Lod 分解
引入高度图后,node 中心点的位置会由高度图所决定,一个 node 内,不同顶点的高度也会不同,之前计算所得的摄像机到 node 的距离就会不准确。我们可以通过下面方案来提高距离准确性:计算 node 范围内,最大高度和最小高度。以最大高度和最小高度的平均高度作为 node 的高度,来计算 node 的位置,然后再计算摄像机到 node 的距离。
ERROR
TODO D3D11 平台下,运行不正确,Vulkan 平台下没问题
参考资料
- https://github.com/wlgys8/GPUDrivenTerrainLearn
- 天涯明月刀手游:如何应用 GPU Driven 优化渲染效果? https://mp.weixin.qq.com/s/m3e_F5FL3O23FPTGa54wgA
参考资料
- https://docs.unity3d.com/Manual/SL-ShaderCompileTargets.html
- https://registry.khronos.org/OpenGL/extensions/ANDROID/ANDROID_extension_pack_es31a.txt
- 【Unity】Compute Shader 的基础介绍与使用 https://zhuanlan.zhihu.com/p/368307575 有道云备份
- Compute shader support for mobile in 2022 https://forum.unity.com/threads/compute-shader-support-for-mobile-in-2022.1305024/
- GPU Instancing 手机兼容性报告 https://zhuanlan.zhihu.com/p/72717290
- ComputeShader 手机兼容性报告 https://zhuanlan.zhihu.com/p/68886986
- SSBO https://www.khronos.org/opengl/wiki/Shader_Storage_Buffer_Object
- 图形管线中的 AsyncCompute 中的 Async 是指什么? https://www.zhihu.com/question/276526226
- Advanced API Performance: Async Compute and Overlap https://developer.nvidia.com/blog/advanced-api-performance-async-compute-and-overlap/