Performance
Performance
guide
guide
Design faster. Render faster. Iterate faster.
Our AMD Ryzen™ Performance Guide will help guide you through the optimization process with a collection of tidbits, tips, and tricks which aim to support you in your performance quest.
Tools
PresentMon
PresentMon is a Command Line Interface (CLI) tool for logging frame times such as MsBetweenPresents
.
Example:
PresentMon-1.6.0-x64.exe -process_name "MyGame.exe"
-stop_existing_session
-terminate_on_proc_exit
-terminate_after_timed
-timed 60
-output_file "%CD%\result\presentmon.csv"
Open Capture and Analysis Tool (OCAT)
OCAT is a Graphics User Interface (GUI) tool with hot key support for logging frame times based on PresentMon.
Windows® Performance Toolkit
Windows Performance Analyzer (WPA)
WPA is a highly configurable tool for finding system performance bottlenecks and ideal for filtering and visualizing call stacks.
- WPA is included in the Windows SDK Windows Performance Toolkit and also available in the Microsoft Store.
- WPA opens logs created by
wpr.exe
orxperf.exe
. -
wpr.exe
is included in all Windows 10 installations. -
xperf.exe
is included in the Windows SDK. - See https://docs.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer
- See https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk/
GPUView
GPUView is a tool for analyzing GPU performance with regard to direct memory access (DMA) buffer processing.
- GPUView allows you to find times where the GPU Hardware Queue is empty or times where the Process Context CPU Queue is empty.
- Ideally, the GPU Hardware Queue should be near 100% busy.
- GPUView is included in the Windows SDK Windows Performance Toolkit.
- See https://docs.microsoft.com/en-us/windows-hardware/drivers/display/using-gpuview
- See https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk/
Visual Studio Concurrency Visualizer
You can use the Concurrency Visualizer for Visual Studio to locate performance bottlenecks, CPU underutilization, thread contention, cross-core thread migration, synchronization delays, DirectX activity, areas of overlapped I/O, and other information.
AMD µProf
- Find performance bottlenecks using CPU hardware Performance Monitoring Counters (PMCs)
- Instruction Based Sampling (IBS) has disassembly instruction accurate attribution but with limited counter coverage.
- Event Based Sampling (EBS) has more counters available but less accurate attribution. It is typically accurate within a few instructions. AMD Dev Techs often use EBS counters in the Assess Performance (Extended) profile.
- See https://developer.amd.com/amd-uprof/
Radeon GPU Profiler (RGP)
RGP is an offline compiler and performance analysis tool for DirectX, Vulkan®, SPIR-V™, OpenGL® and OpenCL™.
- The Overview > Frame summary may quickly assess if the application is CPU bound (GPU idle > 5%) based on the few frames captured.
- See https://github.com/GPUOpen-Tools/radeon_gpu_profiler
Compiling
Use the latest compiler and Windows SDK
- Get the latest build and link time improvements.
- Ensure you are using the latest C runtime optimizations.
- See https://devblogs.microsoft.com/cppblog/the-coalition-sees-27-9x-iteration-build-improvement-with-visual-studio-2019/
Add virus and threat protection exclusions
- Add project folders to virus and threat protection settings exclusions for faster build times.
- We have seen some projects compiling 20% faster!
Prefer Shipping configuration builds for CPU profiling
- Debug and development configuration builds may greatly reduce performance.
- Stats collection may cause cache pollution.
- Logging may create serialization points.
- Sometimes debug builds may disable multi-threading optimizations.
- While investigating open issues, developers may submit change requests which enable debug features on Test and Shipping configurations. Be sure to disable debug features before you ship!
- Some Unreal Engine settings to verify include:
- In
Build.h
,#define FORCE_USE_STATS
and#define STATS
should never be enabled during Shipping builds. - It may be convenient to enable
ALLOW_CONSOLE_IN_SHIPPING
during game development. - See master/Engine/Source/Runtime/Core/Public/Misc/Build.h
- In
Disable Anti-Tamper for CPU profiling
- Build a binary similar to Shipping configuration but without Anti-Tamper or Anti-Cheat tools which may prevent CPU profiling tools from properly loading symbols.
Testing
Audit Content
Run Unreal Engine UE4Editor MapCheck to find errors.
Use Unity® AssetPostprocessor to enforce minimum standards.
Ask artists and QA for scene recommendations
- It is important to profile potential optimizations using representative content. Not all scenes are created equal, and there is not always one best scene.
- Indoor scenes may have heavy occlusion.
- Outdoor forests may have many masked materials.
- Large crowds may represent a good stress test for AI, navmesh, physics, animation, and rendering workloads.
- Consistent in game time of day is an important consideration when minimizing run to run variation.
- Time of day may trigger specific world events such as rush hour where there are larger crowds or different lighting composition between day and night.
Use the default Platform Clock setting
- Use the default platform clock setting for best performance with high precision and low latency.
- Default:
-
bcdedit.exe /deletevalue useplatformclock
-
- This option should only be used for debugging. However, some overclocking tools may set it to yes:
-
bcdedit.exe /set useplatformclock yes
-
- See
https://docs.microsoft.com/en-us/windows-hardware/drivers/devtest/bcdedit--set
Test the cold shader cache first time user experience
- Be sure to clear the application shader cache if it has one.
- The end user will often not be running the same scene back to back as a developer might.
- The example below clears the Microsoft®, AMD, and NVIDIA® shader caches:
rem Run as administrator
rem Disable Steam Shader Pre-Caching before running this script
rem Reboot after running this script to clear any shaders still in system memory
setlocal enableextensions
cd /d "%~dp0"
rmdir /s /q "%LOCALAPPDATA%\D3DSCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\DxCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\GLCache"
rmdir /s /q "%LOCALAPPDATA%\AMD\VkCache"
rmdir /s /q "%ProgramData%\NVIDIA Corporation\NV_Cache"
rmdir /s /q "%ProgramFiles(x86)%\Steam\steamapps\shadercache"
Analyze frame times
- When doing performance analysis, prefer averages and percentiles over min and max metrics.
- It only takes one bad frame for min and max to no longer be representative of the average experience.
- Be sure to collect sufficient samples when comparing 3 sigma and higher.
- Determine the coefficient of variation over many test iterations.
- Under 3% is good in our experience.
- High variation is endemic of an inconsistent test scene.
- We recommend setting static seed values for dynamically generated content and fixing variables like time of day.
- If higher variation is unavoidable, the user should increase their number of benchmark runs proportionally.
Profiling
Disable Memory Integrity if needed
Hypervisor-Protected Code Integrity (HVCI) is labelled Memory Integrity in the Windows Security app.
- HVCI can be accessed via Settings > Update and Security > Windows Security > Device security > Core isolation details > Memory Integrity.
- You may need to disable Memory Integrity for some tools to function such as AMD µProf.
- See https://support.microsoft.com/en-us/windows/device-protection-in-windows-security-afa11526-de57-b1c5-599f-3a4c6a61c5e2
Add symbols
The symstore and symbol path can be powerful tools for loading vendor symbols and providing hints to tools which do not check the local directory.
- Edit the system environment variables for
_NT_SYMBOL_PATH
.- Example:
_NT_SYMBOL_PATH=cache*c:\symbols;srv*https://download.amd.com/dir/bin;srv*https://driver-symbols.nvidia.com/;srv*http://msdl.microsoft.com/download/symbols
- Install the Windows 10 SDK Debuggers including symchk.exe and symstore.exe. Adding
"C:\Program Files (x86)\Windows Kits\10\Debuggers\x64"
to thePATH
is recommended. - Store symbols for your project.
- Example:
symstore.exe add /r /f *.pdb /s c:\symbols /t "MyProject"
- See https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk/
- See https://docs.microsoft.com/en-us/windows/win32/debug/using-symstore
Determine if CPU-bound
Typically, the application is CPU-bound if GPU Idle > 5%
- Look for bubbles of idle work on the GPU in tools such as RGP, GPUView, and the Visual Studio Concurrency Visualizer.
- There are multiple tools and methods available for developers to detect boundedness:
- Radeon GPU Profiler (RGP)
- GPUView
- Warning: Adapter Hardware Queue 3D is a good measure of GPU %Busy but be sure to zoom to a selection which trims out the head and tail of the log which may be missing events.
- Warning: This capture is typically limited to a few seconds which may be too broad to see smaller idle periods. Consider using the zoom function to limit scope to a few frames at a time.
- Example:
rem run as administrator
rem add "C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\gpuview" to path
setlocal enableextensions
cd /d "%~dp0"
rem switch active foreground window back to the game application
timeout.exe /t 5
call log.cmd light
timeout.exe /t 5
call log.cmd
rem open Merged.etl
- Windows Performance Recorder & Window Performance Analyzer
- Warning: The Windows Performance Analyzer’s GPU Utilization (FM) GPU by Process excludes GPU Idle time in Percentage calculation. Fortunately, you can open the etl file in GPUView.
- Note this capture is typically limited to a few seconds.
- Example:
rem run as administrator
setlocal enableextensions
cd /d "%~dp0"
rem switch active foreground window back to the game application
timeout.exe /t 5
wpr.exe -start gpu -filemode
timeout.exe /t 5
wpr.exe -stop out.etl
rem open out.etl
- Visual Studio Concurrency Visualizer
- The Threads View shows DirectX GPU Engine utilization which may be used to zoom into regions where to GPU is idle for further analysis of blocked threads.
Verify UE4 Parallel Rendering
- While investigating open issues, developers may submit change requests which enable debug features on
Test and Shipping configurations. Some debug features may greatly reduce performance due to disabling
parallel rendering. - Check UE4 Parallel Rendering CVARs before shipping.
Command | Recommended Value |
---|---|
r.rhicmdbypass
|
0
|
r.rhicmdusedeferredcontexts
|
1
|
r.rhicmduseparallelalgorithms
|
1
|
r.rhithread.enable
|
1
|
Verify Parallel DX12 PipelineState Creation
Use a cold shader cache while verifying parallel DX12 pipeline state creation.
- Install the Windows SDK Windows Performance Toolkit.
- Add the GPUView folder to the
PATH
.
rem run as administrator
rem clear shader cache
call log.cmd
rem collect samples while game is starting and calling D3D12.dll!CDevice::CreatePipelineState
call log.cmd
- Open the merged
etl
log file with the Windows Performance Analyzer. - Add CPU Usage (Precise) and CPU Usage (Sampled) Flame by Process, Stack graphs.
- Find all
D3D12.dll!CDevice::CreatePipelineState
within the Flame by Process, Stack.
This find command highlights the samples of interest in the CPU Usage (Precise) graph:
Verify Parallel DX12 Command List Generation
- Install the Windows SDK Windows Performance Toolkit.
- Add the GPUView folder to the
PATH
.
rem run as administrator
rem add "C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit\gpuview" to path
setlocal enableextensions
cd /d "%~dp0"
rem switch active foreground window back to the game application
timeout.exe /t 5
call log.cmd
rem collect samples while game is playing and rendering frames. 1 seconds should be more than enough data.
timeout.exe /t 1
call log.cmd
- Add GPU Utilization, CPU Usage (Precise), and Generic Events graphs.
- Zoom into a single frame between two Present markers.
- In the Generic Events graph, move the CPU Column next to the Task name then filter and expand Command List.
Debugging
WinDbg
WinDbg may be used for setting breakpoints, logging, skipping functions, editing memory, or editing registers.
- For any function, the first four args are in
RCX
,RDX
,R8
, andR9
. Arguments five and higher are passed on the stack. - Note Steam games often require a
steam_appid.txt
file orSteamAppId
system environment variable to launch an executable from WinDbg. - Verify
DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE
was used:-
DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE (2)
is recommended for optimal performance on hybrid graphics systems. - These WinDbg commands may help:
-
bp dxgi!CDXGIFactory::EnumAdapterByGpuPreference ".printf \"FOUND DXGIFactory::EnumAdapterByGpuPreference DXGI_GPU_PREFERENCE=%x\\n\",@r8"
- Verify
GetLogicalProcessorInformation(Ex)
calls with non-zero input buffer lengths return success:- Some applications incorrectly assume the buffer size and may crash, especially on systems with many logical processors.
- Test if the first call has input buffer length
0
to get the buffer length to malloc. - Test that all calls with non-zero input buffer lengths return success (
return 1
). - These WinDbg commands may help:
bp kernelbase!GetLogicalProcessorInformation "bp /1 @$ra \".printf \\\"GetLogicalProcessorInformation returned %i\\\", @rax; .echo; g\"; .printf \"GetLogicalProcessorInformation input buffer length 0x%x\", poi(@rdx); .echo; g"
bp kernelbase!GetLogicalProcessorInformationEx "bp /1 @$ra \".printf \\\"GetLogicalProcessorInformationEx returned %i\\\", @rax; .echo; g\"; .printf \"GetLogicalProcessorInformationEx input buffer length 0x%x\", poi(@r8); .echo; g"
Integrated Graphics
Test for Integrated Graphics
The DirectX APIs refer to Accelerated Processing Units (APUs) or Integrated Graphics parts via the term Unified Memory Architecture (UMA).
DirectX 12
bool isUMA(ID3D12Device* pDevice)
{
bool result = false;
D3D12_FEATURE_DATA_ARCHITECTURE data = {};
if (S_OK == pDevice->CheckFeatureSupport(
D3D12_FEATURE_ARCHITECTURE,
&data,
sizeof(data)))
{
result = data.UMA;
}
return result;
}
//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//
#include
#include
#include
#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d12" )
bool isUMA(ID3D12Device* pDevice)
{
bool result = false;
D3D12_FEATURE_DATA_ARCHITECTURE data = {};
if (S_OK == pDevice->CheckFeatureSupport(
D3D12_FEATURE_ARCHITECTURE,
&data,
sizeof(data)))
{
result = data.UMA;
}
return result;
}
int main()
{
ID3D12Device* pDevice = nullptr;
if (SUCCEEDED(D3D12CreateDevice(
NULL,
D3D_FEATURE_LEVEL_11_0,
_uuidof(ID3D12Device),
(void**)&pDevice)))
{
IDXGIFactory* pFactory;
IDXGIFactory4* pFactory4;
if (SUCCEEDED(CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)(&pFactory)))
&& SUCCEEDED(pFactory->QueryInterface(__uuidof(IDXGIFactory4), (void**)&pFactory4)))
{
LUID luid = pDevice->GetAdapterLuid();
IDXGIAdapter* pAdapter;
DXGI_ADAPTER_DESC desc;
if (SUCCEEDED(pFactory4->EnumAdapterByLuid(luid, __uuidof(IDXGIAdapter), (void**)&pAdapter))
&& SUCCEEDED(pAdapter->GetDesc(&desc)))
{
printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
printf("isUMA %i\n", isUMA(pDevice));
SIZE_T budget = desc.DedicatedVideoMemory;
if (isUMA(pDevice))
{
budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
}
IDXGIAdapter3* pAdapter3 = nullptr;
DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
&& SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
{
budget = info.Budget;
}
printf("budget %I64u\n", budget);
}
}
}
}
DirectX 11.3
bool isUMA(ID3D11Device* pDevice)
{
bool result = false;
ID3D11Device3* pD3D11Device3 = nullptr;
if (S_OK == pDevice->QueryInterface(IID_PPV_ARGS(&pD3D11Device3)) && pD3D11Device3)
{
D3D11_FEATURE_DATA_D3D11_OPTIONS2 data = {};
if (S_OK == pD3D11Device3->CheckFeatureSupport(
D3D11_FEATURE_D3D11_OPTIONS2,
&data,
sizeof(data)))
{
result = data.UnifiedMemoryArchitecture;
}
pD3D11Device3->Release();
}
return result;
}
//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//
#include
#include
#include
#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d11" )
bool isUMA(ID3D11Device* pDevice)
{
bool result = false;
ID3D11Device3* pD3D11Device3 = nullptr;
if (S_OK == pDevice->QueryInterface(IID_PPV_ARGS(&pD3D11Device3)) && pD3D11Device3)
{
D3D11_FEATURE_DATA_D3D11_OPTIONS2 data = {};
if (S_OK == pD3D11Device3->CheckFeatureSupport(
D3D11_FEATURE_D3D11_OPTIONS2,
&data,
sizeof(data)))
{
result = data.UnifiedMemoryArchitecture;
}
pD3D11Device3->Release();
}
return result;
}
int main()
{
UINT flags = NULL; // D3D11_CREATE_DEVICE_SINGLETHREADED;
D3D_FEATURE_LEVEL featureLevels[] = { D3D_FEATURE_LEVEL_11_0 };
UINT numFeatureLevels = ARRAYSIZE(featureLevels);
D3D_FEATURE_LEVEL featureLevel;
ID3D11Device* pDevice = nullptr;
ID3D11DeviceContext* pImmediateContext = nullptr;
if SUCCEEDED(D3D11CreateDevice(
NULL,
D3D_DRIVER_TYPE_HARDWARE,
NULL,
flags,
featureLevels,
numFeatureLevels,
D3D11_SDK_VERSION,
&pDevice,
&featureLevel,
&pImmediateContext))
{
IDXGIDevice* pDXGIDevice = nullptr;
IDXGIAdapter* pAdapter = nullptr;
DXGI_ADAPTER_DESC desc;
if (SUCCEEDED(pDevice->QueryInterface(__uuidof(IDXGIDevice), (void**)&pDXGIDevice))
&& SUCCEEDED(pDXGIDevice->GetAdapter(&pAdapter))
&& SUCCEEDED(pAdapter->GetDesc(&desc)))
{
printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
printf("isUMA %i\n", isUMA(pDevice));
SIZE_T budget = desc.DedicatedVideoMemory;
if (isUMA(pDevice))
{
budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
}
IDXGIAdapter3* pAdapter3 = nullptr;
DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
&& SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
{
budget = info.Budget;
}
printf("budget %I64u\n", budget);
}
}
}
Calculate VRAM Budget appropriately for Integrated Graphics
Integrated graphics parts which share their video memory with the CPU require special considerations when detecting VRAM budgets.
DirectX
Preferred method:
IDXGIAdapter3* pAdapter3 = nullptr;
DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
&& SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
{
budget = info.Budget;
}
Alternative method:
DXGI_ADAPTER_DESC desc;
if (SUCCEEDED(pAdapter->GetDesc(&desc)))
{
SIZE_T budget = desc.DedicatedVideoMemory;
if (isUMA(pDevice))
{
budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
}
}
-
DedicatedVideoMemory
: This represents the actual local memory on discrete GPUs and the dedicated carve-out system memory on integrated GPUs. -
DedicatedSystemMemory
: This value is always zero on AMD GPUs. -
SharedSystemMemory
: This is determined by the GPU KMD and may return up to half of system memory. - UMA: Unified Memory Architecture used in integrated GPUs.
-
DedicatedVideoMemorySize
alone may be insufficient to run some gaming applications on systems with integrated graphics (UMA). - For systems with integrated graphics (UMA), developers should query
SharedSystemMemorySize
then rely on the GPU KMD and the vidMm to assign system memory optimally. - Use DX12 (or DX11.3)
CheckFeatureSupport
to query UMA.
-
- See https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/d3dkmthk/ns-d3dkmthk-_d3dkmt_segmentsizeinfo
- See https://docs.microsoft.com/en-us/windows-hardware/drivers/display/calculating-graphics-memory
//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//
#include
#include
#include
#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d12" )
bool isUMA(ID3D12Device* pDevice)
{
bool result = false;
D3D12_FEATURE_DATA_ARCHITECTURE data = {};
if (S_OK == pDevice->CheckFeatureSupport(
D3D12_FEATURE_ARCHITECTURE,
&data,
sizeof(data)))
{
result = data.UMA;
}
return result;
}
int main()
{
ID3D12Device* pDevice = nullptr;
if (SUCCEEDED(D3D12CreateDevice(
NULL,
D3D_FEATURE_LEVEL_11_0,
_uuidof(ID3D12Device),
(void**)&pDevice)))
{
IDXGIFactory* pFactory;
IDXGIFactory4* pFactory4;
if (SUCCEEDED(CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)(&pFactory)))
&& SUCCEEDED(pFactory->QueryInterface(__uuidof(IDXGIFactory4), (void**)&pFactory4)))
{
LUID luid = pDevice->GetAdapterLuid();
IDXGIAdapter* pAdapter;
DXGI_ADAPTER_DESC desc;
if (SUCCEEDED(pFactory4->EnumAdapterByLuid(luid, __uuidof(IDXGIAdapter), (void**)&pAdapter))
&& SUCCEEDED(pAdapter->GetDesc(&desc)))
{
printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
printf("isUMA %i\n", isUMA(pDevice));
SIZE_T budget = desc.DedicatedVideoMemory;
if (isUMA(pDevice))
{
budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
}
IDXGIAdapter3* pAdapter3 = nullptr;
DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
&& SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
{
budget = info.Budget;
}
printf("budget %I64u\n", budget);
}
}
}
}
//
// Copyright (c) 2021 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in
// all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
// THE SOFTWARE.
//
#include
#include
#include
#pragma comment( lib, "dxgi" )
#pragma comment( lib, "d3d11" )
bool isUMA(ID3D11Device* pDevice)
{
bool result = false;
ID3D11Device3* pD3D11Device3 = nullptr;
if (S_OK == pDevice->QueryInterface(IID_PPV_ARGS(&pD3D11Device3)) && pD3D11Device3)
{
D3D11_FEATURE_DATA_D3D11_OPTIONS2 data = {};
if (S_OK == pD3D11Device3->CheckFeatureSupport(
D3D11_FEATURE_D3D11_OPTIONS2,
&data,
sizeof(data)))
{
result = data.UnifiedMemoryArchitecture;
}
pD3D11Device3->Release();
}
return result;
}
int main()
{
UINT flags = NULL; // D3D11_CREATE_DEVICE_SINGLETHREADED;
D3D_FEATURE_LEVEL featureLevels[] = { D3D_FEATURE_LEVEL_11_0 };
UINT numFeatureLevels = ARRAYSIZE(featureLevels);
D3D_FEATURE_LEVEL featureLevel;
ID3D11Device* pDevice = nullptr;
ID3D11DeviceContext* pImmediateContext = nullptr;
if SUCCEEDED(D3D11CreateDevice(
NULL,
D3D_DRIVER_TYPE_HARDWARE,
NULL,
flags,
featureLevels,
numFeatureLevels,
D3D11_SDK_VERSION,
&pDevice,
&featureLevel,
&pImmediateContext))
{
IDXGIDevice* pDXGIDevice = nullptr;
IDXGIAdapter* pAdapter = nullptr;
DXGI_ADAPTER_DESC desc;
if (SUCCEEDED(pDevice->QueryInterface(__uuidof(IDXGIDevice), (void**)&pDXGIDevice))
&& SUCCEEDED(pDXGIDevice->GetAdapter(&pAdapter))
&& SUCCEEDED(pAdapter->GetDesc(&desc)))
{
printf("DedicatedVideoMemory %I64u\n", desc.DedicatedVideoMemory);
printf("DedicatedSystemMemory %I64u\n", desc.DedicatedSystemMemory);
printf("SharedSystemMemory %I64u\n", desc.SharedSystemMemory);
printf("isUMA %i\n", isUMA(pDevice));
SIZE_T budget = desc.DedicatedVideoMemory;
if (isUMA(pDevice))
{
budget += desc.DedicatedSystemMemory + desc.SharedSystemMemory;
}
IDXGIAdapter3* pAdapter3 = nullptr;
DXGI_QUERY_VIDEO_MEMORY_INFO info = {};
if (SUCCEEDED(pAdapter->QueryInterface(__uuidof(IDXGIAdapter3), (void**)&pAdapter3))
&& SUCCEEDED(pAdapter3->QueryVideoMemoryInfo(0, DXGI_MEMORY_SEGMENT_GROUP_LOCAL, &info)))
{
budget = info.Budget;
}
printf("budget %I64u\n", budget);
}
}
}
Optimize Scalability for Integrated Graphics
Sometimes feature scaling may required in order to achieve acceptable framerates on thermal limited platforms.
- Straightforward changes to try for scaling include:
- Use
DXGI_FORMAT_R11G11B10_FLOAT
rather thanDXGI_FORMAT_R16G16B16A16_FLOAT
. - Reduce shadow map quality.
- Reduce volumetric fog quality.
- Disable Ambient Occlusion.
- Use
- The following related Unreal Engine CVars may be helpful:
-
r.SceneColorFormat
-
r.AmbientOcclusionLevels
-
Hybrid Graphics
Select the optimal GPU for Hybrid Graphics
Additional considerations may be necessary to ensure the expected GPU is utilized in hybrid graphics platforms.
- Windows 10 v1803 added
IDXGIFactory6::EnumAdapterByGpuPreference
. - Use
DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE
for game applications. - WinDbg may be used to test if
DXGI_GPU_PREFERENCE=2
(DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE
).
bp dxgi!CDXGIFactory::EnumAdapterByGpuPreference ".printf \"FOUND DXGIFactory::EnumAdapterByGpuPreference DXGI_GPU_PREFERENCE=%x\\n\",@r8"
- The user may change preferences per application in Graphics settings.
- Example from Dell G5 15 Special Edition (5505)
Memory
Optimize memcpy/memset
- Update the compiler for the latest
memcpy
,memset
, and other c runtime optimizations. - Aligning
memcpy
source and destination to a4096
byte page boundary may reduce Zen 2 store to load forwarding events (SeeSTLIOther
in AMD µProf). - Aligning data to a
4096
page boundary may benefit probe filtering on AMD Threadripper™ and EPYC™ processors.
Avoid false sharing
- Alignas of the native cache line size (
64
bytes) may reduce false sharing. - Use aligned memory allocators such as
_aligned_malloc
orC++17 aligned new
. - Prefer thread local storage and local variables over process shared data.
- Try using per thread range indices such that thread ranges avoid sharing the same
64
byte cache line or4096
byte page. - Try copying data rather than using process shared data.
- Try using per thread range indices such that thread ranges avoid sharing the same
- Padding or reordering a struct may reduce false sharing in some cases where variables which share the same cache line are used by more than one thread.
Prefer data access patterns matching hardware prefetcher behaviors
- Streaming
- Uses history of memory access patterns to fetch additional sequential lines in ascending or descending order.
- Stride
- Uses memory access history of individual instructions to fetch additional lines when each access is a constant.
Use Software Prefetch instructions for linked data structures experiencing cache misses
- Use Software Prefetch instructions on linked data structures, such as
std::vector<T*>
, experiencing cache misses.- Tune prefetch distance to account for memory latency. In our experience, four iterations into the future is a good place to start tuning.
- Use NTA on use once data.
- While in dual-thread mode, beware that too many software prefetches from one thread may evict the working set of the other thread from their shared caches.
- Remove Ineffective Software Prefetches found by PMCx052.
- The AMD µProf Assess Performance (Extended) profile may help find Data Cache refills from DRAM.
Synchronization
Use Modern Sync APIs
Modern sync APIs include std::mutex
, std::shared_mutex
, SRWLock
, and EnterCriticalSection
.
- These may be faster than and consume less power than
WaitForSingleObject
or user spin locks. - Some modern sync APIs leverage AMD’s
mwaitx
instruction efficiently to wait on an address or timeout. - Legacy sync APIs may have unneeded
Syscall
overhead. - User spins locks may consume OS thread scheduling resources unnecessarily since the OS scheduler may be unable to determine if it should yield to another program thread rather than spin.
- It is generally recommended to issue sleep/wait instructions rather than spin locks.
- Even when waiting on the GPU, calls like
SetEventOnCompletion()
may be as efficient as the old fence polling model while avoiding starving other threads or unnecessarily consuming power.
Test application scalability from 1 to %NUMBER_OF_PROCESSORS%
This advice is specific to AMD processors and is not general guidance for all processor vendors.
Generally, applications show SMT benefits and use of all logical processors is recommended. However, games often suffer from SMT contention on the main or render threads during gameplay.
- One strategy to reduce this contention is to create threads based on physical core count rather than logical processor count.
- Avoid setting thread pool size as a constant.
- Profile your application/game to determine the ideal thread count.
- Game initialization, including decompressing assets and compiling/warming shaders, may benefit from logical processors using SMT dual-thread mode.
- Game play may prefer physical core count using SMT single-thread mode.
- We recommend creating developer options to:
- Set Max Thread Pool Size.
- Force Thread Pool Size.
- Force SMT.
- Force Single NUMA Node (implicitly Group).
- Profile against multiple CPUs. There is no hard and fast rule here.
- The best thread count heuristic may vary between low and high core count CPUs.
- While a 12 core CPU may benefit from an idle thread in your game to handle interrupts from the Operating System and 3rd party apps, a 6 core may require the availability of every compute resource.
- Developers may tune the low cores threshold for optimal performance on different core count CPUs.
- AMD µProf may be used to show the actual thread concurrency histogram for a process.
- See:
CPU Core Counts
This sample code correctly detects the physical and logical cores of today’s modern processors, along with the processor vendor and family.
Now watch the presentations!
AMD Ryzen Processor Software Optimization (Let’s Build 2020) – YouTube link
Join AMD Game Engineering team members for an introduction to the AMD Ryzen family of processors followed by advanced optimization topics.
Learn about the high-performance AMD Zen 2 microarchitecture and profiling tools. Gain insight into code optimization opportunities and lessons learned.
Microsoft® Game Stack Live: AMD Ryzen Processor Software Optimization
Join AMD on an adventure thru Zen 2 and Zen 3 processors which power today’s game consoles and PCs. Dive into instruction sets, cache hierarchies, resource sharing, and simultaneous multi-threading. Journey across the sands of silicon to master microarchitecture and uncover best practices!
Want even more?
Why stop here? Take a look at our other Performance Guides.
AMD RDNA™ Performance Guide
Our one-stop resource for getting great AMD RDNA™ performance on Vulkan® and DirectX®12 APIs!
There is even more performance advice to be found in our videos and tutorials.
Videos
Words not enough? How about pictures? How about moving pictures? We have some amazing videos to share with you!
Developer Guides
Browse our technical blogs, and find valuable advice on developing with AMD hardware, ray tracing, Vulkan®, DirectX®, Unreal Engine, and lots more.