Features
The Radeon RX 9000-series is the first RDNA 4 GPU, a microarchitecture design that incorporates multiple improvements over the prior generation. Key among these are large augmentations to hardware acceleration of both Raytracing and Machine Learning/AI workloads, two areas where RDNA 3 was broadly trailing the competition in the consumer space. Each RDNA4 Compute Unit incorporate purpose-built hardware components for these workloads, allowing performance to scale directly with number of CUs; in that respect it is similar to conventional rasterization workloads.
A 'full-fat' Navi48 GPU is equipped with 64 Compute Units and is present on the RX 9070 XT. Cut-down Navi48 GPUs with 56 CUs are present on RX 9070 GPUs. While there is scope for Navi48-based SKUs with fewer CUs in the future, larger RDNA4 GPUs with more CUs would need to be based on a new core.
Lithography and Topology
The RX 9070’s Navi48 core is manufactured by TSMC on their 4N process, a more dense lithography than the combination of 5N and 6N utilised for the RX 7000-series GPUs. While not a significant revision to the process compared to prior steps down, it should allow the RX 9000-series GPUs to clock higher at lower operating voltages than previous generations and thus improve overall efficiency. Alternatively, it could allow AMD to push the core further, getting more performance than comparatively complex older GPUs at the highest stable TDP.
AMD have also returned to a monolithic core design for Navi48 rather than the separated chiplet style utilised by the desktop RX 7000-series GPUs. This avenue significantly simplifies GPU packaging despite a relatively large die size but would be hard to scale up in a cost-effective manner. It is strongly indicative of a GPU generation with a ‘one-size fits all’ approach to the performance segment.
Memory
The RX 9070 and 9070 XT both utilises 16GB of GDDR6 VRAM, the same memory standard present on the RX 5000-series from 2019. This memory is clocked at 20Gb/s and communicates over a 256-bit bus for 640GB/s of total bandwidth. It’s notable that, unlike NVIDIA’s RTX 5070, AMD have neither reduced the frame buffer size nor the bus width when stepping down the half-tier from XT/Ti to non-XT/non-Ti.
AMD opted for GDDR6 over GDDR7 primarily due to cost and the targeted performance window of the 9070-series. Indeed, given the issues AAA titles have had hitting frame buffer limits on 8GB-10GB GPUs in recent years, it could be the design choice that has the largest impact on long-term performance. Certainly, moving down to 8 or even 12GB looks unwise for even a mid-tier part today.
More deeply, RDNA4 also incorporates additional out-of-order queues for memory requests and returns. Raytracing workloads are in AMD’s judgement highly sensitive to memory latency, particularly traversal through bounding volume hierarchy (BVH) structures. Fast memory retrieval and return requested by a simple shader could be delayed by the memory requests of slower shader operations ahead in the queue, leading to unnecessary latency; the new approach allows them to be fully satisfied out-of-order, returning the faster request when ready rather than needing to wait in the queue.
Raytracing
RDNA4 marks the debut of AMD’s 3rd Generation raytracing accelerators, a technology upgrade which aims to double the throughput of this aspect of the rendering pipeline through a handful of key underlying improvements:
Enhanced Ray Accelerators with computation units featuring eight rays/box and two rays/triangle, increasing base computation capability by approx. 2-fold. This is further augmented by acceleration of the Ray Hardware Stack Management system to allow faster and more efficient allocation of resources.
Adoption of a BVH8 Structure reduces traversal steps and latency while primitive node compression reduces BVH sizes.
Meanwhile AMDs use of Orienting Bounding Boxes allows each BVH to be independently oriented rather than aligned identically, with rays transformed to match their rotation as they enter the box. This reduces the rate of false intersections and hence traversal steps, improving traversal performance by an average of 10%.
Accelerated Shading through dynamic VGPR (vector general purpose registers) management leading to better register occupancy, while the previously mentioned out-of-order memory returns will in general improve latency.
Adoption of a BVH8 Structure reduces traversal steps and latency while primitive node compression reduces BVH sizes.
Meanwhile AMDs use of Orienting Bounding Boxes allows each BVH to be independently oriented rather than aligned identically, with rays transformed to match their rotation as they enter the box. This reduces the rate of false intersections and hence traversal steps, improving traversal performance by an average of 10%.
Accelerated Shading through dynamic VGPR (vector general purpose registers) management leading to better register occupancy, while the previously mentioned out-of-order memory returns will in general improve latency.
This net result should be that RDNA4 hardware will significantly outpace RDNA3 in raytraced workloads on hardware with equivalent rasterization capabilities, potentially allowing the RX 9070 XT to rival higher GPU tiers of the previous generation. While overall raytracing throughput should be increased roughly 2x, you’re unlikely to see quite that level of uplift in the render process as a whole given that raytracing workloads are only a portion of what goes into the construction of a whole frame.
AI Acceleration
‘AI’ is something of a nebulous term at present; however, in the realm of computer graphics it now generally refers to the process of ‘training’ and ‘inferencing’, i.e. processing relevant data to train a model or neural network to perform a particular task, or processing data through the trained model/NN to do that task. It’s on this paradigm that proprietary ‘deep learning’ upscaling technologies such as NVIDIA DLSS are based, but this burgeoning field is swiftly becoming a key tool to keep pace with multifaceted demands of modern gaming workloads.
AI data structures, known as ‘tensors’, can be formed as 16-bit inputs that are commonly used in graphics rendering and GPGPU workloads; increasingly however they’ve become optimised by utilising more narrow structures such as 8-bit and even 4-bit types. Natively processing those workloads to more efficiently leverage hardware resources is key to accelerating their performance on a GPU.
In contrast to RDNA3, RDNA4 supports 8-bit floating point formats (FP8) as well as 16 and 4-bit equivalents, which in turn unlocks support for data encoded in both E4M3 and E5M2 8-bit formats. These two formats are important in modern deep learning for both training and inferencing techniques, particularly in the processing of data batches with mixed data types.
A further concept that RDNA4 leverages is that of sparsity, i.e. processing incomplete datasets through a trained model. This allows more efficient use of hardware resources and reduces latency, which is exceptionally important to time-critical outputs such as generated frames.
FidelityFX Super Resolution 4
The RX 9000-series will be the first hardware to support AMD FSR4, a new upscaling process that leverages machine learning rather than algorithmic techniques. Similar in scope to NVIDIA’s DLSS 2.0, it utilises an AMD-trained ‘Model for Super Resolution’ to upscale low-resolution frames to near-native quality, boosting frame rates and improving experiences. AMD are touting FSR4 as a pathway to 4K gaming on RDNA4, particularly if also using Frame Interpolation (i.e. Frame Generation) akin to the feature widely advertised in NVIDIA’s RTX 50-series launch.
MLSR-based frames are generated from low-resolution frame data, including colour and depth, as well as motion vector data. The latter was critical in NVIDIA’s development of DLSS into a workable technology through its 2.0 revision, and should mean that FSR4 is a complete experience out of the box.
AMD's Model for Super Resolution makes extensive use of the FP8 capabilities of RDNA4. Due to this limitation the model is not suitable for RDNA3 hardware in its current form, hence FSR4's RX 9000-series exclusivity.
In narrow circumstances, use of AI upscaling can improve certain aspects of image quality, even over frames rendered at full resolution. Typically this will be because of post-processing anti-aliasing and pixel-binning not quite showing particularly jagged and spiky objects properly; as such, we’d probably describe it as an infrequent up-side rather than a core aspect of the technology.
FSR 4 can be utilised in the 30+ games that support FSR 3.1 at the time of writing, with 75+ expected by the end of 2025. As an extension of the FSR 3.1 API, it can be enabled with a simple driver-level toggle; however it requires support at the game-engine level just as FSR 3.1 does.
PCI-Express 5 Support
The RX 9000-series is AMD's first to support the PCIe 5.0 standard, doubling the per-lane data transfer rate from 16 to 32 GT/s. A 16-lane GPU will therefore have access to a total bandwidth of up-to 63.015 GB/s, should it need it.
Despite being an impressive technology, increased I/O bandwidth will struggle to be of use to current generation GPUs. Ancillary technologies such as Direct Storage, which itself could leverage the fast read speeds of PCIe Gen5 NVMe SSDs, has the potential make use of it in the future if and when videogames take advantage of it. Currently it's not a feature that will sway purchasing decisions.