NVIDIA Gaming Magic: maximum control, higher quality, more FPS

Promoted - 8 July, 03:00 PM

The resource intensity of modern games is growing so rapidly that with the release of each new AAA project, the first thing you have to pay attention to is the recommended system requirements. Indeed, with the increase in the realism of graphics, the active involvement of ray tracing and high-quality physical models, gaming PCs require ever higher performance. Fortunately, almost all such "heavyweights" support NVIDIA graphics technologies, which allow you to radically improve the situation. With the advent of the GeForce RTX 50 series video cards based on the Blackwell architecture, the possibilities for optimization have become even greater. Of course, first of all, we are talking about DLSS 4, Multi Frame Generation and the NVIDIA Reflex 2 delay reduction mechanism. Let's take a closer look at how these technologies work and check their effectiveness in practice.

DLSS 4 with "transformers"

NVIDIA has been developing DLSS (Deep Learning Super Sampling) smart scaling technology based on neural networks for over 6 years, which allows for significantly improved gaming performance and even improved image quality.

With the announcement of the GeForce RTX 50 graphics cards, developers also took a significant step in the development of DLSS, moving from traditional convolutional neural networks (CNN) to more advanced Transformer models in its latest iteration – DLSS 4.

Although Convolutional Neural Networks (CNNs) are quite efficient, they only analyze neighboring pixels in an image. This limited perception can lead to defects such as afterimages of moving objects, blurring, and flickering, especially in dynamic scenes or with complex geometry.

Unlike CNNs, transformers can analyze each pixel in a frame and assess its relative importance across the entire frame and even across multiple consecutive frames. The transformer "understands" the context of the image as a whole, not just its local parts. This results in fewer artifacts, especially when the camera or objects are moving quickly. Moving objects look much sharper.

The new DLSS 4 models use twice as many parameters as their CNN predecessors, allowing them to have a deeper understanding of scenes. Although using transformers requires significantly more computation, the improvement in the quality of the final image justifies the expense.

The transformer model also usually requires significantly more video card memory. There is also room for optimizations here. Recently, the developers offered a new SDK 310.3.0 (Software Developer Kit), which allows for a ~20% reduction in memory consumption for implementing the transformer model.

With the introduction of the neural model of transformers, a new level of image detail preservation during DLSS Super Resolution scaling was achieved.

Ray Reconstruction technology has also been improved with the new model. Transformers provide more accurate noise removal and better reproduction of light, shadows, and reflections because they more effectively capture long-range frame-by-frame dependencies.

Transformer models will also be used to implement DLAA (Deep Learning Anti-Aliasing), which is actually a DLSS Super Resolution mode that works with a 1x scaling factor. That is, DLAA does not scale the image up, but simply applies AI to achieve the highest quality smoothing at native resolution.

Because the transformer model improves the quality of the very basic image reconstruction process, DLAA allows it to reduce "stepping" and flickering at the edges of objects, as well as preserve fine texture details that may have been slightly blurred in previous versions of CNN-based DLAA. Transformers handle motion and global context better, so this minimizes any possible artifacts that may have occurred previously.

So with the transition to the Transformers model, we can count on improved DLAA quality, which will provide an extremely clean and realistic image without the need for scaling, of course, if the existing graphics card already provides sufficient FPS at native resolution. Otherwise, it will be more relevant to involve DLSS.

The DLSS Transformer Model, introduced at the beginning of this year, was in beta status until recently. Now the level of readiness of the technology allows us to talk about mass implementation. Therefore, we can predict a gradual departure from CNN and a more active use of the transformer model for a better implementation of DLSS.

Multi Frame Generation

A fundamental feature of solutions based on the NVIDIA Blackwell architecture and one of the main options of DLSS 4 was support for Multi Frame Generation (MFG) technology. If the previous version of DLSS 3 with Frame Generation could generate one additional frame between those rendered by the GPU, then MFG is capable of generating up to three AI frames for each traditionally rendered one.

The first generation of Frame Generation support appeared in GeForce RTX 40 graphics cards, which were equipped with a separate hardware unit Optical Flow Accelerator (OFA), which was used to quickly calculate optical flow (pixel movement vectors) between two frames.

In Blackwell, NVIDIA has abandoned the separate OFA, completely transferring this function to high-performance AI models running on tensor cores. This approach to calculating optical flow is much more flexible and accurate, which is critical for predicting motion when generating multiple intermediate frames at once. The productive tensor cores of the 5th generation Blackwell with support for the FP4 format are able to efficiently perform the complex calculations required to quickly and accurately generate three additional frames at once. This allows the AI model to deeply analyze the scene, motion and lighting, creating "additional" frames.

In the configuration when using DLSS with Performance Mode and using Multi Frame Generation (4x Mode), 15 out of 16 pixels are generated using AI. This brings us to the topic of the impact and overall importance of the potential of the AI accelerator based on tensor cores.

The actual performance increase in games is very significant. When using DLSS 4 with MFG, the number of fps can increase up to eight times from the initial value. Especially when it comes to cases with the most resource-intensive modes using high-quality ray tracing or even higher-quality path tracing implementation.

When using Multi Frame Generation to generate so many additional frames, timing their delivery to the monitor becomes extremely important. As mentioned, Blackwell has a significantly upgraded display engine with a new hardware feature called High Speed HW Flip Metering. This module allows the GPU to control "flips" (switching frame buffers) and display timings extremely precisely. This ensures that all generated frames (up to three for each actually rendered frame) are inserted into the stream smoothly, without micro-stutters, maintaining low latency.

Previously, in DLSS 3 (based on the Ada Lovelace architecture), the process of synchronizing and scheduling the display of generated frames relied heavily on the central processing unit (CPU) and software mechanisms. However, even when the overall frame rate was high, the moments when frames were displayed on the screen could be uneven, which was visually perceived as "micro-stuttering" or a loss of smoothness. In addition, this took up some CPU resources, which could limit performance in games that were heavily dependent on the CPU.

With the advent of Multi Frame Generation in DLSS 4, which generates up to three additional AI frames for each traditionally rendered one, the problem of accurate and smooth frame display becomes even more critical. Managing such a large stream of "artificial" frames with software becomes extremely difficult. That is why NVIDIA developed High Speed HW Flip Metering and integrated it directly into the Display Engine module.

NVIDIA claims that Hardware Flip Metering reduces frame timing variability by up to 5x. This means that each frame is displayed on the screen at very consistent intervals, creating a smooth picture, even at very high frame rates.

So, in addition to the increased performance of the 5th generation tensor cores, the presence of the High Speed HW Flip Metering hardware unit is one of the most important reasons why DLSS 4 Multi Frame Generation is implemented exclusively on video cards based on GPUs with the Blackwell architecture. To calculate and organize the output of a large number of frames, most of which are generated using neural networks, additional resources and hardware solutions are required that are not implemented in video cards of previous generations.

NVIDIA Reflex 2

For dynamic games, the system's response time to user actions is a very important parameter. Especially when it comes to competitive projects, when even fractions of a second can play a decisive role and decide the fate of a fundamental duel - defeat or victory.

At the platform level, this is called system latency – the time that elapses from the moment of input (for example, a mouse click) until the corresponding action is displayed on the monitor. This time consists of several stages:

Peripheral Latency: The time from when you press a button on your mouse/keyboard to when the signal reaches your PC.
Game Latency: The time it takes for the CPU to process input and prepare a new frame for the GPU.
Render Latency: The time it takes for the GPU to render a frame.
Display Latency: The time it takes for a monitor to process and display a frame.

As a result, all stages can take tens of milliseconds. In practice, even such relatively small values do not allow you to get a feeling of complete control over the situation. A small lag and reaction delay are usually not so important, but, again, for competitive projects it can be important.

To reduce system latency, NVIDIA offers NVIDIA Reflex technology. This mechanism eliminates the rendering queue. Typically, the CPU can prepare frames faster than the GPU can render them. This results in the formation of a "render queue" - a buffer of frames waiting for the GPU to process. The larger this queue, the higher the latency. Reflex synchronizes the CPU and GPU, preventing the CPU from being too far ahead. It ensures that the CPU sends frames to the GPU "just-in-time", eliminating or significantly reducing this queue. In this case, system responsiveness can improve by up to 50%.

Along with the Blackwell architecture solutions, the second generation of latency reduction technology was introduced – NVIDIA Reflex 2. In addition to streamlining the rendering queue, Reflex 2 uses the Frame Warp mechanism for specific frame deformation.

Reflex 2 allows the CPU to evaluate the latest data about mouse movement and camera position just before the rendered frame is sent to the display. Based on this latest data, the frame that is almost finished is "warped" or "deformed". This means that the pixels of the frame are shifted to reflect the most recent camera position or player mouse movement. For example, if you quickly rotate the mouse when the frame is almost finished, Frame Warp will "shift" the image to better match your latest movement.

Since "warping" can create empty areas at the edge of the frame (when it shifts) or around certain objects, Reflex 2 uses predictive rendering algorithms and Inpainting technology to fill in these empty spaces, making the process invisible to the eye.

As a result, NVIDIA Reflex 2 can reduce system latency by 75% in certain cases. The technology will initially debut on GeForce RTX 50 series graphics cards, but with the update, other models of the GeForce RTX lines will be supported.

Hands-on experiments with GeForce RTX 5080

To consolidate the theoretical part and explore the capabilities of video cards based on NVIDIA Blackwell architecture GPUs in more depth, we experimented with the GeForce RTX 5080 16 GB. A pre-top model that, despite its considerable cost, remains in the cohort of solutions for demanding gamers with realistic needs. So for the practical part, we will use the ASUS TUF Gaming GeForce RTX 5080 16GB OC version.

A video card of this class definitely deserves a separate review, but this time we will focus on the general capabilities of the GeForce RTX 5080. Here we will just briefly note that we are dealing with a 3.6-slot, three-fan "beauty" that received a factory overclocked GPU.

The frequency formula of the graphics processor instead of the recommended 2295/2617 MHz looks like 2295/2700 MHz. GDDR7 memory chips operate at an effective 30,000 MHz. With a 256-bit bus, the total memory bandwidth is 960 GB/s. Recall that the PSP for the predecessor GeForce RTX 4080 was 716 GB/s. And only the flagship RTX 4090 with a 384-bit bus offered ~1 TB/s.

Interestingly, in the RTX 50 line, it is the GeForce RTX 5080 that uses GDDR7 chips with the highest standard operating frequency – 1875 MHz (effective 30,000 MHz).

A massive cooler with a chunky heatsink, three fans, and through-flow technology will certainly offer effective cooling, even considering that we are talking about a graphics card with a TGP of 320 W.

To test the video card, we used a system unit with the following configuration:

Graphics card: ASUS TUF Gaming GeForce RTX 5080 16GB GDDR7 OC Edition (TUF-RTX5080-O16G-GAMING)
Processor: AMD Ryzen 7 9800X3D (8/16; 4.7/5.2 GHz; 96 MB L3)
Cooling: ASUS ROG STRIX LC III 360 ARGB
Motherboard: ASUS TUF GAMING X870-PLUS WIFI
Memory: G.Skill Trident Z5 Neo RGB DDR5-6000 64GB (2x32GB) (F5-6000J2836G32GX2-TZ5NR)
Drive: Crucial E100 1TB (CT1000E100SSD8)
BZ: ASUS ROG STRIX 1000W 80+ Gold (ROG-STRIX-1000G)
Case: ASUS TUF Gaming GT502 Horizon ARGB

We note the use of the fastest gaming processor - Ryzen 7 9800X3D with 3D V-Cache. Additional third-level cache in certain projects radically affects the number of fps. A liquid cooling system for such a CPU is the norm, and 64 GB of DDR5-6000 memory is becoming a rule of good taste for platforms of this class.

The developer recommends equipping system units with GeForce RTX 5080 series video cards with power supplies with a capacity of 850+ W. So the "kilowatt" ASUS ROG STRIX 1000W 80+ Gold will definitely not be superfluous, especially if you want to experiment with overclocking components.

We started our practical experiments with the 3DMark test components. In addition to the ASUS TUF Gaming GeForce RTX 5080 16GB OC performance, the charts will also present the previously obtained results for the representative of the previous generation (ADA Lovelace) – GeForce RTX 4080 SUPER 16GB. And to have a clear example of progress within several generations, the averaged performance of the GeForce RTX 3080 10GB – a representative of the generation of graphics cards with the Ampere architecture GPU – is presented next to it.

Classic 3DMark stages record the advantage of the ASUS TUF Gaming GeForce RTX 5080 16GB OC over its predecessor at 16–26%. While the difference with the GeForce RTX 3080 10GB is almost double (87–97%). And these are video cards of the same class, with only a five-year difference in the announcement dates.

When evaluating DLSS performance, especially in variants with frame generation, one is once again left wondering how significant the impact of AI mechanisms can be, with the help of which fps indicators are "scaled" not by percentages, but by multiples.

In real games, when using 4K mode and maximum settings for classic rendering without ray tracing, the advantage of the GeForce RTX 5080 over the RTX 4080 SUPER is about 17% on average. Unfortunately, it was not possible to evaluate the results of the GeForce RTX 3080 10GB in similar conditions. According to available data from the network, the actual performance difference is ~80%. At the same time, a number of projects would probably not have enough 10 GB of memory to use the maximum quality settings.

If you want to test the maximum capabilities of video cards, it is worth using modes with high-quality ray tracing implementation. However, as we can see in the diagrams, when it comes to RT in 4K resolution, even the GeForce RTX 5080 cannot do without additional "help". In the game Cyperpunk 2077, when using RT Overdrive, the frame/c counter can only see an average of 20 fps. So here all hopes are pinned on AI, and directly its practical implementation in the form of DLSS.

But DLSS alone is not enough in this case. Without frame generation technology, you can count on much better than the initial results (~50 fps), but the limit of comfortable play is reached only when Frame Generation is activated. The model on ADA Lovelace stops there, while the GeForce RTX 5080 on Blackwell only gains momentum. With the activation of multiple frame generation (Multi Frame Generation), we have an average of 166 fps. And this, let's recall, is 4K resolution and RT Overdrive tracing mode.

In Alan Wake 2, the situation is generally similar. With an initial frame rate of 15–18 fps, frame generation was able to "accelerate" the performance to 50–60 fps, while MFG allowed the RTX 5080 to reach 113 fps.

We also experimented with DOOM: The Dark Ages, which recently officially received support for Path Tracing and DLSS 4, Ray Reconstruction, and of course Multi Frame Generation. The highest quality implementation of ray tracing requires corresponding performance.

In the basic version, in 4K resolution with maximum graphics quality and activated Path Tracing, even with the GeForce RTX 5080, we have an average of 20 fps. Scaling with the Performance profile allows us to almost reach the level of a relatively comfortable 60 frames/s. But if you want to feel more free in dynamic scenes, you can’t do without frame generation again. Activating Frame Generation increases the average performance to 100 fps, and the 170 fps obtained with MFG is already a reason to choose DLSS modes with a lower scaling factor.

By the way, before activating MFG, it is recommended to bring the base frame rate to 60 fps. This will allow you to get the best results.

The magic of turning 20 FPS into 170 FPS is a vivid example of the multidimensional progress of visual technologies. The active involvement of AI opens up new horizons for optimizations and improving the experience in various areas. As Moore's Law in the classical interpretation loses its relevance due to physical limitations, further development requires new solutions. On the example of architectural changes in Blackwell, we see how NVIDIA is transforming the general approach to computing, shifting the focus to the use of neural networks. This allows us to achieve significant progress, bypassing the traditional limitations of increasing computing power.

A wide selection of NVIDIA GeForce RTX 50 graphics cards is available in the Telemart.ua online store