Revised Cache Organisation and Policies
Intel’s currently crop of CPUs operate with three distinct cache Tiers – Level 1, Level 2 and Level 3; the lower the level the faster the processor can access that cache of data. Size, latency and replacement policy can contribute significantly to CPU performance, although different workloads will tend to be suited to different caching structures and algorithms.
Skylake X has increased the L2 cache size from 256KB to 1MB, a four-fold increase. Meanwhile, the L3 Cache has changed significantly. Not only has the size of the cache reduced – down from 2MB per core to 1.375MB per core – but the rules governing the replacement of data within the cache has changed.
Previously Intel’s L3 cache was inclusive – data held in L2 is replicated in L3 in its entirety, and so data pushed out of L2 will still be available from L3 for some time. Of course this means that the size of the L3 cache is usually large in comparison on L2 – in the case of Skylake, L3 is eight times the size of L2 (256KB vs 2MB). Retaining an inclusive L3 cache on Skylake-X would likely have greatly increased its size, complexity and the CPU’s overall die size.
Skylake-X transitions to a non-inclusive L3 cache alongside a L2 cache only slightly smaller in size. Non-inclusive cache is a type of victim cache, and differs from exclusive policies by not automatically ejecting L2 data from L3. Instead, data modified or removed from L2 is moved to L3, preserving it in this older form until ejected from L3.
As with Skylake, Skylake-X L2 cache is private to the CPU core, whereas L3 cache is shared amongst all cores.
The benefit of a larger L2 cache is a greater chance that data you want to access is in that cache, i.e. higher ‘hit rate’. It is also particularly well suited to environments with a high level of replication in data use, chiefly enterprise environments including cloud services, potentially motivating the change. Cloud-based services are expected to be a major market for the Xeon variants of this CPU family.
Unfortunately these decisions are by their nature trade-offs, namely a larger L2 cache will tend to increase the number of CPU cycles needed to read and load data. Furthermore a more complex L3 replacement policy seems to have a detrimental effect on L3 cache latency as well, increasing from 44-cycle latency to 77.
The changes to L2 cache in particular can potentially have a significant net impact on IPC and hence overall performance, but that will greatly depend on the workload. We’ll see how it performs in a bit.
Scalability is the trending term in computing this season, and that’s certainly the case with Intel’s Core X family of processors. The underlying processor architecture of these HEDT CPUs is that of the Intel Xeon workstation and enterprise class, which this spring was also granted the slightly longwinded title of Intel® Xeon® Scalable processors; but this time it’s no marketing spiel.
One problem central to scaling CPUs beyond two cores is that of cross-core communication. Each core needs to communicate with the others in the CPU, but the further the path between two cores the higher the latency will be.
When CPUs reached six cores Intel introduced a ring bus topology for cross-CPU communication. Essentially, each CPU was connected to two others (one on each side) so lines of communication flowed in a ring around the grid of cores. Maximum latency between any two cores scaled with the total number of cores, and as so long as this number is relatively low latency remained under control.
In this generation the number of cores continues to grow, now reaching 18 on consumer-class Skylake-X CPUs alone, and something has to give. With Skylake-X the ring topology is no more; enter the Mesh.
Mesh architecture conceptual representation. Red lines represent horizontal and vertical wires connecting CPU components; green squares represent switches at intersections.
The Mesh topology solution is inherited from Knight’s Landing AKA Xeon Phi, Intel’s supercomputing platform with supports up to 72 processors. In a primer on the design Akhilesh Kumar, Intel’s Principal Engineer in Datacenter Processor Architecture, describes an array of cores with communication nodes routing communication traffic between adjacent cores. You can read the primer here. Each core directly communicates with 2, three or four cores adjacent to it, depending on its location in the array (with edges and corners communicating with 3 and 2 other cores respectively). Target cores can then be referenced relative to the originating core by describing the pathway needed to communicate with it.
It would have been possible to achieve a similar solution using a central routing node, but the amount of traffic this node needed to handle would be far higher. By making each core a little more complex load should be better balanced and the architecture as a whole is more scalable.
That’s a good thing too. The Skylake-X platform will reach 18-cores when the Core i9-7980XE is available later this year, and with AMD and Intel in a race to leverage more cores on the desktop the sky is virtually the limit.
Native AVX-512 Support
AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 systems. AVX and AVX 2 are both supported by Intel and AMD’s current crop of desktop CPUs, but AVX-512 is currently exclusive to Skylake-X on consumer CPUs.
Skylake-X also supports an overclocking ratio for AVX-512 operations. AVX-512 workloads substantially increase the thermal output of the CPU, and so a negative ratio offset allows the CPU to selectively downclock when facing these tasks and remain within thermal limits. As with similar AVX-256 functionality on mainstream Skylake CPUs, enthusiasts who overclock their CPU will be able to take advantage to improve long-term stability if mixed workloads are encountered.