Lunar Lake Says Farewell to Hyperthreading As Intel Usher In AI Era

👤by Tim Harmer Comments 📅04.06.2024 19:15:25


Just like their competitors, Intel spent the vast majority of their Computex 2024 Keynote extolling the virtues of AI hardware acceleration, the sector in the industry exhibiting the most growth and investor interest. New Xeon 6 silicon with an all E-core implementation - up to 144 of them - will be joined by one boasting their Performance cores. The Gaudi 3 AI Accelerator was also re-introduced as their purpose-built solution for AI training and inferencing in the datacenter. But for consumers it was the new Lunar Lake laptops that took centre stage.

The keynote itself was a little light on details if you really wanted to get into the guts of the new technologies involved. One clear aspect of the imminent availability of Lunar Lake laptops however was the sheer number of partners who are throwing their weight behind this new class of system. Global partners like ASUS, ACER and GIGABYTE will have models in the channel fast and at scale; the implication was heavy that Intel were the only chipmaker (particularly x86 chipmaker) who would likely have these Microsoft Copilot+ validated models shipping in significant quantities.



This inevitably leads to a broader discussion of Lunar Lake's capabilities. What makes it special? What changes did Intel make in the transition from Meteor Lake? What are the implications for Arrow Lake, the upcoming desktop CPU generation? Some, but not all, of these questions were answered in supplementary material made available under embargo during the pre-Computex Intel Architecture Day.

What is Lunar Lake?

Lunar Lake is the codename for Intel's upcoming SoC that will replace Meteor Lake in low power laptops and other suitable hardware applications this year. The processor's major selling point is the integration of a 48 TOPS Neural Processing Unit for hardware-accelerated AI functionality, a feature that was a requirement for Microsoft's new suite of Windows Copilot Runtime applications that leverage AI. These 'AI PCs' are likely to become more prevalent if locally processed AI takes off, but could easily become a development dead-end if software publishers fail to convince a relatively uninformed customer base.



First up, the SoC package. You will have spotted the two DRAM chips immediately; they're two LPDDR5x-8533 memory modules installed onto the SoC package, reducing the footprint of the combined CPU and memory subsystem, power requirements and latency characteristics. Lunar Lake SoC's will be equipped with 16 or 32GB of RAM which cannot be upgraded; more performance at a trade-off, much like Apple M-Series SoCs.

The die section of the chip is assembled using Intel's Foveros tile packaging technology which allows chiplets manufactured with different process technologies to be lined up in close proximity. Each tile is placed on the base tile as small bumps with a tiny pitch takes care of signalling between tiles, keeping them tightly packed together to reduce latency and power requirements.

Lunar Lake has three main visible tiles. The largest is the Compute tile, manufactured using TSMC's N3B process and containing CPU, NPU and GPU logic as well as exclusive or shared cache pools. The second is the Platform Controller Tile, which is comprised of the PCIe Controller and other features that don't scale particularly well with a smaller manufacturing process, that's manufactured with TSMC's N6 process. Finally in the bottom-left there's a stabiliser tile intended purely for mechanical stability.

The large U around the left, bottom and right of the die area stiffens the SoC package, ensuring it's not fragile when used outside of a lab.



When zooming in on the Compute die the internal structures become quite apparent. Running along the top is the 128-bit dual-channel memory controller, located as close a physically possible to the LPDDR5X DRAM. A large section to the left encloses the eight Xe2 GPU cores and their associated cache. The regular structure to the right are the six NPU cores, above which is the media engine. To the right of that is a large 8MB cache structure, below which is the cluster of four small E-Cores connected by a cross of cache. The last major structure is the four P-Cores complete with 12MB of L3 cache.

Power islands allow controllers to selectively disable sections to save power when not in use, a methodology that's been used to variable effect in both SoCs and GPUs.

This layout alone differs significantly from Meteor Lake, and not only because of all that area allocated to the NPU. It delineates between P-Core and E-Core zones more explicitly than before, and allocates a large shared 'memory side-cache' primarily used by the E-Cores and NPUs rather than have the P-Cores and E-Cores closely share resources. The anticipated result is an E-Core that is significantly more performant than the previous generation, bringing us to the next headline.

Goodbye to Hyperthreading... for now?

Thread management and scheduling via the Thread Director has been one of the major hurdles of the many-core revolution. Simultaneous Multithreading (or Hyperthreading as Intel calls it) however has remained a part of the core design even as core counts ballooned on desktop and mobile. The technology however comes at a cost: die area and power requirements, both of which are at a premium in mobile applications.

Intel's thread management utilises a preferred priority system that makes the most effective use of discrete core single-threaded operation first, then hyperthreaded performance as a last resort. A lot of workloads which would previously make use of hyperthreading functionality on P-Cores now tend to be off-loaded onto E-Cores, and only a tiny fraction would need the full potential of E-cores + hyperthreaded P-Cores at least on mobile. However, even when not active you still have to use some power on that hyperthreading circuitry.

Quite reasonably Intel wondered if Hyperthreading was strictly necessary on this platform, and came to the conclusion that it wasn't. Instead they beefed up single threaded performance on the Lion Cove P-Core - notable adding an intermediate cache between L1 and L2 - and allocated more space to the Skymont E-Cores with larger cache pool, resulting in a considerably more efficient overall design tailored to power-constrained platforms.



To give some idea of the capabilities of Skymont E-Cores, Intel believes that they have roughly the same IPC as Raptor Lake P-Cores (while operating at lower clock speeds) and represent an improvement of over 30% across all workloads compared with the previous generation of E-Cores.

This doesn't mean that Hyperthreading/SMT is dead. Intel have described Lion Cove as a flexible architecture with Hyperthreading capability that's disabled on Lunar Lake and so we can expect to see a variant with the technology on Desktop. It also looks highly likely to remain a fixture of datacenter CPUs where performance rather than power efficiency is the primary factor in design.

Xe2, the basis of Battlemage.



Lunar Lake also incorporates Intel's latest generation Xe2 GPU architecture. We're hoping to go into more depth on the release of Intel's 2nd Generation 'Battlemage' discrete GPUs but it's worth going over some of the highlights:

- Considerably revised microarchitecture over Alchemist
- Native SIMD16 rather than SIMD8 ALUs with support for SIMD32
- Better utilisation
- Improved game compatibility to reduce driver optimisation requirements
- Faster and more efficient Ray Tracing Unit
- Provides more TOPS to augment NPU AI processing.

In the estimation of many, Intel failed to leverage the available silicon on Alchemist GPUs, leading to unwarranted material cost for a GPU in its performance bracket. Improved utilisation of all its available resources is therefore perhaps the most important factor to Xe2's improved architecture from a profitability perspective; something that will be critically important to the discrete-GPU project's long-term viability.



With luck the changes will make Lunar Lake laptops more suitable for gaming than previous low-power laptop implementations, but the final point on the list will raise an eyebrow or two amongst investors. Intel's competitors strongly touted the power of their dedicated NPU, but by leveraging the Xe2 GPU for an up-to additional 70 TOPS over the NPUs 48 Lunar Lake may be the more capable Consumer AI inferencing platform for Windows Copilot Runtime applications.

It would certainly be ironic if Intel stole a march on AMD with more powerful and versatile on-chip GPU features.

Rumours of x86's Death Appear to have been Exaggerated

Despite having the initiative stolen by Qualcomm and their Snapdragon Elite X AI platform last month, Intel were bullish about Lunar Lake being the best AI platform for Windows Copilot+, shrugging off commentary that ARM's efficiency will kill off demand for x86-based solutions in the low power AI-enabled space.



A plethora of Lunar Lake laptops from Intel's partners are expected in Q3 2024 alongside major updates to Windows Copilot+. Arrow Lake meanwhile, the desktop replacement for Raptor Lake that will incorporate some Lunar Lake's advancements, is expected later in the year.

In Summary

You'd be forgiven for thinking that Intel's Computex was only concerned with surface-level appearances and AI at Computex 2024, but dig into the released materials and you'll find that Chipzilla has been the most open and comprehensive of the PC chipmaking trio. A lot of their plans are contingent on the success of open AI models, both in the consumer and enterprise space, so nothing is cut-and-dry. That being said, they at least have a track record of availability in the consumer space that their competitors struggle with.




Related Stories

Recent Stories