This has puzzled me for a while. The cited system has 2x89.6 GB/s bandwidth. But a single CCD can do at most 64GB/s of sequential reads. Are claims like "Apple Silicon having 400GB/s" meaningless? I understand a typical single logical CPU can't do more than 50-70GB/s, and it seems like a group of CPU's typically shares a mem controller which is similarly limited.
To rephrase: is it possible to cause 100% mem bandwith utilization with only or 1 or 2 CPU's doing the work per CCD?
ryao 22 hours ago [-]
On Zen 3, I am able to use nearly the full 51.2GB/sec from a single CPU core. I have not tried using two as I got so close to 51.2GB/sec that I had assumed that going higher was not possible. Off the top of my head, I got 49-50GB/sec, but I last measured a couple years ago.
By the way, if the cores were able to load things at full speed, they would be able to use 640GB/sec each. That is 2 AVX-512 loads per cycle at 5GHz. Of course, they never are able to do this due to memory bottlenecks. Maybe Intel’s Xeon Max series with HBM can, but I would not be surprised to see an unadvertised internal bottleneck there too. That said, it is so expensive and rare that few people will ever run code on one.
buildbot 20 hours ago [-]
People have studied the Xeon Max! Spoiler - yes, it's limited to ~23GB/s per core. It can't achieve anywhere close to the theoretical bandwidth of the HBM even, with all cores active. It's a pretty bad design in my opinion.
It is integer factors better overall total BW than ddr5 spr; I think they went for minimal investment + time to market for the spr w/ hbm product rather than heavy investment to hit full bw utilization. Which may have made sense for intel overall given business context etc
KeplerBoy 24 hours ago [-]
Aren't those 400 GB/s a figure which only apply when the GPU with its much wider interface is accessing the memory?
bobmcnamara 21 hours ago [-]
That figure is at the memory controller.
It applies as a maximum speed limit all the time, but it's unlikely that a CPU would cause the memory controller to reach it. Why it's important is that it causes increased latency whenever other bus controllers are competing for bandwidth, but I don't think Apple has documented their internal bus architecture or performance counters necessary to see how.
doctorpangloss 20 hours ago [-]
Another POV is that maybe the max memory bandwidth figure is too vague to guide people optimizing libraries. It would be nice if Apple Silicon was as fast as "400GB/s" sounds. Grounded closer to reality, the parts are 65W.
KeplerBoy 17 hours ago [-]
But those 65 Watts contain State of the Art Flops/Watt.
jmb99 16 hours ago [-]
> The cited system has 2x89.6 GB/s bandwidth.
The following applies for certain only to the Zen4 system; I have no experience with Zen5.
That is the theoretical max bandwidth of the DDR5 memory (/controller) running at 5600 MT/s (roughly: 5600MT/s ÷ 2MT/s × 32 bits/T = 89.6GB/s). There is also a bandwidth limitation between the memory controller (IO die) and the cores themselves (CCDs), along the Infinity Fabric. Infinity Fabric runs at a different clock speed than the cores, their cache(s), and the memory controller; by default, 2/3 of the memory controller. So, if the Memory controller's CLocK (MCLK) is 2800MHz (for 5600MT/s), the FCLK (infinity Fabrick CLocK) will run at 1866.66MHz. With 32 bytes per clock read bandwidth, you get 59.7GB/s maximum sequential memory read bandwidth per CCD<->IOD interconnect.
Many systems (read: motherboard manufacturers) will overclock the FCLK when applying automatic overclocking (such as when selecting XMP/EXPO profiles, and I believe some EXPO profiles include overclocking the FCLK as well. (Note that 5600MT/s RAM is overclocked; the fastest officially supported Zen4 memory speed is 5200MT/s, and most memory kits are 3600MT/s or less until overclocked with their built-in profiles.) In my experience, Zen4 will happily accept FCLK up to 2000MHz, while Zen4 Threadripper (7000 series) seems happy up to 2200MHz. This particular system has the FCLK overclocked to 2000MHz, which will hurt latency[0] (due to not being 2/3 of MCLK) but increase bandwidth. 2000MHz × 32 bytes/cycle = 64GB/s read bandwidth, as quoted in the article.
First: these are theoretical maximums. Even the most "perfect" benchmark won't hit these, and if they do, there are other variables at play not being taken into account (likely lower level caches). You will never, ever see theoretical maximum memory bandwidth in any real application.
Second: no, it is not possible to see maximum memory bandwidth on Zen4 from only one CCD, assuming you have sufficiently fast DDR5 that the FCLK cannot be equal to the MCLK. This is an architecture limitation, although rarely hit in practice for most of the target market. A dual-CCD chip has sufficient memory bandwidth to saturate the memory before the Infinity Fabric (but as alluded to in the article, unless tuned incredibly well, you'll likely run into contention issues and either hit a latency or bandwidth wall in real applications). My quad-CCD Threadripper can achieve nearly 300GB/s, due to having 8 (technically 16) DDR5 channels operating at 5800MT/s and FCLK at 2200MHz; I would need an octo-CCD chip to achieve maximum memory bandwidth utilization.
Third: no, claims like "Apple Silicon having 400GB/s) are not meaningless. Those numbers are achieved the exact same way as above, and the same way Nvidia determines their maximum memory bandwidth on their GPUs. Platform differences (especially CPU vs GPU, but even CPU vs CPU since Apple, AMD, and Intel all have very different topologies) make the numbers incomparable to each other directly. As an example, Apple Silicon can probably achieve higher per-core memory bandwidth than Zen4 (or 5), but also shares bandwidth with the GPU; this may not be great for gaming applications, for instance, where memory bandwidth requirements will be high for both the CPU and GPU, but may be fine for ML inference since the CPU sits mostly idle while the GPU does most of the work.
[0] I'm surprised the author didn't mention this. I can only assume they didn't know this, and haven't tested over frequencies or read much on the overclocking forums about Zen4. Which is fair enough, it's a very complicated topic with a lot of hidden nuances.
bpye 15 hours ago [-]
> Note that 5600MT/s RAM is overclocked; the fastest officially supported Zen4 memory speed is 5200MT/s
This specifically did change in Zen 5, the max supported is now 5600MT/s
neonsunset 12 hours ago [-]
Easily, the memory subsystem on AMDs consumer parts is embarrassingly weak (on all desktop and portable consumer devices in general save for Apple ones and select bespoke designs).
jeffbee 21 hours ago [-]
There are large differences in load/store performance across implementations. On Apple Silicon for example the M1 Max a single core can stream about 100GB/s all by itself. This is a significant advantage over competing designs that are built to hit that kind of memory bandwidth only with all-cores workloads. For example five generations of Intel Xeon processors, from Sandybridge through Skylake, were built to achieve about 20GB/s streams from a single core. That is one reason why the M1 was so exceptional at the time it was released. The 1T memory performance is much better than what you get from everyone else.
As far as claims of the M1 Max having > 400GB/s of memory bandwidth, this isn't achievable from CPUs alone. You need all CPUs and GPUs running full tilt to hit that limit. In practice you can hit maybe 250GB/s from CPUs if you bring them all to bear, including the efficiency cores. This is still extremely good performance.
btw what's about as important is that in practice you don't need to write super clever code to do that, these 68GB/s are easy to reach with textbook code without any cleverness
zamadatix 11 hours ago [-]
68 Gbps of memory read/write can be easily reached (assuming the memory bandwidth is there to reach it with) on any current architecture by running a basic loop adding 64 bit scalars. What could be even less clever than that?
> From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself. On the M1 Max, it seems that we’re hitting the limit of what a core can do – or more precisely, a limit to what the CPU cluster can do.
Wow
Agingcoder 1 days ago [-]
Proper thread placement and numa handling does have a massive impact on modern amd cpus - significantly more so than on Xeon systems.
This might be anecdotal, but I’ve seen performance improve by 50% on some real world workloads.
bob1029 21 hours ago [-]
NUMA feels like a really big deal on AMD now.
I recently refactored an evolutionary algorithm from Parallel.ForEach over one gigantic population to an isolated population+simulation per thread. The difference is so dramatic (100x+) that loss of large scale population dynamics seems to be more than offset by the # of iterations you can achieve per unit time.
Communicating information between threads of execution should be assumed to be growing more expensive (in terms of latency) as we head further in this direction. More threads is usually not the answer for most applications. Instead, we need to back up and review just how fast one thread can be when the dependent data is in the right place at the right time.
Agingcoder 18 hours ago [-]
Yes - I almost view the server as a small cluster in a box, and an internal network with the associated performance impact when you start going out of box
bobmcnamara 21 hours ago [-]
Is cross thread latency more expensive in time, or more expensive relative to things like local core throughput?
bob1029 20 hours ago [-]
Time and throughput are inseparable quantities. I would interpret "local core throughput" as being the subclass of timing concerns wherein everything happens in a smaller physical space.
I think a different way to restate the question would be: What are the categories of problems for which the time it takes to communicate cross-thread more than compensates for the loss of cache locality? How often does it make sense to run each thread ~100x slower so that we can leverage some aggregate state?
The only headline use cases I can come up with for using more than <modest #> of threads is hosting VMs in the cloud and running simulations/rendering in an embarrassingly parallel manner. I don't think gaming benefits much beyond a certain point - humans have their own timing issues. Hosting a web app and ferrying the user's state between 10 different physical cores under an async call stack is likely not the most ideal use of the computational resources, and this scenario will further worsen as inter-thread latency increases.
hobs 23 hours ago [-]
When I was caring more about hardware configuration on databases in big virtual machine hosts not configuring NUMA was an absolute performance killer, more than 50% performance on almost any hardware because as soon as you left the socket the interconnect suuuuucked.
cebert 2 days ago [-]
George’s detailed analysis always impresses me. I’m amazed with his attention to detail.
geerlingguy 1 days ago [-]
It's like Anandtech of old, though the articles usually lag product launches a little further. Probably due to lack of resources (in comparison to Anandtech at its height).
I feel like I've learned a bit after every deep dive.
ip26 20 hours ago [-]
He goes far deeper than I remember Anandtech going.
IanCutress 15 hours ago [-]
Just to highlight, this one's Chester :)
AbuAssar 1 days ago [-]
Great deep dive into AMD's Infinity Fabric!
The balance between bandwidth, latency, and clock speeds shows both clever engineering and limits under pressure.
Makes me wonder how these trade-offs will evolve in future designs. Thoughts?
Cumpiler69 1 days ago [-]
IMHO these internal and external high speed interconnects will be more and more important in the future, as More's law is dying, GHz aren't increasing, and newer FAB nodes are becoming monstrously expensive, so connecting cheaper made dies together is the only way to scale compute performance for consumer applications where cost matters. Apple did the same on the high end M chips.
The only challenge is SW also needs to be rewritten to use these new architectures efficiently otherwise we see performance decreases instand of increases.
sylware 24 hours ago [-]
You would need fine-grained hardware configuration from the software based on that very software semantics and task. If ever possible in a shared hardware environment.
Video game consoles with shared GPU(for 3D) and CPU had to chose: favor the GPU with high bandwidth and high latency, or the CPU with low lantency with lower bandwidth. Since a video game console is mostly GPU, they went for the GDDR, namely high bandwidth with high latency.
On linux, you have the alsa-lib which does handle sharing the audio device among the various applications. They had to choose a reasonable default hardware configuration for all: it is currently stereo 48kHz, and it is moving to the 'maximum number of channels' at a maximum of 48kHz with left and right channels.
lincpa 1 days ago [-]
[dead]
Rendered at 12:25:01 GMT+0000 (UTC) with Wasmer Edge.
To rephrase: is it possible to cause 100% mem bandwith utilization with only or 1 or 2 CPU's doing the work per CCD?
By the way, if the cores were able to load things at full speed, they would be able to use 640GB/sec each. That is 2 AVX-512 loads per cycle at 5GHz. Of course, they never are able to do this due to memory bottlenecks. Maybe Intel’s Xeon Max series with HBM can, but I would not be surprised to see an unadvertised internal bottleneck there too. That said, it is so expensive and rare that few people will ever run code on one.
https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi...
It applies as a maximum speed limit all the time, but it's unlikely that a CPU would cause the memory controller to reach it. Why it's important is that it causes increased latency whenever other bus controllers are competing for bandwidth, but I don't think Apple has documented their internal bus architecture or performance counters necessary to see how.
The following applies for certain only to the Zen4 system; I have no experience with Zen5.
That is the theoretical max bandwidth of the DDR5 memory (/controller) running at 5600 MT/s (roughly: 5600MT/s ÷ 2MT/s × 32 bits/T = 89.6GB/s). There is also a bandwidth limitation between the memory controller (IO die) and the cores themselves (CCDs), along the Infinity Fabric. Infinity Fabric runs at a different clock speed than the cores, their cache(s), and the memory controller; by default, 2/3 of the memory controller. So, if the Memory controller's CLocK (MCLK) is 2800MHz (for 5600MT/s), the FCLK (infinity Fabrick CLocK) will run at 1866.66MHz. With 32 bytes per clock read bandwidth, you get 59.7GB/s maximum sequential memory read bandwidth per CCD<->IOD interconnect.
Many systems (read: motherboard manufacturers) will overclock the FCLK when applying automatic overclocking (such as when selecting XMP/EXPO profiles, and I believe some EXPO profiles include overclocking the FCLK as well. (Note that 5600MT/s RAM is overclocked; the fastest officially supported Zen4 memory speed is 5200MT/s, and most memory kits are 3600MT/s or less until overclocked with their built-in profiles.) In my experience, Zen4 will happily accept FCLK up to 2000MHz, while Zen4 Threadripper (7000 series) seems happy up to 2200MHz. This particular system has the FCLK overclocked to 2000MHz, which will hurt latency[0] (due to not being 2/3 of MCLK) but increase bandwidth. 2000MHz × 32 bytes/cycle = 64GB/s read bandwidth, as quoted in the article.
First: these are theoretical maximums. Even the most "perfect" benchmark won't hit these, and if they do, there are other variables at play not being taken into account (likely lower level caches). You will never, ever see theoretical maximum memory bandwidth in any real application.
Second: no, it is not possible to see maximum memory bandwidth on Zen4 from only one CCD, assuming you have sufficiently fast DDR5 that the FCLK cannot be equal to the MCLK. This is an architecture limitation, although rarely hit in practice for most of the target market. A dual-CCD chip has sufficient memory bandwidth to saturate the memory before the Infinity Fabric (but as alluded to in the article, unless tuned incredibly well, you'll likely run into contention issues and either hit a latency or bandwidth wall in real applications). My quad-CCD Threadripper can achieve nearly 300GB/s, due to having 8 (technically 16) DDR5 channels operating at 5800MT/s and FCLK at 2200MHz; I would need an octo-CCD chip to achieve maximum memory bandwidth utilization.
Third: no, claims like "Apple Silicon having 400GB/s) are not meaningless. Those numbers are achieved the exact same way as above, and the same way Nvidia determines their maximum memory bandwidth on their GPUs. Platform differences (especially CPU vs GPU, but even CPU vs CPU since Apple, AMD, and Intel all have very different topologies) make the numbers incomparable to each other directly. As an example, Apple Silicon can probably achieve higher per-core memory bandwidth than Zen4 (or 5), but also shares bandwidth with the GPU; this may not be great for gaming applications, for instance, where memory bandwidth requirements will be high for both the CPU and GPU, but may be fine for ML inference since the CPU sits mostly idle while the GPU does most of the work.
[0] I'm surprised the author didn't mention this. I can only assume they didn't know this, and haven't tested over frequencies or read much on the overclocking forums about Zen4. Which is fair enough, it's a very complicated topic with a lot of hidden nuances.
This specifically did change in Zen 5, the max supported is now 5600MT/s
As far as claims of the M1 Max having > 400GB/s of memory bandwidth, this isn't achievable from CPUs alone. You need all CPUs and GPUs running full tilt to hit that limit. In practice you can hit maybe 250GB/s from CPUs if you bring them all to bear, including the efficiency cores. This is still extremely good performance.
Wow
I recently refactored an evolutionary algorithm from Parallel.ForEach over one gigantic population to an isolated population+simulation per thread. The difference is so dramatic (100x+) that loss of large scale population dynamics seems to be more than offset by the # of iterations you can achieve per unit time.
Communicating information between threads of execution should be assumed to be growing more expensive (in terms of latency) as we head further in this direction. More threads is usually not the answer for most applications. Instead, we need to back up and review just how fast one thread can be when the dependent data is in the right place at the right time.
I think a different way to restate the question would be: What are the categories of problems for which the time it takes to communicate cross-thread more than compensates for the loss of cache locality? How often does it make sense to run each thread ~100x slower so that we can leverage some aggregate state?
The only headline use cases I can come up with for using more than <modest #> of threads is hosting VMs in the cloud and running simulations/rendering in an embarrassingly parallel manner. I don't think gaming benefits much beyond a certain point - humans have their own timing issues. Hosting a web app and ferrying the user's state between 10 different physical cores under an async call stack is likely not the most ideal use of the computational resources, and this scenario will further worsen as inter-thread latency increases.
I feel like I've learned a bit after every deep dive.
The only challenge is SW also needs to be rewritten to use these new architectures efficiently otherwise we see performance decreases instand of increases.
Video game consoles with shared GPU(for 3D) and CPU had to chose: favor the GPU with high bandwidth and high latency, or the CPU with low lantency with lower bandwidth. Since a video game console is mostly GPU, they went for the GDDR, namely high bandwidth with high latency.
On linux, you have the alsa-lib which does handle sharing the audio device among the various applications. They had to choose a reasonable default hardware configuration for all: it is currently stereo 48kHz, and it is moving to the 'maximum number of channels' at a maximum of 48kHz with left and right channels.