Caching Analysis

It does not take us long to find a suspect for the lower single-threaded performance of the CMT enabled module: the instruction cache.

Instruction Cache Hitrate

The instructions of Cinebench and 7-Zip fit almost perfectly in the instruction cache, but that cannot be said about our MS SQL Server SQL statements. The 8-way 32KB Instruction caches of the latest Intel CPUs are clearly not large enough and shed some light on why the Opteron 6174 performed so well in this benchmark. The older AMD CPU has up to 40% fewer instruction cache misses.

The 2-way 64KB instruction cache was clearly not the optimal choice for caching two threads: the hit rate goes from an excellent 97% down to a mediocre 95% once we enable the second integer thread. It will take some engineering, but increasing the associativity of the L1 instruction cache seems necessary to make sure that the two CMT threads do not hinder each other. Let's move on to the data cache.

Data Cache hit rate

Reducing the data cache from 64KB to 16KB was probably necessary in order to keep the die size of the module under control. (A Bulldozer module is less than 80 mm², while two Magny-Cours cores are good for 115 mm².) However this reduction comes with a price: the data cache suffers twice as many misses as before. Intel's 8-way cache does a bit better, but it is not spectacular. Now let's check out the L2 caches.

L2 Cache hit rate

The very low L2 cache hit rates on the older Opteron and Xeon seem like a fluke but that is not the case. In  the case of Cinebench, don't forget that this benchmark has an extremely low miss rate in the L1 cache, so most of the easy to cache code and data is already there. The relatively high L2 cache miss rate on the Xeon means that 44% of less than 1% misses the L2 cache--or in other words, almost nothing. The data is almost perfectly cache inside the caches and the data cache hit rate is 99.99%. Most of the L2 cache misses are a few hardly used instructions.

The same is true for the relatively bad hit rate of the Opteron 6174 L2 cache in 7-Zip. The Opteron has a higher L1 data cache hit rate than the other CPUs, so the L2 cache is less accessed. The bad L2 hit rate is not the reason for the lower performance of the older Opteron. Which brings us to the final area of analysis....

IPC Analysis Branch Prediction Analysis
Comments Locked

84 Comments

View All Comments

  • ArteTetra - Wednesday, May 30, 2012 - link

    "A core this complex in my opinion has not been optimized to its fullest potential. Expect better performance when AMD introduces later steppings of this core with regard to power consumption and higher clock frequencies."

    You don't say?
  • JohanAnandtech - Thursday, May 31, 2012 - link

    A quote by a reader, not ours :-). The idea is probably that Bulldozer was AMD's very first implementation of their new architecture.
  • haplo602 - Wednesday, May 30, 2012 - link

    now this was a great read. finaly something interesting (the consumer benchmarks are NOT intereseted anymore for me).

    I hope there will be a differential analysis once you have Piledriver CPUs available.
  • JohanAnandtech - Thursday, May 31, 2012 - link

    Piledriver analysis: definitely. Thanks for the encouraging words :-)
  • mikato - Friday, June 1, 2012 - link

    I agree - great critical thinking in this article! This subject definitely needed more research.
  • Spunjji - Wednesday, June 6, 2012 - link

    +1. This is the sort of thing I come here for!
  • Beenthere - Wednesday, May 30, 2012 - link

    Expecting Vishera to be an Intel killer is foolish as it's not going to happen and there is no need for it to happen. Ivy Bridge is very much like FX in that it's only 5% faster than SB and runs hot. At least FX chips OC and scale well unlike Ivy Bridge.

    If AMD can use some of the techniques imployed in Trinity they should be able to get a 15+% improvement over the FX CPUs. This combined with higher clockspeeds now that GloFo has sorted 32nm production should provide a nice performance bump in Vishera.

    95% of consumers do not buy the fastest, most over-hyped and over-priced CPU on the planet for their PC or server apps. Mainstream use is what AMD is shooting for at the moment and doing pretty well at it. Eventually they will release APUs for all PC market segments that perform well, use less power and cost less than discrete CPU/GPU combo. THAT is what 95% of the X86 world will be using.
  • Homeles - Wednesday, May 30, 2012 - link

    "Ivy Bridge is very much like FX in that it's only 5% faster than SB and runs hot"

    I think you need to go read about Intel's tick-tock strategy.

    Also, unlike Bulldozer, Ivy Bridge was a step forward. A small one, but performance per watt went up, while with Bulldozer it often went backwards.

    Process maturity from GloFo will help, but probably not as much as you would think.

    Finally, "95% of users" aren't going to benefit best from a processor built with server workloads in mind. Even with server workloads, Bulldozer fails to deliver. APUs are definitely the future, but keep in mind that Intel's had an APU out for as long as AMD has. If you think that AMD's somehow going to pull a fast one on Intel, you're delusional. Intel and Nvidia as well are very, very well aware of heterogeneous computing.
  • The_Countess - Wednesday, May 30, 2012 - link

    looking at how much the performance per watt went up with piledriver compared with llano, I think they''ll have a lot more headroom on the desktop and server space to increase the clock frequencies to where they are suppose to be with the bulldozer launch.
  • Homeles - Wednesday, May 30, 2012 - link

    Yeah, Piledriver will likely perform the way AMD had intended Bulldozer to perform.

Log in

Don't have an account? Sign up now