Caching Analysis

It does not take us long to find a suspect for the lower single-threaded performance of the CMT enabled module: the instruction cache.

Instruction Cache Hitrate

The instructions of Cinebench and 7-Zip fit almost perfectly in the instruction cache, but that cannot be said about our MS SQL Server SQL statements. The 8-way 32KB Instruction caches of the latest Intel CPUs are clearly not large enough and shed some light on why the Opteron 6174 performed so well in this benchmark. The older AMD CPU has up to 40% fewer instruction cache misses.

The 2-way 64KB instruction cache was clearly not the optimal choice for caching two threads: the hit rate goes from an excellent 97% down to a mediocre 95% once we enable the second integer thread. It will take some engineering, but increasing the associativity of the L1 instruction cache seems necessary to make sure that the two CMT threads do not hinder each other. Let's move on to the data cache.

Data Cache hit rate

Reducing the data cache from 64KB to 16KB was probably necessary in order to keep the die size of the module under control. (A Bulldozer module is less than 80 mm², while two Magny-Cours cores are good for 115 mm².) However this reduction comes with a price: the data cache suffers twice as many misses as before. Intel's 8-way cache does a bit better, but it is not spectacular. Now let's check out the L2 caches.

L2 Cache hit rate

The very low L2 cache hit rates on the older Opteron and Xeon seem like a fluke but that is not the case. In  the case of Cinebench, don't forget that this benchmark has an extremely low miss rate in the L1 cache, so most of the easy to cache code and data is already there. The relatively high L2 cache miss rate on the Xeon means that 44% of less than 1% misses the L2 cache--or in other words, almost nothing. The data is almost perfectly cache inside the caches and the data cache hit rate is 99.99%. Most of the L2 cache misses are a few hardly used instructions.

The same is true for the relatively bad hit rate of the Opteron 6174 L2 cache in 7-Zip. The Opteron has a higher L1 data cache hit rate than the other CPUs, so the L2 cache is less accessed. The bad L2 hit rate is not the reason for the lower performance of the older Opteron. Which brings us to the final area of analysis....

IPC Analysis Branch Prediction Analysis
Comments Locked

84 Comments

View All Comments

  • thunderising - Wednesday, May 30, 2012 - link

    Glad AMD has "Greater Performance" planned sometime in the future. Wow!
  • themossie - Wednesday, May 30, 2012 - link

    "There are also other factors at play, though, as it's already known that StarCraft II doesn't use more than two cores; theinstead, it's likely the..."

    (feel free to remove comment after fixing this)
  • Nightraptor - Wednesday, May 30, 2012 - link

    I am wondering if it would be possible to compare the processor performance of the Trinity A10 with a underclocked FX-4100 set to the same frequencies (I don't know if it is possible to disable the L3 cache on the FX-4100). This might give us a rough idea of how much the improvements of Piledriver have bought us. Just doing rough math in my head it would seem that they have to be pretty significant given how a FX-4100 compared to the Phenom II X4's (it lost alot, if not most of the time t of the time). The new A10 Trinity's on the other hand seem to win most of the time compared to the old architecture. Given that the A10 is a Piledriver based FX-4XXX series equivalent minus the L3 cache it would seem that Piledriver brought very significant enhancements. Either that or the Phenom II era processors responded much more poorly to the lack of L3 than Piledriver does.
  • coder543 - Wednesday, May 30, 2012 - link

    I was hoping they would be doing the same thing, even though it would be challenging to draw real information out of comparing a desktop processor and a mobile processor.
  • SleepyFE - Wednesday, May 30, 2012 - link

    +1
  • kyuu - Wednesday, May 30, 2012 - link

    +2

    If you can figure out some way to do a comparison and analysis of Piledriver's performance vs. Bulldozer, I think a great many of us are interested to see that. From benchmarks, it seems like Piledriver improved a great deal over Bulldozer, but it's difficult to tell without being able to compare two similar processors.
  • Aone - Wednesday, May 30, 2012 - link

    You can compare A10-4600M or A8-4500M versus mobile Llano or Phenom or even Turion to see tweaked BD is nothing of spectacular. For instance, in most cases A8-4500M (2.3GHz base) loses to Llano A8-3500M (1.5GHz base).
  • Nightraptor - Wednesday, May 30, 2012 - link

    Where are you getting the benchmarks from that a Trinity loses to Llano. In almost all the benchmarks I have been able to find (with the exception of a few) it seems that Trinity beats Llano, hence the original post. If the Piledriver enhancements were very minor I would've expected Trinity (a hacked quad core) to lose to Llano most of the time (a true quad core). This didn't appear to happen - at least not in the anandtech review.
  • Aone - Wednesday, May 30, 2012 - link

    "http://www.notebookcheck.net/AMD-A-Series-A8-4500M...
    Look at "Show comparison chart". Great info!
  • Nightraptor - Thursday, May 31, 2012 - link

    I'm not a big fan of the reliability of that website - They tend to be pretty scant on the test circumstances and configurations. Furthermore I'm curious where they are getting the informaiton for the A8-4500 as to the best of my knowledge the only Trinity in the wild at the moment is the A10 which AMD sent out in a custom made review laptop. All they list is a "K75D Sample".

Log in

Don't have an account? Sign up now