print

IBM

POWER7 & Power PC Processor Cores

IBM’s Next Generation Server Processor

18a_v2

POWER7 Core
The new POWER7 Core has a total of 18 Execution units, including two fixed point pipelines bit aligned to the two LSU pipes. For POWER7 we expanded the capability of the load store pipes to execute simple fixed point instructions. This new core has four merged Vector and Scaler FPU pipelines capable of four double precision multiply add operations per cycle or 8 flops per cycle. There is an instruction fetch unit (IFU) which also executes branch and condition register instructions. This IFU also contains our branch prediction logic which has been improved for POWER7. Our decimal floating point unit introduced in POWER6 to accelerate commercial applications is also part of this core. POWER7 features a more flexible ISU capable of dispatching 6 instructions per cycle including 2 branches and issuing up to 8 instructions per cycle. In POWER7 we took advantage of the out of order execution to switch from a dedicated recovery unit to a distributed one using the branch redirect capability of the Out Of Order machine. Looking at the Cache Hierarchy you can see the 32KB I cache, along with the 32K D-cache. Each being serviced the tightly coupled 256KB L2 cache which is also bit aligned to the Load Store unit pipes.

18b_v2

POWER7 Processor Chip
POWER7 is fabricated in IBMs 45nm Silicon on insulator technology using copper interconnect and embedded dram for the L3. The chip is 567mm squared and contains 1.2B transistors. However considering that each EDRAM cell has the function of a 6T SRAM cell the chip actually has the equivalent function of a 2.7B transistor chip. EDRAM is one of the key innovations on the POWER7 chip. This allows IBM to bring the large 32MB shared L3 cache on chip, which has multiple advantages. It causes Latency reduction by eliminating off chip driver and receiver delays the L3 can be accessed in as little as 1/6 the cycles of our previous generation off chip cache, it improves L3 Bandwidth per core, using on chip interconnect with 11 levels of metal we can provide each core with 32B busses to and from the L3 providing twice the bandwidth per core of an off chip cache, it requires less Power because EDRAM uses 1/5 the standby power of a traditional Sram cell, plus no longer has power hungry off chip drivers in the access path, and EDRAM is 1/3 the area of traditional SRAM. The Large EDRAM L3 cache acts as a filter reducing the per core traffic to the Memory and SMP interconnect. So in summary EDRAM provides 1/6 the latency, twice the bandwidth, 1/5 the standby power in 1/3 the area. The chip as you can see has 8 processor cores each with twelve execution unit capable of running 4 way SMT. To feed the eight high performance cores th chip has two memory controllers, one on each side of the chip. Each memory controller supports 4 channels of DDR3 memory. Combined these 8 channels provide 100GBs of sustained memory bandwidth. On the top and bottom of the chip are IBM’s SMP links providing 360GB/s of coherency bandwidth to make SMP systems scalable to 32 sockets.

19a_v2

19b_v2

This 7th generation Power Architecure® chip adds balanced multi-core design , edram technology and SMT4 to the POWER innovation portfolio. Providing over 4 times the performance in the same power envelope as the previous generation. The balanced design allows the chip to scale from single socket blades to 32 socket 1024 thread high end systems. POWER7 is the building block for the clustered peta-scale PERCS and NCSA Blue waters project. Pictured on the top right is one of IBM’s entry system cards featuring the POWER7 core. In this single memory controller configuration the eight dimms slots are seen on the right and the processor is hiding under the copper head sink on the left. The 4 memory buffer chips are between the processor and dimms with the SMP connector at the bottom. All three operating systems AIX, Linux, and IBM’s I-series operating system are operational and running in multiple system configurations. Power7 system general availability will be early 2010.

19c_v2

To combat the classic power management trade off, of wake up latency vs power reduction for architected idle modess, the POWER architecture defines 4 idle modes of increasing depth. They are: doze, nap sleep and RV Winkle Power gate. Power7 chose to implement two knee of the curve design points (see above). To optimize for best wake-up time we implemented nap mode. In nap mode clocks are turned off to the execution units and frequency is reduced. But caches and TLB’s remain coherent to reduce wake-up time. For optimal power reduction at the expense of some wake-up latency, Power7 implemented a sleep function.

Sleep provides almost all the power reduction of power gating, but eliminates the wake up penalty of re-initializing configuration register. In sleep the chaches are purged and all the clocks are turned off. This allows us to drop the voltage below Vmin to a retenation level called Vretention. At Vretention the 45nm CMOS chip has almost no leakage current, but still retains state when the voltage is ramped back-up and the clocks are turned on again.

20a_v2
 

20b_v2
 

BUILT_ON_POWERjpg
 

This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.

Contact Information

IBM

One Orchard Road
Armonk, NY, 10504
USA

toll-free: 888.SHOP.IBM
www.ibm.com

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • TwitThis