EddyHawk's Info List
---
MicroProcessors
---
INFO SOURCES
---
Martin Malix/HWiNFO/V4.7.x
Tom?? Navratil/NSSI/V0.5x
Michael Colin/PCConfig/V9.33
http://www.tomshardware.com
http://www.anandtech.com
http://www.x86.org
http://www.amd.com
---
INTEL
---
4004
 note:
  the 1st microprocessor
  requested by Japanese company for calculator (?)
8080
 8bit
8080A
8085
note: only for PC
8086 and 8088
 release date: 1978
 16bit
 word-alignment has no effect for data fetching
 20 address line (2^20 = 1Mb mem addressing)
 CISC
 first use of pipelining
 8086 & 8088 are binary compatible, but not pin compatible
 29,000 transistors
 separate FPU (8087)
 8088
  8bit-bus version of 8086
  release date: Jun 1979
 speed variation: 4.7-10Mhz
80186 & 80188
 bug: IDIV
 note: 80188 is 8bit-bus version of 80186
 successful as embedded processor (in hi-performance disk controller)
 chip integration
 fault tolerance protection (try to trap invalid instruction & recover it)
 speed variation: 6-40Mhz
80286
 release: Feb 1982
 more opcodes:
 -ENTER
 -LEAVE
 -PUSH immediate
 -extended IMUL, SHL, SHR
 shift+add is faster than MUL
 protected mode (16bit)
 4 more address line (2^(20+4) = 16Mb mem addressing)
 134,000 transistors
 separate FPU (80287)
 speed variation: 6-20Mhz
80386
 year: 1985/1986
 32bit addressing (max 4Gb mem)
 more opcodes: MOVSD
 enhanced PM (switch with RM w/o resetting processor)
 v86 mode: rm prog can be executed in pm enviroment
 bug:
 -early version can't switch back to protected mode from real mode
 -32 bit multiply
 -STOSB
 3 above bugs only apply to 16Mhz version & soon be corrected
 -POPAD
  not found until mid 1990 (nearly all Intel's & AMD's 386 DX/SX still have
   the error)
 -INSB
 -MOVSB
 optimize: dec jnz is faster than loop
 PIQ (Prefetch Instruction Queue) ?
 16-40Mhz
 cache
  external L2 cache (on motherboard)
 separate FPU (80387)
 SX
  16bit, max 16Mb mem
  80376: run exclusively on PM
 SL
  386SX version with internal static core. I.e., its clocktiming can be
   reduced to 0 MHz, so that its consumption of power is nearly 0 mW.
RapidCAD
 a 2-pieced chip, pincompatible to 386DX and 387DX, but containing 486
  structures
80486
 year: 1989
 more opcodes: BSWAP
 less clock cycles for most instructions execution
 integrated 80487 fpu
  80bit register
 provision for multitasking
 pipelined execution
 DX (33Mhz)
  8kb L1 cache
 SX
  '486DX without fpu', separate fpu (80487SX) is available for upgrade
   but 486SX is actually 80486DX with non-functional/lately removed fpu
   487SX is actually 80486DX with some pins relocated, it will disable
    486SX electronically & fully replace it while running :) But disabled
    486SX still consume energy & cause heat
   Conclusion:
    using 486SX+487SX = using 2 486DX, one disabling another :)
    this thing is predecessor of 'OverDrive' :)
 SX/J
  only 16bit datapath
 SL
  power-saving 486SX
 DX2: doubled core clock frequency (50 vs 25, 66 vs 33Mhz)
  1st mcpu with mcpu clock is larger than bus clock
 DX4: tripled core clock frequency (100 vs 33Mhz)
  integrated L1 asynchronous cache (16kb)
 DX4 Write Back Enhanced
  more enhancements are power saving structures, enhanced v86,
   page size extensions, cpuid instruction
  note: latest DX4 (or 486-S) is released after early Pentium (mid 1993)
 clock speed: 20-100Mhz
80586 (Pentium, P5)
 bug: FDIV (found on Dec 1994/Jan 1995)
 date: 1993
 5V
 8kb (data) & 8kb (instruction) L1 cache
  separate cache is considered more effective because of different access
   pattern between instruction & data. instruction tend to be retrieved
   sequentially & more frequently reused
 3.1m transistors
 RISC?
 more instructions: RDMSR, RDTSC, WRMSR, CPUID
  note: CPUID is found on any Intel's processor on 1993+ (including some 486)
 superscalar: more than 1 execution unit: can execute more than 1 instruction
  up to 2 instructions per clock cycle
 note: Intel tried to conceal many programming enhancement of P5
 speed variation in Mhz: 60,66,90,100,120,133,150,166,180,200,233
 the last equal mcpu clock with bus clock: P5-60/66Mhz
 P5-150Mhz = P5-133Mhz in disguise
 branch prediction
  under branch instruction, we can't determine which instruction 
   sequences should be prefetch-ed until the jmp is executed. If processor
   choose IQ before jmp & the chosen IQ is found wrong after jmp is
   executed, we must refill the pipeline with different queue, which is
   costly, especially on larger PIQ
  then the branch prediction is added to try to guess which IQ to be taken,
   based on previous jmps record (Branch Target Buffer)
  EdH: branch-prediction is crap! Simply fill the pipeline until locating a
   jmp, wait until the jmp is executed and then continue to fill pipeline
   before another next jmp. it will be faster than branch-prediction but
   without building such complex mechanisme into mcpu
   this 'enhancement' forces programmers to reassemble their programs to
    avoid misprediction to improve performance, which means more work to do
    than before <sigh> yes, another intel's gift after 64kb limit :)
   i bet a bottle of beer that 'my' solution is simpler (no need for
    research & implementation), faster (no chance of fail, no need to execute
    branch prediction) and compatible (no need to modify the program to be
    'optimized for Pentium' & then modify it again to be 'optimized for
    Pentium II' & so on)
    even the worst of the worst of 'my' solution is just like processor
     without pipeline and that's impossible since no program consists of only
     jmps. most of the time the pipeline will be full. even if 'w/o pipeline'
     state is ever reached, it will ensure mcpu long-life & less-power-needed
     for not doing refill & refill again
 OverDrive: 486-pin-compatible Pentium
 MMX
  2.8V
  32kb L1 cache
  introduced at begin of 1997
  MMX (MultiMedia eXtensions)
   57 new SIMD instructions, 64bit register
   register is mapped to FPU register
Pentium Pro
 code name: P6
 date: Nov 1995
 more instruction: CMOVx
 bug: FIST
 first version is unable to outperform Pentium MMX
 30% faster than Pentium
 default slow vidmem write
 RISC
 5.5m transistors (+15.5m for cache)
 4 more address line
  max 64Gb mem addressing
 16kb (data) & 16kb (instruction) L1 cache
 256/512/1024kb on-die L2 cache
  but very expensive to manufacture
 the last to use 'Socket' (but 'Socket' is used again on P-IV)
 note: Intel tried to conceal most of P6 programming features, but failed
 2 level branch prediction, much more powerful than previous branch
  prediction
  4-bit history
  16 x 2 bit pattern (total 32 bit)
  capable to
   learn repetitive branch pattern
   handle 16 different pattern
Pentium II (PII):
 0.25 micron
 4ns SRAM 16kb instruction & 16kb data L1 cache
 bug: FIST
 uses Slot-1 (Pentium Pro incompatible)
 = Pentium Pro (P6) core + MMX
 Klamath: no L2 cache
 Deschutes: 1/2 speed 512kb L2 cache (4.4ns SRAM)
  7.5m transistors
 Dixon (PII Xeon): 256kb full speed on-die L2 cache
  for server
  uses Slot-2
 max 64Gb main mem addressing, but actually only max 512Mb due to cache limitation
 note: there are 2 types of L2 cache: ECC & non-ECC
 PII-300- Mhz may have L2 non-ECC cache
 PII-300+ Mhz always have L2 ECC cache
Celeron:
 'Convington'
  stripped Pentium II (the 'castrated one'): no L2 cache
  reportedly slower than 'half of its speed' Pentium MMX
 'Mendocino' (300A)
  19m transistors
  128kb on-die L2 cache
 'Coppermine -128' (533A)
  28m transistors
  128kb ATC L2 cache (on-die)
  533-766Mhz
   0.18 micron
   66Mhz bus speed
   SSE
  800Mhz
   100Mhz bus speed
 'Tualatin' or? 'Coppermine-T' (mobile Celeron/PIII ?)
  0.13 micron
  1.45-1.475V
  256kb or 512kb L2 cache
  differential clocking (?)
  1.13-1.26Ghz
  133Mhz bus speed
  overclocked Tualatin is so close to Willamette performance
Pentium III (PIII):
 date: Feb 1999
 up to 28m transistors
 Slot-1
 5 instruction per clock
 essentially Pentium II with new instructions:
  SSE
  processor serial number
 16kb L1 instruction & 16kb L1 data cache
 Katmai
  100Mhz FSB?
  9.5m transistors
 B
  using 133Mhz FSB
 E
  using Coppermine core (0.18 micron, on die & full speed L2 cache)
 EB
  Coppermine with 133Mhz FSB
 Coppermine
  date: 26 Oct 1999
  SpeedStep technology (for mobile mcpu)
   reducing mcpu clock speed to prolong battery life
  based on Katmai core
   plus 256kb full speed on-die L2 cache & 256bit data path
  28m transistors
  0.18 micron
  1.75V
  Giga-Coppermine: 1Ghz
   date: 8 March 2000 (2 days after AMD Giga Athlon)
   lesser FPU than G-Athlon's
 MMX2 or? SSE? or KNI (Katmai New Instructions)
  ISSE (Internet Streaming SIMD (Single Instruction Multiple Data) Extension)
   50 floating point SIMD instructions,
    12 integer SIMD instructions,
    8 cacheability instructions (total 70 instructions)
   8 registers, each is 128bit wide
  speed: ? - 1,066Mhz (failed at 1.13Ghz)
 Cascades (PIII Xeon)
Pentium 4:
 'Willamette'
   date: Nov 2000
   1.7 V
   Socket423
   Socket478?
   42m transistors
   requires dual-channel RDRAM
   slower than PIII
   speed variant: 1.3-1.7Ghz
   quad pumped (400Mhz) FSB
   NetBurst architecture (to achieve as high as possible clock speed)
    Advanced Dynamic Execution
     Execution Trace Cache (replacing L1 Instruction Trace Cache)
      caching decoder result (mOP) instead of x86 instructions
       to avoid stalled pipelining when decoders are busy
      estimated size: 92-96kb
     8kb L1 data cache (to enable very low latencies -> 2 clock cycles)
     Hardware Prefetch
      'guess' next data & 'prefetch' it to cache
     Enhanced Branch Prediction
      to support Execution Trace Cache
    Hyper Pipeline
     no less than 20 stages pipeline
    Rapid Execution Engine
     2 'double of processor clock' ALUs & AGUs
    SSE2
     144 double precision floating point SIMD 128bit instructions
 'Northwood'
  date: Sep 2001
  Socket478
  1.3 V
  0.13 micron
  5-10% faster than Willamette
   shrunk Willamette
  copper interconnects
  512kb L2 cache
  533Mhz FSB
  speed: 2-2.2Ghz
 Foster (PIV Xeon)
  SMT, 0-2Mb L3 cache
  speed: 1.4-1.7Ghz
Itanium
 'Merced'
   733Mhz
   too hot to reach higher clock speed
 'Mc Kinley'
  largely developed by HP, may more successful than 'Merced'
 64bit
 VLIW/EPIC
  requires advanced technology of software compiler
  not compatible with previous mcpu, compatibility is provided through
   hardware translation
StrongARM
 206Mhz
 note: for embedded system
XScale
 note: for embedded system
Timna
 integrated cpu+chipset -> low cost
 cancelled
Prescott
 800Mhz FSB
Tejas
 1,2Ghz FSB
---
AMD (Advanced Micro Device)
---
note:
 only for PC
 AMD mcpu requires more cooling because it generates more heat than Intel's
Am386
 1st AMD product
 speed variation: 16-40Mhz
 max overclocked: 80Mhz
Am486DX4 Write Back Enhanced
 8kb L1 asynchronous cache
 120Mhz
 EdH: this pros in my 2nd cpu at 1st reported as 120Mhz, & 2nd reported
  as 95Mhz by Martin Malix's HWiNFO V4.7.5
Am5x86
 16kb write-back L1 cache, speed >= 133Mhz
 Am5x86-P75
K5 (Krypton)
 8kb L1 instruction cache, 16kb L1 data cache
 Pentium pin-compatible, but late to market & very slow
 P-rating user beside Cyrix 6x86
K6
 AMD bought NexGen when Atiq Raza/NexGen finishes the design of Nx686
  Nx686 core is then used to create K6
 2.8V
 32kb L1 instruction cache, 32kb L1 data cache
 supports MMX
 1st AMD's capable to compete with Intel's
 K6-2 (+3DNow!)
  adding its own (MMX-like/additional MMX) instruction: 3DNow!
  0.25 micron
  64kb L1 cache
  on-board L2 cache
  speed variant: 300-500Mhz
 K6-2+
  0.18 micron
  128kb on-die L2 cache
  on-board L3 cache
  PowerNow!
  speed: 475-550Mhz
  note: intended for laptop, only sold to computer manufactures & OEM customers
 K6-3 (Sharptooth)
  0.25 micron
  21m transistors
  Super7
  6 instructions per clock
  on-board L3 cache
  3DNow! -> 21 SIMD instructions
  speed variant: 400-450Mhz
  64kb L1 cache, 256kb on-die L2 cache
  100Mhz bus speed
 K6-3+
  0.18 micron
  2.0V
  PowerNow!
  speed: 400-500Mhz
  more overclockable than K6-3
  note: intended for laptop, only sold to computer manufacturers & OEM customers
K7 (Athlon)
 Athlon 1
 22m transistors
 0.25 micron
 SlotA
 9 instructions per clock
  3 integer, 3 floating point, 3 address calculation pipelined units
  3 parallel full x86 decoders
 clock tech: source synchronous (clock forwarding)
 advanced dynamic branch prediction
  EdH: whatever it means :)
 Enhanced 3DNow!
  +19 integer instructions
  +5 DSP instructions
 Supports MMX
 512kb L2 cache (64bit data path) with:
  1/2 clock speed: 500-700Mhz
  2/5 clock speed: 750-850Mhz (0.18 micron)
  1/3 clock speed: 900-1,000Mhz (0.18 micron)
  note: AMD was unable to find faster cache for Athlon
 200Mhz bus speed (DEC Alpha EV6 bus protocol), up to 400+ Mhz
 no thermal protection: vulnerable to sudden thermal death
 Athlon 2 or K75
 0.18 micron
 Giga Athlon: 1Ghz
  date: 6 March 2000 (2 days earlier than Intel Giga-Coppermine)
  better FPU than G-Coppermine
 Thunderbird (Athlon 3)
  SlotA/SocketA
  37m transistors
  6-layer metal copper
  copper interconnect
   but there are aluminum version of Thunderbird (if < 1Ghz?)
  0.18 micron
  24 entries L1 cache data TLB
  256kb full speed on-die L2 cache, connected with 64bit data path
   4Mbit SRAM
  200Mhz (Athlon-B?) or 266Mhz DDR FSB (Athlon-C)
  Lightning Data Transport
  speed variant: 650-1,400Mhz
  uses G-share algo: branch prediction
  1+ Ghz
   some use different core, using code called AXIA
 Mustang (Athlon Ultra)
  0.18 micron
  copper? interconnect
  optimized Thunderbird
  dual pumped 266Mhz DDR FSB
  512kb/1Mb/2Mb L2 cache
  cancelled? redirected as Thunderbird?
 Palomino (AthlonXP (desktop), AthlonMP (MPS), Athlon 4 (mobile))
  release: Oct? 2001
  1.5V ?
  37.5m transistors
  3-7% faster than Thunderbird
   auto hw data prefetch
   40 entries L1 cache data TLB
   exclusive L1/L2 cache TLB
   removed TLB serializing algo (has bad impact on very large database program)
  reduced power consumption
  heat protection
   lesser heat
   thermal diode
  3DNow! Pro: full SSE implementation
  PowerNow!
   capable to reduce mcpu clock speed down to 500Mhz to prolong battery life of 30%
  133Mhz bus speed
  speed: 1,333-1,600Mhz (1500+ - 2000+ MODEL number)
 Duron
  Spitfire
   date: June 2000
   1.5-1.6V
   0.18 micron
   25m transistors
   shrunk & optimized Athlon core
   aluminum interconnect
   200Mhz DDR FSB
   64kb instruction & 64kb data L1 cache (synchronous)
   64kb full speed on-die L2 cache, connected with 256bit data path
    (synchronous)
    exclusive cache: data isn't duplicated between L1 & L2 cache
   socketA
   speed variant: 550-950Mhz
   uses G-share algo: branch prediction
   no thermal protection: vulnerable to sudden thermal death
  Morgan
   stripped Palomino
   1.75V
   25.18m transistors
   thermal diode (?)
   pluggable to existing socketA motherboard, only require BIOS update
   speed: 1-1.2Ghz
Appaloosa
 after Morgan
 0.13 micron
Thoroughbred
 after Palomino 
 0.13 micron
Barton
 after Thoroughbred
 0.13 micron SOI
Hammer-line (K8?)
 8th generation
 SSE2
 800Mhz FSB
 x86-64 tech (64bit mcpu)
  direct extension of 32bit architecture
 ClawHammer
  single Barton
  1-2 way MP Barton
 SledgeHammer
  4-8 way MP Barton
  2Ghz?
---
Cyrix
---
note:
 Cyrix mcpu is always CISC design
 on Cyrix/TI/IBM mcpus, enabling state of the Negate-Lock pin causes previously noncachable locked cycles are executed as unlocked cycles and therefore, may be cached. This results in speed increase of up to 10% on machines without L2 cache as laptops and up to 5% with 256 PB cache.
4x86 or Cx486
 first Cyrix
 actually 386-compatible chip
 manufactured by TI
 Cx486DLC
  1kb internal cache
 Cx486SLC
  1kb internal cache
 Cx486S or M6
  486SX-pincompatible chip
  2kb internal cache
 Cx486DX or M7
  486DX-pincompatible chip
  8kb internal cache
  with FPU
5x86 or Cx5x86
 actually 486-compatible chip
 manufactured by TI
 independently optimized by Cyrix (beyond Intel's)
 clock speed: 100, 120Mhz
6x86 or Cx6x86 or M1
 Cyrix designed it, IBM makes it & sells as IBM 6x86, but Cyrix sells
  some of it as Cyrix 6x86
 actually 586 pin-compatible, not 686 (nomenclature)
 P-rating user beside AMD K5 (because at the same clock speed it slightly faster than Pentium)
 clock speed:
  100Mhz (P120+) 50Mhz bus
  110Mhz (P133+) 55Mhz bus
  120Mhz (P150+) 60Mhz bus
  133Mhz (P166+) 66Mhz bus
  150Mhz (P200+) 75Mhz bus
MediaGX or Gx86
 date: 1997
 integrated mcpu
 note: this time Cyrix is acquired by National Semiconductor ?
GXm (Gx86 + MMX)
M II or M2 or 6x86MX
 1st appearance: 1998
 >= 300mhz
 2.8V
 socket 7
 64kb cache
 supports MMX
 equal to 6x86MX but:
  0.25 micron
  at higher clock (PR266)
---
NEC
---
introduced: 1985
note:
 ~20% faster at the same clock speed than Intel's equivalent
 supports 8018(6/8) instructions
 Z-80 mode (can run Z-80 code directly)
 V20
  8088 pin-compatible
 V30
  8086 pin-compatible
---
UMC
---
note: entered the market on 486 time, but withdrew due to infringement
U5S or U5SX
 486SX-pincompatible, but slightly faster
SD
 486DX-pincompatible, but slightly faster
SLV
486DX2
SX2
---
TEXAS INSTRUMENT
---
 based on Cyrix core + their own enhancement
 TI486
  SXL
   Cx486DLC compatible
   8kb internal cache
  Potomac
---
IBM
---
 486SLC
  486SX-compatible
  16bit datapath
  16kb internal cache
 486BL (Blue Lightning)
  internal clock-(doubbler/trippler)
  16bit datapath
---
HARRIS, SIEMENS, FUJITSU, KRUGER, HITACHI
---
 OEM/clone producers of 8086 & 80286
---
CENTAUR
---
 C6
note:
 always contain MMX
 Pentium-compatible pin
---
NEXGEN
---
note: start with 386-compatible
Nx586
 32kb internal cache
 uses 386 instruction set
 no (cpuid instruction & fpu)
---
MOTOROLA
---
 68000
note: for Apple Macintosh
 Dragon Ball
  33Mhz
  MX1
  note: for Palm
---
VIA
---
 note: NSC sold Cyrix to VIA on 1999
 VIA Cyrix III (Joshua)
  Socket370 (compatible with PIII & Celeron)
  support:
   3DNow! using dual pipelined FPU
   MMX
   133Mhz FSB, but supports 66/100Mhz FSB as well
   64kb on-die L1 cache, 256kb on-die L2 cache
   two 7-stage pipeline
   2 level translation buffer
   512-entry BTB
   fuse-repair tech (could doubling yields & keep cost down)
   speculative execution
   has 'write combining' feature
  PR433-PR533
   0.18 micron
  PR600
   0.15 micron
 'Samuel'
  by: IDT/VIA
  after Joshua
  enhanced FPU
  will end performance-rating (p-rating, PR)
  clock speed: 500-600Mhz
 'Samuel 2' (?)
 'Ezra'
  after Samuel
  will support SSE & 3DNow!
  Socket370 or SocketA
 WinChip
 2A
 3
 C6
 note: special mcpu for Windows(tm)
---
TRANSMETA
---
 Crusoe
  VLIW
  TM3120 400Mhz
   slower than piii 500mhz
  700Mhz
  targeted for low-end market
---
SUN
---
SPARC
 4 instructions per clock
SPARC (newer)
 7 instructions per clock
UltraSPARC-III
 64bit
 600-900Mhz
---
Other or Unknown
---
Zilog Z-80
Rise mP6
Rise mP6 II
NSC Geode GX1
 note: successor of Cyrix's
---
MISC
---
PR or P-Rating or Performance Rating or Pentium? Rating:
 rating invented by AMD and/or Cyrix to show that their lower-clock speed
  mcpu is comparable with Intel's Pentium with higher-clock speed
 example: AMD K5 133Mhz has PR200, which means that this K5 has equal
  performance with Intel Pentium 200Mhz, eventhough it only runs at 133Mhz
 AMD no longer uses PR for K6 but return it on different form
  (MODEL number) for Athlon Palomino and later
 Cyrix plans to stop PR on Cyrix III 'Samuel'
---
EdH: mcpu is getting overwhelming. They are supposed to be general purpose
 but with unnecessarry addition of MMX, SSE, SSE2, 3DNow!, etc. People
 wanting good 3D gaming will buy 3D gfx card. Parties need heavy-duty
 floating-point capabilities will buy specialized computer.
 Not to mention rather useless stuff like (advanced dynamic) (branch
 prediction/execution), rapid execution engine, netburst, etc
 Tell me, do you want newer mcpu to run your softwares faster or newer mcpu
 forces you to reoptimize your softwares? larger (V/U) pipeline and branch
 prediction are supposed to improve performance immediately, but why
 programmers still invest a lot of time rewriting their code to avoid
 misprediction, to be fit on mcpu cache, and so on?