EddyHawk's Info List --- MicroProcessors --- INFO SOURCES --- Martin Malix/HWiNFO/V4.7.x Tom?? Navratil/NSSI/V0.5x Michael Colin/PCConfig/V9.33 http://www.tomshardware.com http://www.anandtech.com http://www.x86.org http://www.amd.com --- INTEL --- 4004 note: the 1st microprocessor requested by Japanese company for calculator (?) 8080 8bit 8080A 8085 note: only for PC 8086 and 8088 release date: 1978 16bit word-alignment has no effect for data fetching 20 address line (2^20 = 1Mb mem addressing) CISC first use of pipelining 8086 & 8088 are binary compatible, but not pin compatible 29,000 transistors separate FPU (8087) 8088 8bit-bus version of 8086 release date: Jun 1979 speed variation: 4.7-10Mhz 80186 & 80188 bug: IDIV note: 80188 is 8bit-bus version of 80186 successful as embedded processor (in hi-performance disk controller) chip integration fault tolerance protection (try to trap invalid instruction & recover it) speed variation: 6-40Mhz 80286 release: Feb 1982 more opcodes: -ENTER -LEAVE -PUSH immediate -extended IMUL, SHL, SHR shift+add is faster than MUL protected mode (16bit) 4 more address line (2^(20+4) = 16Mb mem addressing) 134,000 transistors separate FPU (80287) speed variation: 6-20Mhz 80386 year: 1985/1986 32bit addressing (max 4Gb mem) more opcodes: MOVSD enhanced PM (switch with RM w/o resetting processor) v86 mode: rm prog can be executed in pm enviroment bug: -early version can't switch back to protected mode from real mode -32 bit multiply -STOSB 3 above bugs only apply to 16Mhz version & soon be corrected -POPAD not found until mid 1990 (nearly all Intel's & AMD's 386 DX/SX still have the error) -INSB -MOVSB optimize: dec jnz is faster than loop PIQ (Prefetch Instruction Queue) ? 16-40Mhz cache external L2 cache (on motherboard) separate FPU (80387) SX 16bit, max 16Mb mem 80376: run exclusively on PM SL 386SX version with internal static core. I.e., its clocktiming can be reduced to 0 MHz, so that its consumption of power is nearly 0 mW. RapidCAD a 2-pieced chip, pincompatible to 386DX and 387DX, but containing 486 structures 80486 year: 1989 more opcodes: BSWAP less clock cycles for most instructions execution integrated 80487 fpu 80bit register provision for multitasking pipelined execution DX (33Mhz) 8kb L1 cache SX '486DX without fpu', separate fpu (80487SX) is available for upgrade but 486SX is actually 80486DX with non-functional/lately removed fpu 487SX is actually 80486DX with some pins relocated, it will disable 486SX electronically & fully replace it while running :) But disabled 486SX still consume energy & cause heat Conclusion: using 486SX+487SX = using 2 486DX, one disabling another :) this thing is predecessor of 'OverDrive' :) SX/J only 16bit datapath SL power-saving 486SX DX2: doubled core clock frequency (50 vs 25, 66 vs 33Mhz) 1st mcpu with mcpu clock is larger than bus clock DX4: tripled core clock frequency (100 vs 33Mhz) integrated L1 asynchronous cache (16kb) DX4 Write Back Enhanced more enhancements are power saving structures, enhanced v86, page size extensions, cpuid instruction note: latest DX4 (or 486-S) is released after early Pentium (mid 1993) clock speed: 20-100Mhz 80586 (Pentium, P5) bug: FDIV (found on Dec 1994/Jan 1995) date: 1993 5V 8kb (data) & 8kb (instruction) L1 cache separate cache is considered more effective because of different access pattern between instruction & data. instruction tend to be retrieved sequentially & more frequently reused 3.1m transistors RISC? more instructions: RDMSR, RDTSC, WRMSR, CPUID note: CPUID is found on any Intel's processor on 1993+ (including some 486) superscalar: more than 1 execution unit: can execute more than 1 instruction up to 2 instructions per clock cycle note: Intel tried to conceal many programming enhancement of P5 speed variation in Mhz: 60,66,90,100,120,133,150,166,180,200,233 the last equal mcpu clock with bus clock: P5-60/66Mhz P5-150Mhz = P5-133Mhz in disguise branch prediction under branch instruction, we can't determine which instruction sequences should be prefetch-ed until the jmp is executed. If processor choose IQ before jmp & the chosen IQ is found wrong after jmp is executed, we must refill the pipeline with different queue, which is costly, especially on larger PIQ then the branch prediction is added to try to guess which IQ to be taken, based on previous jmps record (Branch Target Buffer) EdH: branch-prediction is crap! Simply fill the pipeline until locating a jmp, wait until the jmp is executed and then continue to fill pipeline before another next jmp. it will be faster than branch-prediction but without building such complex mechanisme into mcpu this 'enhancement' forces programmers to reassemble their programs to avoid misprediction to improve performance, which means more work to do than before yes, another intel's gift after 64kb limit :) i bet a bottle of beer that 'my' solution is simpler (no need for research & implementation), faster (no chance of fail, no need to execute branch prediction) and compatible (no need to modify the program to be 'optimized for Pentium' & then modify it again to be 'optimized for Pentium II' & so on) even the worst of the worst of 'my' solution is just like processor without pipeline and that's impossible since no program consists of only jmps. most of the time the pipeline will be full. even if 'w/o pipeline' state is ever reached, it will ensure mcpu long-life & less-power-needed for not doing refill & refill again OverDrive: 486-pin-compatible Pentium MMX 2.8V 32kb L1 cache introduced at begin of 1997 MMX (MultiMedia eXtensions) 57 new SIMD instructions, 64bit register register is mapped to FPU register Pentium Pro code name: P6 date: Nov 1995 more instruction: CMOVx bug: FIST first version is unable to outperform Pentium MMX 30% faster than Pentium default slow vidmem write RISC 5.5m transistors (+15.5m for cache) 4 more address line max 64Gb mem addressing 16kb (data) & 16kb (instruction) L1 cache 256/512/1024kb on-die L2 cache but very expensive to manufacture the last to use 'Socket' (but 'Socket' is used again on P-IV) note: Intel tried to conceal most of P6 programming features, but failed 2 level branch prediction, much more powerful than previous branch prediction 4-bit history 16 x 2 bit pattern (total 32 bit) capable to learn repetitive branch pattern handle 16 different pattern Pentium II (PII): 0.25 micron 4ns SRAM 16kb instruction & 16kb data L1 cache bug: FIST uses Slot-1 (Pentium Pro incompatible) = Pentium Pro (P6) core + MMX Klamath: no L2 cache Deschutes: 1/2 speed 512kb L2 cache (4.4ns SRAM) 7.5m transistors Dixon (PII Xeon): 256kb full speed on-die L2 cache for server uses Slot-2 max 64Gb main mem addressing, but actually only max 512Mb due to cache limitation note: there are 2 types of L2 cache: ECC & non-ECC PII-300- Mhz may have L2 non-ECC cache PII-300+ Mhz always have L2 ECC cache Celeron: 'Convington' stripped Pentium II (the 'castrated one'): no L2 cache reportedly slower than 'half of its speed' Pentium MMX 'Mendocino' (300A) 19m transistors 128kb on-die L2 cache 'Coppermine -128' (533A) 28m transistors 128kb ATC L2 cache (on-die) 533-766Mhz 0.18 micron 66Mhz bus speed SSE 800Mhz 100Mhz bus speed 'Tualatin' or? 'Coppermine-T' (mobile Celeron/PIII ?) 0.13 micron 1.45-1.475V 256kb or 512kb L2 cache differential clocking (?) 1.13-1.26Ghz 133Mhz bus speed overclocked Tualatin is so close to Willamette performance Pentium III (PIII): date: Feb 1999 up to 28m transistors Slot-1 5 instruction per clock essentially Pentium II with new instructions: SSE processor serial number 16kb L1 instruction & 16kb L1 data cache Katmai 100Mhz FSB? 9.5m transistors B using 133Mhz FSB E using Coppermine core (0.18 micron, on die & full speed L2 cache) EB Coppermine with 133Mhz FSB Coppermine date: 26 Oct 1999 SpeedStep technology (for mobile mcpu) reducing mcpu clock speed to prolong battery life based on Katmai core plus 256kb full speed on-die L2 cache & 256bit data path 28m transistors 0.18 micron 1.75V Giga-Coppermine: 1Ghz date: 8 March 2000 (2 days after AMD Giga Athlon) lesser FPU than G-Athlon's MMX2 or? SSE? or KNI (Katmai New Instructions) ISSE (Internet Streaming SIMD (Single Instruction Multiple Data) Extension) 50 floating point SIMD instructions, 12 integer SIMD instructions, 8 cacheability instructions (total 70 instructions) 8 registers, each is 128bit wide speed: ? - 1,066Mhz (failed at 1.13Ghz) Cascades (PIII Xeon) Pentium 4: 'Willamette' date: Nov 2000 1.7 V Socket423 Socket478? 42m transistors requires dual-channel RDRAM slower than PIII speed variant: 1.3-1.7Ghz quad pumped (400Mhz) FSB NetBurst architecture (to achieve as high as possible clock speed) Advanced Dynamic Execution Execution Trace Cache (replacing L1 Instruction Trace Cache) caching decoder result (mOP) instead of x86 instructions to avoid stalled pipelining when decoders are busy estimated size: 92-96kb 8kb L1 data cache (to enable very low latencies -> 2 clock cycles) Hardware Prefetch 'guess' next data & 'prefetch' it to cache Enhanced Branch Prediction to support Execution Trace Cache Hyper Pipeline no less than 20 stages pipeline Rapid Execution Engine 2 'double of processor clock' ALUs & AGUs SSE2 144 double precision floating point SIMD 128bit instructions 'Northwood' date: Sep 2001 Socket478 1.3 V 0.13 micron 5-10% faster than Willamette shrunk Willamette copper interconnects 512kb L2 cache 533Mhz FSB speed: 2-2.2Ghz Foster (PIV Xeon) SMT, 0-2Mb L3 cache speed: 1.4-1.7Ghz Itanium 'Merced' 733Mhz too hot to reach higher clock speed 'Mc Kinley' largely developed by HP, may more successful than 'Merced' 64bit VLIW/EPIC requires advanced technology of software compiler not compatible with previous mcpu, compatibility is provided through hardware translation StrongARM 206Mhz note: for embedded system XScale note: for embedded system Timna integrated cpu+chipset -> low cost cancelled Prescott 800Mhz FSB Tejas 1,2Ghz FSB --- AMD (Advanced Micro Device) --- note: only for PC AMD mcpu requires more cooling because it generates more heat than Intel's Am386 1st AMD product speed variation: 16-40Mhz max overclocked: 80Mhz Am486DX4 Write Back Enhanced 8kb L1 asynchronous cache 120Mhz EdH: this pros in my 2nd cpu at 1st reported as 120Mhz, & 2nd reported as 95Mhz by Martin Malix's HWiNFO V4.7.5 Am5x86 16kb write-back L1 cache, speed >= 133Mhz Am5x86-P75 K5 (Krypton) 8kb L1 instruction cache, 16kb L1 data cache Pentium pin-compatible, but late to market & very slow P-rating user beside Cyrix 6x86 K6 AMD bought NexGen when Atiq Raza/NexGen finishes the design of Nx686 Nx686 core is then used to create K6 2.8V 32kb L1 instruction cache, 32kb L1 data cache supports MMX 1st AMD's capable to compete with Intel's K6-2 (+3DNow!) adding its own (MMX-like/additional MMX) instruction: 3DNow! 0.25 micron 64kb L1 cache on-board L2 cache speed variant: 300-500Mhz K6-2+ 0.18 micron 128kb on-die L2 cache on-board L3 cache PowerNow! speed: 475-550Mhz note: intended for laptop, only sold to computer manufactures & OEM customers K6-3 (Sharptooth) 0.25 micron 21m transistors Super7 6 instructions per clock on-board L3 cache 3DNow! -> 21 SIMD instructions speed variant: 400-450Mhz 64kb L1 cache, 256kb on-die L2 cache 100Mhz bus speed K6-3+ 0.18 micron 2.0V PowerNow! speed: 400-500Mhz more overclockable than K6-3 note: intended for laptop, only sold to computer manufacturers & OEM customers K7 (Athlon) Athlon 1 22m transistors 0.25 micron SlotA 9 instructions per clock 3 integer, 3 floating point, 3 address calculation pipelined units 3 parallel full x86 decoders clock tech: source synchronous (clock forwarding) advanced dynamic branch prediction EdH: whatever it means :) Enhanced 3DNow! +19 integer instructions +5 DSP instructions Supports MMX 512kb L2 cache (64bit data path) with: 1/2 clock speed: 500-700Mhz 2/5 clock speed: 750-850Mhz (0.18 micron) 1/3 clock speed: 900-1,000Mhz (0.18 micron) note: AMD was unable to find faster cache for Athlon 200Mhz bus speed (DEC Alpha EV6 bus protocol), up to 400+ Mhz no thermal protection: vulnerable to sudden thermal death Athlon 2 or K75 0.18 micron Giga Athlon: 1Ghz date: 6 March 2000 (2 days earlier than Intel Giga-Coppermine) better FPU than G-Coppermine Thunderbird (Athlon 3) SlotA/SocketA 37m transistors 6-layer metal copper copper interconnect but there are aluminum version of Thunderbird (if < 1Ghz?) 0.18 micron 24 entries L1 cache data TLB 256kb full speed on-die L2 cache, connected with 64bit data path 4Mbit SRAM 200Mhz (Athlon-B?) or 266Mhz DDR FSB (Athlon-C) Lightning Data Transport speed variant: 650-1,400Mhz uses G-share algo: branch prediction 1+ Ghz some use different core, using code called AXIA Mustang (Athlon Ultra) 0.18 micron copper? interconnect optimized Thunderbird dual pumped 266Mhz DDR FSB 512kb/1Mb/2Mb L2 cache cancelled? redirected as Thunderbird? Palomino (AthlonXP (desktop), AthlonMP (MPS), Athlon 4 (mobile)) release: Oct? 2001 1.5V ? 37.5m transistors 3-7% faster than Thunderbird auto hw data prefetch 40 entries L1 cache data TLB exclusive L1/L2 cache TLB removed TLB serializing algo (has bad impact on very large database program) reduced power consumption heat protection lesser heat thermal diode 3DNow! Pro: full SSE implementation PowerNow! capable to reduce mcpu clock speed down to 500Mhz to prolong battery life of 30% 133Mhz bus speed speed: 1,333-1,600Mhz (1500+ - 2000+ MODEL number) Duron Spitfire date: June 2000 1.5-1.6V 0.18 micron 25m transistors shrunk & optimized Athlon core aluminum interconnect 200Mhz DDR FSB 64kb instruction & 64kb data L1 cache (synchronous) 64kb full speed on-die L2 cache, connected with 256bit data path (synchronous) exclusive cache: data isn't duplicated between L1 & L2 cache socketA speed variant: 550-950Mhz uses G-share algo: branch prediction no thermal protection: vulnerable to sudden thermal death Morgan stripped Palomino 1.75V 25.18m transistors thermal diode (?) pluggable to existing socketA motherboard, only require BIOS update speed: 1-1.2Ghz Appaloosa after Morgan 0.13 micron Thoroughbred after Palomino 0.13 micron Barton after Thoroughbred 0.13 micron SOI Hammer-line (K8?) 8th generation SSE2 800Mhz FSB x86-64 tech (64bit mcpu) direct extension of 32bit architecture ClawHammer single Barton 1-2 way MP Barton SledgeHammer 4-8 way MP Barton 2Ghz? --- Cyrix --- note: Cyrix mcpu is always CISC design on Cyrix/TI/IBM mcpus, enabling state of the Negate-Lock pin causes previously noncachable locked cycles are executed as unlocked cycles and therefore, may be cached. This results in speed increase of up to 10% on machines without L2 cache as laptops and up to 5% with 256 PB cache. 4x86 or Cx486 first Cyrix actually 386-compatible chip manufactured by TI Cx486DLC 1kb internal cache Cx486SLC 1kb internal cache Cx486S or M6 486SX-pincompatible chip 2kb internal cache Cx486DX or M7 486DX-pincompatible chip 8kb internal cache with FPU 5x86 or Cx5x86 actually 486-compatible chip manufactured by TI independently optimized by Cyrix (beyond Intel's) clock speed: 100, 120Mhz 6x86 or Cx6x86 or M1 Cyrix designed it, IBM makes it & sells as IBM 6x86, but Cyrix sells some of it as Cyrix 6x86 actually 586 pin-compatible, not 686 (nomenclature) P-rating user beside AMD K5 (because at the same clock speed it slightly faster than Pentium) clock speed: 100Mhz (P120+) 50Mhz bus 110Mhz (P133+) 55Mhz bus 120Mhz (P150+) 60Mhz bus 133Mhz (P166+) 66Mhz bus 150Mhz (P200+) 75Mhz bus MediaGX or Gx86 date: 1997 integrated mcpu note: this time Cyrix is acquired by National Semiconductor ? GXm (Gx86 + MMX) M II or M2 or 6x86MX 1st appearance: 1998 >= 300mhz 2.8V socket 7 64kb cache supports MMX equal to 6x86MX but: 0.25 micron at higher clock (PR266) --- NEC --- introduced: 1985 note: ~20% faster at the same clock speed than Intel's equivalent supports 8018(6/8) instructions Z-80 mode (can run Z-80 code directly) V20 8088 pin-compatible V30 8086 pin-compatible --- UMC --- note: entered the market on 486 time, but withdrew due to infringement U5S or U5SX 486SX-pincompatible, but slightly faster SD 486DX-pincompatible, but slightly faster SLV 486DX2 SX2 --- TEXAS INSTRUMENT --- based on Cyrix core + their own enhancement TI486 SXL Cx486DLC compatible 8kb internal cache Potomac --- IBM --- 486SLC 486SX-compatible 16bit datapath 16kb internal cache 486BL (Blue Lightning) internal clock-(doubbler/trippler) 16bit datapath --- HARRIS, SIEMENS, FUJITSU, KRUGER, HITACHI --- OEM/clone producers of 8086 & 80286 --- CENTAUR --- C6 note: always contain MMX Pentium-compatible pin --- NEXGEN --- note: start with 386-compatible Nx586 32kb internal cache uses 386 instruction set no (cpuid instruction & fpu) --- MOTOROLA --- 68000 note: for Apple Macintosh Dragon Ball 33Mhz MX1 note: for Palm --- VIA --- note: NSC sold Cyrix to VIA on 1999 VIA Cyrix III (Joshua) Socket370 (compatible with PIII & Celeron) support: 3DNow! using dual pipelined FPU MMX 133Mhz FSB, but supports 66/100Mhz FSB as well 64kb on-die L1 cache, 256kb on-die L2 cache two 7-stage pipeline 2 level translation buffer 512-entry BTB fuse-repair tech (could doubling yields & keep cost down) speculative execution has 'write combining' feature PR433-PR533 0.18 micron PR600 0.15 micron 'Samuel' by: IDT/VIA after Joshua enhanced FPU will end performance-rating (p-rating, PR) clock speed: 500-600Mhz 'Samuel 2' (?) 'Ezra' after Samuel will support SSE & 3DNow! Socket370 or SocketA WinChip 2A 3 C6 note: special mcpu for Windows(tm) --- TRANSMETA --- Crusoe VLIW TM3120 400Mhz slower than piii 500mhz 700Mhz targeted for low-end market --- SUN --- SPARC 4 instructions per clock SPARC (newer) 7 instructions per clock UltraSPARC-III 64bit 600-900Mhz --- Other or Unknown --- Zilog Z-80 Rise mP6 Rise mP6 II NSC Geode GX1 note: successor of Cyrix's --- MISC --- PR or P-Rating or Performance Rating or Pentium? Rating: rating invented by AMD and/or Cyrix to show that their lower-clock speed mcpu is comparable with Intel's Pentium with higher-clock speed example: AMD K5 133Mhz has PR200, which means that this K5 has equal performance with Intel Pentium 200Mhz, eventhough it only runs at 133Mhz AMD no longer uses PR for K6 but return it on different form (MODEL number) for Athlon Palomino and later Cyrix plans to stop PR on Cyrix III 'Samuel' --- EdH: mcpu is getting overwhelming. They are supposed to be general purpose but with unnecessarry addition of MMX, SSE, SSE2, 3DNow!, etc. People wanting good 3D gaming will buy 3D gfx card. Parties need heavy-duty floating-point capabilities will buy specialized computer. Not to mention rather useless stuff like (advanced dynamic) (branch prediction/execution), rapid execution engine, netburst, etc Tell me, do you want newer mcpu to run your softwares faster or newer mcpu forces you to reoptimize your softwares? larger (V/U) pipeline and branch prediction are supposed to improve performance immediately, but why programmers still invest a lot of time rewriting their code to avoid misprediction, to be fit on mcpu cache, and so on?