PDP-11/70 CPU core and SoC :: Performance
Performance
Microarchitecture
Because the w11a microarchitecture is very similar to the original 11/70 processor, the KB11-C CPU, the instruction timing in clock cycles is also very similar. A register-register operation takes two clock cycles, more a involved case like an "add @r1,a(r2)" for example takes 12 cycles. Notable exceptions are the MUL (5 instead of 22 cycles) and DIV (23 instead of 46 cycles) instructions.Clock Rate
On Spartan class FPGA's the w11a systems run with at a clock frequency of at least 50 MHz. Specifically:FPGA board system clock comment xc3s1000-4 S3BOARD w11a_s3 50 MHz no DCM xc3s1200e-4 Nexys2 w11a_n2 58 MHz uses DCM xc6slx16-2 Nexys3 w11a_n3 80 MHz uses DCM
Expected Performance
Compared to KB11-C CPU: The KB11-C CPU had a 150 ns micro cycle time. Both the w11a and the 11/70 have a cache, which greatly reduces the impact of memory latencies. So one expects that the w11a is about a factor 50/6.7 or 7.5 faster than the original PDP-11/70.
Compared to J11 CPU: This later ASIC implementation of the 11/70 ran with up to 20 MHz clock rate. It needed 4 clocks per microcycle, resulting in a 200 ns micro cycle time. However, the J11 had a significantly improved microarchitecture yielding an up to a factor two better cpi (cycles-per-instruction) value. So one expects that the w11a is at least a factor (50/20)*(4/1)*(1/2) or 5 faster than the fastest J11 based system, the PDP-11/93.
Benchmarks
The Dhrystone 2 and Tower of Hanoi benchmark codes taken from the 'BYTE UNIX Benchmark' were used to compare the w11a with real PDP-11's and other processors. The w11a values were determined for both boards, the comparison values obtained from Michael Schneider's benchmark collection:
Type OS CPU (Mhz) Dhry2 Hanoi Dhry Hanoi Dhry
[lps] [lps] /MHz /MHz /Han
w11a_s3 V0.5 BSD 2.11 w11a (50) 11510 160.8 230 3.2 71.6
w11a_n2 V0.5 BSD 2.11 w11a (50) 11519 160.4 230 3.2 71.8
w11a_n2 V0.51 BSD 2.11 w11a (58) 13218 186.1 228 3.2 71.0
w11a_n3 V0.54 BSD 2.11 w11a (80) 17797 252.2 222 3.1 70.6
pdp-11/53+ BSD 2.11 KDJ11-SD (4.5)* 828 12.2 184 2.7 67.8
Mac SE/30 A/UX 68030 (16) 3042 81.8 190 5.1 37.2
SUN 3/60 NetBSD 68020 (20) 6934 121.3 346 6.1 57.3
DECstation 2100 NetBSD R2000 (12) 13206 155.5 1100 13.0 85.2
NeXT N1100 NetBSD 68040 (25) 26882 386.1 1075 15.4 69.6
HP 9000/433t NetBSD 68040 (40) 55763 960.3 1394 24.0 58.1
NCR system 3230 NetBSD i486DX/2 (66) 63464 993.1 961 15.0 63.9
NCR system 3230 NetBSD i486DX/4 (100) 75010 1022.3 750 10.2 73.4
Power Mac G4 Gentoo PPC7455 (1400) 3713k 46.6k 2652 33.3 79.6
Lenovo TS S10 Gentoo i686 E8400(3000) 16464k 262.6k 5488 87.5 62.7
Note that the J11 system is listed with the effective microcycle rate of
4.5 MHz rather the chip clock rate of 18 MHz. This is also consistent with
Bob Supnik's notes on the J11
were the J11 is classified as '4.5 MHz'. This gives a more meaningful
values for the Dhry/MHz or 'Dhrystone per MHz' column. For a fair comparison
it is also important to remark that the PDP-11/53+ systems didn't have a
cache and were therefore about a factor 2.3 slower than a PDP-11/93 with cache
(see
comparison),
explaining the large factor between the w11a_s3 and the 11/53 benchmark results.
The Dhrystone, Tower of Hanoi and 'syscall' benchmarks were also run on a simulated PDP-11 using simh version V3.8-1 and natively on a Linux system. In both cases a Kubuntu 10.4 system with an Intel Core2 Duo E8400 CPU was used, cpufreg was fixed to 3 GHz.
System Platform (MHz) Dhry2 Hanoi syscall
[lps] [lps] [lps]
2.11BSD w11a_s3 V0.5 (50) 11510 160.8 7080
2.11BSD w11a_n2 V0.5 (50) 11519 160.4 6888
2.11BSD w11a_n2 V0.51 (58) 13218 186.1 7616
2.11BSD w11a_n3 V0.54 (80) 17797 252.2 10375
2.11BSD simh on Intel E8400 (--) 17174 250.0 10713
Ubuntu 10.4 Intel E8400 (3000) 10785k 74.1k 1020k
Some observations are:
- The Nexys2 and Nexys3 boards have a significantly larger main memory latency than the S3BOARD. Because Dhry2 and Hanoi run almost completely in cache and do rarely writes they execute with equal speed. syscall is more sensitive to memory latencies, either due to more cache misses or delays from write-thru's.
- The simulated PDP-11 on a modern 3 GHz PC is about as fast as the current FPGA implementation on a Spartan-6. However, there is certainly room for improvement on the FPGA side, either with a faster devices (e.g. a Virtex-6 instead of a Spartan-6) and/or an improved microarchitecture (like J11 or even better).
- Comparing the arithmetic benchmarks (Dhry2 and Hanoi) with the 'syscall' benchmark on simh-simulated 2.11BSD and native Linux suggests that the system call overhead, normalized by processor speed, is larger for contemporary Linux than for 2.11BSD.
