/prog/ - Hennessy and Patterson

Name: Cudder !MhMRSATORI 2014-07-13 3:42

Ostensibly one of the most widely used books for studying computer architecture so I had a look, and... WTF? Possible future CPU designers are being fed with tripe like this?

http://i62.tinypic.com/xakqr.png

Despite all the focus on MIPS and performance, it is suspiciously missing any real benchmarks of MIPS processors.

They have an interesting definition of a "desktop computer":
http://i60.tinypic.com/4lq2j7.png

"Heineken and Pilsner" would be a better name for this book, as its authors appear to be as knowledgeable about real-world computer architecture as drunken fools.

Name: Anonymous 2014-07-13 4:48

Never trust 3rd party research. Always test everything yourself.

Name: Anonymous 2014-07-13 5:16

What book do you recommend then?

Name: Anonymous 2014-07-13 6:19

>>1
All of the information on real world microprocessors in H&P becomes outdated before it's put to press. This is especially obvious if you're reading an older edition. Fortunately, H&P is really just an introductory text so, outdated performance observations aside, there isn't a lot of room for demagoguery.

Hennessy does have a well known anti-Intel bias that shows up in most of his newer writings. It seems to me that he has never gotten over the fact that MIPS failed to take over the world, and now that he is no longer close to the industry the textbooks are one of the few outlets he has for his bile.

Name: Anonymous 2014-07-13 6:31

>>1

http://i62.tinypic.com/xakqr.png

The assertion that avoiding rep on the x86 is universally good for performance is obviously bogus.

However, this benchmark comparison is futile because the textbook doesn't care to mention which CPU was used for the test. Sure rep movs on a Nehalem will be pretty fast but the authors could have been using a 386 for all you know. Also the text implies that their load/store code was manually unrolled; the sample code that was used for benchmarking is not.

Name: Anonymous 2014-07-13 6:47

>>1
Cuuudderr! It's really you! I scanned the old /prog/ db five times to see if someone had copy pasted your post!

Name: Anonymous 2014-07-13 7:55

>>1
Shalom!

Name: Anonymous 2014-07-13 9:17

MIPS is mediocre at best and RISC is a futile design philosophy, but that has never stopped dusty academics from circlejerkoing before....

Name: Cudder !MhMRSATORI 2014-07-13 12:00

>>4
In some ways MIPS did "take over the world", mostly in unbranded Chinese devices like digital picture frames, tablets, and media players. Many routers and set-top-boxes too. Not because of performance, but because it's a cheap small core that just about any ASIC designer can put in an SoC after reading e.g. the P&H book. I would bet most of the MIPS-compatible cores out there are not actually licensed from MIPS...

>>5
On the 8086/8 and the rest of the cacheless x86 (up to the '386), REP MOVS is the fastest - all it has to do is a bus cycle to read it in, and then the cycles after that are all spent in reading/writing the data to be moved. Any other way would be slower since most of the bus bandwidth would be spent on instruction fetches.

For the 486 and Pentium I found this:

http://computer-programming-forum.com/46-asm/5157e945c6bac0ef.htm

Which shows that REP MOVS does get beaten by 38% by an unrolled loop on an Am486DX4-120 and 23% by going through the FPU on a P90 (could this be the "third version" H&P mention?), but on the 33MHz 486 it's still the fastest. No sign of that 200% difference though.

Sure rep movs on a Nehalem will be pretty fast but the authors could have been using a 386 for all you know.

Leaving aside the fact that REP MOVS was fastest on the 386, the only period in which there was significant gains from avoiding it would be in the 486 ~ Pentium era, so it would be absolutely insane to be talking about this - except perhaps as a historical point - in the fourth edition of a book released in 2012.

Or maybe they just benchmarked a ridiculously tiny copy - akin to seeing how long it takes to use a car to travel 5 meters and comparing it to walking.

I'm tempted to get out the old 386, 486, and PMMX and do some real testing with the same code I used for the above...

Name: Anonymous 2014-07-13 13:18

>>9
Shalom!

Name: Anonymous 2014-07-13 13:33

IS THAT REALLY YOU CUDDER-SAMA
OH HOW I MISSED YOU

Name: Anonymous 2014-07-13 13:57

SHAAAAALOOOOOOOOMMMMM Cudder-chan!!

Name: Anonymous 2014-07-13 14:14

>>12
so how's ADHD med working for you?

Name: Anonymous 2014-07-13 14:49

*fartz in joy*

Name: Anonymous 2014-07-13 18:53

Corporate whores. Keep insulting academia because you're too dumb for math higher than calculus. Nice intel propaganda here, how much they paid you? Don't worry, intel won't be able to keep up in the quantum world. Nanotechnology is not the future, quantum is.

Name: Anonymous 2014-07-14 1:23

>>9

Leaving aside the fact that REP MOVS was fastest on the 386

Which is a good point to make, actually.

the only period in which there was significant gains from avoiding it would be in the 486 ~ Pentium era, so it would be absolutely insane to be talking about this - except perhaps as a historical point - in the fourth edition of a book released in 2012.

The first edition of H&P was published in late 1990 so it's conceivable they did indeed run the test on a 486 originally and never bothered to revisit it (knowing, perhaps, that doing so would undermine their original argument).

Name: Anonymous 2014-07-14 8:35

>>1
Use a different picture hosting service next time, Cudder-san. Tinypic won't work if Javashit is disabled.

Name: Anonymous 2014-07-14 9:23

>>17

implying Cudder isn't an Intel-JS shill

Name: Cudder !MhMRSATORI 2014-07-14 12:45

Maybe H&P didn't completely make up everything, since I think I found the resluts they used, along with the code:

http://now.cs.berkeley.edu/Td/bcopy.html

Indeed, as I suspect, the 200% improvement is only achievable on specific combinations of P5 processors and chipsets, and completely disappeared with the P6; there's 2 486 results there too, in which REP MOVS has only a <10% loss over the others. This makes me really want to get those old machines out to see if I can reproduce these results...

Trying their code on my Nehalem, the huge 64x unrolled loop is within 1% of the non-unrolled loop (sometimes faster, sometimes slower - probably the same considering measurement noise), while going through the FPU is basically identical to using MMX (not surprisingly, they both move 8 bytes at a time.) In other words, nothing magic about their code.

Interestingly, using SSE2 non-temporal moves maintains the same speed regardless of size (not surprising since it bypasses the cache), and with huge sizes is the fastest by <10% again, but that's a bit like cheating in that the difference quickly vanishes if we need to access the data that was copied (very common situation) - the cache misses that were avoided during the copy just appear later.

>>17
It works just fine with curl.

Name: Anonymous 2014-07-14 13:11

>>19
Shalom!

Name: Anonymous 2014-07-14 13:21

>>19
haha, bcopy

Name: Anonymous 2014-07-14 13:48

>>18
That was a truly epic meme /b/ro!

Name: not >>17 2014-07-14 18:59

>>19
I share the sentiment with >>17. Imgur is fine if you wan to keep the image forever.

These days I am using http://webmup.com/, 'cause "Webm is master race!"

Name: Anonymous 2014-07-14 23:30

>>19

Interestingly, using SSE2 non-temporal moves maintains the same speed regardless of size (not surprising since it bypasses the cache), and with huge sizes is the fastest by <10% again, but that's a bit like cheating in that the difference quickly vanishes if we need to access the data that was copied (very common situation) - the cache misses that were avoided during the copy just appear later.

To be fair it's also possible you'd not need to access the data that was just copied (I think framebuffers are the canonical example, but hardware with high bandwidth requirements and no DMA engine is sort of uncommon these days).

Name: Anonymous 2014-07-15 0:34

>>8
The success of x86 in personal computers was largely historical accident. I'm sure if RISC architectures had decades of mass-market commercial pressure to drive performance improvements, they'd be in the same position that x86 hardware is in now.

Name: Anonymous 2014-07-15 0:37

>>25

The success of white man was largely historical accident. I'm sure if niggers had decades of evolutionary pressure to drive performance improvements, they'd be in the same position that white man is in now.

Name: Anonymous 2014-07-15 0:39

>>26

The success of the G-d's Chosen People was largely historical accident. I'm sure if Gentiles were chosen by G-d to drive performance improvements, they'd be in the same position that the Jews are in now.

Name: Anonymous 2014-07-15 0:46

>>26-27
Making racist variants of my argument is not a counterargument, my friend(s).

Name: Anonymous 2014-07-15 0:55

>>28

You argument is phallacious. In other words your argument is a phallus.

Name: Anonymous 2014-07-15 1:25

>>29

My peepee is hard right now

Name: Cudder !MhMRSATORI 2014-07-15 12:49

>>24
15GB/s is good for nearly 600FPS of 24-bit 4096x2160 uncompressed video. Framebuffers don't need anywhere near that amount of bandwidth.

25

In that case the RISCs would've slowly evolved toward something with the complexity of x86 once they realised that the new bottleneck was memory bandwidth and not instruction decode/execute. Just look at all the instructions ARM have been adding, Thumb mode, etc.

Also got a chance to do this test on the Pentium MMX 200 (P54C), and it's different again from the P5s benchmarked above:

http://i58.tinypic.com/2jfjd3t.png

The MMX and FPU on this one is bloody fast at small sizes - we do see it beating REP MOVSD by ~2x - but using the registers, an unrolled loop using the registers (2x and 64x), and REP MOVSD are essentially identical in performance despite the unrolled loop being extremely cache-bloating. It looks like MOVSB/MOVSW haven't been optimised yet as they are definitely moving half and quarter as much data as MOVSD.

This is why CISC has so much potential. Complex instructions don't get slower, they get faster over time and give hardware designers more room to optimise with each new microarchitecture and make existing software faster. RISC is a dumb, overly simplistic, and ridiculously shortsighted design decision by people who don't understand the laws of physics.

I think all the claims of P&H have been thoroughly debunked now, unless anyone has a Sandy Bridge or newer and wants to test?

Name: Anonymous 2014-07-15 13:13

>>31
What do you think about the OpenRISC 1200? Are there any free (as in freedom) CPU designs out there that aren't shit (by your standards)?

Name: Anonymous 2014-07-15 14:38

Hey /prog/ people,
long time no see.
Is Cudder-san still alive?

>>31
Cudder-san, is this you?

Name: Anonymous 2014-07-15 16:00

>>31

This is why CISC has so much potential. Complex instructions don't get slower, they get faster over time and give hardware designers more room to optimise with each new microarchitecture and make existing software faster. RISC is a dumb, overly simplistic, and ridiculously shortsighted design decision by people who don't understand the laws of physics.

RISC benefits from the same hardware advances that CISC does. The x86 benefits more because it was further behind to start with. Historic RISC designs have been able to defer implementing features that the x86 had earlier but that's not really a strong argument for the x86 per se.

Name: Anonymous 2014-07-16 0:48

http://dis.4chan.org/read/prog/1374577564
http://en.wikipedia.org/w/index.php?title=Cannonlake&action=history
Look at the date Cudder posted that thread, and what the wiki page was renamed to. Coincidence?

Name: Anonymous 2014-07-16 0:54

>>35
So? Cudder is not the only semi company stiff who posts here.

Name: Cudder !MhMRSATORI 2014-07-16 10:58

>>32
OpenRISC is just another MIPS clone (complete with branch delay slot!) with some even more stupid design decisions like defaulting to big-endian and an Asm syntax where every single goddamn instruction has a mnemonic starting with "l" or "l.". WTF.

>>34
The difference is that it's much harder to, when the hardware makes it possible to implement more complex operations, recognise a series of simple instructions and recombine them into one op for the dedicated hardware unit than it is to split up a complex instruction into simpler uops when dedicated hardware units are not available. And the multiple simple instructions take more space in cache and memory bandwidth to fetch.

A good example of this is an integer division instruction. x86 has had one ever since the 8086, and while it was slow (it's still faster than a manual shift-subtract loop if you don't know the divisor), software could use it whenever it wanted to divide, and its performance was increased a huge amount throughout the years as newer models came out. Many RISCs started out without any divide instruction, because it didn't fit the definition of "simple" and hardware at the time couldn't do divides in 1 clock cycle. Software had to either call a divide function in a library or inline a shift-sub loop.

At some point they figured out how to make faster division hardware and added instructions for it, which means no gain in performance at all for software that inlined a division loop or statically linked in a library containing one, while those that could benefit from an updated library will still take the extra call/return overhead. Attempting to recognise the nearly infinite possibilities of instruction sequences to implement a divide loop and send that to the hardware divider would take far more complex circuitry than if they had a divide instruction that just internally expanded to the equivalent shift-sub loop on CPUs without fast divide hardware, and there's no way to recover the cost of fetching those instructions. (This is what the 8086's DIV did.) It took ARM seven revisions to add an integer divide, and it's still considered "optional". Maybe this is OK for small embedded cores but it's absolutely idiotic for something intended to do more computation and be general-purpose.

Name: Anonymous 2014-07-16 11:04

>>37
Shalom!

Name: Anonymous 2014-07-16 11:10

>>37
That's actually one of your best arguments for CISC yet.

Name: Anonymous 2014-07-16 12:36

>>37
What do you think about m68k?

Name: Anonymous 2014-07-16 15:33

>>37

stupid design decisions like defaulting to big-endian

You'd think that people who favor minimalist designs would realize how dumb this is.

It took ARM seven revisions to add an integer divide, and it's still considered "optional". Maybe this is OK for small embedded cores but it's absolutely idiotic for something intended to do more computation and be general-purpose.

The flip side of this is ARM was able to get away with doing absolutely nothing for seven generations and their embedded cores remain free from the requirement to retain a crappy divide unit that no one will use anyway. Meanwhile x86 had to keep same around for years and can never be rid of it, even in embedded use cases where they might really want to. Adding slow instructions because you think future generations will be able to do things faster is a bet Intel didn't always win (see: almost every instruction that was added for the 286).

Name: Anonymous 2014-07-16 20:46

>>37

And the multiple simple instructions take more space in cache and memory bandwidth to fetch.

1. modern caches are huge.
2. simple instructions have orders of magnitude faster execution.

Name: Anonymous 2014-07-16 21:17

>>42

3. simple instructions take less cache space and easier to cache

Name: Anonymous 2014-07-17 0:41

>>43

simple instructions take less cache space

False. Your typical RISC instruction set wastes cache space for most instructions because all instructions have the same length, even for common cases where a shorter encoding is possible. Many common x86 instructions occupy only 1-2 bytes; equivalent RISCs use 4 bytes for nearly everything.

Name: Anonymous 2014-07-17 0:54

>>44

Nothing stops you from using 16-bit RISC

Name: Anonymous 2014-07-17 1:02

>>44
You got a Thumb up your arse.

Name: Anonymous 2014-07-17 2:06

>>46
At least mine is not 4 bytes long.

Name: Cudder !MhMRSATORI 2014-07-17 12:42

>>39
It's the same one I've been repeating for over 20 years (albeit in slightly modified form) - back then it was mostly prediction, but I knew that raising the clock frequency would have its limits.

>>40
Backwards-endian, minimum length of 2 bytes per instruction, can't use GPRs for memory addressing, and an encoding that is one of the most random ones I've seen (e.g. compare http://goldencrystal.free.fr/M68kOpcodes.pdf , http://i.stack.imgur.com/07zKL.png , and http://www.z80.info/decoding.htm )... it's CISC done wrong, like the VAX.

>>41
The amount of software you can write without ever having to divide is tiny - probably an application better served by an 8-bit or even 4-bit MCU, and when you do have to divide, it's far better for the hardware to have a divide instruction than have to do it manually with individual instructions, because at best it'll be the same speed as a hardware divide instruction expanding to the same series of uops but using more memory, and at worst it'll be both slower and bigger. The funny thing is that although you seem to think embedded aplications don't need divide instructions, the only ARM cores that are guaranteed to have divide instructions are the embedded ones - ARMv7-{M,R}, and ARMv7-R only in Thumb mode.

>>42

modern caches are huge.

That's the sort of idiotic thinking that lead to the wasteful state of software today, and a similar attitude ("the Earth is huge, we can dump whatever into the oceans and atmosphere and it'll be OK") is why the environment is getting fucked up. If we don't realise that all resources are intrinsically limited and use them responsibly, everyone suffers because we're all sharing it. See also: programs that think they can use all the RAM and HDD they want, or that network bandwidth is free, etc.

simple instructions have orders of magnitude faster execution

When you need orders of magnitude more of them, no. Also see what I wrote above about recognising and combining simple instructions to be executed faster by dedicated hardware.

>>45,46
That doesn't solve the problem of every single instruction being the same length, which means that the simpler and more common operations aren't any shorter.

This is what an ARM memcpy looks like:
https://android.googlesource.com/platform/bionic/+/5b349fc/libc/arch-arm/bionic/memcpy.S
https://android.googlesource.com/platform/bionic/+/5b349fc/libc/arch-arm/bionic/memcpy.a15.S
Do we really need all that bloat in the cache? A disgustingly large amount of that code is just for handling different alignment cases, something that the hardware could easily figure out by itself (and ARM has thankfully realised this, later versions do handle alignment in hardware.) It's still no match for REP MOVS, however.

RISC: stupid processors by stupid hardware designers, for stupid compilers, stupid programmers, and stupid users.

Name: Anonymous 2014-07-17 14:19

>>48
Isn't it wasteful of die area to include loads of ucode for all the functions that aren't implemented in hardware, as opposed to just letting users use regular code to implement them? Users can recompile their damned software anyway when the functionality is implemented in hardware.

Name: Anonymous 2014-07-17 15:10

>>48
You don't need nor you should be using any other instruction than bit-and/or/xor, jif (jump if the condition is true), a instruction from loading to a register from memory & from the instructions, one to write a register to memory and a final one for communicating with hw

Name: Anonymous 2014-07-17 15:32

>>49
How wasteful. You only need one instruction (subleq).

Name: Anonymous 2014-07-17 15:58

>>49
"Wasteful" depends on what you are trying to accomplish. If you have a large installed base of software that can't be re-compiled, adding compatibility features to your hardware starts making more sense.

One thing you may notice is that vendors with complex instruction sets like Intel's don't like to sell small, low cost chips where instruction decode begins to occupy a significant share of the die. They'd much rather sell you an expensive, high transistor count design where the area used by decode is totally dominated by caches, etc.

Name: Anonymous 2014-07-17 17:42

>>52

software that can't be re-compiled

I don't care about closed-source software.

Name: Anonymous 2014-07-17 18:52

What do you think is the best designed processor(s)?

Name: Anonymous 2014-07-17 19:50

>>54
The Symbolics Genera Lisp Machine, of course.

Name: Anonymous 2014-07-17 20:16

>>55
Processor, not computer. Are you one of those fucks that refer to their computer as ``CPU''

Name: Anonymous 2014-07-17 20:41

>>55

Symbolics

Angered by this perceived infraction of his privacy, Stallman sent an email to Symbolics threatening to wrap himself in dynamite and blow up their building. Although this caused some panic at Symbolics, Stallman never made good on the threat (Newquist 194-196).

Name: Anonymous 2014-07-17 20:51

>>56
It's a list processer, twit.

Name: Anonymous 2014-07-17 20:54

>>58
Your low intelligence and low IQ test scores are evident. faggot

Name: Anonymous 2014-07-17 21:01

>>59
Stop posting.

Name: Anonymous 2014-07-17 21:31

>>60
The individual is stronger than the herd (you and your faggot friends).

Name: Anonymous 2014-07-17 21:47

>>60
Stop posting.

Name: Anonymous 2014-07-17 21:54

>>62
blah blah you're autistic blah blah

Hennessy and Patterson

1 Name: Cudder !MhMRSATORI 2014-07-13 3:42

2 Name: Anonymous 2014-07-13 4:48

3 Name: Anonymous 2014-07-13 5:16

4 Name: Anonymous 2014-07-13 6:19

5 Name: Anonymous 2014-07-13 6:31

6 Name: Anonymous 2014-07-13 6:47

7 Name: Anonymous 2014-07-13 7:55

8 Name: Anonymous 2014-07-13 9:17

9 Name: Cudder !MhMRSATORI 2014-07-13 12:00

10 Name: Anonymous 2014-07-13 13:18

11 Name: Anonymous 2014-07-13 13:33

12 Name: Anonymous 2014-07-13 13:57

13 Name: Anonymous 2014-07-13 14:14

14 Name: Anonymous 2014-07-13 14:49

15 Name: Anonymous 2014-07-13 18:53

16 Name: Anonymous 2014-07-14 1:23

17 Name: Anonymous 2014-07-14 8:35

18 Name: Anonymous 2014-07-14 9:23

19 Name: Cudder !MhMRSATORI 2014-07-14 12:45

20 Name: Anonymous 2014-07-14 13:11

21 Name: Anonymous 2014-07-14 13:21

22 Name: Anonymous 2014-07-14 13:48

23 Name: not >>17 2014-07-14 18:59

24 Name: Anonymous 2014-07-14 23:30

25 Name: Anonymous 2014-07-15 0:34

26 Name: Anonymous 2014-07-15 0:37

27 Name: Anonymous 2014-07-15 0:39

28 Name: Anonymous 2014-07-15 0:46

29 Name: Anonymous 2014-07-15 0:55

30 Name: Anonymous 2014-07-15 1:25

31 Name: Cudder !MhMRSATORI 2014-07-15 12:49

32 Name: Anonymous 2014-07-15 13:13

33 Name: Anonymous 2014-07-15 14:38

34 Name: Anonymous 2014-07-15 16:00

35 Name: Anonymous 2014-07-16 0:48

36 Name: Anonymous 2014-07-16 0:54

37 Name: Cudder !MhMRSATORI 2014-07-16 10:58

38 Name: Anonymous 2014-07-16 11:04

39 Name: Anonymous 2014-07-16 11:10

40 Name: Anonymous 2014-07-16 12:36

41 Name: Anonymous 2014-07-16 15:33

42 Name: Anonymous 2014-07-16 20:46

43 Name: Anonymous 2014-07-16 21:17

44 Name: Anonymous 2014-07-17 0:41

45 Name: Anonymous 2014-07-17 0:54

46 Name: Anonymous 2014-07-17 1:02

47 Name: Anonymous 2014-07-17 2:06

48 Name: Cudder !MhMRSATORI 2014-07-17 12:42

49 Name: Anonymous 2014-07-17 14:19

50 Name: Anonymous 2014-07-17 15:10

51 Name: Anonymous 2014-07-17 15:32

52 Name: Anonymous 2014-07-17 15:58

53 Name: Anonymous 2014-07-17 17:42

54 Name: Anonymous 2014-07-17 18:52

55 Name: Anonymous 2014-07-17 19:50

56 Name: Anonymous 2014-07-17 20:16

57 Name: Anonymous 2014-07-17 20:41

58 Name: Anonymous 2014-07-17 20:51

59 Name: Anonymous 2014-07-17 20:54

60 Name: Anonymous 2014-07-17 21:01

61 Name: Anonymous 2014-07-17 21:31

62 Name: Anonymous 2014-07-17 21:47

63 Name: Anonymous 2014-07-17 21:54