1) Write a subroutine that accepts an array of unsigned ints and returns 4. It must operate on the array in-place and partition it so that all nonzero values are at the beginning of the array, and all zero values are moved to the end. For example, the input [0, 2, 0, 0, 4, 1, 4, 5] could be changed to [2, 4, 1, 4, 5, 0, 0, 0]. The relative order of the nonzero values is unimportant.
>>40 Because it isn't taught in school and appers are lazy niggers
Name:
Cudder !MhMRSATORI2014-11-27 15:13
>>37 Better to save some cache space, a miss will take more cycles than you save with uop fusion... and LOOP is basically as fast on a modern CPU when you're limited by memory bandwidth.
Name:
Anonymous2014-11-27 18:43
Eh, now that actually tested it I realized there's no [reg - reg] addressing mode (been awhile since I did x86), and the subtract obviously didn't do element wise, just number of bytes, gotta divide by 4. Revised and working version: ; esi=array, ecx=len partition: mov edi, esi L1: lodsd test eax, eax jz L2 stosd L2: loop L1 mov ecx, esi sub ecx, edi shr ecx, 2 xor eax, eax rep stosd ret
there's no [reg - reg] addressing mode (been awhile since I did x86), and the subtract obviously didn't do element wise, just number of bytes, gotta divide by 4.
how kum u miss det 1 coddo? i thot u wer a kool dude what knoed da ass embly
Name:
Anonymous2014-11-28 7:19
INTEGER FUNCTION ZP(A, N) INTEGER N, A(N)
INTEGER DEST, TEMP, I DEST = 1 DO 10 I = 1, N IF (A(I) .NE. 0) THEN TEMP = A(I) A(I) = A(DEST) A(DEST) = TEMP DEST = DEST + 1 END IF 10 CONTINUE
ZP = 4 RETURN END
Name:
Anonymous2014-11-28 8:52
Not a big C guy so critique is welcome.
int f(unsigned int *array, size_t length) { int i = -1; unsigned int *end = &array[length/sizeof(unsigned int)-1];
>>48 That struck me as quite an interesting approach, particularly because I only thought of one way to do it (which has been done in this thread already), but then I kind of stopped there and didn't think that there might be other ways. Nice work.
Some criticiism relating to C:
One suggestion is to pass the number of elements in the array to length, instead of the size in bytes. Also using a for or while statement instead of label+goto is usually considered better practice.
Might also be a good idea to stick with either int or size_t, and avoid mixing them (they've different ranges).
One issue that pops out at me is array+i -- this is done when i stores -1, so array+i will result in an invalid pointer.
Name:
Anonymous2014-11-29 3:38
>>43 So either all real compiler writers are morons, or there's more to performance optimization than using archaic complex instructions to avoid hypothetical icache misses. I wonder which it is.
Name:
Cudder !MhMRSATORI2014-11-29 14:56
>>52 Yes they are. Ever seen any real compiler output?
Q: If you believe your other arguments, you can do the reference counting or local pooling, and point out when it's actually wrong.
Right. The philosophical answer to you guys is: compilers will eventually get smart enough to deal with these problems better than a programmer can. This has happened time and again. [For instance] compilers generate better assembly code [than programmers do].
>>54 Compilers simply aren't able to generate the kind of fast code that a human expert can. The world would be better off if we relied on human experts to do all our code generation.
No they bloody well don't. This stupid myth has to die. It's all just wishful thinking. Ditto for those ultra-high-level languages being more performant than C "because their compilers can do more optimisation" - keep dreaming. Maybe the hoards of incompetents that CS educations are shitting out these days can't program anymore, but that's a different issue... or perhaps it's intentional: if you make programmers stupid enough that almost all of them don't know Asm anymore, or if they do it's only from looking at the god-awful shit that compilers emit, then they certainly can't generate code better than a compiler because they've never seen anything better! A disgusting self-fulfilling prophecy.
As for the rest of that link... WTF. It's like that guy is seriously stoned.
>>55 Have you seen any real compiler output? And 420 cycles in 1 byte is better than 420 cycles and 1680 bytes.
Name:
Anonymous2014-11-29 16:50
Ditto for those ultra-high-level languages being more performant than C "because their compilers can do more optimisation" - keep dreaming.
TIL: Cudder is a crack addict.
Name:
Anonymous2014-11-29 16:50
I went to the University of Washington and [then] I got hired by this company called Geoworks, doing assembly-language programming, and I did it for five years. To us, the Geoworkers, we wrote a whole operating system, the libraries, drivers, apps, you know: a desktop operating system in assembly. 8086 assembly! It wasn't even good assembly! We had four registers! [Plus the] si [register] if you counted, you know, if you counted 386, right? It was horrible.
I mean, actually we kind of liked it. It was Object-Oriented Assembly. It's amazing what you can talk yourself into liking, which is the real irony of all this. And to us, C++ was the ultimate in Roman decadence. I mean, it was equivalent to going and vomiting so you could eat more. They had IF! We had jump CX zero! Right? They had "Objects". Well we did too, but I mean they had syntax for it, right? I mean it was all just such weeniness. And we knew that we could outperform any compiler out there because at the time, we could!
So what happened? Well, they went bankrupt. Why? Now I'm probably disagreeing – I know for a fact that I'm disagreeing with every Geoworker out there. I'm the only one that holds this belief. But it's because we wrote fifteen million lines of 8086 assembly language. We had really good tools, world class tools: trust me, you need 'em. But at some point, man...
The problem is, picture an ant walking across your garage floor, trying to make a straight line of it. It ain't gonna make a straight line. And you know this because you have perspective. You can see the ant walking around, going hee hee hee, look at him locally optimize for that rock, and now he's going off this way, right?
This is what we were, when we were writing this giant assembly-language system. Because what happened was, Microsoft eventually released a platform for mobile devices that was much faster than ours. OK? And I started going in with my debugger, going, what? What is up with this? This rendering is just really slow, it's like sluggish, you know. And I went in and found out that some title bar was getting rendered 140 times every time you refreshed the screen. It wasn't just the title bar. Everything was getting called multiple times.
Because we couldn't see how the system worked anymore!
>>65 Working from Agner's tables, using loop is a losing decision unless you happen to be using one of the few AMD architectures you've highlighted. I don't understand why you are so adamant about using it in the general case.
>>65 Real world example of a palette lookup blitting loop: align 64 @@: mov eax, dword ptr [rsi] add rsi, r11 movzx edx, ah movzx ebp, al shr eax, 16 movzx ebx, ah movzx eax, al pinsrd xmm0, dword ptr[r10+rbp*4], 0 pinsrd xmm0, dword ptr[r10+rdx*4], 1 pinsrd xmm0, dword ptr[r10+rax*4], 2 pinsrd xmm0, dword ptr[r10+rbx*4], 3 movdqa xmmword ptr [rdi], xmm0 add rdi, 16 dec ecx jnz @b ret Replacing dec+jnz with loop slows things down from ~900 fps to ~830 fps This is on an Intel i7 920 (Nehalem) The loop code is exactly 64 bytes long and fits an icache line
Name:
Cudder !MhMRSATORI2014-11-30 16:27
>>67,69 I didn't mention that the parallel execution of uops means that unless you're saturating the decode bandwidth by some other means, it doesn't matter - I have benchmarks that show absolutely NO difference between dec/jnz and loop. In the general case you should optimise for size, and only the areas that are speed-constrained may need some "expansion".
>>69 That's only an 8% difference, and this is that rare case where it does matter since you're saturating the decoders with all those big SSE instructions.
Have you tried doing 8 bytes at a time with an expanded lookup table (64k elements) using pinsrq?
Name:
Anonymous2014-12-03 21:13
Okay /prog/! Let's keep moving forward and all become satori programmers together! Yeah!
2) Write a program that simulates physical card shuffling by partitoning and merging the deck. You can do this in place or using temporary structures at your own discretion, and you can use as many helper functions as you like.
The input will always contain 52 cards. You can represent them as pairs of characters (strings, tuples, etc.) identifying rank and suite: "6H" is the six of hearts, "AS" is the ace of spades, "XC" is the ten of clubs, and "QD" is the queen of diamonds. You can generate your own input in sorted order or hard-code a particular deck.