Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I\'ve rolled my own SIMD-accelerated math library. It\'s gotten pretty complete,

ID: 655025 • Letter: I

Question

I've rolled my own SIMD-accelerated math library. It's gotten pretty complete, so naturally I went to conduct speed tests and optimize it.

Btw this isn't premature optimization, the lib is actually complete in functionality, I really need it to be fast now.

So anyway, I'm testing some vector dot product methods against the ones in Microsoft's DirectXMath. The difference between my vectors and the ones in DirectXMath is, that in DirectXMath XMVECTOR is just a naked __m128, while mine is a __m128 inside a vector_simd class that is 16 byte aligned and with an aligned allocator.

Now one would assume that with all possible optimizations enabled in release mode they would compile to the same thing. I mean int a; and int arr[1]; compile to the same thing in release mode, just like my array template class has the same performance as a raw array and so on... but to my surprise my class's methods came up 2 times slower. I even tried just pasting the SSE code from DirectXMath into my class's method, it still came out slower. So the only thing left was the difference that one is a class and the other is raw.

It seemed for a while that the overhead of accessing it as a class member and/or the extra overhead from the constructor (which is empty...). However I put all the definitions of my vector class methods in it's header and still it came up 2x slower than DirectXMath.

Maybe the overhead comes from the __m128 being a class member?

I don't understand why the compiler can't optimize that away, it's like having a struct with a single integer in it?Has anyone else had such issues?

Explanation / Answer

Working out why code is slower can be exceptionally difficult, but for a gross difference like a doubling of execution time, you can get a good idea of at least where to start looking by using a profiler. You need to particularly be aware of anywhere where data is copied around, especially inside a loop; that can easily get expensive. (I'd suggest a tool like cachegrind, but I don't know if it is available for your platform.)

Alignment problems are actually easier to detect and fix. In particular, if you have a structure like this:

struct {
eightByteQuantity a;
eightByteQuantity b;
};
Setting its alignment to 16 bytes easily force the alignment of each of the members to be on a 16-byte boundary, doubling the effective size. Be aware that the size of things like vtable pointers must also be paid for.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote