Good news! I been labouring a library just for that purpose; https://github.com/t0rakka/mango
Check out the include/mango/math/ folder for C++ operator overloaded SIMD vector classes. The underlying low-level code is in simd/ folder and has implementations for various architectures so the code also acts as portable SIMD abstraction. The different levels: native -> simd -> math levels interact seamlessly together in case some super exotic instruction must be used.
I can't stress the fact that these abstractions are crafted to be no overhead whatsoever when using the primitives. The calling conventions, parameter passing, everything has been single-mindedly crafted to be as efficient as possible. We don't want spilling; everything runs in-registers as much as humanly possible - at least not because of doing bad choices!
The low-level API can be described as "Functional" ; nothing is passed by reference - objects are not modified by the functions - result of calculation is always a returned by value.
Here's some random code I wrote recently so that can take a peek at what the API looks like to use:
int32x4 coverageMask = (cx0 & cx1 & cx2) < zero;
uint32 mask = coverageMask.mask();
if (mask)
{
float32x4 c0 = convert<float32x4>(cx0);
float32x4 c1 = convert<float32x4>(cx1);
float32x4 c2 = convert<float32x4>(cx2);
float32x4 w = 1.0f / (c0 * block.w[0] + c1 * block.w[1] + c2 * block.w[2]);
float32x4 depth = (c0 * block.depth[0] + c1 * block.depth[1] + c2 * block.depth[2]) * w;
float32x4 depthMask = (depth < depthBuffer[0]) & reinterpret<float32x4>(coverageMask);
int32x4 colorMask = reinterpret<int32x4>(depthMask);
float32x4 r = (c0 * block.color[0].xxxx + c1 * block.color[1].xxxx + c2 * block.color[2].xxxx) * w;
float32x4 g = (c0 * block.color[0].yyyy + c1 * block.color[1].yyyy + c2 * block.color[2].yyyy) * w;
float32x4 b = (c0 * block.color[0].zzzz + c1 * block.color[1].zzzz + c2 * block.color[2].zzzz) * w;
int32x4 v0 = convert<int32x4>(r);
int32x4 v1 = convert<int32x4>(g);
int32x4 v2 = convert<int32x4>(b);
int32x4 color = v2 | (v1 << 8) | (v0 << 16);
colorBuffer[0] = select(colorMask, color, colorBuffer[0]);
depthBuffer[0] = select(depthMask, depth, depthBuffer[0]);
}