In this case, the loop is load-store bound on every mainstream architecture, so you will never saturate the ALUs. For each element, whether it is scalar or SIMD vector of whatever size, the best possible codegen requires 2 loads, 1 floating-point add, and 1 store.
Common big-core microarchitectures can sustain 2-4 fadd instructions (scalar or vector) per cycle, but typically only sustain 2 or 3 memory operations per cycle (again, scalar or vector, and only if they hit caches). So the limiting factor on how fast you're going to go is necessarily load-store bandwidth.
Unrolling is still be beneficial because of other considerations, however; beyond the real computational work that's invariant, you have loop overhead in the form of an index update, plus compare and branch, which adds two or three additional instructions and uOps to each loop iteration. If you only do one element per loop, that means you have to be able to turn over 5-7 instructions and 6-7 uOps per cycle to saturate the machine, and not all uArches are capable of doing so. Amortizing the loop overhead across two elements cuts that down to 8-11 instructions and maybe 10 uOps, which is achievable on many mainstream big core designs.
Thus, unrolling this loop 2x will be pretty common to achieve better performance. E.g. the compiler-optimized loop on Apple Silicon:
0: ldp q0, q1, [x12, #-0x10] // load two vectors from a
ldp q2, q3, [x11, #-0x10] // load two vectors from b
fadd.2d v0, v0, v2 // add first vectors from a and b
fadd.2d v1, v1, v3 // add second vectors
stp q0, q1, [x10, #-0x10] // store two result vectors to c
add x12, x12, #0x20 // advance pointer to a
add x11, x11, #0x20 // advance pointer to b
add x10, x10, #0x20 // advance pointer to c
subs x9, x9, #0x4 // subtract 4 from loop count
b.ne 0b // repeat loop if not done
and the compiler-optimized loop for x86_64 (without AVX[512]):
0: movupd xmm0, xmmword ptr [r15 + 8*rcx + 32] // first vector from a
movupd xmm1, xmmword ptr [r15 + 8*rcx + 48] // second vector from a
movupd xmm2, xmmword ptr [r14 + 8*rcx + 32] // first vector from b
addpd xmm2, xmm0 // sum first vectors
movupd xmm0, xmmword ptr [r14 + 8*rcx + 48] // second vector from b
addpd xmm0, xmm1 // sum second vectors
movupd xmmword ptr [r12 + 8*rcx + 32], xmm2 // store first result
movupd xmmword ptr [r12 + 8*rcx + 48], xmm0 // store second result
add rcx, 4 // update index
cmp rax, rcx // check for loop termination
jne 0b