For Loops: Patterns and Low-Level Tour

I wanted the challenge of taking an elementary programming concept (‘for’ loops) and writing an article for advanced programmers. All of the patterns discussed today are applicable to high-level languages that have ‘for’ loops, but we will examine ‘for’ loops at a low level to discover why things are the way they are. By “low level,” I mean compilers and hardware. Examples are in C, C++, x86 assembly, and, of course, JS++. In addition to covering for loop patterns for everyday programming, I’m going to cover for loops from the perspective of optimizing compilers.

Length/Size Caching

Compilers may not automatically cache your loop length or array size just because the array you’re looping does not change. In JS++:

int[] arr = [ 1, 2, 3 ];

for (int i = 0; i < arr.length; ++i) {
    // ...
}

// is not the same as:

for (int i = 0, len = arr.length; i < len; ++i) {
    // ...
}

Now, for more advanced users, I'm going to show you why. We consider C/C++ optimizing compilers to be the state of the art. As of this writing, the latest gcc 11.2 (and clang 13) do not perform this optimization for this basic case:

// Assuming 'v' is const std::vector<int> &
for (size_t i = 0; i != v.size(); ++i) {
    printf("%zu\n", i);
}

... with special emphasis on the type being a const reference.

On -O3, the comparison on each iteration becomes:

.L3:
        mov     rsi, rbx
        mov     edi, OFFSET FLAT:.LC0
        xor     eax, eax
        add     rbx, 1
        call    printf
        mov     rax, QWORD PTR [rbp+8] # occurs
        sub     rax, QWORD PTR [rbp+0] # every
        sar     rax, 2                 # iteration
        cmp     rbx, rax
        jne     .L3

I'm using printf versus std::cout here because the generated assembly code is easier to read. Furthermore, it doesn't do any exception bookkeeping.

Now, if we cache the size:

for (size_t i = 0, sz = v.size(); i != sz; ++i) {
    printf("%zu\n", i);
}
        push    rbp
        push    rbx
        sub     rsp, 8
        mov     rbp, QWORD PTR [rdi+8] # before
        sub     rbp, QWORD PTR [rdi]   # the
        sar     rbp, 2                 # loop
        je      .L1
        xor     ebx, ebx
.L3:
        mov     rsi, rbx
        mov     edi, OFFSET FLAT:.LC0
        xor     eax, eax
        add     rbx, 1
        call    printf
        cmp     rbx, rbp
        jne     .L3

The assembly code is subtracting a start pointer from an end pointer (at a +8 byte offset, denoting a 64-bit size type). If the pointers are equal, there are zero elements, so we jump to the basic block immediately following the for loop. SAR rbp, 2 is equivalent to >> 2 (division by 4, sizeof(int)). (pointer_end - pointer_start) / sizeof(T) gives us the number of elements. We can confirm in the libstdc++ source code:

struct _Vector_impl_data
{
    pointer _M_start;
    pointer _M_finish;
    // ...
};
      /**  Returns the number of elements in the %vector.  */
      _GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR
      size_type
      size() const _GLIBCXX_NOEXCEPT
      { return size_type(this->_M_impl._M_finish - this->_M_impl._M_start); }

The problem, from the perspective of compilers, is that the side effect (writing to standard output) causes the optimizer to discard the loop length caching optimization. The fix is simple: use only pure functions if you want a compiler to automatically apply this optimization. The alternative is to simply cache the size if no mutations occur inside the loop.

One more reason compilers may not perform this optimization, which is important to JS++ but maybe not for C/C++, is compile times. The analysis required to prove the loop length does not change can be expensive.

There are a lot of moving parts: the language, the compiler, the CPU cache architecture, and so on. Whether or not you need this optimization at all should depend on your benchmarks. If the size is in the L1 CPU cache, it would make no practical difference unless you needed to shave 3-4 cycles per iteration (e.g. high-frequency trading). The key takeaway for general developers is that you cannot assume the compiler will do this for you—even when it seems obvious that the size or loop length never changes.

Unsigned Reverse Iteration

Oftentimes, I'm surprised when programmers don't know how to manually loop in reverse when the data type is unsigned. Yes, C++ has rbegin/rend/crbegin/crend, but I'm talking about manual reverse iteration.

First, let's see what does not work. I'll use the JS++ syntax because, in my opinion, it's the most readable for a general audience:

int[] arr = [ /* ... */ ];
for (unsigned int i = (unsigned int) arr.length - 1;
     i > 0;
     i--)
{
    // incorrect
}

The above code will never print the first element.

"No problem," you say to yourself. "I'll just compare against -1."

Wait. You're going to compare an unsigned (non-negative) value against -1? Two things can happen. In C and C++, -1 will wrap. If you're comparing a 64-bit number, -1 becomes 0xFFFFFFFFFFFFFFFF. Your loop condition will never be true (because i will never be "greater than" UINT64_MAX), and the optimizing compiler will simply eliminate the loop. In JS/JS++, your i counter variable will wrap, and you'll get an infinite loop. (The difference, then, is that C/C++ will wrap the RHS; while JS++ will wrap the LHS. JS++ does this for type system soundness reasons, which extend beyond the scope of for loops.)

The code discussed would be perfectly fine if the container size type were signed. Thus, I presented the "intuitive" (but incorrect) method of unsigned reverse iteration. Instead, the proper way to do unsigned reverse iteration looks something like this:

for (unsigned int i = arr.length; i --> 0; ) {
    // ...
}

There's a curiously-titled Stack Overflow thread on this:

What is the "-->" operator in C/C++?

The above code can be rewritten as:

for (unsigned int i = arr.length; (i--) > 0; ) {
    // ...
}

You should prefer the latter—for readability.

Interestingly, being that unsigned reverse iteration is deceptively unintuitive, it was brought up as a consideration to make all JS++ container size types signed. I was in the unsigned camp, but, besides being easier, what pushed us into the signed types direction is because JS++ is a superset of JavaScript (ECMAScript 3). If you want to see a bug in language design and specification for a very popular language (JavaScript), please read Design Notes: Why isn’t System.Array.length an ‘unsigned int’? As far as I know, JS++ was the first to uncover this design bug.

The common misconception with signed container size types is that you have to pay for two bounds checks instead of one to check for negative index accesses:

if (i < 0 && i >= arr.length) {
    // out of bounds
}

However, besides being more beginner-friendly, signed container size types are actually equally efficient to unsigned:

if ((unsigned int) i >= (unsigned int) arr.length) {
    // out of bounds
}

If the index is a negative value, it will wrap (and, thus, will always be greater than the length property cast to unsigned because length is limited to the signed System.Integer32.MAX_VALUE). In other words, the "negative index check" is redundant. The casts do not result in additional instructions.

Labeled break/continue

This brings me to an example that will not work in C or C++, because those languages do not have labeled break/continue statements.

Somebody actually emailed me to remark on how impressed he was with the quality of the JS++ documentation. He spent all these years programming and never knew about labeled break/continue in JavaScript! The following will work in JS++, JS, and Java. I'll just take it directly from the JS++ documentation:

outerLoop: for (int x = 1; x <= 5; ++x) {
    innerLoop: for (int y = 1; y <= 5; ++y) {
        break outerLoop;
    }
}

Notice we labelled the loops (outerLoop and innerLoop) at lines 1 and 2, respectively. We also provided a label (outerLoop) to the break statement at line 3. By referring to the outerLoop in our break statement, we were able to exit the outermost loop; without the label, we would not have been able to exit the outermost loop. (The innermost loop would have been exited since it is the closest loop to the break statement.)

Source: Label Statement (JS++ Documentation)

One reason we might want to break out of an outer loop is if we are searching a jagged array. Once the element is found, we break out of both loops by breaking out of the outer loop.

Looping in Row-Major Order

Given the following C code, would it take more CPU cycles to loop the columns first or the rows first?

const int rows = 64, cols = 64;
int matrix[rows][cols];

In the code above, assume int requires 32 bits of storage space (4 bytes). Furthermore, we'll assume a cache line size of 64 bytes. We can also be assured the arrays are contiguous. From the C11 language specification, 6.2.5 Types:

An array type describes a contiguously allocated nonempty set of objects with a particular member object type, called the element type.

For visualization, treat each cell as consuming 256 bytes of memory (sizeof(int) ✕ 64 cols ✕ 1 row). The matrix will look like this in memory:

0 1 2 3 4 63

Notice the "shape" of the matrix in memory is not a square, as we might expect in mathematics for a 64 x 64 matrix. Instead, each cell represents one row containing 64 columns consuming 256 bytes each.

Equipped with this visualization, let's first examine looping columns first:

for (int i = 0; i < cols; i++) {
    for (int j = 0; j < rows; j++) {
        // ...
    }
}

This code has an access pattern that looks like this:

r0
c0, c1, c2, …
r1
c0, c1, c2, …
r2
c0, c1, c2, …
r3
c0, c1, c2, …
r4
c0, c1, c2, …
r…
c0, c1, c2, …
r…
c0, c1, c2, …
r63
c0, c1, c2, …

I've prefixed the cell values with lowercase "r" to mark row 1, 2, 3, 4, and so on. Likewise, the columns are marked c0, c1, and so on. There's a problem here though. We're jumping around in memory.

Each "row" is actually divided into 64 columns, stored contiguously. You can imagine the storage as 4096 (64 x 64) columns stored contiguously in memory.

In the first iteration of the loop, we access the first column. We enter the innermost loop, which iterates the rows. We're at c0, r0. On the next iteration of the innermost loop, we're at c0, r1. Then we're at c0, r2, and so forth. When the innermost loop finishes, we start at c1, r0.

We have a cache line size of 64 bytes. At c0, r0, we can cache the first 16 columns of r0 (64 bytes / sizeof(int) = 16). Suppose we have an L1 cache hit with a cost of ~4 cycles and a cache miss of ~100 cycles. Thus, a cache miss would be roughly 25x slower. We have the first 16 columns cached (c0, c1, c2, c3, and so on for r0), but the innermost loop immediately jumps to r1. Thus, instead of getting the data we need from the cache, we have to fetch the data from DRAM... costing us 100 cycles. We have to pay for this penalty 64 times in the innermost loop.

Clearly, this would not be efficient. It is better to just access the contiguous data in the order which it is stored:

for (int i = 0; i < rows; i++) {
    for (int j = 0; j < cols; j++) {
        // ...
    }
}
c0
r0
c1
r0
c2
r0
c3
r0
c4
r0
c…
r0
c…
r0
c63
r0

Notice in the visualization, all of the columns of row 0 are accessed first. We do not move on to row 1 until all the columns of row 0 are accessed. Coincidentally, this results in fewer cache misses. We pay for a first fetch from DRAM (100 cycles) for the first 16 columns, the next accesses come from SRAM (~4 cycles), and we only pay for three more DRAM fetches for the full 64 columns. In the end, this cache-friendly code costs us ~10x fewer cycles in the innermost loop.

Loop-Invariant Code Motion

I had a family member studying computer science at a top 10 CS school. The homework assignment presented a multiple choice question, and she was asked, "Does this code do anything?" The choices were yes/no.

int main(void) {
    int x;
    for (int i = 0; i < 10; i++) x = 2;
    return 0;
}

The correct answer is "no." Loop-invariant code motion might sound intimidating or arcane, but we can break it up into its components: 1) loop invariant, and 2) code motion. First, the compiler begins by identifying "loop invariants"—essentially, code that does not change across loop iterations. The assignment to x = 2 does not change with each iteration. (If the assignment were x = i + 1 instead, it would not qualify as a "loop invariant.") The second half of the optimization, "code motion", permits us to move code (code motion = moving code) that does not change outside of the loop—if safe. I'm emphasizing "if safe" because, if the loop condition were never true (e.g. if we changed the condition to i != 0), x = 2 should never occur.

And that, in a nutshell, is loop-invariant code motion: moving code that doesn't change outside of the loop. The code should now look like this, in high-level code:

int main(void) {
    int x;
    x = 2; // moved outside loop
    for (int i = 0; i < 10; i++);
    return 0;
}

However, we can dive deeper into modern compilers. After loop-invariant code motion, the compiler can perform a second pass: dead code elimination. The for loop is no longer doing anything, and it gets eliminated. The variable x is written to but never read; thus, it can also be eliminated. The final program, in high-level code:

int main(void) {
    return 0;
}

At a low level, if we compile the dead code, gcc -S -O2 finally gives us proper code:

int main(void) {
    int x;
    x = 2; // moved outside loop
    for (int i = 0; i < 10; i++);
    return 0;
}
main:
        xor     eax, eax
        ret

In the end, my family member reported back that the correct answer was indeed "no." I asked if she gave my detailed explanation. She said, "No, because the professor would know I cheated." (😁)

There's a question relating to loop-invariant code motion (via gcc -fmove-loop-invariants) on Stack Overflow dating back 5 years, but the only answer is incorrect. The proper answer is buried in the gcc documentation:

Most optimizations are only enabled if an -O level is set on the command line. Otherwise they are disabled, even if individual optimization flags are specified.

Source: GCC Manual 3.10 Options That Control Optimization

Roger PoonRoger Poon
JS++ Designer and Project Lead. Follow me on Twitter or GitHub.