When msvc::musttail attribute silently fails

11 Jan 2026

Python developers recently reported a 15% speedup when using the new MSVC musttail attribute to create a threaded interpreter. Unfortunately, I found that MSVC does not always generate a tail call when you use this attribute, which can potentially lead to stack overflow when interpreting a complex program.

Background

The musttail attribute, also supported by clang, forces the compiler to generate a tail call for a return statement that calls another function. So instead of a CALL instruction to call this function followed by a RET to return from the current function, it emits a JMP instruction to jump to the next function without creating a new stack frame. MSVC recently added support for this technique, which is useful for p-code interpreters.

P-code interpreters are traditionally coded as a giant switch statement inside a loop:

for (each instruction in pcode) {
    switch (opcode) {
        case ADD:
            // do the addition
            break;
        case MUL:
            // do the multiplication
            break;
        ...
    }
}

The switch statement compiles to an indirect jump. All p-code instructions go through this jump, which makes it hard for the processor to predict the next branch.

If musttail is supported by your compiler, you can create a function for each p-code instruction and link these functions via a dispatch table:

function DoAdd(INSTR * instr) {
    // Do the addition
    // ...
    // Move to the next p-code instruction
    instr++;
    return dispatch_table[GetOpcode(instr)](instr); // musttail
}

function DoMul(INSTR * instr) {
    // Do the multiplication
    // ...
    // Move to the next p-code instruction
    instr++;
    return dispatch_table[GetOpcode(instr)](instr); // musttail
}

dispatch_table = {
    ADD: DoAdd,
    MUL: DoMul,
    ...
};

This compiles to an indirect jump in the epilogue of each function. When running your interpreter, the branch predictor will save the branch history separately for each p-code instruction, e.g. if your p-code usually runs MUL after ADD, the processor will remember this. That's why threaded code is usually faster.

The MSVC problem

The newly added [[msvc::musttail]] attribute is ignored when the function is moderately complex (so that it saves non-volatile registers on stack) and it has multiple returns (some of them without a tail call):

void __declspec(noinline) increment(int x) {
    printf("%d\n", x + 1);
}

void incrementIfPositive(int x) {
    DWORD64 a = GetTickCount64();
    DWORD64 b = GetTickCount64();
    DWORD64 c = GetTickCount64();
    if (c == 0) {
        return;
    }

[[msvc::musttail]]
    return increment(x + (int)(b - a + c / 2));
}

This is a made-up example, but a similar early return happens in a real interpreter when handling an exception. Assembly output:

; 18   :     if (a == 0) {

  test  rax, rax
  je    SHORT $LN1@incrementIfPositive

...

; 25   :     [[msvc::musttail]]
; 26   :     return increment(x + (int)(b - a + c / 2));

  shr  rax, 1
  lea  ecx, DWORD PTR [rbx+42]
  sub  eax, edi
  add  ecx, eax
  call ?increment@@YAXH@Z
  mov  rbx, QWORD PTR [rsp+48]
$LN1@incrementIfPositive:

; 27   : }

  add  rsp, 32          ; 00000020H
  pop  rdi
  ret  0
?incrementIfPositive@@YAXH@Z ENDP

Despite the [[msvc::musttail]] attribute, the Visual C++ compiler generates a call to the increment function. I think it’s because the function epilogue is quite long (with add rsp, 32 and pop rdi instructions), so the compiler does not want to duplicate it for the if (c == 0) case. Instead, the compiler generates a conditional jump to $LN1@incrementIfPositive when c == 0, but this prevents the musttail optimization.

Visual C++ also does not produce any compilation error (as it should do according to the documentation), but just ignores the musttail attribute and generates the call instruction instead of jmp / rex_jmp.

A workaround that I found is to create a useless handle_exception function and call it instead of returning early:

int g_x;

void __declspec(noinline) handle_exception(int x) {
    // Do something to avoid optimizing out this function
    g_x = x;
}

void incrementIfPositive(int x) {
    DWORD64 a = GetTickCount64();
    if (a == 0) {
        return handle_exception(x);
    }
    // The rest of the code is the same
    // ...

Assembly output in this case:

; 18   :     if (a == 0) {

  test  rax, rax
  jne  SHORT $LN2@incrementIfPositive

; 27   : }

  add  rsp, 32
  pop  rdi

; 19   :         return handle_exception(x);

  jmp  ?handle_exception@@YAXH@Z

...

; 24   : 
; 25   :     [[msvc::musttail]]
; 26   :     return increment(x + (int)(b - a + c / 2));

  shr  rax, 1
  lea  ecx, DWORD PTR [rbx+42]
  sub  eax, edi
  add  ecx, eax
  mov  rbx, QWORD PTR [rsp+48]

; 27   : }

  add  rsp, 32
  pop  rdi

; 24   : 
; 25   :     [[msvc::musttail]]
; 26   :     return increment(x + (int)(b - a + c / 2));

  jmp  ?increment@@YAXH@Z
?incrementIfPositive@@YAXH@Z ENDP

Here, a tail call is correctly generated. Unfortunately, this workaround does not help if you have more than one early return from the function. Even if you create several useless functions, it still won't work.

Conclusion

I encountered this problem when trying to apply the musttail optimization to the regular expression engine in Aba Search and Replace. I don't know if it affects the Python interpreter, but I reported the bug to Microsoft and it's under their consideration now.

Aba Search and Replace screenshot

Stop jumping between browser tabs and random online tools. Aba Search and Replace is your Swiss army knife for fast, safe text updates across multiple files and data conversions, with all your data staying on your computer. Built for developers, testers, and analysts.

This is a blog about Aba Search and Replace, a tool for replacing text in multiple files.