CUDA BIP39 Kernel Bug: When Negative Shifts Silently Corrupt Your Entropy

A single missing else if in a CUDA BIP39 kernel silently corrupted entropy packing due to CUDA's wrap-around negative-shift semantics. The bug surfaced when the standard "abandon × 11 about" test vector returned zero results on an H100 — despite the kernel compiling and running without errors.

Here is the bug, the CUDA semantics that make it invisible, and the fix.

Background: BIP39 Entropy Structure

A 12-word BIP39 mnemonic encodes 132 bits: 128 bits of entropy plus a 4-bit checksum. The checksum is the first nibble of SHA256 of the entropy bytes.

System Output

Word 1    Word 2   ...   Word 12
[11 bits][11 bits] ... [11 bits] = 132 total bits
└─────────────── 128 bits ─────────────────┘└─ 4 bits ─┘
              entropy                        checksum

Each word index (0–2047) contributes 11 bits, packed MSB-first (big-endian, per BIP39 spec). For 12 words:

| Words | Global bit positions | Content | |-------|---------------------|---------| | 1–11 | 0–120 | Pure entropy | | 12 (MSBs: word-index bits 10–4) | 121–127 | 7 bits entropy | | 12 (LSBs: word-index bits 3–0) | 128–131 | 4 bits checksum |

In a correct implementation, checksum bits (128–131) are excluded from the 16-byte entropy buffer and verified separately. The bug prevents this separation.

The Bug

Our kernel stored the 132-bit sequence in two uint64_t registers, MSB-first:

System Output

uint64_t high = 0;  // global bits 0..63  (MSB at bit 63)
uint64_t low  = 0;  // global bits 64..127 (MSB at bit 63, LSB at bit 0)

The packing loop:

System Output

// BUGGY version
for (int b = 0; b < WORD_BITS; b++) {
    int bit_pos = bit_start + b;
    int bit_val = (idx >> (WORD_BITS - 1 - b)) & 1;

    if (bit_pos < 64) {
        high |= ((uint64_t)bit_val) << (63 - bit_pos);
    } else {
        // bit_pos 64–127: shift = 63 - (bit_pos-64), range [63..0] → OK
        // bit_pos 128–131: shift = 63 - (64..67) = -1..-4  → NEGATIVE
        low |= ((uint64_t)bit_val) << (63 - (bit_pos - 64));
    }
}

For bit_pos = 128: shift = 63 - (128 - 64) = -1.
For bit_pos = 131: shift = 63 - (131 - 64) = -4.

In ISO C/C++, left-shifting by a negative amount is undefined behavior (C11 §6.5.7). On CUDA, the PTX ISA explicitly defines this behavior by masking the shift amount modulo the operand width — and the result is not zero.

CUDA PTX: Negative Shifts Are Defined (and Wrong)

CUDA's PTX ISA (PTX ISA Reference, Section 9.7.1, shl.b64) masks the shift amount modulo the operand bit width before executing. The shift amount is treated as an unsigned value:

System Output

shl.b64  result, src, amount
effective amount = amount & 63      // amount is treated as u64

System Output

(uint64_t)x << -1  →  (-1 as u64 = 0xFFFF...FFFF) & 63 = 63  →  x << 63
(uint64_t)x << -4  →  (-4 as u64 = 0xFFFF...FFFC) & 63 = 60  →  x << 60

This behavior holds across all supported architectures (Kepler sm_30 through Hopper sm_90). No crash. No warning. No NaN. The instruction executes and returns a plausible-looking non-zero value.

Detection limitation: You cannot use compute-sanitizer (formerly cuda-memcheck) to catch this. compute-sanitizer detects memory access errors and race conditions, not arithmetic operations that produce wrong-but-valid values. The negative shift returns a legal uint64_t — sanitizers have no way to know it is incorrect.

How Checksum Bits Overwrite Entropy

The critical issue: the negative shift wraps around and overwrites bits within the valid entropy region — specifically entropy[8] bits 4–7 — rather than causing an out-of-bounds memory access.

For bit_pos = 128 (checksum bit 0), shift = 63:

Sets bit 63 of low
Bit 63 of low = global entropy bit 64 → entropy[8] bit 7

For bit_pos = 129 (checksum bit 1), shift = 62:

Sets bit 62 of low = global entropy bit 65 → entropy[8] bit 6

For bit_pos = 130 (checksum bit 2), shift = 61:

Sets bit 61 of low = global entropy bit 66 → entropy[8] bit 5

For bit_pos = 131 (checksum bit 3), shift = 60:

Sets bit 60 of low = global entropy bit 67 → entropy[8] bit 4

The checksum bits do not cause a buffer overflow. They write back into valid memory — specifically into entropy[8] bits 4–7. The result is silent entropy corruption that SHA256 then computes a checksum for, which does not match the pre-loaded checksum bits from low >> 60.

Worked example: "abandon × 11 about"

"about" = word index 3 = binary 00000000011. Only bits at positions 130 and 131 have bit_val = 1.

| bit_pos | Shift formula | PTX effective | low bit set | Entropy byte corrupted | |---------|--------------|---------------|---------------|----------------------| | 130 | 63-(130-64)=-3 | -3 & 63 = 61 | bit 61 | entropy[8] bit 5 → +0x20 | | 131 | 63-(131-64)=-4 | -4 & 63 = 60 | bit 60 | entropy[8] bit 4 → +0x10 |

System Output

entropy[8] = (low >> 56) & 0xFF;
// low = 0x3000000000000000 (bits 61,60 set)
// (0x3000000000000000 >> 56) = 0x30
// entropy[8] = 0x30  (correct: 0x00)

SHA256 of the corrupted 16-byte entropy produces a wrong checksum nibble. The checksum filter rejects the candidate. Zero results.

The Fix

One else if:

System Output

-    } else {
+    } else if (bit_pos < 128) {

Full fixed code:

System Output

if (bit_pos < 64) {
    high |= ((uint64_t)bit_val) << (63 - bit_pos);
} else if (bit_pos < 128) {  // entropy bits only
    low |= ((uint64_t)bit_val) << (63 - (bit_pos - 64));
}
// bit_pos 128–131: checksum bits
// They are verified after the loop: (low >> 60) & 0x0F vs SHA256(entropy)[0] >> 4

With this guard, bits 60–63 of low remain zero after the packing loop. The actual checksum bits (from the last word's index) land at low >> 60 only if the loop runs without the guard — which the fix prevents.

Committed as f698621, verified on H100 sm_90 (CUDA 12.4).

Affected Search Space

The bug triggers only when checksum bits in the last word's index are non-zero, i.e., when any of the word's 4 lowest bits are 1.

Words with index & 0xF == 0: 128 words (unaffected)
Words with index & 0xF != 0: 1920 words (93.75% of vocabulary) — silently rejected

If you are searching for a missing word in the last position, roughly 93.75% of your search space is silently discarded. The kernel runs to completion and reports zero results.

For other word counts:

| Words | Entropy bits | Checksum bits | Affected vocab % | |-------|-------------|---------------|-----------------| | 12 | 128 | 4 | 93.75% (1/16 safe) | | 15 | 160 | 5 | 96.88% | | 18 | 192 | 6 | 98.44% | | 21 | 224 | 7 | 99.22% | | 24 | 256 | 8 | 99.61% |

Test Vector Verification

Before fix:

System Output

./bip39_crack --target 9858EfFD232B4033E47d90003D41EC34EcaEda94 --missing-pos 11
# Searched 2048 candidates: 0 found  ← WRONG

After fix:

System Output

./bip39_crack --target 9858EfFD232B4033E47d90003D41EC34EcaEda94 --missing-pos 11
# ID: 0  →  abandon abandon abandon ... abandon about  ← FOUND ✓

Standard BIP39 test vectors (from the BIP39 spec):

System Output

# Test 1: zero entropy — exercises the checksum-bit boundary (this exact bug)
mnemonic: abandon × 11 about
entropy:  00000000000000000000000000000000
ETH (m/44'/60'/0'/0/0): 0x9858EfFD232B4033E47d90003D41EC34EcaEda94

# Test 2: all-ones entropy — catches off-by-one in checksum extraction
mnemonic: zoo zoo zoo zoo zoo zoo zoo zoo zoo zoo zoo wrong
entropy:  ffffffffffffffffffffffffffffffff

# Test 3: mixed — catches byte-order (endianness) bugs  
mnemonic: legal winner thank year wave sausage worth useful legal winner thank yellow
entropy:  7f7f7f7f7f7f7f7f7f7f7f7f7f7f7f7f

The Same Pattern in Other Tools

johncantrell97/bip39-solver-gpu — `just_seed.cl`

System Output

// just_seed.cl line 33 — int32 overflow
indices[0] = (mnemonic_hi & (2047 << 53)) >> 53;

2047 is a 32-bit int. 2047 << 53 in OpenCL: shift amount masked to 5 bits → 53 % 32 = 21. Result: 2047 << 21 = 0xFFE00000 overflows int32, becomes negative. Sign-extended to ulong: mask becomes 0xFFFFFFFFFFE00000 — wrong.

The working file int_to_address.cl avoids the issue entirely:

System Output

indices[0] = (mnemonic_hi >> 53) & 2047;  // right-shift: always positive, naturally bounded

Right-shift with mask extraction is safer than left-shift accumulation for bit extraction. The former is bounded by the operand; the latter requires explicit range guards.

hashcat PR #4522 (mode 36000)

Not affected. Mode 36000 brute-forces the BIP39 passphrase given a known mnemonic. The mnemonic arrives as pre-formed UTF-8; no 11-bit-per-word packing occurs on the GPU.

Implementation Guidelines

Guard entropy writes against checksum positions. For 12-word: if (bit_pos < 128). Generalized: if (bit_pos < entropy_bit_count). One line prevents the entire bug class.
Prefer right-shift extraction over left-shift accumulation. (hi >> offset) & 0x7FF is bounded by construction. A left-shift loop over more bits than the register width requires explicit guards.
Validate against BIP39 spec test vectors before any real search. The "abandon × 11 about" vector exercises exactly this boundary. If it fails, your checksum filter is wrong.
compute-sanitizer will not catch this. The negative shift returns a valid uint64_t. Sanitizers only surface memory errors, not arithmetic producing wrong values. Use known test vectors and assert-based validation during development.

About Our Work

At Innora, we build GPU-accelerated key-analysis tooling for authorized forensic and security testing on behalf of protocol teams and asset recovery firms. Validating against spec test vectors — not just "compiles and runs" — is a prerequisite for any real engagement.

If your protocol involves wallet derivation, PBKDF2 seed generation, or ECDSA key management at scale, we offer a 24h focused security review covering derivation path correctness, nonce reuse, and GPU implementation bugs.

Code at innora.ai/security | @Innora_sg

CUDA BIP39 Kernel Bug: When Negative Shifts Silently Corrupt Your Entropy

Here is the bug, the CUDA semantics that make it invisible, and the fix.

Background: BIP39 Entropy Structure

A 12-word BIP39 mnemonic encodes 132 bits: 128 bits of entropy plus a 4-bit checksum. The checksum is the first nibble of SHA256 of the entropy bytes.

System Output

Word 1    Word 2   ...   Word 12
[11 bits][11 bits] ... [11 bits] = 132 total bits
└─────────────── 128 bits ─────────────────┘└─ 4 bits ─┘
              entropy                        checksum

Each word index (0–2047) contributes 11 bits, packed MSB-first (big-endian, per BIP39 spec). For 12 words:

In a correct implementation, checksum bits (128–131) are excluded from the 16-byte entropy buffer and verified separately. The bug prevents this separation.

The Bug

Our kernel stored the 132-bit sequence in two uint64_t registers, MSB-first:

System Output

uint64_t high = 0;  // global bits 0..63  (MSB at bit 63)
uint64_t low  = 0;  // global bits 64..127 (MSB at bit 63, LSB at bit 0)

The packing loop:

System Output

// BUGGY version
for (int b = 0; b < WORD_BITS; b++) {
    int bit_pos = bit_start + b;
    int bit_val = (idx >> (WORD_BITS - 1 - b)) & 1;

    if (bit_pos < 64) {
        high |= ((uint64_t)bit_val) << (63 - bit_pos);
    } else {
        // bit_pos 64–127: shift = 63 - (bit_pos-64), range [63..0] → OK
        // bit_pos 128–131: shift = 63 - (64..67) = -1..-4  → NEGATIVE
        low |= ((uint64_t)bit_val) << (63 - (bit_pos - 64));
    }
}

For bit_pos = 128: shift = 63 - (128 - 64) = -1.
For bit_pos = 131: shift = 63 - (131 - 64) = -4.

CUDA PTX: Negative Shifts Are Defined (and Wrong)

CUDA's PTX ISA (PTX ISA Reference, Section 9.7.1, shl.b64) masks the shift amount modulo the operand bit width before executing. The shift amount is treated as an unsigned value:

System Output

shl.b64  result, src, amount
effective amount = amount & 63      // amount is treated as u64

System Output

(uint64_t)x << -1  →  (-1 as u64 = 0xFFFF...FFFF) & 63 = 63  →  x << 63
(uint64_t)x << -4  →  (-4 as u64 = 0xFFFF...FFFC) & 63 = 60  →  x << 60

This behavior holds across all supported architectures (Kepler sm_30 through Hopper sm_90). No crash. No warning. No NaN. The instruction executes and returns a plausible-looking non-zero value.

Detection limitation: You cannot use compute-sanitizer (formerly cuda-memcheck) to catch this. compute-sanitizer detects memory access errors and race conditions, not arithmetic operations that produce wrong-but-valid values. The negative shift returns a legal uint64_t — sanitizers have no way to know it is incorrect.

How Checksum Bits Overwrite Entropy

For bit_pos = 128 (checksum bit 0), shift = 63:

Sets bit 63 of low
Bit 63 of low = global entropy bit 64 → entropy[8] bit 7

For bit_pos = 129 (checksum bit 1), shift = 62:

Sets bit 62 of low = global entropy bit 65 → entropy[8] bit 6

For bit_pos = 130 (checksum bit 2), shift = 61:

Sets bit 61 of low = global entropy bit 66 → entropy[8] bit 5

For bit_pos = 131 (checksum bit 3), shift = 60:

Sets bit 60 of low = global entropy bit 67 → entropy[8] bit 4

Worked example: "abandon × 11 about"

"about" = word index 3 = binary 00000000011. Only bits at positions 130 and 131 have bit_val = 1.

System Output

entropy[8] = (low >> 56) & 0xFF;
// low = 0x3000000000000000 (bits 61,60 set)
// (0x3000000000000000 >> 56) = 0x30
// entropy[8] = 0x30  (correct: 0x00)

SHA256 of the corrupted 16-byte entropy produces a wrong checksum nibble. The checksum filter rejects the candidate. Zero results.

The Fix

One else if:

System Output

-    } else {
+    } else if (bit_pos < 128) {

Full fixed code:

System Output

if (bit_pos < 64) {
    high |= ((uint64_t)bit_val) << (63 - bit_pos);
} else if (bit_pos < 128) {  // entropy bits only
    low |= ((uint64_t)bit_val) << (63 - (bit_pos - 64));
}
// bit_pos 128–131: checksum bits
// They are verified after the loop: (low >> 60) & 0x0F vs SHA256(entropy)[0] >> 4

Committed as f698621, verified on H100 sm_90 (CUDA 12.4).

Affected Search Space

The bug triggers only when checksum bits in the last word's index are non-zero, i.e., when any of the word's 4 lowest bits are 1.

Words with index & 0xF == 0: 128 words (unaffected)
Words with index & 0xF != 0: 1920 words (93.75% of vocabulary) — silently rejected

If you are searching for a missing word in the last position, roughly 93.75% of your search space is silently discarded. The kernel runs to completion and reports zero results.

For other word counts:

Test Vector Verification

Before fix:

System Output

./bip39_crack --target 9858EfFD232B4033E47d90003D41EC34EcaEda94 --missing-pos 11
# Searched 2048 candidates: 0 found  ← WRONG

After fix:

System Output

./bip39_crack --target 9858EfFD232B4033E47d90003D41EC34EcaEda94 --missing-pos 11
# ID: 0  →  abandon abandon abandon ... abandon about  ← FOUND ✓

Standard BIP39 test vectors (from the BIP39 spec):

System Output

# Test 1: zero entropy — exercises the checksum-bit boundary (this exact bug)
mnemonic: abandon × 11 about
entropy:  00000000000000000000000000000000
ETH (m/44'/60'/0'/0/0): 0x9858EfFD232B4033E47d90003D41EC34EcaEda94

# Test 2: all-ones entropy — catches off-by-one in checksum extraction
mnemonic: zoo zoo zoo zoo zoo zoo zoo zoo zoo zoo zoo wrong
entropy:  ffffffffffffffffffffffffffffffff

# Test 3: mixed — catches byte-order (endianness) bugs  
mnemonic: legal winner thank year wave sausage worth useful legal winner thank yellow
entropy:  7f7f7f7f7f7f7f7f7f7f7f7f7f7f7f7f

The Same Pattern in Other Tools

johncantrell97/bip39-solver-gpu — `just_seed.cl`

System Output

// just_seed.cl line 33 — int32 overflow
indices[0] = (mnemonic_hi & (2047 << 53)) >> 53;

The working file int_to_address.cl avoids the issue entirely:

System Output

indices[0] = (mnemonic_hi >> 53) & 2047;  // right-shift: always positive, naturally bounded

Right-shift with mask extraction is safer than left-shift accumulation for bit extraction. The former is bounded by the operand; the latter requires explicit range guards.

hashcat PR #4522 (mode 36000)

Not affected. Mode 36000 brute-forces the BIP39 passphrase given a known mnemonic. The mnemonic arrives as pre-formed UTF-8; no 11-bit-per-word packing occurs on the GPU.

Implementation Guidelines

Guard entropy writes against checksum positions. For 12-word: if (bit_pos < 128). Generalized: if (bit_pos < entropy_bit_count). One line prevents the entire bug class.
Prefer right-shift extraction over left-shift accumulation. (hi >> offset) & 0x7FF is bounded by construction. A left-shift loop over more bits than the register width requires explicit guards.
Validate against BIP39 spec test vectors before any real search. The "abandon × 11 about" vector exercises exactly this boundary. If it fails, your checksum filter is wrong.
compute-sanitizer will not catch this. The negative shift returns a valid uint64_t. Sanitizers only surface memory errors, not arithmetic producing wrong values. Use known test vectors and assert-based validation during development.

About Our Work

Code at innora.ai/security | @Innora_sg

CUDA BIP39 Kernel Bug: When Negative Shifts Silently Corrupt Your Entropy

Background: BIP39 Entropy Structure

The Bug

CUDA PTX: Negative Shifts Are Defined (and Wrong)

How Checksum Bits Overwrite Entropy

Worked example: "abandon × 11 about"

The Fix

Affected Search Space

Test Vector Verification

The Same Pattern in Other Tools

johncantrell97/bip39-solver-gpu — just_seed.cl

hashcat PR #4522 (mode 36000)

Implementation Guidelines

About Our Work

Feng Ning (风宁)

Related Chronicles

ERC-4337 Paymaster Attacks: The Gas Fee Extraction Gap Nobody Is Fixing

CVE-2026-37555: Pre-Auth DoS in Vanetza V2X via Uncaught ECC Exception

How a Single Math.min() Broke Cross-Chain Security: Dissecting the Hyperlane WeightedMultisigIsm Bug

Subscribe for AI Security Insights

CUDA BIP39 Kernel Bug: When Negative Shifts Silently Corrupt Your Entropy

Background: BIP39 Entropy Structure

The Bug

CUDA PTX: Negative Shifts Are Defined (and Wrong)

How Checksum Bits Overwrite Entropy

Worked example: "abandon × 11 about"

The Fix

Affected Search Space

Test Vector Verification

The Same Pattern in Other Tools

johncantrell97/bip39-solver-gpu — just_seed.cl

hashcat PR #4522 (mode 36000)

Implementation Guidelines

About Our Work

Feng Ning (风宁)

Related Chronicles

ERC-4337 Paymaster Attacks: The Gas Fee Extraction Gap Nobody Is Fixing

CVE-2026-37555: Pre-Auth DoS in Vanetza V2X via Uncaught ECC Exception

How a Single Math.min() Broke Cross-Chain Security: Dissecting the Hyperlane WeightedMultisigIsm Bug

Subscribe for AI Security Insights

johncantrell97/bip39-solver-gpu — `just_seed.cl`

johncantrell97/bip39-solver-gpu — `just_seed.cl`