CUDA BIP39 Kernel Bug: When Negative Shifts Silently Corrupt Your Entropy
A single missing else if in a CUDA BIP39 kernel silently corrupted entropy packing due to CUDA's wrap-around negative-shift semantics. The bug surfaced when the standard "abandon × 11 about" test vector returned zero results on an H100 — despite the kernel compiling and running without errors.
Here is the bug, the CUDA semantics that make it invisible, and the fix.
Background: BIP39 Entropy Structure
A 12-word BIP39 mnemonic encodes 132 bits: 128 bits of entropy plus a 4-bit checksum. The checksum is the first nibble of SHA256 of the entropy bytes.
Word 1 Word 2 ... Word 12
[11 bits][11 bits] ... [11 bits] = 132 total bits
└─────────────── 128 bits ─────────────────┘└─ 4 bits ─┘
entropy checksum
Each word index (0–2047) contributes 11 bits, packed MSB-first (big-endian, per BIP39 spec). For 12 words:
| Words | Global bit positions | Content | |-------|---------------------|---------| | 1–11 | 0–120 | Pure entropy | | 12 (MSBs: word-index bits 10–4) | 121–127 | 7 bits entropy | | 12 (LSBs: word-index bits 3–0) | 128–131 | 4 bits checksum |
In a correct implementation, checksum bits (128–131) are excluded from the 16-byte entropy buffer and verified separately. The bug prevents this separation.
The Bug
Our kernel stored the 132-bit sequence in two uint64_t registers, MSB-first:
uint64_t high = 0; // global bits 0..63 (MSB at bit 63)
uint64_t low = 0; // global bits 64..127 (MSB at bit 63, LSB at bit 0)
The packing loop:
// BUGGY version
for (int b = 0; b < WORD_BITS; b++) {
int bit_pos = bit_start + b;
int bit_val = (idx >> (WORD_BITS - 1 - b)) & 1;
if (bit_pos < 64) {
high |= ((uint64_t)bit_val) << (63 - bit_pos);
} else {
// bit_pos 64–127: shift = 63 - (bit_pos-64), range [63..0] → OK
// bit_pos 128–131: shift = 63 - (64..67) = -1..-4 → NEGATIVE
low |= ((uint64_t)bit_val) << (63 - (bit_pos - 64));
}
}
For bit_pos = 128: shift = 63 - (128 - 64) = -1.
For bit_pos = 131: shift = 63 - (131 - 64) = -4.
In ISO C/C++, left-shifting by a negative amount is undefined behavior (C11 §6.5.7). On CUDA, the PTX ISA explicitly defines this behavior by masking the shift amount modulo the operand width — and the result is not zero.
CUDA PTX: Negative Shifts Are Defined (and Wrong)
CUDA's PTX ISA (PTX ISA Reference, Section 9.7.1, shl.b64) masks the shift amount modulo the operand bit width before executing. The shift amount is treated as an unsigned value:
shl.b64 result, src, amount
effective amount = amount & 63 // amount is treated as u64
(uint64_t)x << -1 → (-1 as u64 = 0xFFFF...FFFF) & 63 = 63 → x << 63
(uint64_t)x << -4 → (-4 as u64 = 0xFFFF...FFFC) & 63 = 60 → x << 60
This behavior holds across all supported architectures (Kepler sm_30 through Hopper sm_90). No crash. No warning. No NaN. The instruction executes and returns a plausible-looking non-zero value.
Detection limitation: You cannot use
compute-sanitizer(formerlycuda-memcheck) to catch this.compute-sanitizerdetects memory access errors and race conditions, not arithmetic operations that produce wrong-but-valid values. The negative shift returns a legaluint64_t— sanitizers have no way to know it is incorrect.
How Checksum Bits Overwrite Entropy
The critical issue: the negative shift wraps around and overwrites bits within the valid entropy region — specifically entropy[8] bits 4–7 — rather than causing an out-of-bounds memory access.
For bit_pos = 128 (checksum bit 0), shift = 63:
- Sets bit 63 of
low - Bit 63 of
low= global entropy bit 64 →entropy[8]bit 7
For bit_pos = 129 (checksum bit 1), shift = 62:
- Sets bit 62 of
low= global entropy bit 65 →entropy[8]bit 6
For bit_pos = 130 (checksum bit 2), shift = 61:
- Sets bit 61 of
low= global entropy bit 66 →entropy[8]bit 5
For bit_pos = 131 (checksum bit 3), shift = 60:
- Sets bit 60 of
low= global entropy bit 67 →entropy[8]bit 4
The checksum bits do not cause a buffer overflow. They write back into valid memory — specifically into entropy[8] bits 4–7. The result is silent entropy corruption that SHA256 then computes a checksum for, which does not match the pre-loaded checksum bits from low >> 60.
Worked example: "abandon × 11 about"
"about" = word index 3 = binary 00000000011. Only bits at positions 130 and 131 have bit_val = 1.
| bit_pos | Shift formula | PTX effective | low bit set | Entropy byte corrupted |
|---------|--------------|---------------|---------------|----------------------|
| 130 | 63-(130-64)=-3 | -3 & 63 = 61 | bit 61 | entropy[8] bit 5 → +0x20 |
| 131 | 63-(131-64)=-4 | -4 & 63 = 60 | bit 60 | entropy[8] bit 4 → +0x10 |
entropy[8] = (low >> 56) & 0xFF;
// low = 0x3000000000000000 (bits 61,60 set)
// (0x3000000000000000 >> 56) = 0x30
// entropy[8] = 0x30 (correct: 0x00)
SHA256 of the corrupted 16-byte entropy produces a wrong checksum nibble. The checksum filter rejects the candidate. Zero results.
The Fix
One else if:
- } else {
+ } else if (bit_pos < 128) {
Full fixed code:
if (bit_pos < 64) {
high |= ((uint64_t)bit_val) << (63 - bit_pos);
} else if (bit_pos < 128) { // entropy bits only
low |= ((uint64_t)bit_val) << (63 - (bit_pos - 64));
}
// bit_pos 128–131: checksum bits
// They are verified after the loop: (low >> 60) & 0x0F vs SHA256(entropy)[0] >> 4
With this guard, bits 60–63 of low remain zero after the packing loop. The actual checksum bits (from the last word's index) land at low >> 60 only if the loop runs without the guard — which the fix prevents.
Committed as f698621, verified on H100 sm_90 (CUDA 12.4).
Affected Search Space
The bug triggers only when checksum bits in the last word's index are non-zero, i.e., when any of the word's 4 lowest bits are 1.
- Words with index
& 0xF == 0: 128 words (unaffected) - Words with index
& 0xF != 0: 1920 words (93.75% of vocabulary) — silently rejected
If you are searching for a missing word in the last position, roughly 93.75% of your search space is silently discarded. The kernel runs to completion and reports zero results.
For other word counts:
| Words | Entropy bits | Checksum bits | Affected vocab % | |-------|-------------|---------------|-----------------| | 12 | 128 | 4 | 93.75% (1/16 safe) | | 15 | 160 | 5 | 96.88% | | 18 | 192 | 6 | 98.44% | | 21 | 224 | 7 | 99.22% | | 24 | 256 | 8 | 99.61% |
Test Vector Verification
Before fix:
./bip39_crack --target 9858EfFD232B4033E47d90003D41EC34EcaEda94 --missing-pos 11
# Searched 2048 candidates: 0 found ← WRONG
After fix:
./bip39_crack --target 9858EfFD232B4033E47d90003D41EC34EcaEda94 --missing-pos 11
# ID: 0 → abandon abandon abandon ... abandon about ← FOUND ✓
Standard BIP39 test vectors (from the BIP39 spec):
# Test 1: zero entropy — exercises the checksum-bit boundary (this exact bug)
mnemonic: abandon × 11 about
entropy: 00000000000000000000000000000000
ETH (m/44'/60'/0'/0/0): 0x9858EfFD232B4033E47d90003D41EC34EcaEda94
# Test 2: all-ones entropy — catches off-by-one in checksum extraction
mnemonic: zoo zoo zoo zoo zoo zoo zoo zoo zoo zoo zoo wrong
entropy: ffffffffffffffffffffffffffffffff
# Test 3: mixed — catches byte-order (endianness) bugs
mnemonic: legal winner thank year wave sausage worth useful legal winner thank yellow
entropy: 7f7f7f7f7f7f7f7f7f7f7f7f7f7f7f7f
The Same Pattern in Other Tools
johncantrell97/bip39-solver-gpu — just_seed.cl
// just_seed.cl line 33 — int32 overflow
indices[0] = (mnemonic_hi & (2047 << 53)) >> 53;
2047 is a 32-bit int. 2047 << 53 in OpenCL: shift amount masked to 5 bits → 53 % 32 = 21. Result: 2047 << 21 = 0xFFE00000 overflows int32, becomes negative. Sign-extended to ulong: mask becomes 0xFFFFFFFFFFE00000 — wrong.
The working file int_to_address.cl avoids the issue entirely:
indices[0] = (mnemonic_hi >> 53) & 2047; // right-shift: always positive, naturally bounded
Right-shift with mask extraction is safer than left-shift accumulation for bit extraction. The former is bounded by the operand; the latter requires explicit range guards.
hashcat PR #4522 (mode 36000)
Not affected. Mode 36000 brute-forces the BIP39 passphrase given a known mnemonic. The mnemonic arrives as pre-formed UTF-8; no 11-bit-per-word packing occurs on the GPU.
Implementation Guidelines
-
Guard entropy writes against checksum positions. For 12-word:
if (bit_pos < 128). Generalized:if (bit_pos < entropy_bit_count). One line prevents the entire bug class. -
Prefer right-shift extraction over left-shift accumulation.
(hi >> offset) & 0x7FFis bounded by construction. A left-shift loop over more bits than the register width requires explicit guards. -
Validate against BIP39 spec test vectors before any real search. The "abandon × 11 about" vector exercises exactly this boundary. If it fails, your checksum filter is wrong.
-
compute-sanitizerwill not catch this. The negative shift returns a validuint64_t. Sanitizers only surface memory errors, not arithmetic producing wrong values. Use known test vectors and assert-based validation during development.
About Our Work
At Innora, we build GPU-accelerated key-analysis tooling for authorized forensic and security testing on behalf of protocol teams and asset recovery firms. Validating against spec test vectors — not just "compiles and runs" — is a prerequisite for any real engagement.
If your protocol involves wallet derivation, PBKDF2 seed generation, or ECDSA key management at scale, we offer a 24h focused security review covering derivation path correctness, nonce reuse, and GPU implementation bugs.
Code at innora.ai/security | @Innora_sg

Related Chronicles
ERC-4337 Paymaster Attacks: The Gas Fee Extraction Gap Nobody Is Fixing
ERC-4337 paymasters have a gas accounting gap. Here is the PoC and the fix.
CVE-2026-37555: Pre-Auth DoS in Vanetza V2X via Uncaught ECC Exception
A pre-auth DoS in Vanetza V2X: one crafted 802.11p packet crashes the ITS-G5 stack via an uncaught off-curve ECC exception. CVSS 6.5, no fix available.
How a Single Math.min() Broke Cross-Chain Security: Dissecting the Hyperlane WeightedMultisigIsm Bug
How Math.min() in Hyperlane's WeightedMultisigIsm silently rejected valid signatures, risking permanent fund freezing on warp routes.
Subscribe for AI Security Insights
Join 5,000+ engineers and security researchers. Get our latest deep dives into Sovereign AI, Red Teaming, and System Architecture.
No spam. Unsubscribe at any time.
Comments are currently disabled.