Bringing Metal to a crypto backdoor fight! Exploiting the GPU and the 90s crypto wars to crack the APT Down code signing keys

The APT Down leak contained four code signing certificates and the passphrase only for the most recent one. Since the passphrase was found on the usual rockyou.txt wordlist, I was curious to see if the remaining three could be cracked using the same wordlist.

I started this project by writing a small utility to decrypt the PVK key, as it could be easily tested with the known passphrase. The code appeared correct, but it wasn’t working. Then, I had to context switch into some consulting work, and this project went into the TODO bin.

Last Friday, I had free time and decided to take another try at this so it wouldn’t get lost forever. Once again, the code appeared to be okay, even if it was using CoreCrypto instead of OpenSSL (older OpenSSL versions have support for PVK).

There is some information about the PVK format available online. This page from ArchiveTeam has most of the information needed to parse the file. The first time I read it, I noticed the caveat about the RC4 encryption key:

There are two possible ways that the password is used to make the RC4 key. They both concatenate the salt bytes with the ASCII encoded password and calculate the SHA1 hash. The first method uses the SHA1 hash as the RC4 key, the second method uses only the first 5 bytes of the SHA1 hash followed by 11 zero bytes. This second method (using only 40 bits of the SHA1 hash) is an historic limitation to comply with the US export restrictions on strong encryption in the 1990s.

Ah, the famous weak crypto export wars of the 90s. The problem is that for some reason my brain went into denial and assumed the keys wouldn’t be weakly encrypted. In reality, they were using the weak 40-bit key when I decided to test it.

The famous Russian proverb is “Trust, but verify”, but I’ll get another monitor sticker saying “Don’t assume, verify”.

Anyway, after fixing and testing the code, I decided to go the bruteforce way since 40-bit should be a cheap lunch for today’s CPUs. I built a GCD and a pthreads-based util for this purpose.

The following structure defines the PVK header:

struct pvk_header {
    uint32_t magic;     // always 0xb0b5f11e
    uint32_t reserved;
    uint32_t key_type;
    uint32_t encrypted; // 1 - encrypted, 0 otherwise
    uint32_t salt_len;  // usually 16 bytes, 0 if not encrypted
    uint32_t key_len;
    // followed by the salt data
};

Since we are dealing with encrypted PVK files, there is a salt, and it’s 16 bytes for all leaked files. Another header follows the salt data:

struct blob_header {
    uint8_t  type;      // PUBLICKEYBLOB, PRIVATEKEYBLOB, SIMPLEBLOB
    uint8_t  version;   // Version number of the key blob format. This currently must always have a value of "0x02".
    uint16_t reserved;
    uint32_t key_alg;   // Algorithm identifier for the key contained by the key blob. 
                        // Some examples are CALG_RSA_SIGN, CALG_RSA_KEYX, CALG_RC2, and CALG_RC4.
                        // https://learn.microsoft.com/en-us/windows/win32/seccrypto/alg-id
};

Everything that follows the blob_header header is encrypted. According to Microsoft documentation, we are dealing with an encrypted RSA key. The decrypted data should start with the following structure:

struct rsa_header {
    uint32_t magic;     // This should be set to "RSA1" (0x31415352) for public keys and to "RSA2" (0x32415352) for private keys.
    uint32_t bitlen;    // Number of bits in the modulus. In practice, this must always be a multiple of eight.
    uint32_t pubexp;    // The public exponent.
};

The bitlen field will most probably be 0x800 (2048 bits) or 0x400 (1024 bits). The smaller PVK files are 636 bytes long, and the others are 1212 bytes, a strong signal that we are dealing with 1024 and 2048 RSA keys.

Initially, I tried to use just the rsa_header structure’s magic value to validate the bruteforce tests. However, this produces too many candidates. Using the bitlen eliminates this problem, and since we have a good guess about its correct value for each key, we just need to try to decrypt the 8 initial bytes to make everything faster.

The Mac Mini M4 is fast but doesn’t have enough threads for this CPU bruteforce task, so I used instead a Ryzen 3950X with 16 cores and 32 threads. Five hours later, I finally got the key for the 2005 PVK file. Initially, I built a new decrypted PVK file by hand and used an older OpenSSL version (1.0.2g works fine) to convert the PVK to PEM format, and voilá, the private key was finally unlocked.

Five hours isn’t a bad result, but there were still two keys left to crack, and I didn’t have much patience to wait for the results. It’s the perfect task for GPUs!

A brief look into hashcat reveals that there’s no support for this encryption type, other than for PDFs and Office documents that used the same “encryption” in the past. What could be more fun than writing some hashcat support for this? Write ourselves something that uses the GPU to bruteforce the 40-bit space. Since I don’t have modern GPUs other than Apple Silicon based, it was an opportunity to play with Metal.

GPUs are definitely something I don’t deal with, so I went lazy and asked an LLM to write a Metal-based 40-bit RC4 bruteforcer. As often happens, the initial LLM outputs didn’t work, but after fixing the errors it appeared to do something and bruteforce some test keys much faster than the CPU version.

The LLM magic is many times superficial, and when you start going deeper it often cracks easily. For some reason, the same code wasn’t able to find some test keys. Time to debug the problem instead of playing the stochastic parrot lottery.

In this case, Apple’s documentation is (surprisingly!) quite useful, and the LLM code is pretty much a documentation carbon copy, minus the errors and RC4 algorithm.

It turns out the code wasn’t increasing the key offset, so it was bruteforcing the same initial group of keys over and over. So much for “thinking” and “intelligence” LLM hype!

After fixing and testing the code, it was time to unleash it on the same key cracked by the CPU version. Don’t assume, verify!

This time, it took 102 minutes to find the key, compared to the 300 minutes for the CPU version. Approximately 30 billion keys were tested at an approximate rate of 77 MH/s. The 2004 paper The Effectiveness of Brute Force Attacks on RC4 extrapolates a 4 MH/s rate on a 500 units FPGA running at 10Mhz. A 1995 40-bit RC4 challenge by Hal Finney had rates between 1.5 MH/s and 4.5 MH/s. In 1997 a 28 MH/s rate was achieved using a top-200 supercomputer at the time. Details can be found here.

The 1024 keys took approximately one to two hours to complete. A significant amount of time was wasted searching keyspace that was far from the actual key in all cases. One of the referenced attempts in 1995 has the following note:

keyspace more than 99.6% (key was #7EF0 and #8000-#FFFF was searched first!).

Might be a good idea to implement a strategy to split the search space and alternate between searching above and below the midpoint. However, that’s all hindsight, given that all keys were located above the midpoint in this case.

Since this was a fun exercise, it provided an opportunity to learn a bit more about Metal for compute and try to improve the performance. I implemented a few small changes and managed to clock around 122 MH/s. Not a bad improvement :-).

As a reference point (which I only did after I improved performance), the latest hashcat v7.1.2 benchmark does between 100 to 173.1 MH/s for mode RC4 40-bit DropN (I don’t quite understand the large results variance, rarely hitting the 170 MH/s):

% ./hashcat -m 33500 --benchmark
hashcat (v7.1.2) starting in benchmark mode
(...)
METAL API (Metal 368.52)
========================
* Device #01: Apple M4, skipped

OpenCL API (OpenCL 1.2 (Jul 11 2025 19:18:49)) - Platform #1 [Apple]
====================================================================
* Device #02: Apple M4, GPU, 5461/10922 MB (1024 MB allocatable), 10MCU

Benchmark relevant options:
===========================
* --backend-devices-virtmulti=1
* --backend-devices-virthost=1
* --optimized-kernel-enable

------------------------------------
* Hash-Mode 33500 (RC4 40-bit DropN)
------------------------------------

Speed.#02........:   173.1 MH/s (46.49ms) @ Accel:1024 Loops:1024 Thr:32 Vec:1

An impactful code change involved the shape of the thread group size grid. The initial LLM-generated code used a one-dimensional maximum threads grid:

NSUInteger threadGroupSize = pipeline.maxTotalThreadsPerThreadgroup;
MTLSize threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);

However, after reviewing the Metal documentation and related articles, this shape performs much better, as it better aligns with hardware capabilities and maximizes occupancy through uniform work distribution across all GPU cores:

NSUInteger w = pipeline.threadExecutionWidth;
NSUInteger h = pipeline.maxTotalThreadsPerThreadgroup / w;
MTLSize threadgroupSize = MTLSizeMake(w, h, 1);

In this case, both w and h values are 32.

Running this improved version against the 2005 key, now achieving a rate of ~116 MH/s, it took 68 minutes to find the key, a significant improvement over the initial version. Better results were obtained by reducing the height in half. A thread group size of 32x16 yielded performance approximately 6 MH/s higher than a 32x32 grid.

The main problem is that RC4 isn’t GPU friendly due to its memory usage. We need to utilize almost 300 bytes per GPU thread, along all the random memory accesses.

We can use Xcode Instruments to profile and attempt to understand the behavior. The L1 cache appears to be frequently invalidated, and this might be one of the reasons why RC4 is so hard to the GPU.

Profiling hashcat generates a different result, although the rates aren’t much better.

I definitely need to further explore this GPU profiling and optimization topic to better understand what is going on. Profiling the RC4 algorithm in segments clearly reveals that the KSA random accesses and writes become the primary bottleneck. I wish I had more material to display on profiling and optimization, but I wasn’t happy with my research results. The behavior appears too inconsistent, documentation is lacking, and Apple’s videos aren’t particularly helpful. It’s an area to explore with more time and patience.

Now we can finally demonstrate all the necessary steps to crack the 2005 private key. First we start by bruteforcing the key:

% ./metal_bruteforce
--------------------------------------------
    40-bit RC4 Metal based bruteforcer
    (c) fG!, 2025, All rights reserved.
  reverser@put.as - https://reverse.put.as
--------------------------------------------

Cracking...
Checked keys up to 0x6EE6000000. Current rate is 115.099 MH/s. 
✅  Valid key found: 6EE6729D0D0000000000000000000000
🔑  Key to use with pvk_decrypt: 0xd9d72e66e
⏱  Key was found in 34.008 seconds.

Next we can decrypt the PVK file:

$ ./pvk_decrypt -i myprivatekey-2005.pvk -o decrypted.pvk -k 0xd9d72e66e
--------------------------------------------
             40-bit PVK Decryptor
    (c) fG!, 2025, All rights reserved.
  reverser@put.as - https://reverse.put.as
--------------------------------------------

⏳  Executing PVK decryption...
🔑  Encryption key: 6ee6729d0d0000000000000000000000
👍  Decrypted content appears to contain valid RSA private key.
✍️  Writing decrypted PVK file...

✅  All done! Now you can use OpenSSL to convert the keys. Tested with openssl-1.0.2g, newer versions removed PVK support.

And finally extract the RSA private key using an older OpenSSL version:

$ apps/openssl rsa -inform PVK -outform PEM -in decrypted.pvk -out 2005.pem

$ apps/openssl rsa -in 2005.pem -text -inform PEM
Private-Key: (2048 bit)
modulus:
    00:aa:13:57:10:49:41:04:7b:0f:2b:f6:71:d3:e3:
    c6:67:e1:ce:44:f0:ba:85:df:75:0b:4e:a7:93:c3:
(...)

Now that we have the private key, we can sign and encrypt anything we want. Of course the certificates are expired, but that’s a different problem.

Metal compute definitely looks interesting. The API and the Metal Shading Language don’t appear to be a mess, nor does the documentation. It could be a valuable tool for specific problems in the future. It was definitely a good surprise. Easy to use, hard to master and optimize!

The code is available on Github:

metal_bruteforce: the Metal based bruteforcer.
pvk_decrypt: the util to generate a decrypted PVK file after key was found.

One has to wonder how many keys the five eyes SIGINT boys & girls cracked all this time ;-). If the quantum breakthrough does ever happen (or already did?) all that storage in Utah is going to become very alive!

Have fun,
fG!

Some references: