This Week In Security:Use-After-Free For Dummies, WiFi Cracking, And PHP-FPM

In a brilliant write-up, [Stephen Tong] brings us his “Use-After-Free for Dummies“. It’s a surprising tale of a vulnerability that really shouldn’t exist, and a walkthrough of how to complete a capture the flag challenge. The vulnerable binary is running on a Raspberry Pi, which turns out to be very important. It’s a multithreaded application that uses lock-free data sharing, through pair of integers readable by multiple threads. Those ints are declared using the volatile keyword, which is a useful way to tell a compiler not to optimize too heavily, as this value may get changed by another thread.

On an x86 machine, this approach works flawlessly, as all the out-of-order execution features are guaranteed to be globally transparent. Put another way, even if thread one can speed up execution by modifying shared memory ahead of time, the CPU will keep the shared memory changes in the proper order. When that shared memory is controlling concurrent access, it’s really important that ordering happens the way you expect it. What was a surprise to me is that the ARM platform does not provide that global memory ordering. While the out-of-order execution will be transparent to the thread making changes, other threads and processes may observe those actions out of order. An example may help:

volatile int value;
volatile int ready;

// Thread 1
value = 123; // (1)
ready = 1; // (2)

// Thread 2
while (!ready); // (3)
print(value); // (4)

This is one of [Stephen]’s examples. If this were set up to run in two threads, on an x86 machine you would have a guarantee that (4) would always print 123. On an ARM, no such guarantee. You may very well have an uninitialized value. It’s a race condition. Now you may look at this and wonder like I did, how does anyone program anything for ARM chips? First thing, even though memory reordering is a thing, ARM guarantees consistency within the same thread. This quirk only affects multi-threaded programming. And second, libraries for multi-threaded programming offer semantics for marking memory access that need to be properly ordered across threads.

The actual exploitable binary in question uses a circular queue for the inter-process buffer, and tracks a head and tail location, to determine how full the buffer is. One process puts data in, the second reads it out. The vulnerability is that when the buffer is completely full, memory manipulation reordering can result in a race condition. This ring buffer gets filled with pointers, and when the race is won by an attacker, the same pointer is used twice. In essence, the program now has two references to the same object. Without any further tricks, this results in a double free error when the second reference is released.

What are the tricks we could use to make this into an exploit? First, know that what we have is two references to an object. That object contains a pointer to another string, the length of which is entirely controlled by the user provided data. We can trigger a release of one of those references, which leads to the object getting freed, but we still have another reference, which now points to uninitialized memory. To turn this into an arbitrary read, a very clever trick is used. Before freeing our object, we allocate another object, and store a long-ish string. Then we free the object we have a double reference to, and finally free the object with the long string. Finally, we allocate one more object, but the string we store is crafted to look like a valid object. Memory gets reallocated in a last in, first out order, so the string is stored in the reclaimed memory we still have a reference to. The program expects the object to contain a pointer to a string, so our fake object can point to arbitrary memory, which we can then read.

The last trick is arbitrary write, which is even harder to pull off. The trick here is actually perform the double free, but manipulate the system so it doesn’t result in a segfault. We can use the above trick to write arbitrary data to a freed memory location. Because the location has made it onto the free list twice, the system still considered it free even though it’s also in use. The Linux memory manager uses a clever trick to manage reclaimed memory chunks, storing a pointer to the next reclaimed location in each chunk. Write the location you want to overwrite in that free chunk, and then allocate another chunk. The system now thinks your arbitrary location is the next free memory location to use. The next allocation is your arbitrary write. The writeup has more details, as well as the rest of the exploitation chain, so be sure to read the whole thing.

How Secure is that WiFi?

[Ido Hoorvitch] of CyberArk had some pandemic induced time on his hands, and opted to collect packet captures of 5000 password protected WiFi networks around Tel Aviv. In the old days, you had to capture a 4-way handshake to have any chance at breaking WPA encryption. In 2018 a new technique was discovered, where a single authentication response was all that was required to attempt to crack the key — no active user required. The magic string here is the PMKID, which is a SHA-1 hash of the WPA password and other network details, first run through a key derivation function.

The popular tool, Hashcat, can take advantage of a GPU to accelerate the cracking of a PMKID. SHA-1 hashes are one of the things GPUs are particularly good at, after all. The 8 Quadros managed almost 7 million hash calculations per second. The problem with trying to crack a WPA key is that while they must be at least 8 characters long, they can be much longer, making for an enormous search space. The first trick [Ido] used was to take advantage of one of the common password sources, a cell phone number. In Tel Aviv, that means the password is 05 followed by 8 more digits. That’s a searchable key space, and of the 5000 sniffed networks, nearly half were cracked by this approach. Next was pointing Hashcat at a dictionary file, to automatically try known passwords. Between the dictionary attack, and constraint-based approaches like the cell number format, 70% of the networks targeted were cracked. The takeaway? Use a long password that isn’t easily guessed, and won’t be easily part of a constrained search.

Google Use After Free PoC

Reported in June of this year by the Security For Everyone Team, CVE-2021-30573 now has a published PoC. This vulnerability was fixed in Chrome/Chromium 92. The triggering code is a bit of simple but very malformed HTML. Trying to parse this code just by looking at it, I immediate called it “cursed” HTML, so there’s no wonder Chrome had trouble with it, too.

<select class="form-control">
<option style="font-size: 1rem;" value="
<"">
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA(abbreviated)
</>">a
</option>
</select>

PHP Worker to Root

A bug in PHP-FPM discovered by Ambionics Security allows jumping from control over a PHP worker straight to system root. While it’s a serious problem, this isn’t a remote code execution vulnerability. Some other techniques needs to be used first to take over a PHP worker thread. This means an attacker would nead to be able to run PHP code, and then also find a way to escape the PHP “sandbox.” While not trivial, there are techniques and bugs to make this possible.

The problem is that the inter-process communication mechanism is shared mapped memory, and far too much of the data structure is made available to the individual workers. A worker can modify the main data structure, causing the top-level process to write to arbitrary memory locations. While the location may be arbitrary, the actual data writes are extremely limited in this vulnerability. In fact, it boils down to two write primitives: Set-0-to-1 and clear-1168-bytes. That may not seem like much, but there are a lot of flags that can be toggled by setting a value to 1, and the rest of the exploit makes heavy use of that technique. The real trick is to generate arbitrary error messages, then use the 0-to-1 primitive to corrupt the data structure of those messages.

The vulnerability has been around for a very long time, since PHP 5.3.7. It’s fixed in 8.0.12, 7.4.25, and 7.3.32. One final wrinkle here is PHP 7.3 is still in security support, but this was considered an invasive change, and the PHP maintainers initially opted not to push the fix to this older version. After some back and forth on the bug discussion, the right call was made, and 7.3.32 has been released with the fix.

Gitlab in the Wild

HN Security had a client report something suspicious, and it turns out to be CVE-2021-22205 in use in the wild. This bug is a problem in ExifTool, where a DjVu file can execute arbitrary perl code. This RCE was abused to make new users admin on the attacked system. If you’re running Gitlab, make sure you’re up-to-date. Versions 13.10.3, 13.9.6, and 13.8.8 were released with the fix on April 14 of this year. It appears that the 14.* versions were never vulnerable, as version 14.0 was released after this fix. Hackerone and the entire bug bounty community has its share of problems, but the disclosure thread for this one is an example of a program run correctly.

27 thoughts on “This Week In Security:Use-After-Free For Dummies, WiFi Cracking, And PHP-FPM

  1. With ARM compilers using volatile isn’t enough for such operations (depending on the compiler).
    You also have to use memory barrier instructions (again, syntax depending on the compiler) to enforce that data is flushed to/from memory before proceeding away from a critical operation. Then any other thread can depend on the data being consistent.
    When porting to ARM yes this can be an extra headache, but really the programmers doing the port should know all this stuff already.

    1. Yes. This is because of cache incoherence. Each core/thread sees a copy of a variable in its own cache which may hold a different value until memory barrier instruction is issued.

        1. Are you sure? This would mean information about pending memory accesses are part of a thread context. This is the only reason I can imagine, why a different thread on the same core would see a different value in memory. Otherwise, how would the same thread (on the same core) be able to properly retrieve the value it wrote a few instructions earlier.

          1. Yes I am sure. The simple explanation for the need for memory barriers is that the instruction and data pipelines on ARM take different routes. So you code can be executing away without the data having made it to memory (this improves performance).

            The memory barrier instruction is a synchronisation point between the two pipelines.

            If you don’t do it, then later code can be executing that references the same memory, and reads the value in from the physical memory rather than the value that is traversing the previous write pipeline. You might thing of the pipeline values as tiny caches but that would probably confuse the matter as the data and instruction caches are much bigger, use different hardware, and have a different “synchronisation API”.

            ARM do a course on this stuff, I did it as part of an M3 and M7 programming course.
            I have also experienced these opportunities for bugs first hand while writing device drivers and fixing linked list code.

          2. Oh and in case I didn’t make this clear – for your “multiple threads” you don’t even need an RTOS or similar, if you use IRQ’s. Just consider an IRQ to be a different thread to the main code, and add memory barriers where needed.

    2. “really the programmers doing the port should know all this stuff already.”

      When a “high level” language is incapable of hiding the basic chip architecture from the application programmer, it is most certainly a grave flaw in the language.

      WHY does Every C programmer have to memorize thousands of pages of chip manuals and work out the arcane details of mulitithreaded memory coherency for a long laundry list of architectures? Why do we all have to be experts on something that the language could be handling transparently? Is this really what we “should” be doing?

      And where are these “programmers doing the port”? They do not exist! They are exactly the same programmers who have been making security disaster after security disaster for decades now, why do you expect them to magically get better at their jobs?

        1. The compiler is not going to figure out that you forgot to use the barrier instructions that were not needed when the original developer wrote the code for x86_64. They retired and are no longer available so you have to pick over the code yourself to find the race conditions.

          1. True – and it’s all “part of the fun” when porting code between architectures.
            Bun now I’ve been doing ARM (“do no ARM” said the x86 programmer :-) ) for a few years this stuff is now second nature and when I look at such code I’m internally thinking “needs a barrier there”, “needs a barrier here” etc.

            I suppose it would therefore be possible to get a compiler/preprocessor to do this for you maybe with some simple rules. Then produce a copy of the same code with the barriers, which you can then save over the original. That would be pretty neat.

      1. I’ve written C code on countless embedded targets over a couple of decades now, and I’ve never “memorised thousands of pages of chip manuals”. Sure, I have to read them, but I can write code to handle it and then forget most of the details. Or use an existing library where someone else has done the work.

        It’s ridiculous to assume you can blindly execute *any* code on a different platform and expect it to work perfectly. Even if there was a language that cleanly hid those details across every platform… is that supposed to have magically appeared out of the aether? Someone has to have written it, and they have to do it it in a language that gives access to those platform features. If this magic language has hidden it all, it obviously can’t be self-hosting… which means someone has to be doing it in C *anyway*. If it’s exposed via some mechanism then you’ll have someone try to dumbly use THAT instead of the intended method and we’re right back here again.

        Don’t blindly thow ‘volatile’ around unless you understand what it does, which means understanding the platform you’re on. If you can’t do that, use one of the libraries that already do the job properly. Which is *functionally identical* to having it built into the language.

      1. Memory barriers have huge performance penalties on modern processors. The execution pipeline must be flushed. You might just as well throw out 30 years of innovation and compile for 80386 because that’s what kind of performance you will get.

    3. I’m looking at it and thinking bad software design with no thread safe mutex lock. If you have multiple threads accessing the exact same memory why would you ever consider that the RAM would be in a consistent state. Me, if I was carrying out something like a multi threaded Fast Fourier Transform (for the sake of an example), I’d treat any RAM shared between threads as read only and allocate a mutex lock per thread for the block of memory being processed in parallel and only once all locks had unlocked by all the threads then would I consider it safe to allow a different thread to write to that block of RAM (with an additional mutex lock). I would end up using more memory but I would be thread safe. I guess the problem is if people fully think things through or half ass the job (Bailey, the “I have no idea what I’m doing” dog always comes to mind https://i.imgur.com/ZQM77OT.jpeg RIP: 2009-2016).

      1. “I’d treat any RAM shared between threads as read only and allocate a mutex lock per thread”

        This is a performance killing disaster and completely unnecessary in most circumstances, you can get far better performance and still ensure thread safety with lockless algorithms, you have to code it up yourself and you must be very careful but the rewards are big.

        This is why we need a new language, it should be easy to do the right thing.

        1. “This is a performance killing disaster and completely unnecessary in most circumstances”
          It was a contrived example but at the end of the day it depends on the size of the FFT if it was a 128 point then yes it would totally destroy performance, but if it was a 33554432 point FFT, then the overhead becomes insignificant.

  2. The example is kind of contrived because you can use a byte sized type for the variables and just ignore the whole ordering problem. A smart compiler would figure this out and do it for you, but alas C compilers are too stupid to infer typing.

    1. Man, you really hate C. Using pointer(s) show us where C did ya wrong.
      just messing with ya, you can like/hate what ever you like/hate.
      Its all 01101111 01101110 01100101 01110011 00100000 01100001 01101110 01100100 00100000 01111010 01100101 01110010 01101111 01110011 00100000 01110100 01101111 00100000 00101011 00100000 00101101 00100000 01110110 01101111 01101100 01110100 01100001 01100111 01100101 00100000 01100110 01110010 01101111 01101101 00100000 01110110 01100001 01110010 01101001 01100001 01110100 01101001 01101111 01101110 01110011 00100000 01100100 01100101 01100110 01101001 01101110 01100101 01100100 00100000 01100010 01111001 00100000 01100011 01101111 01101110 01110011 01100101 01101110 01110011 01110101 01110011 00100000 01101111 01100110 00100000 01110000 01100001 01110010 01100001 01101101 01100101 01110100 01100101 01110010 01110011 00101110 00100000 01010100 01101000 01100101 01101110 00100000 01100001 01100111 01100001 01101001 01101110 00100000 01111001 01101111 01110101 00100000 01101000 01100001 01110110 01100101 00100000 01110100 01101111 00100000 01100101 01111000 01100011 01110101 01110011 01100101 00100000 01101101 01100101 00100000 01001001 00100111 01101101 00100000 01100001 00100000 01101100 01101001 01110100 01110100 01101100 01100101 00100000 01100010 01101001 01110100 00100000 01000010 01110010 01101001 01110100 01101001 01110011 01101000 01101100 01111001 00100000 01101101 01100001 01100100 . have a fun day

  3. There are no guarantees on x86 either (MSVC has an exception that’s non-standard). The compiler is free of re-ordering accessed across volatiles. It is not allowed to reorder voalitles though. And even if the cache might reorder accesses (I believe ARM is notorious for ordering writes in memory address order), the CPU itself may also do this when reordering accesses in its pipeline.

Leave a Reply to XCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.