Happy Valentine’s Day! And what’s more romantic than a post about analyzing an x64 crash dump? If you haven’t picked up a card already, feel free to print this out and hand it to your significant other.
Way back in December, we started looking at the fundamentals of x64 crash analysis so that we could work up to analyzing an actual x64 crash. If you haven’t already, I suggest that you read them in order since we’ll put all of those posts in practice here:
x64 Trap Frames
x64 Calling Convention
x64 Stack Frame layout
Reconstructing parameters from x64 crash dumps
With that out of the way, we can start our analysis the way we always do with any crash by running !analyze -v:

The bugcheck code in this case is PAGE_FAULT_IN_NONPAGED_AREA (0×50). In order to solve this crash we should probably cover what exactly this crash code is indicating.
The kernel virtual address space in Windows contains lots of different memory regions that serve different purposes. Regardless of the purpose of the region, one characteristic that all kernel memory has is whether or not the memory is pageable. When a memory region is pageable, the Memory Manager (Mm) is free to take the contents of the physical page of memory, write it out to disk, and then invalidate the virtual address. The next time someone tries to read the contents of that address, a page fault occurs and the contents are brought back into memory from disk. Once the memory is again resident, the Mm fixes the virtual address pointer and resumes the thread. When a memory region is non-pageable, it means that the Mm promises to never page out the memory and invalidate the virtual address in this manner.
Having non-pageable memory is important in Windows because it is the only kind of memory that you are allowed to access at IRQL DISPATCH_LEVEL or above (see my previous post here for more on IRQL). The reason for this is that you are not allowed to perform any wait operations at IRQL DISPATCH_LEVEL or above and by using pageable memory you’re implicitly stating that you can wait if the memory that you’re trying to access is not resident in memory.
With that out of the way, we can understand what the particular bugcheck code means. These non-pageable regions are only guaranteed to be valid if you have a valid outstanding resource allocation from the Mm. Take, for example, non-paged pool, which is the kernel equivalent to the user mode heap with the exception that memory allocated from this pool is guaranteed to never be paged out to disk. However, that does not mean that every address within the non-paged pool area is valid at all times. The Mm may delay programming a particular virtual address in this region until he is going to return a pool allocation to a particular caller, or mark the virtual address as invalid when someone frees a valid pool allocation. If someone tries to access one of these invalid addresses, a page fault will occur and the Mm will inspect the invalid virtual address to decide what needs to be done with it. If this virtual address corresponds to a region that is guaranteed to not page fault when valid, the Mm calls KeBugCheck with a bugcheck code of PAGE_FAULT_IN_NONPAGED_AREA. If you think about it, this is the only reasonable thing that the Mm can do since there is no solution to this state (you can argue that there are things that could be done, but that’s in the realm of fault tolerant systems and not relevant to the discussion).
We can now break down the text associated with the bugcheck code and understand a bit more of what it means:
Invalid system memory was referenced. This cannot be protected by try-except, it must be protected by a Probe. Typically the address is just plain bad or it is pointing at freed memory.
Invalid system memory was referenced.
This bugcheck is always the result of dereferencing a bad kernel virtual address.
This cannot be protected by try-except, it must be protected by a Probe.
If you dereference a bad user virtual address, a structured exception is raised that your driver can catch in a structured exception handling (SEH) block. When an invalid kernel address is accessed, there is no structured exception raised and the system simply bugchecks. The comment here about being protected by a probe is misleading. There is no way to validate a kernel address other than dereferencing it and hoping for the best. The idea is that kernel callers are trusted, thus if a kernel component hands you a kernel virtual address you must assume that it is valid. What the comment here is referring to is that you should not be touching kernel virtual addresses that originated from user mode. You can avoid this situation by calling ProbeForRead or ProbeForWrite on any address handed to you from user mode, which will raise an exception if the address is a kernel virtual address. This is only useful if you’re performing METHOD_NEITHER I/O and is not relevant to our conversation.
Typically the address is just plain bad or it is pointing at freed memory.
This is the short version of what we’ve been talking about up until now. If you have page faulted on an address in the non-paged area it means that you have dereferenced something that is not a valid memory allocation. Generally this means that it’s a garbage value (e.g. uninitialized pointer), freed memory, or some kind of corruption (e.g. a pointer value from a corrupted data structure).
Now we should have a much better idea as to what we’re looking at and the kind of bug that we’re looking for. According to the !analyze output, the invalid address that caused the wreck was 0xffffba80`07122a88 and we were attempting to write to the address (parameter 2 from the bugcheck information). Let’s look at the trap frame output and attempt to identify the invalid address in the faulting instruction:

If you haven’t read my previous post about x64 trap frames, the output above will likely be confusing. There are two pointer dereferences in the above output, the instruction pointer RIP and RBX. Neither one of these values is 0xffffba80`07122a88 and, in fact, this looks to be a NULL pointer dereference since RBX is zero. However, as we know, the trap frames on the x64 do not contain non-volatile register state and RBX is a non-volatile register. So, in order to get the value of RBX at the time of the crash back we’ll need to scroll back through the assembly and find another volatile register that either shadows RBX in this frame or we can use to derive RBX.
The first step to getting this information will be to execute the .trap command and get ourselves into the correct trap frame. We don’t need to find the trap frame address ourselves since it was already present in the !analyze output:

Now that our registers are back, we can go back through the disassembly and figure out where RBX came from prior to the dump. I generally do this by bringing up the disassembly window (Alt+7) since I find that a bit more convenient in this situation than trying to use the keyboard shortcuts to navigate the assembly. Bringing up the window and scrolling back a bit shows RBX coming from RDX a few instructions earlier:

If we go back and view the RDX register value contents, we’ll see that they match the pointer value from the first parameter to the bugcheck:

We could have intuited the value of RBX based on the bugcheck information and our knowledge of x64 trap frames, though I like to take this extra step to make sure that I understand these things and also get my bearings with the dump. Also, in this case I know that the RBX value came from RDX, which is likely the second parameter to the function. Thus, if I can find a function prototype I can know what the type of the structure should be. This all provides me with greater context for the dump and a greater chance that I’ll have some success in analyzing it.
Let’s review what we have up to this point:
1) Someone has tried to write to an invalid system address
2) The address was passed as the second parameter to this routine
The next logical step that we need to explore is, “what kind of address is this supposed to be and why is this address invalid?”
For the first part of this question, we can typically use the !address extension and figure out what region this address lies in. Unfortunately, at the time of this writing that command does not work on Windows 7 and determining this without that extension is beyond the scope of this article. There is one thing we can quickly check though and that is whether or not this address lies in one of the various executive pools, which we can do with the !pool command:

Based on that output, it’s likely that this is not a pool allocation. The fact that it states that it is corrupt or free pool doesn’t really mean much, it should really say, “I have no idea what this is, but it doesn’t look like pool to me.” While that doesn’t provide us much positive information, we at least know that it’s not likely a bad pool address due to being used after it was freed. This at least removes a class of bugs and narrows our search a bit.
Since it’s not pool and !address doesn’t work, the best we can now is inspect the PTE contents and see if there’s anything interesting there:

According to the !pte output, this address is not only bad it’s really bad. At no level of the page translation process does this address have valid information. To me, this screams that this address is the result of an uninitialized pointer reference or a data corruption. Since the crash occurs in a Microsoft supplied component, the fact that this crash would come from an uninitialized pointer is very unlikely and so my sights are set firmly on some type of corruption. But what kind?
When I first analyzed this dump, I stayed at this point for almost 24 hours (luckily not straight!). I just couldn’t see what the corruption was or where it came from. I spent the time to go through every other thread in the system and even searched the kernel address space for other references to this address hoping for some light to shine on a clue. Since this was a Filter Manager structure that was corrupt, I also checked all of the current Filter Manager mini-filters with the !fltkd.filters command and was bummed to find only in box Microsoft supplied filters running.
At this point I took the advice that I give to all students: I walked away from the dump. Anyone who claims that crash dump analysis isn’t difficult is either lying or doesn’t get presented with many challenging dumps. Sometimes walking away from the dump gives you the fresh perspective and eyes that you need to spot something miniscule, such as a missing bit.
Before calling it quits on the dump, I gave it one last look and noticed something curious about the faulting pointer value when compared to the other values in the trap frame: only the high four bytes at the top of the address were 0xf, not the high five bytes like the other registers.

I felt like Archimedes in the bathtub, though I expressed my excitement in a slightly less dramatic fashion (and without all of the nudity). What if this was a single bit flip error? It’s possible that due to some sort of cosmic error there was a single bit in this address that should have come back as 1 but instead read as 0. So, I flipped the bit and found what was an entirely plausible pool address that indicated it was a valid Filter Manager allocation:

This gave me a plausible explanation for the dump: hardware failure. I wanted to collect at least one other piece of information to support this, so I decided to check to see if there were any physical pages marked as bad in this system. This this wouldn’t provide any type of solution, it would at least be another indicator that this machine was having hardware related issues. Thus, I inspected the state of the Page Frame Database in this machine using the !memusage command and did indeed find ~7MB of bad pages:
