Code: PDF hunter

So of late I’ve been playing around a lot with Scapy and pcap files, mostly for my sniffMyPackets project but also because it teaches me more about network forensics and python. The other area I’m starting to learn about is Malware Analysis and I’ve been spending some time looking at the Honeynet Project challenges.

One of the challenges to is to find the malicious content within a PDF file that is provided to you in a pcap file. Normally I would just reach for Network Miner and rebuild the file(s) that way but I wanted to see if I could write some code myself.

The goal of my code was simple, parse through a pcap file, identify a PDF and then rebuild the file so that if a tool such as exiftool or file was used that it would correctly be identified as a PDF and that you could open the PDF and view the content (if you wanted to).

I follow a certain process when I’m carving up pcap files, it’s not rocket science really just common sense. First off find the packets you are interested in, I tend to use a mix of Wireshark and Scapy for this and then look for something you can use to filter down to the packets you want before getting into the nitty gritty of carving them up.

For this piece of code I need to find some way of identifying a PDF file in a pcap file and as most PDF files will appear in a pcap file as part of an HTTP conversation, I parse each packet and if the packet has a Raw layer (a raw layer in Scapy is essentially the payload of a packet) then I look for this ‘Content-Type: application/pdf’. If this is matched then I store the TCP ACK number as a variable for use later.

Now once I have the ACK number I then need to find all the packets that relate to this in order to get the whole file. Now it turns out the ACK is the same for all the packets that the PDF download is in (something I didn’t realise until I started this) so it’s a simple case of using the following code to find all the packets I’m after:

for p in pkts:
if p.haslayer(TCP) and p.haslayer(Raw) and (p.getlayer(TCP).ack == int(ack) or p.getlayer(TCP).seq == int(ack)):
raw = p.getlayer(Raw).load

If either the TCP ACK or SEQ match our stored ACK variable we get the Raw layer and store it into a python list. This means that we now have (hopefully) all the packets that make up the PDF stored nicely away and because it’s a TCP conversation they should all be in the right order.

Now that we have all the packets we write those out to a temporary file, it’s a temporary file because if you were to open it in a text editor you would see all the HTTP headers at the top and the bottom, which means if you ran file against it, then you would get back a file type of “data” and not “PDF” (which is what we are after).

So we then have to do some python magic (well I think it’s magic), to slice the rubbish out. Now this is the part that took me the longest to figure out. If you have ever looked at a PDF file in a text editor (I wouldn’t blame you if you haven’t), you would notice that they start with “%PDF-“ and end with “%%EOF” so finding the start of a PDF file is easy, the problem is that a PDF file can have multiple %%EOF towards the end of the file and I kept cutting at the wrong point.

To fix this I came up with a bit of a long-winded way of carving the temporary file up (see the code below):

# Open the temp file, cut the HTTP headers out and then save it again as a PDF
total_lines = ''
firstcut = ''
secondcut = ''
final_cut = ''

f = open(tmpfile, 'r').readlines()

total_lines = len(f)

for x, line in enumerate(f):
if start in line:
firstcut = int(x)

for y, line in enumerate(f):
if end in line:
secondcut = int(y) + 1

f = f[firstcut:]

if int(total_lines) - int(secondcut) != 0:
final_cut = int(total_lines) - int(secondcut)
f = f[:-final_cut]

If you read Python awesome, if you don’t here’s what happens.

First off I open the temporary file and count the number of lines, I look for the variable I declared at the start of the code as start (which is this: start = str(‘%PDF-‘)), if that’s matched it stores the line number as the variable firstcut

I then need to find the last cut, I look for the variable end (which is this: end = str(‘%%EOF’)) now remember I said a PDF can have multiple EOF statements, well I get round that because Python overrides the variable secondcut each time it’s matched so the last line with EOF is always the one used. I also add a +1 to the line number because for the next chunk of code if I didn’t I would actually cut the final %%EOF file the file (I know this because I did it, before realising what was happening).

So we now do a simple little IF statement to make sure that there is something at the end of the file to cut (sometimes there isn’t on the pcap files I’ve used/made) and if there is we slice the bad HTTP headers out before saving the file. If there isn’t anything to cut then we just save the file.

Hopefully that makes sense to non-python people (I can but hope).

I’ve tested this on a number of different pcap files that have PDF downloads in them and it works, I can open and view the PDF and if I run file or exitfool against it then it appears as a normal PDF. I’m sure there are some cases when it won’t work 100% but if you find something that doesn’t let me know so I can try to fix it.

The code can be found here: (in my ever-growing GitHub repo). Oh and I’ve added this function into my sniffMyPackets transform pack.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s