Wednesday, March 20, 2013

APT PDFs and metadata extraction

One of the modules in our new Rapid Reverse Engineering class is artifact extraction.  For this section of the class the students use a python module we create for doing some artifact/metadata extraction from samples.  One of the more interesting pieces of metadata that attackers leave behind is the software that the malicious file was created with. In this case I was looking at some PDFs.  I then realized that I extract this information for individual samples, but I have never run a test on a large set of known APT malware to see what comes out. So a quick adventure I set out on and wow was I surprised by the information.

I ended up with the following pie graph

The sample size was roughly 300+ known APT samples that we have.  It wasn't our whole sample set of PDF's but for starters was a decent size.  List (top 10) looked like this

Acrobat Web Capture 8.0 (15%)
Adobe LiveCycle Designer ES 8.2 (15%)
Acrobat Web Capture 9.0 (8%)
Python PDF Library - (7%)
Acrobat Distiller 9.0.0 (Windows) (7%)
Acrobat Distiller 6.0.1 (Windows) (7%)
pdfeTeX-1.21a (7%)
Adobe Acrobat 9.2.0 (4%)
Adobe PDF Library 9.0 (4%)

A number of things amazed me about this data. One of them was the lack of opsec on the attackers perspective, and the old versions of software that they are using. From the offensive perspective if you are dealing with targets that have resources to do deep level forensics and operations then every little bit of opsec is needed. It only takes a small amount of data to put together a large piece of the puzzle.

From the defensive position it points out the ability for defense organizations to do some early detection.  I doubt that most organizations are actually keeping track or analyzing what types of clean, business case pdfs come through the front doors.  What do the normal clean pdf's coming through your front doors actually look like?  Are the clean business case PDFs being created by the
"Python PDF Library -" software? This is a piece of software that is no longer maintained. If you have a standard set of pdf's that come through your front doors and they aren't using strange libraries such as pyPDF then it might be time to create a nice little snort signature and alert on it.  I wouldn't recommend blocking at that level (unless you are up for it), but alerting on something simple like that can create extremely large dividends for response/defense teams. Imagine telling your CIO/CISO that you detected and re-mediated APT* attack coming through the front door by a simple snort sig.  

Some of the honorable mentions for that didn't make it into the top 10 are:

Advanced PDF Repair at
Acrobat Web Capture 6.0 (wow that is old)
¦  d o P D F       V e r   6 . 2   B u i l d   2 8 8   ( W i n d o w s   X P     x 3 2 )  *Ya that is the way it show's up
alientools PDF Generator 1.52
PDFlib 7.0.3 (C++/Win32)

I am getting to the point that you must look at data sets and see what type of information you can gleam from them. This idea might be feasible in your organization and it might not, but you as the defender have the ability to determine that for yourself.  

At the end of April (25-26th) we are debuting Rapid Reverse Engineering in New York City with Trail Of Bits  Rapid Reverse Engineering is a class designed for helping students learn how to rapidly assess files for incident response scenarios.


Anonymous said...

Aren't most APT PDFs reused from other places/victims? I think the meta data might be related to the originator (another victim maybe?) and not the attacker in this case. The tool they use to inject the exploit code most likely doesn't update most/any of those fields.

I would be interested to see if the created date (in meta data) was during a time that the pyPDF was still supported and when the exploit became known. That could tell you some interesting things

Kyle Maxwell said...

I'm not confident that most of these PDFs are re-used, though it certainly does bear further investigation. In the investigations I've seen, they have generally been targeted material.

Anonymous said...

Yeah almost all the APT1 payload PDFs were from other origins.

Given the pretty even distribution you see in your data, I believe recommending that people block or build signatures based on this metadata is just hyperbole to promote your training classes. I'm really sad to see this stuff when it happens in infosec.

Anonymous said...

There is definitely some possible in keying off this metadata. Of course, if you have better decoding/detection/magic working against your e-mail or otherwise inbound PDFs then this might be a moot point. These artifacts perhaps combined with other fields such as author or time might help whittle down those false positives.

Source: been doing this exact crap for too long.