In analyzing malicious PDF documents, being able to understand the format of its object structure is definitely useful. In order to look for malicious content inside the file, we might need to go through some of the process that’ll include interpreting the PDF object structure. The PDF object is enclosed with “obj” and “endobj”. Between the “obj” and “endobj” there are usually 2 components, object dictionary and stream. Object dictionary are represented by keys and values that enclosed with “<<” and “>>”, while stream is a sequence of bytes. A stream shall consist of zero or more bytes bracketed between the keywords stream (followed by newline) and endstream.
The below snippet reflects the normal PDF object structure;
obj 1 0
<< /Length 12 >>
stream
HELLO WORLD!
endstream
endobj
The obj 1 0 contains the dictionary (in between << and >>) of /Length (key) with value of 12. Below the dictionary, the stream exist with string “HELLO WORLD!” just before the endstream. Finally, thehe object structure is closed with endobj tag which indicate the end of object 1 0’s portion.
Although the PDF object structure is rather easy to understand, these structure can also be easily manipulated in many ways for malicious intent. The main reason of manipulation purpose is to break the analysis process particularly for PDF analysis tools.
How can the PDF object structure be manipulated? Usually attackers omit some syntax or tags required within the object. This omission, however, seems to be considered as valid structure by PDF reader such as Adobe Reader. For example:
Object without “endobj”
obj 1 0
<< /Length 1337 >>
stream
HELLO WORLD!
endstream
Object without “endstream”
obj 1 0
<< /Length 1337 >>
stream
HELLO WORLD!
endobj
So-called bluff trick
obj 1 0
<< /Length 1337 >>
stream
HELLO WORLD!endstream\n
endstream
endobj
In the 3 examples above, we can see that even when some components are dropped (or added) from/to the structure and the PDF reader can still render the text without generating any error.
In the last snippet, we can see the use the bluff trick to confuse the security tools in getting the right portion of stream. When pattern matching technique is used, the script/tool might not get the complete stream content since it got confused between the first and the second endstream. A proper handling of these manipulation should be considered thoroughly in order to get a reliable extraction.
Generalizing the security tools seems to be a crucial task in order for it to work in any conditions encountered. Pattern matching technique alone will not work. Understanding the format within the PDF object helps a lot in the process of generalizing the analysis tools.
For example, in a normal manipulation method, attackers cannot get rid of the “endstream” and “endobj”‘s tag simultaneously. Instead, either “endstream” or “endobj” or both will exist. From our rough solution, a regular expression like />>.*?stream(.*?)(endstream|endobj)/m can be reliably implemented with aid of other filtering mechanism.