Working with Embedded OLE Objects in Java

September 29, 2012 - 9 minutes read - 1860 words

Document introduction contains a proof of concept for the research done on reading and writing OLE objects in Java. This research has been done using Apache POI. Please go thru the references for information about apache POI.

This article covers a POC to fetch OLE objects from RIF formatted XML. Not just that, we need to store them in a in some viewable format.

For a given XML in RIF format, parse it and read the embedded OLE object and represent in some viewable format using Java.

XML was given in RIF(Requirement Interchange Format), it is used to exchange requirements along with the associated meta data. The given XML was generated from a tool DOORS by IBM. It embeds the attachments in BASH64 encoded under tag <rif-xhtml:object> as follows:

<rif-xhtml:object name="2cdb48aa776249cfa74c46ef450dffb9" classid="00020906-0000-0000-c000000000000046"> 0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAAP
AAAAAQAAAAAAAAAAEAAAAgAAAAEAAAD+////AAAAAAAAAAAEAAAA+AAAAPkA
AAD6AAAA+wAAAPwAAAD9AAAA/gAAAP8AAAAAAQAAAQEAAAIBAAADAQAA/way
..
..
..
<rif-xhtml:object>

This matches well with the following information from the RIF 1.0 standard (p, 28)

Object embedding in the XHTML name space With RIF, it is also possible to embed arbitrary objects (e.g. pictures) within text. This functionality is based on Microsoft’s clipboard and OLE (Object Linking and Embedding) technology and is realized within the XHTML name space by including the “Object Module”. The following table gives an overview of the XML element and attributes from the Object Module that are used for embedding objects:

Element	Attributes	Attribute types	Minimal Content Model
Object	classid	URI	(PCDATA
	data	URI
	height	Length
	name	CDATA
	type	ContentType
	width	Length

Each rif-xhtml:object tag have “name” and “classid” attribute, where classid = “00020906-0000-0000-c000000000000046” is classid for Microsoft Word Document according to Windows Registry Hacks So we will only be taking care of classId “00020906-0000-0000-c000000000000046” while parsing the XML.

OLE File System

Normally, these OLE documents are stored in subdirectories of the OLE filesystem. The exact location of the embeded documents will vary depending on the type of the master document, and the exact directory names will differ each time. To figure out exactly which directory to look in, you will either need to process the appropriate OLE 2 linking entry in the master document, or simple iterate over all the directories in the filesystem.

As a general rule, you will find the same OLE 2 entries in the subdirectories, as you would’ve found at the root of the filesystem were a document to not be embeded.

Apache POIFS File System

POIFS file systems are essentially normal files stored on a Java-compatible platform’s native file system. They are typically identified by names ending in a four character extension noting what type of data they contain. For example, a file ending in “.xls” would likely contain spreadsheet data, and a file ending in “.doc” would probably contain a word processing document. POIFS file systems are called “file system”, because they contain multiple embedded files in a manner similar to traditional file systems. Along functional lines, it would be more accurate to call these POIFS archives. For the remainder of this document it is referred to as a file system in order to avoid confusion with the “files” it contains.

POIFS file systems are compatible with those document formats used by a well-known software company’s popular office productivity suite and programs outputting compatible data. Because the POIFS file system does not provide compression, encryption or any other worthwhile feature, its not a good choice unless you require interoperability with these programs.

The POIFS file system does not encode the documents themselves. For example, if you had a word processor file with the extension “.doc”, you would actually have a POIFS file system with a document file archived inside of that file system.

Proof of Concept

Prerequisites/Dependecies

Eclipse IDE
commons-codec-1.7.jar – for base64 encoding/decoding
poi-3.8-20120326.jar – Apache POI Jar
poi-scratchpad-3.8-20120326.jar – Apache POI scratchpad jar which contains the MS World API.
Sample XML in RIF format.

The constants file

package com.tk.poc;

public class Constants {
    public static final String CLASS_ID_OFFICE = "00020906-0000-0000-c000000000000046";
    public static final String ATTR_NAME = "name";
    public static final String ATTR_CLASS_ID = "classid";
    public static final String TAG_OLE = "rif-xhtml:object";
}

Parser

Use SAX parser to parse the given XML to get content from tag “rif-xhtml:object”. This will parse the given xml and generate the separate file for each rif-xhtml:object tag having classid “00020906-0000-0000-c000000000000046”. It also decode the content before writing to files.

package com.tk.poc;
...
public class Parser extends DefaultHandler {
    private StringBuilder stringBuilder;
    private String name;
    private List<String> generatedFiles = new ArrayList<String>(10);
    /**
     * Parse the xml file. and return the name of the files generated files
     * 
     * @param filePath
     *            for the xml.
     */
    public List<String> parseXML(String filePath) {
        stringBuilder = new StringBuilder();
        SAXParserFactory factory = SAXParserFactory.newInstance();
        try {
            SAXParser parser = factory.newSAXParser();
            parser.parse(filePath, this);
        } catch (ParserConfigurationException e) {
            System.out.println("ParserConfig error");
        } catch (SAXException e) {
            System.out.println("SAXException : xml not well formed");
        } catch (IOException e) {
            System.out.println("IO error");
        }
        return generatedFiles;
    }
    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        name = null;
        if (Constants.TAG_OLE.equalsIgnoreCase(qName)
                && Constants.CLASS_ID_OFFICE.equals(attributes.getValue(Constants.ATTR_CLASS_ID))) {
            stringBuilder = new StringBuilder();
            name = attributes.getValue(Constants.ATTR_NAME);
        }
    }
    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        stringBuilder.append(new String(ch, start, length));
    }
    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        if (name != null && Constants.TAG_OLE.equalsIgnoreCase(qName)) {
            FileOutputStream outputStream = null;
            try {
                String filePath = name;
                outputStream = new FileOutputStream(filePath);
                byte[] base64 = Base64.decodeBase64(stringBuilder.toString());
                outputStream.write(base64);
                generatedFiles.add(filePath);
            } catch (FileNotFoundException e) {
                System.out.println("File not found : " + name);
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                try {
                    if (outputStream != null) {
                        outputStream.close();
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

Reading OLE Object Using Apache POI

You can reach the OLE attachment content by following command

java -classpath poi-3.8-20120326.jar org.apache.poi.poifs.dev.POIFSDump <filename>

It generates following structure for the ole object generated from above xml.

You can read/write OLE object properties. Here is example to read and print the properties

...
public class OLEReader {
    public static void main(String[] args) throws FileNotFoundException, IOException {
        final String filename = "1797c9be9d034d8f907239afb20d6547";
        POIFSReader r = new POIFSReader();
        r.registerListener(new MyPOIFSReaderListener());
        r.read(new FileInputStream(filename));
    }
    static class MyPOIFSReaderListener implements POIFSReaderListener {
        public void processPOIFSReaderEvent(POIFSReaderEvent event) {
            PropertySet ps = null;
            try {
                ps = PropertySetFactory.create(event.getStream());
            } catch (NoPropertySetStreamException ex) {
                System.out.println("No property set stream: \"" + event.getPath() +  
                    event.getName() + "\"");
                return;
            } catch (Exception ex) {
                throw new RuntimeException("Property set stream \"" + event.getPath() + 
                  event.getName() + "\": " + ex);
            }
            /* Print the name of the property set stream: */
            System.out.println("Property set stream \"" + event.getPath() + event.getName() + 
                "\":");

            final long sectionCount = ps.getSectionCount();
            System.out.println("   No. of sections: " + sectionCount);

            /* Print the list of sections: */
            List<Section> sections = ps.getSections();
            int nr = 0;
            for (Section section : sections) {
                System.out.println("   Section " + nr++ + ":");
                String s = section.getFormatID().toString();
                System.out.println("      Format ID: " + s);
                /* Print the number of properties in this section. */
                int propertyCount = section.getPropertyCount();
                System.out.println("      No. of properties: " + propertyCount);

                /* Print the properties: */
                Property[] properties = section.getProperties();
                for (int i2 = 0; i2 < properties.length; i2++) {
                    /* Print a single property: */
                    Property p = properties[i2];
                    long id = p.getID();
                    long type = p.getType();
                    Object value = p.getValue();
                    System.out.println("      Property ID: " + id + ", type: " + type + ", value:  
                       " + value);
                }
            }
        }
    }
}

It will print something like below

  No property set stream: "\0,4,2540,19841Table"
  No property set stream: "\0,4,2540,1984_OlePres000"
  Property set stream "\0,4,2540,1984_SummaryInformation":
     No. of sections: 1
     Section 0:
        Format ID: {F29F85E0-4FF9-1068-AB91-08002B27B3D9}
        No. of properties: 16
        Property ID: 1, type: 2, value: 1252
        Property ID: 2, type: 30, value: 
        Property ID: 3, type: 30, value: 
        Property ID: 4, type: 30, value: Caspari, Michael (415-Extern)
        Property ID: 5, type: 30, value: 
        Property ID: 6, type: 30, value: 
        Property ID: 7, type: 30, value: Normal.dotm
        Property ID: 8, type: 30, value: Caspari, Michael (415-Extern)
        Property ID: 9, type: 30, value: 2
        Property ID: 18, type: 30, value: Microsoft Office Word
        Property ID: 12, type: 64, value: Wed Sep 19 15:32:00 IST 2012
        Property ID: 13, type: 64, value: Wed Sep 19 15:33:00 IST 2012
        Property ID: 14, type: 3, value: 1
        Property ID: 15, type: 3, value: 6
        Property ID: 16, type: 3, value: 38
        Property ID: 19, type: 3, value: 0
  Property set stream "\0,4,2540,1984_DocumentSummaryInformation":
     No. of sections: 1
     Section 0:
        Format ID: {D5CDD502-2E9C-101B-9397-08002B2CF9AE}
        No. of properties: 12
        Property ID: 1, type: 2, value: 1252
        Property ID: 15, type: 30, value: ITI/OD
        Property ID: 5, type: 3, value: 1
        Property ID: 6, type: 3, value: 1
        Property ID: 17, type: 3, value: 43
        Property ID: 23, type: 3, value: 917504
        Property ID: 11, type: 11, value: false
        Property ID: 16, type: 11, value: false
        Property ID: 19, type: 11, value: false
        Property ID: 22, type: 11, value: false
        Property ID: 13, type: 4126, value: [B@19c26f5
        Property ID: 12, type: 4108, value: [B@c1b531
  No property set stream: "\0,4,2540,1984WordDocument"
  No property set stream: "\0,4,2540,1984_CompObj"
  No property set stream: "\0,4,2540,1984_Ole"

Wring OLE Object to viewable format

OLE object can be identified by the class id and its properties. We have noticed that OLE object present in the stream parsed from xml contained inside one folder under ROOT folder also noticed that the OLE object is actually MSWord document. Here is program which converts the OLE object stream to MS Word document.

package com.tk.poc;
...
/**
 * This file read the OLE document and convert to document.
 * 
 * @author kuldeep
 * 
 */
public class OLE2DocConvertor {
    /**
     * 
     * @param filePath
     * @return path of generated document
     */
    public static String convert(String filePath) {
        String output = filePath + ".doc";
        FileInputStream is = null;
        FileOutputStream out = null;
        try {
            is = new FileInputStream(filePath);
            POIFSFileSystem fs = new POIFSFileSystem(is);
            // Assuming there is one folder under root which is an OLE object
            // such as "/0,1,13204,13811" , "/0,1,16267,7627"
            DirectoryNode firstFolder = (DirectoryNode) fs.getRoot().getEntries().next();
            HWPFDocument embeddedWordDocument = new HWPFDocument(firstFolder);
            out = new FileOutputStream(output);
            embeddedWordDocument.write(out);
        
        } catch (FileNotFoundException e) {
            System.out.println("File paths are not correct: " + e.getMessage());
        } catch (IOException e) {
            System.out.println("Error in coverting ole to doc" + e.getMessage());
        } finally {
            try {
                if (is != null) {
                    is.close();
                }
                if (out != null) {
                    out.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return output;
    }
}

On the similar lines you can get actual content as well like image, excel, or any other.

The Main - Here is the executor program which takes XML(RIF) as input and generate separate document for each embedded elements.

package com.tk.poc;

import java.util.List;

/**
 * Main class
 * @author kuldeep
 *
 */
public class Main {
	public static void main(String[] args) {
	    String xmlFileName = "Export_Muster.xml";
		if(args.length>0){
		    xmlFileName = args[0];
		}
	    Parser parser = new Parser();
		System.out.println("Input file : " + xmlFileName);
		List<String> generatedFiles = parser.parseXML(xmlFileName);
		String outputFile = null;
		for (String fileName : generatedFiles) {
		    outputFile = OLE2DocConvertor.convert(fileName);
		    System.out.println("Embeded OLE object converted to document :  " + outputFile);
        }
	}	
}

Conclusion

We were able to parse the RIF XML and able to separate out the embedded attachments in individual viewable/browsable files.

References

＃ole ＃java ＃apache poi ＃embedded ＃RIF ＃XML ＃IBM DOORS ＃technology