Originally published 2010-02-07
The Short of It
- Plan to use the Word 2007 default file format. Files of this format have the extension “docx.” (Older versions of Word require a Compatibility Pack in order to open these files.)
- In order to generate these files, plan to use the docx4j library.
- Peruse the docx4j Getting Started guide. It is available in either HTML or PDF format.
- Familiarize yourself with the docx4j project’s sample code.
- Locate and bookmark the Javadoc for the version of docx4j you are using. Here is the Javadoc for version 2.2.2.
- Create and save a template document in Word 2007. It should contain all elements in the desired structure of documents you wish to output, including the styling of those elements.
- Unzip the template document with any zip application. Copy the document.xml file from the “word” directory of the unzipped contents to somewhere for reference. Go through the other files in the “word” directory and copy to a location accessible by your application any files that contain styling, etc, that you wish to reproduce in the documents generated by your application. At a minimum you will probably want the styles.xml and numbering.xml files, and you may also want the settings.xml file.
- (Optional) Go through the styles.xml, numbering.xml, etc, you copied for use by your application and remove the RSIDs.
- Download the docx4j library and make it accessible to your application.
- Create a service class for your application that has a static method that takes an argument of whatever data structure that contains the data that will be used for the Word documents. Alternatively, add a toDocx() method to the actual data structure class.
- In the new method, create a WordprocessingMLPackage object and add the parts – styles.xml, numbering.xml, etc – you extracted from the template document. This is demonstrated in the ImportForeignPart.java sample code file.
- Write code in this new method to add XML elements containing your data according to the structure of the document.xml you extracted from your template document as a guide. The CreateWordprocessingMLDocument.java sample code file illustrates this.
- Somewhere in your application, call this new method with the data you wish to put in a Word document and save the returned WordprocessingMLPackage object to a file with the suffix “docx.”
The Long of It
One of the projects I’ve been working on a lot lately is to create a tool that generates Word documents from data stored in one of the enterprise applications we use. Because one of my coworkers had already written code in Java that retrieved the desired data from the application in question, and because some quick googling discovered that there weren’t any good libraries for manipulating Microsoft Office files in Ruby while there were a few in Java, I decided to go the Java route.
There were many, many different suggestions that popped up when I googled for creating Word documents in Java, some better than others, which is part of what motivated me to write this blog post. One of the more reputable suggestions was to use the Apache POI library. Apache POI has become the preeminent open source project for interacting with Microsoft Office files in Java, so this made a lot of sense, and I planned to use the library in my project.
However, when the day actually came to use the Apache POI library, I was disappointed to find that its support for Word documents was quite limited. This is in contrast to its support for Excel spreadsheets, which is quite robust. (As far as I can tell, most of the advertorial claims about what Apache POI can do refer to its Excel support, not its Word support.) Apparently the individual in charge of the Word component got a job while the Word libraries were still being developed with a company that had a non-disclosure agreement with Microsoft that forbade him from working on the open source project. That was many moons ago, and he was never replaced. As such the Word libraries in Apache POI are still in their “scratchpad” section and haven’t been fully developed.
This situation results in many practical limitations to using Apache POI for generating Word documents. The libraries could only open existing Word documents, but could not create one from scratch. This could be circumvented by having a dummy blank Word file handy to the application that it could open again and again, but is indicative of how primordial the Word libraries of POI are. Once you have a document object opened in this way, you will find far more pressing limitations. POI’s Word libraries can apparently only add basic paragraphs to the document. Anything more complicated than a paragraph, even structures as simple as a table, are beyond Apache POI’s reach.
This was deemed insufficient, and so I was left to scramble to find another approach to generating Word documents in Java. This turned out to be fortuitous, since POI revolves around the old OLE2-based Office formats, and ever since Office 2007, Microsoft has been pushing the new Office Open XML standard for its Office files.
This standard, which is the default format for all Office files in Office 2007, is basically just several XML files zipped up into a single archive file. I googled for this format, and found a document with the hope-inducing title “Creating Word Document in Office Open XML Format using Java” on openxmldeveloper.org, but this turned out mostly to be about how to zip up the various parts of a docx file into one archive using Java, which I could figure out for myself, and so I eventually stopped reading it.
With an eye toward that approach, I created a docx file in Word 2007, unzipped it, and peered inside. This is what I found.
Digging through the files, I found that all the content was in the document.xml file. As such I conceived of an approach in which I created a template document, repeatedly copied all the other files from it besides the document.xml, used Java’s XML libraries to piece together the data I wanted in a new document.xml file, and used Java’s Zip libraries to put everything together. I therefore set about to learn the structure of the document.xml file. A fragment of such a file is reproduced below.
<w:document>
<w:body>
<w:p w:rsidR="00E04EDA" w:rsidRDefault="00E04EDA" w:rsidP="00E04EDA">
<w:pPr>
<w:pStyle w:val="Title"/>
</w:pPr>
<w:r>
<w:t>Document Title</w:t>
</w:r>
</w:p>
<w:p w:rsidR="00E04EDA" w:rsidRPr="00545795" w:rsidRDefault="00E04EDA" w:rsidP="00E04EDA">
<w:pPr>
<w:pStyle w:val="Heading1"/>
<w:rPr>
<w:rFonts w:ascii="Arial Unicode MS" w:eastAsia="Arial Unicode MS" w:hAnsi="Arial Unicode MS" w:cs="Arial Unicode MS"/>
</w:rPr>
</w:pPr>
<w:bookmarkStart w:id="0" w:name="_Toc210812666"/>
<w:r w:rsidRPr="00545795">
<w:rPr>
<w:rFonts w:ascii="Arial Unicode MS" w:eastAsia="Arial Unicode MS" w:hAnsi="Arial Unicode MS" w:cs="Arial Unicode MS"/>
</w:rPr>
<w:t>Introduction</w:t>
</w:r>
<w:bookmarkEnd w:id="0"/>
...
The structure started to make sense to me, but one nagging question I kept having was, What are these RSIDs, and why are they everywhere? It turns out that these are identifiers that Word uses to keep track of revisions and what changes were made as part of which revisions. When one is creating a document that should be considered created all at once, these can safely be eliminated.
The basic structure of the document.xml is that it contains a body, and a body consists of paragraphs, which one can think of as text separated by newlines, and paragraphs consist of runs, which one can think of as text that may share the same line. Properties – such as font face, bolding, etc – can be applied at either the run or the paragraph level and are best abstracted out into a style. Specifications of a style are in the styles.xml file.
Thus whenever there is text that is to be separated from the rest, another paragraph should be created. Whenever text, even on the same line, is to be styled at all differently, it should be contained in its own run. More information can be gleaned from the ECMA specification of the Open Office XML formats.
I began contemplating all the utility code I would create to manage the manipulation of these documents based on Java’s Zip and XML libraries. Fortunately, I found that someone else had already done all this work for me, and the result is the open source docx4j library.
Now armed with docx4j, it was much easier to use the approach I envisioned to generate Word documents. I simply created a template document with all the structure and styling I desired, extracted the styles.xml and numbering.xml parts, and used docx4j to create new docx packages, to import the styles.xml and numbering.xml parts into them, and to create the document.xml part. This became the tutorial above in “The Short of It.” I linked copiously to the best of the docx4j documentation in the tutorial, as some of the links off of Google and elsewhere seem to be misdirections. (There was one point in this whole endeavor in which I was baffled at the lack of sample code in the docx4j project. This was because I was viewing a link to an older revision of the project’s source code repository.)
blog comments powered by Disqus