Examples of Extracting DOC and DOCX Metadata and Text Using Poi

UPDATE: this post was written in early 2013. At the time it was written, it worked perfectly. However, I have no idea if this will work in your situation, today, as things probably have changed…If it does work for you, directly or after you’ve made a minor change or two, or if it doesn’t work for you, please leave a comment. The fact that I get so many hits for this post is sad–it suggests that the documentation on this subject is lacking…

Anyway…here’s the original post:

Kinda sad, I think, but I couldn’t find any simple examples of how to extract DOC and DOCX metadata and text from Word documents using Poi. So here are a couple. These are directly pulled out of working code, not made up examples. In other words, they work for me…


import org.apache.poi.POIXMLProperties.CoreProperties;
import org.apache.poi.hpsf.PropertySet;
import org.apache.poi.hpsf.SummaryInformation;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.poifs.filesystem.DirectoryEntry;
import org.apache.poi.poifs.filesystem.DocumentEntry;
import org.apache.poi.poifs.filesystem.DocumentInputStream;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

private String processDOC(File file, String outputDateMask, Map<String, String> metadata) throws Throwable {
FileInputStream fs = new FileInputStream(file);
POIFSFileSystem poifs = new POIFSFileSystem(fs);
fs = new FileInputStream(file);
HWPFDocument doc = new HWPFDocument(fs);
DirectoryEntry dir = poifs.getRoot();
SummaryInformation si = null;
DocumentEntry siEntry = (DocumentEntry)
dir.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
DocumentInputStream dis = new DocumentInputStream(siEntry);
PropertySet ps = new PropertySet(dis);
si = new SummaryInformation(ps);
if (si.getTitle() != null) {
metadata.put("title",si.getTitle());
}
if (si.getAuthor() != null) {
metadata.put("author",si.getAuthor());
}
if (si.getSubject() != null) {
metadata.put("subject",si.getSubject());
}
if (si.getKeywords() != null) {
metadata.put("keywords",si.getKeywords());
}
SimpleDateFormat sdf = new SimpleDateFormat(outputDateMask);
if (si.getLastSaveDateTime() != null) {
metadata.put("published",sdf.format(si.getLastSaveDateTime().getTime()));
} else if (si.getCreateDateTime() != null) {
metadata.put("published",sdf.format(si.getCreateDateTime().getTime()));
}
return (String) doc.getText().toString();
}


private String processPDF(File file, String outputDateMask, Map<String, String> metadata) throws Throwable {
PDFTextStripper pdfS = new PDFTextStripper();
PDDocument pdoc = new PDDocument();
pdoc = PDDocument.load(file);
PDDocumentInformation info = pdoc.getDocumentInformation();
if (info.getTitle() != null) {
metadata.put("title",info.getTitle());
}
if (info.getAuthor() != null) {
metadata.put("author",info.getAuthor());
}
if (info.getSubject() != null) {
metadata.put("subject",info.getSubject());
}
if (info.getKeywords() != null) {
metadata.put("keywords",info.getKeywords());
}
if (info.getCreator() != null) {
metadata.put("creator",info.getCreator());
}
if (info.getProducer() != null) {
metadata.put("producer",info.getProducer());
}
SimpleDateFormat sdf = new SimpleDateFormat(outputDateMask);
if (info.getModificationDate() != null) {
metadata.put("published",sdf.format(info.getModificationDate().getTime()));
} else if (info.getCreationDate() != null) {
metadata.put("published",sdf.format(info.getCreationDate().getTime()));
}
return pdfS.getText(pdoc);
}

Advertisements

About jeffmershon

Director of Program Management at SiriusXM.
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Examples of Extracting DOC and DOCX Metadata and Text Using Poi

  1. Frank says:

    Doesn’t work for DOCX 😦

  2. jeffmershon says:

    Frank,

    Like I said in the post, it worked for me (as of the time of the post, at least). Unfortunately, I’m at a new company, so I can’t go back and see if its still working, or what versions of POI and DOCX files it was tested against.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s