Examples of Extracting DOC and DOCX Metadata and Text Using Poi

UPDATE: this post was written in early 2013. At the time it was written, it worked perfectly. However, I have no idea if this will work in your situation, today, as things probably have changed…If it does work for you, directly or after you’ve made a minor change or two, or if it doesn’t work for you, please leave a comment. The fact that I get so many hits for this post is sad–it suggests that the documentation on this subject is lacking…

Anyway…here’s the original post:

Kinda sad, I think, but I couldn’t find any simple examples of how to extract DOC and DOCX metadata and text from Word documents using Poi. So here are a couple. These are directly pulled out of working code, not made up examples. In other words, they work for me…

import org.apache.poi.POIXMLProperties.CoreProperties;
import org.apache.poi.hpsf.PropertySet;
import org.apache.poi.hpsf.SummaryInformation;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.poifs.filesystem.DirectoryEntry;
import org.apache.poi.poifs.filesystem.DocumentEntry;
import org.apache.poi.poifs.filesystem.DocumentInputStream;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

private String processDOC(File file, String outputDateMask, Map<String, String> metadata) throws Throwable {
FileInputStream fs = new FileInputStream(file);
POIFSFileSystem poifs = new POIFSFileSystem(fs);
fs = new FileInputStream(file);
HWPFDocument doc = new HWPFDocument(fs);
DirectoryEntry dir = poifs.getRoot();
SummaryInformation si = null;
DocumentEntry siEntry = (DocumentEntry)
DocumentInputStream dis = new DocumentInputStream(siEntry);
PropertySet ps = new PropertySet(dis);
si = new SummaryInformation(ps);
if (si.getTitle() != null) {
if (si.getAuthor() != null) {
if (si.getSubject() != null) {
if (si.getKeywords() != null) {
SimpleDateFormat sdf = new SimpleDateFormat(outputDateMask);
if (si.getLastSaveDateTime() != null) {
} else if (si.getCreateDateTime() != null) {
return (String) doc.getText().toString();

private String processPDF(File file, String outputDateMask, Map<String, String> metadata) throws Throwable {
PDFTextStripper pdfS = new PDFTextStripper();
PDDocument pdoc = new PDDocument();
pdoc = PDDocument.load(file);
PDDocumentInformation info = pdoc.getDocumentInformation();
if (info.getTitle() != null) {
if (info.getAuthor() != null) {
if (info.getSubject() != null) {
if (info.getKeywords() != null) {
if (info.getCreator() != null) {
if (info.getProducer() != null) {
SimpleDateFormat sdf = new SimpleDateFormat(outputDateMask);
if (info.getModificationDate() != null) {
} else if (info.getCreationDate() != null) {
return pdfS.getText(pdoc);


About jeffmershon

Director of Program Management at SiriusXM.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s