Simple Example of Extracting Metadata and Text from PDF Using PDFBox

Below is a simple example of how to pull text and metadata our of a pdf file using PDFBox. Much simpler to understand than using Poi with DOC and DOCX–but maybe that’s just me!

 

 

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.util.PDFTextStripper;

private String processPDF(File file, String outputDateMask, Map<String, String> metadata) throws Throwable {
PDFTextStripper pdfS = new PDFTextStripper();
PDDocument pdoc = new PDDocument();
pdoc = PDDocument.load(file);
PDDocumentInformation info = pdoc.getDocumentInformation();
if (info.getTitle() != null) {
metadata.put("title",info.getTitle());
}
if (info.getAuthor() != null) {
metadata.put("author",info.getAuthor());
}
if (info.getSubject() != null) {
metadata.put("subject",info.getSubject());
}
if (info.getKeywords() != null) {
metadata.put("keywords",info.getKeywords());
}
if (info.getCreator() != null) {
metadata.put("creator",info.getCreator());
}
if (info.getProducer() != null) {
metadata.put("producer",info.getProducer());
}
SimpleDateFormat sdf = new SimpleDateFormat(outputDateMask);
if (info.getModificationDate() != null) {
metadata.put("published",sdf.format(info.getModificationDate().getTime()));
} else if (info.getCreationDate() != null) {
metadata.put("published",sdf.format(info.getCreationDate().getTime()));
}
return pdfS.getText(pdoc);
}

Advertisements

About jeffmershon

Director of Program Management at SiriusXM.
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Simple Example of Extracting Metadata and Text from PDF Using PDFBox

  1. Santiago says:

    great example! Do you think it could be hard make the extraction on a thread?

    • jeffmershon says:

      Santiago,

      According to the PDFBox site, PDFBox is NOT thread safe, so only one thread can access a single PDFDocument. That doesn’t mean you can’t multithread at all–you would create a PDFDocument within each thread, or perhaps create a pool of PDFDocuments and assign one to a thread when you spawn it. It would really depend on what you’re trying to do. You would be better served taking that question to someone who writes code for a living.
      Jeff

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s