Simple Example of Extracting Metadata and Text from PDF Using PDFBox

Below is a simple example of how to pull text and metadata our of a pdf file using PDFBox. Much simpler to understand than using Poi with DOC and DOCX–but maybe that’s just me!



import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentInformation;
import org.apache.pdfbox.util.PDFTextStripper;

private String processPDF(File file, String outputDateMask, Map<String, String> metadata) throws Throwable {
PDFTextStripper pdfS = new PDFTextStripper();
PDDocument pdoc = new PDDocument();
pdoc = PDDocument.load(file);
PDDocumentInformation info = pdoc.getDocumentInformation();
if (info.getTitle() != null) {
if (info.getAuthor() != null) {
if (info.getSubject() != null) {
if (info.getKeywords() != null) {
if (info.getCreator() != null) {
if (info.getProducer() != null) {
SimpleDateFormat sdf = new SimpleDateFormat(outputDateMask);
if (info.getModificationDate() != null) {
} else if (info.getCreationDate() != null) {
return pdfS.getText(pdoc);


2 Responses to Simple Example of Extracting Metadata and Text from PDF Using PDFBox

  1. Santiago says:

    great example! Do you think it could be hard make the extraction on a thread?

    • jeffmershon says:


      According to the PDFBox site, PDFBox is NOT thread safe, so only one thread can access a single PDFDocument. That doesn’t mean you can’t multithread at all–you would create a PDFDocument within each thread, or perhaps create a pool of PDFDocuments and assign one to a thread when you spawn it. It would really depend on what you’re trying to do. You would be better served taking that question to someone who writes code for a living.

