Monday, June 11, 2007

Access document attributes?


Recently someone posted a query on the ColdFusion Forums asking about cfpdf and accessing document attributes.
=========================
I was wondering if there was a way you could use CFPDF to access document attributes and extract embeded images and text from a PDF? Would there be a way for use to access text blocks created by the user (along with the x:y coordinates)?
 
For example, A LOT of sites convert the PDF to a JPEG and create area maps to simulate a zoom effect. Just looking for a way to deconstruct the PDF using the new CFPDF tag...
=========================
 
Most of these things requested are possible using cfpdf tag.
 
You can extract document attributes like metadata using cfpdf action="getinfo"
 
Extracting Text using cfpdf tag action=processddx. (code for extracting text from pdf follows below)..
 
You can't extract images from pdf. You can create JPEG images from pdf pages using action="thumbnail" and you can also specify o/p image format in this tag.
 
===========================
DDX File:
 
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="
http://ns.adobe.com/DDX/1.0/"
   xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="
http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
   <DocumentText result="Out1">
      <PDF source="Doc1"/>
   </DocumentText>
</DDX>
 

CFM File:
 
<cfset ddxfile = "<Webroot>\ddx-textExtract\doc_text.ddx">
<cfset sourcefile1 = "<Webroot>\ddx-textExtract\<Any pdf having text>">
<cfset destinationfile = "<Webroot>\ddx-textExtract\ddx_result_doc_text.xml">
 
<cfset inputStruct=StructNew()>
<cfset inputStruct.Doc1="#sourcefile1#">
 
<cfset outputStruct=StructNew()>
<cfset outputStruct.Out1="#destinationfile#">
 
<cfpdf action="processddx" ddxfile="#ddxfile#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="ddxVar">
 
<cfoutput>#ddxVar.Out1#</cfoutput>
 
==========================

4 comments:

Alexander said...

This is great stuff you have been doing here! It works like a charm. One request... is it possible that you show an example using DDX, to split a multipage pdf document into single pdf files?
Thanks so much,
Alexander

Ahamad said...

Hi Alexander,
Thanks...

Yes, you can split a pdf file into multiple pdf docs. Here is how you can acheive this.

ooph... blogspot is not allowing me to post code in the contents.

I will do a new post and let you know.

Ahamad said...

Hi,
I have posted the code here...

http://cfpdf.blogspot.com/2007/07/split-pdf-file-into-multiple-pdf-docs.html

ahamad said...

The above example to extract text is also posted here...

http://cf-examples.net/index.cfm/2008/6/18/Extract-Text-From-PDF