Access document attributes?
Recently someone posted a query on the ColdFusion 8 Forums asking about cfpdf and accessing document attributes.
=========================
I was wondering if there was a way you could use CFPDF to access document attributes and extract embeded images and text from a PDF? Would there be a way for use to access text blocks created by the user (along with the x:y coordinates)?
For example, A LOT of sites convert the PDF to a JPEG and create area maps to simulate a zoom effect. Just looking for a way to deconstruct the PDF using the new CFPDF tag...
=========================
=========================
Most of these things requested are possible using cfpdf tag.
You can extract document attributes like metadata using cfpdf action="getinfo"
Extracting Text using cfpdf tag action=processddx. (code for extracting text from pdf follows below)..
You can't extract images from pdf. You can create JPEG images from pdf pages using action="thumbnail" and you can also specify o/p image format in this tag.
===========================
DDX File:
DDX File:
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<DocumentText result="Out1">
<PDF source="Doc1"/>
</DocumentText>
</DDX>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<DocumentText result="Out1">
<PDF source="Doc1"/>
</DocumentText>
</DDX>
CFM File:
<cfset ddxfile = "<Webroot>\ddx-textExtract\doc_text.ddx">
<cfset sourcefile1 = "<Webroot>\ddx-textExtract\<Any pdf having text>">
<cfset destinationfile = "<Webroot>\ddx-textExtract\ddx_result_doc_text.xml">
<cfset sourcefile1 = "<Webroot>\ddx-textExtract\<Any pdf having text>">
<cfset destinationfile = "<Webroot>\ddx-textExtract\ddx_result_doc_text.xml">
<cfset inputStruct=StructNew()>
<cfset inputStruct.Doc1="#sourcefile1#">
<cfset inputStruct.Doc1="#sourcefile1#">
<cfset outputStruct=StructNew()>
<cfset outputStruct.Out1="#destinationfile#">
<cfset outputStruct.Out1="#destinationfile#">
<cfpdf action="processddx" ddxfile="#ddxfile#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="ddxVar">
<cfoutput>#ddxVar.Out1#</cfoutput>
==========================
4 comments:
This is great stuff you have been doing here! It works like a charm. One request... is it possible that you show an example using DDX, to split a multipage pdf document into single pdf files?
Thanks so much,
Alexander
Hi Alexander,
Thanks...
Yes, you can split a pdf file into multiple pdf docs. Here is how you can acheive this.
ooph... blogspot is not allowing me to post code in the contents.
I will do a new post and let you know.
Hi,
I have posted the code here...
http://cfpdf.blogspot.com/2007/07/split-pdf-file-into-multiple-pdf-docs.html
The above example to extract text is also posted here...
http://cf-examples.net/index.cfm/2008/6/18/Extract-Text-From-PDF
Post a Comment