Pulling Content out of Word with ColdFusion 9

I had a 1 + 1 = 2 moment the other day. I was fooling around with the ColdFusion’s ability to turn Word docs into PDFs. At first glance it’s pretty simple and straightforward:

Word to PDF is nice to have, but as features go, it’s a pretty small bullet point. Don’t get me wrong, you get fidelity to the original, including fonts, layouts, and images. But it’s still just converting a Word document to a PDF.

That is until you remember that you can pull content out of PDFs now in ColdFusion 9. So now you can do this:

This will yield you the content of the original Word document. Now that’s cool.

Published by tpryan

View all posts by tpryan

15 thoughts on “Pulling Content out of Word with ColdFusion 9”

Mingo Hagen says:

August 4, 2009 at 9:18 am

while cool, I’d like to have <cfdocument action="extracttext">.

LikeLike

Reply
Ben Nadel says:

August 4, 2009 at 9:19 am

Awesome! I didn’t even know ColdFusion could convert word documents into PDFs! Bitchy!

LikeLike

Reply
John Farrar says:

August 4, 2009 at 11:05 pm

Now that has some serious potential. Will have to
start dumping and see what pragmatic use this can have! 🙂

LikeLike

Reply
Ben Spencer says:

August 11, 2009 at 1:57 am

I havent downloaded CF9 yet, but this example uses .docx as the document format. Can the same be done for the old .doc format which wasn’t XML based?

LikeLike

Reply
Terrence Ryan says:

August 11, 2009 at 4:39 pm

Ben: I haven’t tested that myself, but I’m pretty sure we’ve said it works. 😉

LikeLike

Reply
Ben Spencer says:

August 11, 2009 at 5:27 pm

Thanks Terrence, I read up on it in the end. Yep, it should do .doc quite nicely (with whatever quirks are associated with OpenOffice).

Use #1 for me: Produce thumbnails of word docs for document management application.

LikeLike

Reply
Juan Escalada says:

September 22, 2009 at 7:10 am

Dear Terrence,

A client of mine is asking if a document could be uploaded so that the document´s footnotes would be stripped and, together with the associated paragraph, be emailed to different people (As in Footnote 1 and its paragraph goes to Adam for check-up and footnote 2 goes to Joe)…
I could convince the client to use a PDF if that mad ethings any easier… But I´d appreciate your insigt to know wether this would be at all possible…

Thanks in advance, Juan Escalada.

LikeLike

Reply
Don Blaire says:

April 23, 2010 at 7:22 pm

One of our departments downloads a Word doc in which they want to get the content from. This blog was just what I was looking for. Thanks.

LikeLike

Reply
Tad says:

May 17, 2010 at 4:02 pm

Terrence – thanks for that. I’m trying to use CF9 to pull summary information out of DOCX files. HWPF doesn’t do it, wondering if CF9 can do it based on DOCX support. Do you know the best way of going about such? Leery to do it by converting to pdf first and then using to extract, as the summary info (createTime, etc) may have been mangled.

LikeLike

Reply
Virginia Neal says:

May 28, 2010 at 7:30 pm

The new extracttext action is nice to have and I am able to use it to simply pull the text of a pdf document.

However, I have a real need to keep the general format of the document as well. This would simply be things line centering, indenting/tabs, line breaks, etc.

I had hoped that using useStructure=”true” along with honourspaces=”true” would return the basic format of the document, but that does not seem to be the case.

Do you know if it is possible to maintain the basic formatting of a PDF document?

Thanks

BTW – the PDF documents that I am working with began as Word and were converted to PDF using your suggested approach (thank you).

LikeLike

Reply
Cheyenne Throckmorton says:

June 14, 2010 at 4:14 pm

@Virginia

I was running into a similar problem in the last week. I used an alternative solution that worked in my use case that may or may not help. I still converted the word documents to pdfs then I used the thumbnail capability to create jpg images at 100% resolution and then displayed those images to the end user which worked well for our use case.

While I am looking forward to diving more into the capabilities here and with DDX I think the overall problem we both encountered is that DDX is about the actual content of the pdf devoid of styling, similar to how HTML “is supposed to be”. With HTML you add in a CSS file to style, and I believe there is similar type of document that combines with the DDX to create your fully stylized PDF.

Thats obviously a lot more work to dig into and I only had the time to do the thumbnail solution for now, but hope that helps.

@Terrence thanks for the blog and great tips that even got me started on solving my use case.

LikeLike

Reply
Nick says:

August 7, 2010 at 4:49 pm

Just want to confirm that an OpenOffice install on the server is needed to convert word documents to PDFs using coldfusion 9.

Please correct me if I’m wrong, but I just ran Terrence’s code and received a error that an openOffice install was required.

LikeLike

Reply
Ed says:

October 18, 2010 at 5:14 pm

Yes you need OpenOffice installed, then add the directory in the admin settings under ‘Documents’ section. I tried it w/o the software and it did make a “pdf” file but with garbage in it as if you’re viewing a word doc with textpad.

“When you use cfdocument to convert a document file, the tag first checks for an OpenOffice installation. When the OpenOffice installation is found, the tag processes the rich text conversion through the OpenOffice libraries.”

One odd thing to note is that if the file extension is ‘.doc’ it returned garbage (even with openoffice) as the pdf output but with the same file renamed to “.docx” it worked.

LikeLike

Reply
vakantiehuis says:

February 12, 2011 at 9:31 pm

Leuke site!. Er zijn nog weinig goede sites over dit onderwerp te vinden.
Ben blij met jullie post!
Ik kan helaas geen bookmark aanmaken naar http://www.terrenceryan.com in Firefox. 😦 Weten jullie hoe dit komt?

Groetjes Barbara

LikeLike

Reply
Anupam says:

February 1, 2017 at 10:39 am

Problem with this is that sometimes the converted PDF doesn’t have readable text…..
All the text gets mixed up in the PDF……

It will be better if there is a way to extract text from word document directly….. Is there any way to do that in ColdFusion?

LikeLike

Reply