Pulling Content out of Word with ColdFusion 9

I had a 1 + 1 = 2 moment the other day. I was fooling around with the ColdFusion’s ability to turn Word docs into PDFs. At first glance it’s pretty simple and straightforward:

Word to PDF is nice to have, but as features go, it’s a pretty small bullet point. Don’t get me wrong, you get fidelity to the original, including fonts, layouts, and images. But it’s still just converting a Word document to a PDF.

That is until you remember that you can pull content out of PDFs now in ColdFusion 9. So now you can do this:

This will yield you the content of the original Word document. Now that’s cool.

15 thoughts on “Pulling Content out of Word with ColdFusion 9

  1. I havent downloaded CF9 yet, but this example uses .docx as the document format. Can the same be done for the old .doc format which wasn’t XML based?

    Like

  2. Thanks Terrence, I read up on it in the end. Yep, it should do .doc quite nicely (with whatever quirks are associated with OpenOffice).

    Use #1 for me: Produce thumbnails of word docs for document management application.

    Like

  3. Dear Terrence,

    A client of mine is asking if a document could be uploaded so that the document´s footnotes would be stripped and, together with the associated paragraph, be emailed to different people (As in Footnote 1 and its paragraph goes to Adam for check-up and footnote 2 goes to Joe)…
    I could convince the client to use a PDF if that mad ethings any easier… But I´d appreciate your insigt to know wether this would be at all possible…

    Thanks in advance, Juan Escalada.

    Like

  4. One of our departments downloads a Word doc in which they want to get the content from. This blog was just what I was looking for. Thanks.

    Like

  5. Terrence – thanks for that. I’m trying to use CF9 to pull summary information out of DOCX files. HWPF doesn’t do it, wondering if CF9 can do it based on DOCX support. Do you know the best way of going about such? Leery to do it by converting to pdf first and then using to extract, as the summary info (createTime, etc) may have been mangled.

    Like

  6. The new extracttext action is nice to have and I am able to use it to simply pull the text of a pdf document.

    However, I have a real need to keep the general format of the document as well. This would simply be things line centering, indenting/tabs, line breaks, etc.

    I had hoped that using useStructure=”true” along with honourspaces=”true” would return the basic format of the document, but that does not seem to be the case.

    Do you know if it is possible to maintain the basic formatting of a PDF document?

    Thanks

    BTW – the PDF documents that I am working with began as Word and were converted to PDF using your suggested approach (thank you).

    Like

  7. @Virginia

    I was running into a similar problem in the last week. I used an alternative solution that worked in my use case that may or may not help. I still converted the word documents to pdfs then I used the thumbnail capability to create jpg images at 100% resolution and then displayed those images to the end user which worked well for our use case.

    While I am looking forward to diving more into the capabilities here and with DDX I think the overall problem we both encountered is that DDX is about the actual content of the pdf devoid of styling, similar to how HTML “is supposed to be”. With HTML you add in a CSS file to style, and I believe there is similar type of document that combines with the DDX to create your fully stylized PDF.

    Thats obviously a lot more work to dig into and I only had the time to do the thumbnail solution for now, but hope that helps.

    @Terrence thanks for the blog and great tips that even got me started on solving my use case.

    Like

  8. Just want to confirm that an OpenOffice install on the server is needed to convert word documents to PDFs using coldfusion 9.

    Please correct me if I’m wrong, but I just ran Terrence’s code and received a error that an openOffice install was required.

    Like

  9. Yes you need OpenOffice installed, then add the directory in the admin settings under ‘Documents’ section. I tried it w/o the software and it did make a “pdf” file but with garbage in it as if you’re viewing a word doc with textpad.

    “When you use cfdocument to convert a document file, the tag first checks for an OpenOffice installation. When the OpenOffice installation is found, the tag processes the rich text conversion through the OpenOffice libraries.”

    One odd thing to note is that if the file extension is ‘.doc’ it returned garbage (even with openoffice) as the pdf output but with the same file renamed to “.docx” it worked.

    Like

  10. Problem with this is that sometimes the converted PDF doesn’t have readable text…..
    All the text gets mixed up in the PDF……

    It will be better if there is a way to extract text from word document directly….. Is there any way to do that in ColdFusion?

    Like

Leave a comment