web stats
Mirth Community - View Single Post - Mirth Tools: User defined functions
View Single Post
  #56  
Old 05-21-2018, 08:32 AM
narupley's Avatar
narupley narupley is online now
Mirth Employee
 
Join Date: Oct 2010
Posts: 7,126
narupley is on a distinguished road
Default Extract Text From PDF

Quote:
UPDATE: I've created a public GitHub repository to track these example channels, code templates, scripts, or whatever else!

https://github.com/nextgenhealthcare/connect-examples

To start with I only added the ones I wrote, because I didn't want to presume and add code from others without their explicit permission. Pull requests welcome!
This comes up from time to time...

extractTextFromPDF: Extracts and returns all text from a PDF. Uses the built-in iText library, version 2.1.7.

Parameters:
  • pdfBytes: The raw byte array for the PDF.

Examples:
  • Extract text from a Base64-encoded PDF string:
    Code:
    var pdfBytes = FileUtil.decode(pdfBase64String);
    var pdfText = extractTextFromPDF(pdfBytes);
  • Extract text from a PDF message attachment (before 3.6):
    Code:
    // Is a byte array containing Base64 ASCII bytes
    var attachmentContent = getAttachments().get(0).getContent();
    // Convert to a Base64 string
    var attachmentBase64String = new java.lang.String(attachmentContent, 'US-ASCII');
    // Convert to raw PDF bytes
    var pdfBytes = FileUtil.decode(attachmentBase64String);
    // Extract the text
    var pdfText = extractTextFromPDF(pdfBytes);
  • Extract text from a PDF message attachment (3.6 and later):
    Code:
    // Pass in true for base64Decode, then content is already raw PDF bytes
    var pdfBytes = getAttachments(true).get(0).getContent();
    // Extract the text
    var pdfText = extractTextFromPDF(pdfBytes);

The code:
Code:
/**
	Extracts and returns all text from a PDF. Uses the built-in iText library, version 2.1.7.

	@param {byte[]} pdfBytes - The raw byte array for the PDF.
	@return {String} The extracted text.
*/
function extractTextFromPDF(pdfBytes) {
	var text = new java.lang.StringBuilder();
	var reader = new com.lowagie.text.pdf.PdfReader(pdfBytes);
	
	try {
		var extractor = new com.lowagie.text.pdf.parser.PdfTextExtractor(reader);
		var pages = reader.getNumberOfPages();
		
		for (var i = 1; i <= pages; i++) {
			text.append(extractor.getTextFromPage(i));
			if (i < pages) {
				text.append('\n\n');
			}
		}
	} finally {
		reader.close();
	}

	return text.toString();
}
Attached Files
File Type: xml Extract Text From PDF.xml (1.4 KB, 12 views)
__________________
Step 1: JAVA CACHE...DID YOU CLEAR ...wait, ding dong the witch is dead?

Nicholas Rupley
Work: 949-237-6069
Always include what Mirth Connect version you're working with. Also include (if applicable) the code you're using and full stacktraces for errors (use CODE tags). Posting your entire channel is helpful as well; make sure to scrub any PHI/passwords first.


- How do I foo?
- You just bar.

Last edited by narupley; 06-08-2018 at 11:37 AM.
Reply With Quote