Skip to content

Document Processing Modules

Document Processing Modules

Document modules require system capabilities to be enabled. Set in environment:

CAPS_SOFFICE=true # LibreOffice (PDF conversion, DOCX)
CAPS_TESSERACT=true # OCR
CAPS_MAGICK=true # ImageMagick

PDF Extract — Text Extraction

Extracts text from a PDF file.

appId: pdf-extract

{
"appId": "pdf-extract",
"ref": "readPdf",
"args": {
"path": "{{ $prev.downloadFile.path }}",
"pages": "all"
}
}

Output: { "text": "Invoice #INV-001\nDate: 2024-01-15\nAmount: $1,250.00\n...", "pageCount": 3, "pages": ["Page 1 text...", "Page 2 text..."] }

PDF Merge — Combine PDFs

{
"appId": "pdf-merge",
"ref": "combineDocs",
"args": {
"files": [
"{{ $prev.coverPage.path }}",
"{{ $prev.reportBody.path }}",
"{{ $prev.appendix.path }}"
],
"output": "output/combined-report-{{ $date.format '2006-01-02' }}.pdf"
}
}

Output: { "path": "output/combined-report-2024-01-15.pdf", "pageCount": 24, "size": 204800 }

PDF Split — Separate Pages

{
"appId": "pdf-split",
"args": {
"path": "{{ $prev.downloadPdf.path }}",
"outputDir": "{{ $prev.tmpDir.path }}pages/",
"pages": "1-3,5,7-10"
}
}

Output: { "files": [{"path":"pages/page-1.pdf"}, ...], "count": 7 }

HTML to PDF — Generate PDFs from HTML

{
"appId": "html-to-pdf",
"ref": "generateInvoicePdf",
"args": {
"html": "{{ $prev.renderTemplate.result }}",
"output": "invoices/{{ $prev.args.invoiceId }}.pdf",
"options": {
"pageSize": "A4",
"marginTop": "20mm",
"marginBottom": "20mm",
"marginLeft": "15mm",
"marginRight": "15mm"
}
}
}

Output: { "path": "invoices/INV-001.pdf", "size": 45312 }

OCR — Text Extraction from Images

Extracts text from scanned documents or images using Tesseract.

appId: ocr

{
"appId": "ocr",
"ref": "scanDocument",
"args": {
"path": "{{ $prev.downloadScan.path }}",
"language": "eng",
"outputMode": "text"
}
}

Output: { "text": "INVOICE\nDate: January 15, 2024\n...", "confidence": 94.2 }

Supported languages: eng (English), deu (German), fra (French), spa (Spanish), and any Tesseract language pack installed on the system.

ImageMagick — Image Processing

{
"appId": "imagemagick-convert",
"ref": "convertImage",
"args": {
"input": "{{ $prev.downloadScan.path }}",
"output": "{{ $prev.tmpDir.path }}converted.png",
"format": "png",
"resize": "1200x",
"quality": 85
}
}

Common operations: convert (format/resize), compress, rotate, crop.

DOCX — Parse Word Documents

{
"appId": "docx-parse",
"ref": "readContract",
"args": {
"path": "{{ $prev.downloadDoc.path }}"
}
}

Output: { "text": "CONTRACT AGREEMENT\n\nThis agreement...", "paragraphs": [...], "tables": [...] }