Document Processing Modules
Document Processing Modules
Document modules require system capabilities to be enabled. Set in environment:
CAPS_SOFFICE=true # LibreOffice (PDF conversion, DOCX)CAPS_TESSERACT=true # OCRCAPS_MAGICK=true # ImageMagickPDF Extract — Text Extraction
Extracts text from a PDF file.
appId: pdf-extract
{ "appId": "pdf-extract", "ref": "readPdf", "args": { "path": "{{ $prev.downloadFile.path }}", "pages": "all" }}Output: { "text": "Invoice #INV-001\nDate: 2024-01-15\nAmount: $1,250.00\n...", "pageCount": 3, "pages": ["Page 1 text...", "Page 2 text..."] }
PDF Merge — Combine PDFs
{ "appId": "pdf-merge", "ref": "combineDocs", "args": { "files": [ "{{ $prev.coverPage.path }}", "{{ $prev.reportBody.path }}", "{{ $prev.appendix.path }}" ], "output": "output/combined-report-{{ $date.format '2006-01-02' }}.pdf" }}Output: { "path": "output/combined-report-2024-01-15.pdf", "pageCount": 24, "size": 204800 }
PDF Split — Separate Pages
{ "appId": "pdf-split", "args": { "path": "{{ $prev.downloadPdf.path }}", "outputDir": "{{ $prev.tmpDir.path }}pages/", "pages": "1-3,5,7-10" }}Output: { "files": [{"path":"pages/page-1.pdf"}, ...], "count": 7 }
HTML to PDF — Generate PDFs from HTML
{ "appId": "html-to-pdf", "ref": "generateInvoicePdf", "args": { "html": "{{ $prev.renderTemplate.result }}", "output": "invoices/{{ $prev.args.invoiceId }}.pdf", "options": { "pageSize": "A4", "marginTop": "20mm", "marginBottom": "20mm", "marginLeft": "15mm", "marginRight": "15mm" } }}Output: { "path": "invoices/INV-001.pdf", "size": 45312 }
OCR — Text Extraction from Images
Extracts text from scanned documents or images using Tesseract.
appId: ocr
{ "appId": "ocr", "ref": "scanDocument", "args": { "path": "{{ $prev.downloadScan.path }}", "language": "eng", "outputMode": "text" }}Output: { "text": "INVOICE\nDate: January 15, 2024\n...", "confidence": 94.2 }
Supported languages: eng (English), deu (German), fra (French), spa (Spanish), and any Tesseract language pack installed on the system.
ImageMagick — Image Processing
{ "appId": "imagemagick-convert", "ref": "convertImage", "args": { "input": "{{ $prev.downloadScan.path }}", "output": "{{ $prev.tmpDir.path }}converted.png", "format": "png", "resize": "1200x", "quality": 85 }}Common operations: convert (format/resize), compress, rotate, crop.
DOCX — Parse Word Documents
{ "appId": "docx-parse", "ref": "readContract", "args": { "path": "{{ $prev.downloadDoc.path }}" }}Output: { "text": "CONTRACT AGREEMENT\n\nThis agreement...", "paragraphs": [...], "tables": [...] }