Document Processing
Why Document Processing Matters
Documents are central to how business works. Contracts, invoices, forms, reports, permits, medical records, compliance certificates—they flow through every function.
But documents are also messy. Formats vary. Scans are images, not searchable text. Information is buried in layouts you don’t control. Extracting data manually is slow and error-prone.
flow8’s document processing capabilities let you:
- Extract structured data from unstructured PDFs (invoices, contracts, forms).
- Generate documents on demand (contracts, reports, letters, invoices).
- Process scanned images with OCR (Tesseract) to make them searchable and extractable.
- Convert between formats (Word to PDF, Excel to PDF, HTML to PDF).
- Manipulate PDFs (merge, split, rotate, add watermarks).
- Store and organize documents (automatically file based on type or content).
All in a single flow, without specialist tools or manual intervention.
Document Operations
Extracting Text from PDFs
Scenario: A client sends a contract as a PDF. You need to extract key terms (parties, effective date, expiration, payment terms, liability caps) and populate a legal database.
Without flow8: A lawyer reads the contract (~15 minutes), manually enters terms into a spreadsheet.
With flow8:
- Contract arrives via email.
- Extract PDF Text flowlet: reads the PDF, outputs all text.
- AI Data Extraction flowlet: given the text and a list of fields to extract, AI returns structured JSON.
- Update Database flowlet: writes extracted fields to a contract management database.
- Notify Team flowlet: emails the legal team with extracted summary.
Outcome: 15 minutes of manual work → 30 seconds of automated processing.
Optical Character Recognition (OCR)
Scenario: You receive 100 scanned loan applications (images/low-quality PDFs). Each needs to be read, understood, and filed in the right folder. Right now, someone manually reads each image and types the data into your loan system.
Without flow8: 100 scans × 5 minutes = 8+ hours of manual data entry.
With flow8:
- Scanned document uploaded to folder or email.
- OCR (Tesseract) flowlet: recognizes text from the image. Outputs searchable text.
- AI Classification flowlet: reads the extracted text, determines document type (application, proof of income, employment verification).
- Extract Applicant Data flowlet: AI extracts name, SSN, income, employment from the text.
- File Document flowlet: stores the scan in the right folder structure by applicant name and document type.
- Log Data flowlet: creates a record in your loan system with applicant details.
Outcome: 100 scans processed in ~2 minutes (parallel processing). All data extracted and logged automatically. Documents are now searchable (OCR text is stored with PDFs).
PDF Generation
Scenario: Your sales team closes deals daily. For each deal, you need to generate a custom contract (terms vary by customer), get it signed, and file it. Currently, contracts are Word templates that an admin manually customizes (20–30 minutes per contract).
With flow8:
- Deal is won in Salesforce.
- Webhook triggers flow8.
- Fetch Deal Details flowlet: pulls customer name, amount, products, term from Salesforce.
- Fetch Customer Data flowlet: pulls address, contact, billing info from CRM.
- AI Generate Terms flowlet: given customer data and deal terms, AI generates contract text (payment terms, liability, termination clauses) tailored to customer risk profile.
- Generate PDF flowlet: renders the contract as a formatted PDF using an HTML template.
- Send for E-Signature flowlet: emails PDF to customer with DocuSign link.
- Store Contract flowlet: stores PDF in OneDrive, links to deal record in Salesforce.
Outcome: Contract generated and sent for signature in < 1 minute. No admin time. Contracts are consistent and audit-friendly.
PDF Merging and Splitting
Scenario: A legal team is preparing a case file for court. It includes a cover letter, motion, exhibits, declarations, and appendices (50+ pages). These are stored as separate PDFs. They need to be combined into a single file with a table of contents and proper pagination.
Without flow8: Admin manually opens each PDF, merges them in the right order, renumbers pages (~1 hour).
With flow8:
- Trigger: Admin clicks “Compile Case File” in flow8.
- Fetch PDFs flowlet: retrieves cover letter, motion, exhibits, declarations from OneDrive in the right order.
- Merge PDFs flowlet: combines them into one document.
- Add Pagination flowlet: adds page numbers and footer with case number.
- Store and Notify flowlet: saves the compiled file to OneDrive, emails link to the team.
Outcome: 1-hour manual task → 30 seconds.
Format Conversion
Scenario: Finance team receives invoices in various formats: some as PDFs, some as Excel files, some as Word docs. All need to be converted to PDF for archival and e-signature. Right now, this requires manual conversion in Adobe or online tools.
Without flow8: 50 invoices × 2 minutes each = 100 minutes of clicking and uploading to converters.
With flow8:
- Invoices uploaded to a folder.
- Detect File Type flowlet: determines if PDF, Excel, Word, etc.
- Convert to PDF flowlet (using LibreOffice):
- If Excel/Word: convert using LibreOffice.
- If already PDF: pass through.
- Store Converted flowlet: saves all PDFs to archive folder.
Outcome: 50 invoices converted in parallel in ~1 minute. All are now in uniform PDF format.
HTML to PDF
Scenario: Your reporting system generates reports as HTML (formatted, styled). You want to email them as PDFs. Right now, you copy HTML to a browser, “print to PDF,” and email.
With flow8:
- Report data pulled from database.
- Format as HTML flowlet: generates HTML report with formatting (header, footer, tables, charts).
- Convert HTML to PDF flowlet: renders HTML as PDF.
- Email Report flowlet: attaches PDF and sends via email.
Outcome: Reports are always consistent, always properly formatted, always delivered on schedule.
Image Processing
Scenario: Your product images (from suppliers) are in various sizes and formats. You need to resize them to a standard size for your e-commerce site. Manual resizing is tedious.
With flow8:
- New image uploaded.
- Resize Image flowlet (ImageMagick): resizes to standard dimensions (e.g., 500x500px).
- Optimize flowlet: compresses for web (reduces file size).
- Store flowlet: saves to CDN or product database.
Outcome: Bulk image processing is automated and consistent.
Document Lifecycle: A Complete Example
Let’s walk through a real-world example: Invoice Processing Workflow
The Process
-
Invoice Arrives (PDF or image)
- Scanned invoice uploaded to a folder or emailed.
-
Extract Data (OCR + AI)
- OCR flowlet: if image/low-quality PDF, extract text.
- AI Data Extraction flowlet: reads invoice text, extracts vendor, invoice number, date, line items, total amount, due date.
-
Validation
- Router flowlet: checks extracted data against rules.
- Missing required fields? Flag for manual review.
- Amount exceeds approval limit? Route to manager approval.
- Vendor not in system? Create vendor record.
- Router flowlet: checks extracted data against rules.
-
Reconciliation
- If PO exists: match invoice to PO (Invoice.PoNumber = PO.Number).
- If no match: alert to AP team.
- Check for duplicates: is this invoice already in the system? Comparison flowlet searches prior invoices.
-
Approval Gate (if over threshold)
- If amount > $5,000: NeedsApproval flowlet sends to manager.
- Manager reviews AI-extracted summary and approves/rejects.
- If rejected, routes to exception queue.
-
Post to Accounting
- Create bill in QuickBooks with extracted data.
- Schedule payment based on terms.
- Attach original PDF to QB record.
-
Archive and Notify
- Store invoice PDF in Google Drive in folder: /Vendor Name/YYYY-MM-DD Invoice Number.
- Log transaction in your invoice database with all metadata.
- Email confirmation to AP team and requester.
Time Impact
Before:
- 5 invoices per hour per person (data entry + verification).
- 50 invoices per week = 10 person-hours.
- 1 FTE dedicated to invoice processing.
After:
- 500 invoices per hour (all processed in parallel).
- 50 invoices per week = < 5 minutes for review/exceptions.
- No dedicated FTE.
Annual savings: 1 FTE at $70K salary + benefits = $91K, plus error reduction, plus faster cash reconciliation.
Document Intelligence with AI
Beyond extracting raw text, flow8 combines OCR with AI to understand documents at a deeper level.
Intelligent Classification
Scenario: You receive documents from customers (contracts, POs, invoices, complaints, feedback forms). You need to know what type of document each is and route accordingly.
Without AI: Manual inspection, 2 minutes per document.
With AI + OCR + flow8:
- Document uploaded.
- OCR extracts text.
- AI classification: “This is a contract. It’s a service agreement. Risk level: medium. Recommended action: legal review.”
- Router routes to appropriate team based on classification.
Data Extraction with Context
Scenario: An invoice has a line item “2 hrs consulting @ $150/hr.” You need to extract the hours (2), rate ($150), and type of service (consulting).
Simple regex extraction fails when:
- Format varies (“2 hrs consulting…” vs “Consulting Services - 2 hours @ $150/hr”).
- Text is hand-written (scanned).
- Abbreviations are unclear (“hrs” vs “hours,” “cons.” vs “consulting”).
With AI: AI understands context. It reads “2 hrs consulting” and correctly identifies 2 hours, $150/hr, consulting, even if format varies.
Compliance and Auditability
Every document processed by flow8 is logged:
- Original document — stored in secure archive.
- Extraction metadata — what data was extracted, confidence scores, any manual corrections.
- Processing history — who reviewed it, what approvals were granted, when.
- Retention policy — how long to keep the document (configurable by type, e.g., 7 years for contracts, 3 years for invoices).
When auditors ask “Can you show us all contracts processed for Vendor X in 2024?”, you can pull a complete list with all metadata in seconds.
Document Processing Modules at a Glance
| Module | Input | Output | Use Case |
|---|---|---|---|
| Extract PDF Text | PDF file | Text string | Read text from PDF |
| OCR (Tesseract) | Image file or low-quality PDF | Text string | Read scanned documents |
| Generate PDF | HTML template + data | PDF file | Create reports, contracts, letters |
| Merge PDFs | List of PDF files | Single merged PDF | Combine documents |
| Split PDF | PDF file, page ranges | Multiple PDF files | Separate by page or section |
| Convert to PDF | Word, Excel, HTML file | PDF file | Standardize format |
| Image Processing | Image file | Resized/processed image | Resize, crop, compress images |
| Extract Data (AI) | Text/document | Structured JSON | Pull fields from documents |
| Classify (AI) | Document text | Classification + confidence | Determine document type |
| Add Watermark | PDF file | Watermarked PDF | Mark confidential or draft documents |
Getting Started
- Identify document-heavy processes in your business: intake forms, contracts, invoices, reports, applications.
- Pick one process to automate (start with high volume, low complexity).
- Build the flow in flow8: OCR → Extract → Validate → Route/Archive.
- Measure results: time saved, errors eliminated, documents processed per day.
- Expand to other processes.
Most organizations see ROI within the first month: the time savings from a single document workflow (intake, invoicing, or report generation) typically justifies flow8’s cost.
Summary
Document processing in flow8 is:
- Fast — seconds instead of minutes per document.
- Scalable — processes 10 or 10,000 documents the same way.
- Accurate — OCR + AI extract data reliably, even from messy documents.
- Auditable — full history of every document, every extraction, every decision.
- Integrated — documents flow seamlessly between systems (email → OCR → extract → store → CRM update → archive).
If documents are a bottleneck in your business, flow8 is the solution.