Where an extraction rule gets its value from
At a Glance
- Difficulty: Intermediate
- Time required: ~15 minutes
- Prerequisites: Understanding Data Extraction
- What you'll learn: The eight data sources and when to use each
What is a data source?
Before an extraction rule can process a value, it must be clear where that value comes from. That is
exactly what the data source defines. While the data type
determines how a value is understood, the data source determines where it comes from - the visible
document text, a barcode, the metadata, the file information and more.
You select the data source in the rule editor under "General". Depending on your choice,
the program shows the matching settings.
Determine data from document text
The most important and most common source. The value is read from the visible text of the PDF - for
example via a keyword and the adjacent data area. The tutorial
Understanding Data Extraction
explains the basics.
Important: This source only works with PDFs that contain searchable text.
Pure image PDFs (scans) must first be made searchable using text recognition (OCR).
Determine data from QR or barcode
Reads the content of QR codes and barcodes from the document. This is especially useful when
documents already carry a code that contains a unique identifier - such as a case or document number.
Example: Incoming documents carry a QR code with the case number. You read it
out and name the file accordingly.
Use metadata of the document
Accesses the metadata stored in the PDF - such as title, author, subject or the creation and
modification date held in the document. This information is not part of the visible text but belongs to the
properties of the file.
Example: You sort documents into different folders based on the author stored
in the PDF.
Use file information
Uses properties of the file itself - the file name, the path and the file system date values
(created/modified). Handy when the file name or storage location already contains usable information.
Example: The file name already contains a customer number that you want to
reuse for further filing.
Use custom text
Provides a fixed text that you specify yourself - independent of the document content. This is useful
for fixed building blocks or as a fallback value: create a second rule with the same name and it
steps in if the actual extraction does not find a value.
Note: With this source, only the Text data type is available.
Use placeholder value
This source builds on the result of another rule. This lets you process already extracted values
further or combine several values, without setting up the same extraction again.
Example: One rule reads the invoice date. A second rule uses this value to
produce a different notation from it.
Use form data
Reads the content of fillable PDF form fields (such as text fields or check boxes). This requires the
PDF to contain real form fields - not just printed text. You can find a detailed guide at
Extract PDF form data.
Use sequential number
Generates an automatically incrementing number - for example a continuous document number. The
numbering is managed via named counters that you maintain centrally. Several rules or profiles that use the same
counter share a guaranteed unique, gap-free sequence of numbers.
Example: Each processed invoice receives a continuous internal number such as
000123, 000124, 000125 - with a freely selectable start value and format.