Extract from PDF

This action extracts text and images from a PDF document contained as binary data in a selected binary variable.

Typically, the PDF document has been downloaded into the variable using an Extract Target step. The output from the "Extract from PDF" action is an HTML page containing the text and images extracted from the PDF document. In subsequent steps, the desired information can then be extracted from the page, in the same way as for other HTML pages.

Note that PDF documents do not contain structure information such as tables or paragraphs, only positions of texts and graphics, that might or might not be positioned to look like tables or paragraphs. This can make it difficult to extract the desired information from PDF documents. However, the Extract from PDF step will apply some heuristics to group the text into HTML paragraphs based on the available position information.

Properties

The "Extract Text from PDF" action can be configured using the following properties:

PDF Variable:
The binary variable containing the PDF document as binary data.
Include Images:
Specifies whether embedded images should be extracted. Note that not all images and graphics can be extracted from PDF documents; it depends on the way they have originally been embedded in the document.
Include Positioning:
Specifies whether the positions of the texts should be extracted. The positions may be useful to derive the structure of the document.
Include Formatting:
Specifies whether the formatting (font names, sizes etc.) of the texts should be extracted. Like the positions, the formatting may be useful to derive the structure of the document.
Merge Text:
As default the converter that generated the HTML from the PDF will merge text that is on the same line into one HTML element even if these are represented as different text in the PDF document. Though this may often desirable, it may in some cases have the effect that text that originally far apart will be merges together and appear to be right next to each other. A typical case where it would be desirable to turn this feature off is if the document contains more than one column. Turning the feature off will attempt to preserve the column structure.