Performing Common Tasks

In this section, we will take a look at some common extraction tasks that you should be familiar with.

Extracting Only Part of a Text

If you want to extract only a part of the text in a tag, then you can use patterns on the text in the tag. For example, you might want to extract the name "Bob Smith" from the following text: "The article is written by Bob Smith." To do this, use the Extract data converter (do not confuse this with the Extract step action) and configure it as shown below.

Using the Extract Data Converter

The principle is to configure the Pattern property to match the entire text, with the text to extract being matched by a subpattern, enclosed by parentheses. In this case, the pattern used is ".*by\s(.*)\.", which means that the text between "by " and the period will be matched by the subpattern. For more information on patterns, see Patterns.

Converting Content

Conversion is used whenever you want to normalize content, such as when one text should be replaced by another text. For example, you might want to normalize country codes to their natural language description, e.g. "US" should be normalized to "United States". For plain text conversions, you should use the Convert Using List data converter. For conversions based on patterns or expressions, you should use the If Then data converter.

Number Extraction and Formatting

Whenever you want to extract a number from some content, you should use the Extract Number data converter. For further number formatting, you should use the Format Number data converter. Often, when you want to extract a number from some content, you add an Extract Number data converter; if you need any further formatting you add a Format Number data converter in order to reformat the text extracted by (and outputted by) the Extract Number data converter.

Extracting the Date from a Text

Extracting dates should be done in the same fashion as extracting numbers. Use the Extract Date data converter to extract the date from any text. Extract Date uses patterns to extract the date. The pattern doesn't necessarily have to match the entire text, only the date. The extracted date is then converted to standard date format, which can be formatted in any way using a Format Date data converter.

Using the Extract Date Data Converter

Below are two videos explaining how to extract dates from text.

Video Tutorial on Simple Date Extraction

Video Tutorial on Complex Date Extraction

Extracting Only a Subset of the Tags in the Found Tag

Sometimes, you want to extract from a range of tags rather than a single tag. The Extract action lets you specify a range of tags by specifying the first tag and the last tag in the range.

For example, consider the case of extracting the body text of an article, where the body text is made up of individual sections, each in their own tag, and where information about the article title and author is contained in some other tags. To extract only the body text without the article title and author, use the Extract action to extract the text, and configure the action so that only the range of tags spanning the body is extracted.