Patterns

What follows is a video on patterns and their use in Design Studio. The first half is a lecture like presentation of the syntax, the second half looks closer at some use-case examples in Design Studio. Answers to the problems given in the video can be found at the bottom of the page.

Video Transcript:

Hello. This video will take a closer look at regular expressions, called patterns in Kapow Katalyst. The first half of the video will be a lecture-like presentation of the syntax including wild cards, sets, subpatterns, repetition operators, alternate subpatterns, and subpattern references. The second half will go through three examples in Design Studio using patterns to create conditions and tag finders and to perform data conversion. If you're already familiar with regular expressions you might want to skip directly to the examples.

As mentioned earlier, regular expressions are called patterns in Kapow Katalyst, and will be referred to as patterns for the remainder of this video.

The Wild Card

A pattern is a way to put a string of characters into more general terms by using symbols to represent strings of characters. You might be familiar with the concept from doing searches on your computer where it is sometimes possible to use wild card symbols to represent any character. Doing a search for 'ca*' (the asterisk being the wild card in this example) might return both "cap", "car", "can", and so on. Patterns embrace this same concept while expanding to a much more extensive syntax, which will be presented here.

Kapow Katalyst uses the Perl5 syntax for its patterns. In this syntax the wild card character is symbolized with '.' (a dot or period) which corresponds to any single character including all symbols, whitespaces, and any other special characters you could think of. This correspondence is called matching so the pattern 'ca.' is for example said to match "cap", "car", "can", or any other string of "c" followed by "a" followed by any single character. Similarly the pattern '.a.' matches "nap", "tan", "sad", or any other string of three characters with an "a" as the middle character. It does however not match "an" since each '.' in the pattern has to match up against exactly one character. Similarly it does not match "cans" since the pattern has to match the entire string, not just part of it.

We can test whether a pattern matches a given string directly in Design Studio by using the Pattern Editor. The Pattern Editor can for example be found by inserting a Test Tag step into a robot and clicking "Edit…" below the pattern field in the step action view. The Pattern Editor has three sections. At the top it is possible to type in a pattern, which is then matched to the string typed into the Input field on the left. Clicking the Test button or using the shortcut Ctrl+Enter will tell you whether the pattern matches the input.

Try typing '.a.' in the Pattern field and "can" in the Input field. Then use the shortcut Ctrl+Enter. The Output field will now display "The pattern matches the input." We can ignore the rest of the output for now. If we on the other hand type just "an" in the input field and press Ctrl+Enter we receive the message that "The pattern does not match the input." As I go over more of the pattern syntax, try to experiment with the Pattern Editor to test your understanding of the material.

Although not stated explicitly we are now able to match in two different ways, either we can match character to character (ie. The pattern 'a' matches the string "a") or we can use the wild card symbol '.' to match any character. Additional direct character matching includes the ones listed in the table here.

Pattern Matches the string
'\n' A line break character.
'\r' A carriage return character.
'\t' A tab character.
'\.' "."
'\\' "\"

Any other symbol used by the pattern syntax can also be explicitly matched by preceding it by a backslash '\'.

Sets

Next step is to match a semi known character. By semi known I mean that we only want to match the character with one character in a set of characters. A set of characters is stated in a pattern by using '[]' (brackets). An example is '[abc]' (the set of a, b, c) which will match to either "a", "b" or "c" but will not match to any other characters than these three.

If you wish to include a range of characters to a set it can be done using a '-' (dash or hyphen). '[abc]' can therefore be written as '[a-c]' (the set of characters: a through c). Using words '[a-c]' means match any character in the range from "a" to "c". The two ways of defining sets can be combined to get something like '[a-dkx-z]' (the set of a through d, k, and x through z) which is similar to writing '[abcdkxyz]' (out all those characters in a set) or saying match any character which is either in the range "a" to "d", is "k", or is in the range "x" to "z".

It is also possible to define sets negatively by using '[^]' (a caret at the beginning of the set). An example is '[^a-c]' (the negative set of a through c) which will match any one character excluding "a", "b", and "c".

In the Pattern Editor, try using sets to match (1) any digit (2) any whitespace character (3) anything that is not a digit. You can pause the video if you want to take a moment to think about these problems before seeing the answers (the answers are all given at the bottom of this page).

There are certain shortcuts which can be used for sets that are often used. Here is a table showing some of the most important ones.

Shorthand form Set
'\d' '[0-9]' (Any digit)
'\D' '[^0-9]' (Any non-digit)
'\s' '[ \n\r\t]' (Any whitespace character)
'\S' '[^ \n\r\t]' (Any non-whitespace character)
'\w' '[a-zA-Z0-9_]' (Any word character)
'\W' '[^a-zA-Z0-9_]' (Any non-word character)

Note that the shorthand form can also be used inside sets. For example '[\d\w]' includes all digit and whitespace characters.

Subpatterns

Next we will need to talk about subpatterns within patterns. Terms we have talked about so far such as a character 'a', a set '[abc]', an escaped character '\d' or the wildcard '.' can each be seen as a subpattern. Alternatively we can create our own subpatterns by grouping together other subpatterns using '()'. We could for example create a subpattern from '[ctb]an' by writing '([ctb]an)'.

It is important to recognize these since I will now be introducing some operators which work on the entire subpattern they follow.

Repetition Operators

Operators in patterns allow us to match repetitions of a subpattern by following them with one of the operators given in the table.

Repetition Operator Meaning
'{m,n}' where n ≥ m Matches between m and n repetitions (inclusively) of the preceding subpattern.
'{m,}' Matches m or more repetitions of the preceding subpattern.

For example the pattern 'a{1,}' would match the string "a", "aa", "aaa", or any number of repetitions of 'a'. The pattern '([bn]a){3,3}' would match 'banana', 'babana', 'nabana', or any other string of either "b" or "n" followed by "a" repeated three times. Try it out for yourself.

As for the sets, there are also shorthand versions of the most useful repetition operators as shown in this table.

Shorthand operator Corresponds to
'{m}' '{m,m}'
'?' '{0,1}'
'*' '{0,}'
'+' '{1,}'

Try using what we have learned so far to match (4) anything (5) either "color" spelled without a "u" or "colour" spelled with a "u" (6) any four digit number. The answers follow (at the bottom of the page).

One of the often used patterns is '.*' which matches anything: any string even if it's empty.

Now try extending this and find patterns that match (7) any text containing at least one digit (8) any text containing just one digit. Here is a list of the syntax you may need (video only).

The syntax used in the answers is very useful when matching specific subpatterns within a string.

Alternative Subpatterns

We discussed how to match alternative characters earlier, but what about matching alternative subpatterns? If we have N subpatterns 'p1' through 'pN' , we can match any one of these subpatterns using '(p1|p2|…|pN)' (parentheses and vertical bars as shown here). The pattern given here '(abc|a{5}|\d)' would for example match with either "abc", "aaaaa" or any number.

Try using alternative subpatterns to make a pattern that matches (9) a string which does not contain just one digit. Here, again, is the syntax you might need. And here is the answer: (page bottom)

There is no not operator in the syntax, instead the answer uses two alternatives. The first alternative matches a string with no digits, the second matches any string containing at least two digits.

Subpattern References

The last major part of the syntax to cover is subpattern references. Any substring, "s1" through "sN", matched by a parenthesized subpattern, '(p1)' through '(pN)' in any one pattern, can be referenced to by using '\1' through '\N' where each subpattern is numbered in order from left to right as they are stated in the pattern. Matching '([chm])(at)' to "cat" for example, we could use the reference '\1' to refer to "c" and '\2' to refer to "at".

The entire pattern can always be matched by '\0'.

Notice here that we are referring to the string matched by that subpattern rather than the subpattern itself. A reference to the subpattern '(abc)' would of course yield 'abc' whereas a reference to the subpattern '(\d)' would only match whatever digit was matched by the original subpattern.

As an example consider matching a string containing a quote by using the pattern '.*(['"]).*\1.*' (anything followed by a single or double quote followed by anything followed by a reference followed by anything). This may look confusing but the only thing you really need to notice is that the reference will match the same type of quote which was matched by the subpattern. In other words, this pattern would match both the string He said "hello" with double quotes and He said 'hello' with single quotes. I have purposefully not quoted the two strings here to avoid confusion.

As I will show you later in Design Studio, subpatterns can also be referred to in certain expressions outside of patterns. This is useful when extracting certain parts of a matched string. Taking our quotes example we could add parentheses around the subpattern enclosed by quotes '.*(['"])(.*)\1.*'. Now we are able to extract the quote in Design Studio.

Here is another problem. Try using subpattern references to match (10) four of the same digit (11) a string where at least two characters are the same. … The answers are given (at the bottom of the page) here.

Fewer Repetitions

When using subpattern references it is handy to know the following. By default, the repetition pattern operators (*, +, {...}) will match as many repetitions of the preceding pattern as possible. You can put a "?" after a repetition operator to instead make it match as few repetitions as possible.

(12) Try matching a subpattern to the first occurrence of a digit in a string. … the answer is given (at the bottom of the page) here.

Removing '?' would result in matching the subpattern to the last occurrence of a digit in the string.

Using Patterns in Design Studio

Now that we have learned the syntax of patterns it is time to look at the various use-cases in Design Studio.

Conditions

Creating conditions is the first way of using patterns intelligently in robots. The Test Tag step action is particularly relevant in this context so let's go over a common use case.

I here have a robot which extracts from LinkedIn, all engineering jobs they have listed for Denmark. The robot uses a loop to extract the URL, title, and company name from each job and return them to the user. But let's say I only want to extract from jobs which contain the words "Copenhagen" and "software", indicating that they are probably looking for software engineers in Copenhagen.

First, I insert a new step after the For Each step and assign to it the Test Tag action by clicking on the new step to select it and choose Test Tag from the drop down in the step action view. I ensure that the tag finder finds the entire job post of the current iteration of the loop. Then I iterate through the loop until I find a job offering which matches the criteria I am about to set. This makes it easier to test that the pattern I write will actually work.

Going to the action tab in the step view, I first choose to match only against text (not the entire HTML), then press edit on the pattern. I am now in the Pattern Editor and I can type a pattern to be matched. Since I do not know the order in which the two words "software" and "Copenhagen" might occur, I need to make two alternative subpatterns. In the first alternative I can have Copenhagen followed by anything followed by software. In the second alternative I write the same but in reverse order. Finally I add "any text" before and after the alternatives and press Ctrl+Enter to test whether the pattern matches. It matches!

I close the Pattern Editor and set the Test Tag step to Skip the Following Steps if the Pattern Does Not Match the Found Tag. This way the job post will be skipped if it does not contain the two words specified.

I now go ahead and run the robot in Debug Mode. As expected only few results are extracted and they should all contain the words Software and Copenhagen.

Tag Finders

Patterns can also be used in tag finders. This can be very useful if you know the structure of the information you are looking for but you do not know where on the page it is located. This robot for example goes to multiple different sites to extract the price of a certain pair of headphones. Since we cannot know where on the page to find the price, patterns play a crucial part in determining exactly this.

Let me show you how to set up the extraction step. I'll delete the one I already have, insert a new step and choose for it the Extract action. To configure the step I start by inserting a number converter which extracts the number from any text I might extract. Then I choose to extract into the price attribute of the variable I have made for this robot.

Going to the Finders tab in the step view, I click plus to add a Tag Finder. I locate the price on the page. I can see that it is secluded in its own tag, with nothing else in that tag. This is typical so we will let our pattern match this case. In the Finders View there is a field called Tag Pattern. Immediately we can write the pattern '\$[\d\.]+' (dollar sign followed by one or more digits or dots). The pattern is designed to match any tag containing only a dollar sign followed by a decimal number. I click the magnifying glass in the upper right corner of the page view, which shows me what the Tag Finder finds. Unfortunately it finds the cart balance instead of the headphone price. The cart balance will always be $0 for these kinds of sites, so to avoid this mistake, I will make sure that the first digit in my tag is not a zero. Fortunately, the steep price of headphones ensures that the price will never start with a zero. Rewriting the pattern I get '\$[1-9][\d\.]+' (dollar sign followed by a digit which is not a zero followed by one or more digits or dots) which finds the correct price on the page when I click the magnifying glass.

Before testing the robot I go to the error handling tab of the Extract step and choose to Ignore and Continue on error. If the Tag Finder fails to find the price on the page it should just return the default value of the price attribute which is set to -1. This gives me a clear indication that the robot was not able to find the price. Going to Debug Mode and Looking at the results from an earlier execution of this robot, we see that many of the prices are extracted correctly. The method is of course flawed but it can be surprisingly effective at times.

Data Conversion

The final use for patterns is to convert data from one form into another… For this we can either use one of the data converter lists embedded in a step or use the dedicated Convert Variables step.

In this very simple example, I am extracting the author and date from a blog post. Unfortunately, the two pieces of information are contained by the same string of text and are therefore extracted collectively by the extract step. I will now show you how to separate these two pieces of information using patterns in data converters.

The extract step has a data converter list located in the step action view. The data converter list can be used to convert the extracted text before it is assigned to a variable. I click the plus and choose Extract to insert a data converter which can extract part of the string. A new window opens where I can configure the Extract data converter. At the top there is a pattern, and at the bottom there is a test input and a test output similar to those of the Pattern Editor. The idea with the Extract converter is to write a pattern which matches the entire input string, and then specify the subpattern to be extracted by using parentheses. By default, the entire string is matched AND extracted, resulting in identical input and output strings.

If I want to exclude something from the extracted string I just have to write it outside of the subpattern. Let me precede the subpattern with '.* by ' (any text followed by "space", b, y, "space"). Now the entire string is still matched, but only the name of the author will be part of the substring, and therefore the authors name will be extracted as shown in the Test Output field. The plain text ' by ' forces the two instances of '.*' (any text) to match the date and the author name respectively.

I can now close the configuration window and execute the extract step. The author name is now correctly assigned to my variable.

Let me go back to the extract step and quickly demonstrate another converter which uses patterns. I remove Extract and add the Advanced Extract converter instead. Then I write the same pattern as I used before except that I make subpatterns out of both instances of '.*' (any text). The Test Output is now still the same as the Test Input. This is because Advanced Extract enables me to choose which subpattern I would like to extract by using subpattern references in the Output Expression field.

In expressions, subpattern references are made using the '$' symbol followed by the reference number. Right now the expression refers to the entire matched pattern but if I change it to '$1' I only get the first subpattern, extracting the date, and if I write '$2' I only get the author name which is matched by the second subpattern.

Note that it is also possible to add text, combine subpatterns, and do simple string manipulation using the expression field. For example I could write an expression which recombines the two substrings but in reverse order. For more information on expressions click the question mark next to the expressions field.

Finally I would also like to recommend the Replace Pattern data converter, which replaces instances of a specified pattern in a string.

Those were the final words on patterns. Feel free to review any parts of the video you found useful or go to help.kapowsoftware.com to find even more answers.

Answers to Problems

Problem Number Answer
(1) '[0-9]'
(2) '[ \n\r\t]'
(3) '[^0-9]'
(4) '.*'
(5) 'colou?r'
(6) '\d{4}'
(7) '.*\d.*'
(8) '\D*\d\D*'
(9) '(\D*|.*\d.*\d.*)'
(10) '(\d)\1{3}'
(11) '.*(.).*\1.*'
(12) '.*?(\d).*'