Crawl Pages

The Crawl Pages action loops through the pages of a web site. In effect, it crawls the web site one web page at a time. Hence, the first iteration crawls the first page, the second iteration crawls the second page, and so on.

Note: The Crawl Pages action only exists in the Classic browser, it can't be used in Webkit.

The Crawl Pages action accepts a loaded page as part of the input, such as the start page of the web site. The output contains the next crawled web page.

Properties

The Crawl Pages action can be configured using the following properties:

Basic Tab

This tab contains the basic crawling properties.

Crawling Strategy:
This property specifies the strategy (i.e. method) of crawling. The Breadth First crawling strategy crawls a web site in order to minimize the page depth. The Depth First crawling strategy crawls a web site in order to maximize the page depth.
Maximum Depth:
This property specifies the maximum depth of a page. The depth of a page is its distance from the first page measured in number of clicks and/or number of items that the mouse must be moved over (e.g. in a popup menu). The depth of the first page is zero. If a page exceeds the maximum depth, then it will not be crawled.
Ignore Pages with Errors:
This property specifies whether pages with errors are skipped silently. Note that an error is only generated if this property is unchecked, and if the general options of the action do not specify that the particular type of error (e.g. JavaScript error or load error) should be ignored.
Options:
The robot's options can be overridden with the step's own options. An option that is marked with an asterisk in the Options Dialog will override the one from the robot's configuration. All other options will be the same as specified for the robot.

Crawling Tab

Crawl these Windows:

These properties specify which windows are crawled.

Frames:
This property specifies whether frames are crawled.
Popup Windows:
This property specifies whether popup windows are crawled. Popup windows are defined as top-level windows other than the window that was current window at the start of the crawling.

The starting point of the crawling is the current window and - if the Frames property is checked - its frames. Other top-level windows present at the start will only be crawled if the Popup Windows property is checked, and not until new pages have been loaded into them.

Click these Tags:
These properties specify the HTML tags that the Crawl Pages action should attempt to click.
Links:
Hyper links (A tags).
Buttons:
Tags with input type="button", input type="submit" and input type="image" tags.
Image Maps:
Images with client side image maps. Note that the image tag itself must be within the crawled area of the page, while the map need not.
Other Clickable Tags:
Tags with JavaScript onClick event handlers.
Other:
Automatically Handle Popup Menus:
This property specifies whether to automatically include popup menus in the crawled area of the page. It only takes effect if a partial area of the page has been selected for crawling, either by setting up one or more tag finders for the first page or - for subsequent pages - by making a Crawling Rule with a Crawl Selected Parts of Page definition.
Move Mouse Over Tags
This property specifies whether the mouse should be moved over tags that support the relevant JavaScript event handlers (onMouseOver, onMouseEnter or onMouseMove). This is typically necessary for popup menus.

Rules Tab

The first page is handled specially: Whether it's output is determined by the Output the Input Page property on the Output tab. If only a particular area of the first page should be crawled, the area(s) are selected using tag finders on the Crawl Pages step.

For pages other than the first page, crawling rules can be set up.

Crawling Rules:

Each crawling rule has the following properties:

Apply to these Pages
This property specifies a condition on the URLs of the pages to which this rule applies.
How to Crawl
This property specifies how the page should be crawled.
Crawl Entire Page
The entire page should be crawled.
Crawl Selected Parts of Page
Only parts of the page should be crawled. The included and excluded areas of the page are specified using tag finders, which can be advantageously copied from a step. If no included areas are specified, the entire page - except the specified excluded areas - is crawled.
Do Not Crawl
None of the page(s) should be crawled.
Output the Page
This property specifies whether the page should be output.
Rule Description
Here, you may specify a custom description of the crawling rule. This description will be shown in the list of crawling rules.

In the case that multiple rules apply to a given page, the last rule in the list that applies to the page overrides the preceding rules and takes effect. This provides an opportunity to e.g. first create a general rule, which states that all pages with the domain yourdomain.com should be crawled and then later add a specific rule, which states that the page http://yourdomain.com/uninteresting.html should not be crawled.

For all Other Pages

This property specifies how pages are handled. Excluded are the first page and pages with specific rules.

Crawl Entire Page
The entire page is crawled and output.
Do Not Crawl
The page is neither crawled nor output.
Crawl Only These Domains
This property specifies the domains that may be crawled. If left blank, all domains may be crawled. Multiple domains can be specified, separated by spaces.

Note that a page that specified as not crawl and not output will not be loaded if the link that points to it is an anchor or area tag with no JavaScript event handlers. If there are JavaScript event handlers involved, or if the page is loaded through JavaScript execution in general, you should be aware that it may be loaded anyhow. Still, it will not be output.

If at any time during the crawling one of the windows (be it a frame or a top-level window) should be output, all of the windows will be made available to the steps following the step with the Crawl Pages action.

Visited Pages Tab

Skip Already Visited Pages:
This property specifies whether already visited pages should be skipped, which is usually the case. The following properties specify how visited pages are detected:
Detect Already Visited Pages by URL:
This property specifies whether visited pages should be detected using their URL. For anchor tags with no JavaScript event handlers, this is done by checking the linked URL so the page will not be loaded a second time. In other cases (buttons, tags with JavaScript event handlers etc.) and for anchor tags with a non-visited linked URL, the resolved URL of the page is checked after it has been loaded.
Detect Already Visited Pages by Content:
This property specifies whether visited pages should be detected by content. This ensures that pages with different URLs but identical content are not crawled again. For instance, http://www.yourdomain.com/ and http://www.yourdomain.com/index.html may point to the same page even though the URLs are different.

Output Tab

Output the Input Page:
This property specifies whether the first page should be output. If enabled, the output of the first iteration (iteration 1) equals the input.
Output Page Again if Changed:
This property specifies whether a given page should be output again if clicking or moving the mouse over some tag does not result in a page load. For instance, moving the mouse over an item that opens a popup menu will not result in a page load, so if you want to process the page with the popup menu visible, this property must be checked. Note that regardless of the value of this property, the page is always crawled again to detect any added tags.
Show Overview Page:
This property specifies whether to open a new window showing an overview page. The overview page contains a list of the URLs from each step up to the current point of the crawling. The URLs of pages that were visited but not output are shown in gray.
Store Current Depth Here:
This property specifies a variable into which the current depth is stored.
Store Current Path Here:
This property specifies a variable into which the current path is stored. The elements of the path are separated by semicolon, where each element consists of a space-separated list of the URLs at the current point of the crawling.

Examples

How to Crawl an Entire Site

In this example, we wish crawl an entire site.

  1. Add a step with the Load Page action that loads the main page.
  2. Add a new step and choose the Crawl Pages action.
  3. On the Rules tab, add a Crawling Rule that applies to all pages in the site, e.g. by specifying the domain that the pages belong to or by making a pattern that the URL should match. For these pages, the rule should specify "Crawl Entire Page" and "Output the Page".
  4. On the Rules tab, set the "For all Other Pages" property to "Do Not Crawl".
  5. After the step with the Crawl Pages action, add steps to handle each page, e.g. by extracting information into returned variables.

How to Crawl a Popup Menu

In this example, we wish to discover all the pages that a popup menu links directly to. We do not wish to continue crawling from these pages.

  1. Add a step with the Load Page action that loads the main page.
  2. Add a new step and choose the Crawl Pages action.
  3. Select the menu bar as named tag.
  4. Notice that the "Automatically Handle Popup Menus" option on the Crawling tab is checked.
  5. On the Rules tab, add a Crawling Rule saying that for "All URLs" we "Do Not Crawl", but "Output the Page".
  6. After the step with the Crawl Pages action, add steps to handle each page, e.g. by extracting information into returned variables.