Understanding Tag Paths

To understand the Tag Finder, the concept of a tag path is important. A tag path is a compact text representation of where some tag is located on a page. Consider this tag path:

html.body.div.a

This tag path refers to an <a>-tag inside a <div>-tag inside a <body>-tag inside an <html>-tag.

A tag path can match more than one tag on the same page. For example, the tag path above will match all of the <a>-tags on this page, except the third one:

<html>
  <body>
    <div>
      <a href="url...">Link 1</a>
      <a href="url...">Link 2</a>
    </div>
    <p>
      <a href="url...">Link 3</a>
    </p>
    <div>
      <a href="url...">Link 4</a>
      <a href="url...">Link 5</a>
      <a href="url...">Link 6</a>
    </div>
  </body>
</html>

You can use indexes to refer to specific tags among tags of the same type at that level. Consider this tag path:

html.body.div[1].a[0]

This tag path refers to the first <a>-tag in the second <div>-tag in a <body>-tag inside an <html>-tag. So, on the page above, this tag path would only match the "Link 4" <a>-tag. Note that indexes in tag paths start from 0. If no index is specified for a given tag on a tag path, the path matches any tag of that type at that level, as we saw in the first tag path above. If the index is negative, the matching tags are counted backwards, i.e. starting with the last matching tag which corresponds to index -1. Consider this tag path:

html.body.div[-1].a[-2]

This tag path refers to the second-to-last <a>-tag in the last <div>-tag in a <body>-tag inside an <html>-tag. So, on the page above, this tag path would only match the "Link 5" <a>-tag.

You can use an asterisk (‘*') to mean any number of tags of any type. For example, the tag path

html.*.table.*.a

refers to an <a>-tag located anywhere inside a <table>-tag, which itself can be located anywhere inside an <html>-tag. There is an implicit asterisk in front of any tag path, so you can simply write "table" instead of "*.table" to refer to any table tag on the page. The only exception is tag paths starting with a punctuation mark (‘.'), which means that there is no implicit asterisk in front of the tag path, so the tag path must match from the first (i.e. top-level) tag of the page.

With asterisks, you can create tag paths that are more robust against changes in the page, since you can leave out insignificant tags that are liable to change over time, such as layout related tags. However, using asterisks also increases the risk of accidentally locating the wrong tag.

You can provide a list of possible tags by separating them with '|', as in this tag path:

html.*.p|div|td.a

This tag path refers to an <a>-tag inside a <p>-, <div>-, or <td>-tag located anywhere inside an <html>-tag.

In a tag path, text on a page is referred to just as any other tag, using the keyword "text". Although text is not technically a tag, it is treated and viewed as such in a tag path. For example, consider this HTML:

<html>
  <body>
    <a href="url...">Link 1</a>
    <a href="url...">Link 2</a>
  </body>
</html>

The tag path "html.body.a[1].text" would refer to the text "Link 2".