Need help extracting href and title strings

This comment was posted to reddit on Jul 07, 2015 at 4:34 pm and was deleted within 1 hour(s) and 29 minutes.

Need help extracting href and title strings

The shell, like all orthodox UNIX stuff, is a language oriented around lines of text. HTML, by contrast, is a stream-based language – there is no such concept as a line, and there is no guarantee that the data you want is on the same line or on 1000 different lines, nor do you have any idea what other elements or attributes might get inserted there to screw up your ad-hoc attempt at parsing.

You need a proper HTML parser to accomplish this job. The hxselect command from the HTML-XML-Utils package will let you extract specific elements from HTML documents, one per line, in a predictable format:

$ hxselect -s '\n' a <test.html
<a href="/wiki/William_Tecumseh_Sherman" title="William Tecumseh Sherman">William Tecumseh Sherman</a>
<a href="/wiki/Military_service_of_Ian_Smith" title="Military service of Ian Smith">Military service of Ian Smith</a>
<a href="/wiki/Issy_Smith" title="Issy Smith">Issy Smith</a>
<a href="/wiki/Oerip_Soemohardjo" title="Oerip Soemohardjo">Oerip Soemohardjo</a>
<a href="/wiki/Myles_Standish" title="Myles Standish">Myles Standish</a>
<a href="/wiki/Ronald_Stuart" title="Ronald Stuart">Ronald Stuart</a>

That output is something the shell can work with:

hxselect -s '\n' a <test.html | while IFS='' read -r line; do
    line=${line#'<a '}     # trim '<a ' from start
    line=${line%%>*}       # trim everything after first > from end
    eval "$line"           # interpret the attributes as shell variable assignments
    echo "The href is: $href; the title is: $title"
done

/r/bash Thread

Need help extracting href and title strings

Recently removed from /r/bash

More Random Comments