soup2breve

soup2breve - Converting HTML to Brevé

soup2breve is a Python script that leverages the excellent Beautiful Soup HTML parser by Leonard Richardson to parse an existing HTML document and convert it to a Brevé template. Beautiful Soup is a very forgiving HTML parser that has its roots in HTML screen-scraping and as such allows soup2breve to handle even badly-formed HTML without problems. To use soup2breve you will either need to have Beautiful Soup installed or, as Beautiful Soup is available as a single .py file, this file must be accessible on the Python path.

By default soup2breve will output the entire document as a single Brevé template from the html tag down, which, when processed with Brevé will result in the original HTML document being recreated almost exactly. The differences will be that any consecutive ascii white spaces will be compressed to a single space and any white space with an embedded carriage return will be replaced with just the single carriage return.

To convert an HTML document, invoke soup2breve with the name of the document, e.g.:

./soup2breve index.html

By default the Brevé template will be output on stdout, so to capture the output simply redirect it to the requisite file e.g.:

./soup2breve index.html > index.b

Extending soup2breve

There are certain scenarios where during an HTML to Brevé conversion, you may want to do some additional processing on the HTML prior to the Brevé tags being generated. For example, an HTML document may have absolute links that you want converted to relative links or the document may have links to images that are sourced locally by the web designer during development that should be sourced from a completely separate domain to the main html when in production.

soup2breve has the notion of plug-in tag handlers to allow customised handling of any tag during the conversion process. There are several custom tag handlers in the existing soup2breve.py file and in fact one is mandatory for the correct handling of the meta http-equiv tag sometimes used in HTML documents. The reason this tag requires special handling is because of its use of the name 'http-equiv' as an attribute. This is not a valid name for a Python keyword and that particular tag must be output using the meta(**dict) form to ensure that the 'http-equiv' name is passed as a string.

The custom meta tag handler from soup2breve.py is reproduced below:

# special handling for the http-equiv tag
def meta_handler(tag, output, indent, handlers):
    if tag.get('http-equiv', None):
        a = []
        for key, val in tag.attrs:
            # handle BeautifulSoup encoding substitution
            if "%SOUP-ENCODING%" in val:
                val = val.replace("%SOUP-ENCODING%", DEFAULT_ENCODING)
            a.append('"%s_":"%s"' %(key, val))
        output.append('%smeta ( **{ %s } ),\n' %(current_indent(indent), ', '.join(a)))
        return CONTENT_HANDLED
    else:
        return NOT_HANDLED

The parameters passed to a tag handler are as follows:

  • tag - the actual tag instance; will be a Beautiful Soup object such as Tag, NavigableString, etc.
  • output - the results of the processing so far; a list of strings that you should append the results of any processing done in this handler to
  • indent - the current indent level; an integer
  • handlers - the current set of handlers; a dictionary; this is useful if you want to take over control of the recursion into the document from this point on and would like to reuse or modify the active handler list

A tag handler can return 3 possible status codes that determine the action taken by the main processor when the handler returns. These are:

  • NOT_HANDLED: the handler did nothing. The processor will treat the tag as per normal
  • CONTENT_NOT_HANDLED: the handler has only processed the tag itself, not any children of the tag. The processor will not output anything for the current tag, but if it has children will process them as normal
  • CONTENT_HANDLED: the handler processed the tag and any children completely. The processor will do simply continue processing the next tag, if any

As can be seen in the meta tag handler above, it only processes the tag in the case where the meta tag actually has an existing 'http-equiv' attribute, which it indicates by returning CONTENT_HANDLED in this instance and NOT_HANDLED for all others.

The list of tag handlers must be passed in to the convert() function as a dictionary keyed by the tag name that the handler is for, e.g.:

my_handlers = dict(
                   link=null_handler,
                   script=script_handler,
                   body=body_handler,
                   img=img_handler,
                   a=anchor_handler,
                   )

result = convert_file(filename, my_handlers)

Another example of a tag handler is shown below:

def img_handler(tag, output, indent, handlers):
    assert tag['src'].startswith('./') or tag['src'].startswith('http://')
    output.append('%s%s' %(current_indent(indent), tag.name))
    if hasattr(tag, 'attrs') and tag.attrs:
        output.append(' ( ')
        i = 0
        for key, val in tag.attrs:
            if i:
                output.append(', ')
            if key=='src' and val.startswith('./'):
                output.append('src=%s' %'''h.img_url('%s%s')''' %(handlers['relative_url'],val[1:]))
            else:
                output.append('%s_="%s"' %(key, val))
            i+= 1
        output.append(' ),\n')
    return CONTENT_HANDLED

This is an example of a handler for the 'img' tag and shows that it is possible to add some sanity checking to the Brevé conversion process in addition to actually changing the HTML content during processing. The assertions are a simple way of ensuring that the original html conforms to an agreed specification. The handler simply replaces any relative links with a call to a function (in this case a Pylons helper function that generates the full url, e.g. http://images.example.com/my_image.png) that is executed when the Brevé template is processed.

Using soup2breve with Brevé template inheritance

Another use case for soup2breve is the situation where you have some existing html that needs to fit into a Brevé site that uses template inheritance to define slots for various content to be inserted (see the sub-section titled 'Template Inheritance' in the special directives section. For example, you may have a master file 'index.b' that defines the 'content' slot:

html [
     body [
         slot ( 'content' )
     ]
]

In order to use the existing HTML with this template you need to customise the soup2breve output to use the 'inherits' tag and put the HTML body in an 'override' tag as shown below:

inherits ( 'index' ) [
    override ( 'content' ) [
        # ... contents of html body ...
    ]
]

soup2breve's tag handlers can handle this situation as follows:

  1. Define a handler for the initial html tag which outputs the 'inherits' tag and takes control of the recursive processing (it also filters out any unnecessary blank lines):

    def html_handler(tag, output, indent, handlers):
        output.append('''inherits ( 'index ) [\n''')
        l = len(tag.head)
        for t in tag.head:
            # if not the only tag and it is a newline then ignore it
            # this gets rid of the newline at the start/end of tags
            if l > 1 and (t=='\n' or t==' '):
                continue
            # otherwise convert it
            convert(t, output, indent+1, handlers=handlers)
        convert(tag.body, output, indent+1, handlers=handlers)
        output.append(']')
        return CONTENT_HANDLED
    
  1. Define a handler for the body tag that outputs the 'override' tag at the appropriate indent level:

    def body_handler(tag, output, indent, handlers):
        subs = dict(indent=current_indent(indent))
        output.append('''%(indent)soverride ( 'content' )''' %subs)
        return CONTENT_NOT_HANDLED
    
  2. Define a set of handlers that discard any other tags:

    my_handlers = dict(html=html_handler,
                       head=null_handler,
                       title=null_handler,
                       meta=null_handler,
                       link=null_handler,
                       script=null_handler,
                       body=body_handler)
    
    result = convert_file(filename, my_handlers)
    

That's it - the result above will be in the desired format.

edit page
Back to top
Rendered using Brevé 1.2.8Copyright © 2007, Cliff Wells