soup2breve is a Python script that leverages the excellent Beautiful Soup HTML parser by Leonard Richardson to parse an existing HTML document and convert it to a Brevé template. Beautiful Soup is a very forgiving HTML parser that has its roots in HTML screen-scraping and as such allows soup2breve to handle even badly-formed HTML without problems. To use soup2breve you will either need to have Beautiful Soup installed or, as Beautiful Soup is available as a single .py file, this file must be accessible on the Python path.
By default soup2breve will output the entire document as a single Brevé template from the html tag down, which, when processed with Brevé will result in the original HTML document being recreated almost exactly. The differences will be that any consecutive ascii white spaces will be compressed to a single space and any white space with an embedded carriage return will be replaced with just the single carriage return.
To convert an HTML document, invoke soup2breve with the name of the document, e.g.:
./soup2breve index.html
By default the Brevé template will be output on stdout, so to capture the output simply redirect it to the requisite file e.g.:
./soup2breve index.html > index.b
There are certain scenarios where during an HTML to Brevé conversion, you may want to do some additional processing on the HTML prior to the Brevé tags being generated. For example, an HTML document may have absolute links that you want converted to relative links or the document may have links to images that are sourced locally by the web designer during development that should be sourced from a completely separate domain to the main html when in production.
soup2breve has the notion of plug-in tag handlers to allow customised handling of any tag during the conversion process. There are several custom tag handlers in the existing soup2breve.py file and in fact one is mandatory for the correct handling of the meta http-equiv tag sometimes used in HTML documents. The reason this tag requires special handling is because of its use of the name 'http-equiv' as an attribute. This is not a valid name for a Python keyword and that particular tag must be output using the meta(**dict) form to ensure that the 'http-equiv' name is passed as a string.
The custom meta tag handler from soup2breve.py is reproduced below:
# special handling for the http-equiv tag
def meta_handler(tag, output, indent, handlers):
if tag.get('http-equiv', None):
a = []
for key, val in tag.attrs:
# handle BeautifulSoup encoding substitution
if "%SOUP-ENCODING%" in val:
val = val.replace("%SOUP-ENCODING%", DEFAULT_ENCODING)
a.append('"%s_":"%s"' %(key, val))
output.append('%smeta ( **{ %s } ),\n' %(current_indent(indent), ', '.join(a)))
return CONTENT_HANDLED
else:
return NOT_HANDLED
The parameters passed to a tag handler are as follows:
A tag handler can return 3 possible status codes that determine the action taken by the main processor when the handler returns. These are:
As can be seen in the meta tag handler above, it only processes the tag in the case where the meta tag actually has an existing 'http-equiv' attribute, which it indicates by returning CONTENT_HANDLED in this instance and NOT_HANDLED for all others.
The list of tag handlers must be passed in to the convert() function as a dictionary keyed by the tag name that the handler is for, e.g.:
my_handlers = dict(
link=null_handler,
script=script_handler,
body=body_handler,
img=img_handler,
a=anchor_handler,
)
result = convert_file(filename, my_handlers)
Another example of a tag handler is shown below:
def img_handler(tag, output, indent, handlers):
assert tag['src'].startswith('./') or tag['src'].startswith('http://')
output.append('%s%s' %(current_indent(indent), tag.name))
if hasattr(tag, 'attrs') and tag.attrs:
output.append(' ( ')
i = 0
for key, val in tag.attrs:
if i:
output.append(', ')
if key=='src' and val.startswith('./'):
output.append('src=%s' %'''h.img_url('%s%s')''' %(handlers['relative_url'],val[1:]))
else:
output.append('%s_="%s"' %(key, val))
i+= 1
output.append(' ),\n')
return CONTENT_HANDLED
This is an example of a handler for the 'img' tag and shows that it is possible to add some sanity checking to the Brevé conversion process in addition to actually changing the HTML content during processing. The assertions are a simple way of ensuring that the original html conforms to an agreed specification. The handler simply replaces any relative links with a call to a function (in this case a Pylons helper function that generates the full url, e.g. http://images.example.com/my_image.png) that is executed when the Brevé template is processed.
Another use case for soup2breve is the situation where you have some existing html that needs to fit into a Brevé site that uses template inheritance to define slots for various content to be inserted (see the sub-section titled 'Template Inheritance' in the special directives section. For example, you may have a master file 'index.b' that defines the 'content' slot:
html [
body [
slot ( 'content' )
]
]
In order to use the existing HTML with this template you need to customise the soup2breve output to use the 'inherits' tag and put the HTML body in an 'override' tag as shown below:
inherits ( 'index' ) [
override ( 'content' ) [
# ... contents of html body ...
]
]
soup2breve's tag handlers can handle this situation as follows:
Define a handler for the initial html tag which outputs the 'inherits' tag and takes control of the recursive processing (it also filters out any unnecessary blank lines):
def html_handler(tag, output, indent, handlers):
output.append('''inherits ( 'index ) [\n''')
l = len(tag.head)
for t in tag.head:
# if not the only tag and it is a newline then ignore it
# this gets rid of the newline at the start/end of tags
if l > 1 and (t=='\n' or t==' '):
continue
# otherwise convert it
convert(t, output, indent+1, handlers=handlers)
convert(tag.body, output, indent+1, handlers=handlers)
output.append(']')
return CONTENT_HANDLED
Define a handler for the body tag that outputs the 'override' tag at the appropriate indent level:
def body_handler(tag, output, indent, handlers):
subs = dict(indent=current_indent(indent))
output.append('''%(indent)soverride ( 'content' )''' %subs)
return CONTENT_NOT_HANDLED
Define a set of handlers that discard any other tags:
my_handlers = dict(html=html_handler,
head=null_handler,
title=null_handler,
meta=null_handler,
link=null_handler,
script=null_handler,
body=body_handler)
result = convert_file(filename, my_handlers)
That's it - the result above will be in the desired format.