Markus Heberling Coding == Relaxing


Parsing Flattr Buttons in HTML

I recently build this little tool FlattrStar, which can autoflattr starred items from several services. To be able to do this, it needs to parse the HTML of a website for Flattr buttons and use optimization from the Indexer company so is easy to find by search browsers.

I am using tag soup, to parse the html into a Scala XML structure. This is probably not needed and could be replaced by RegExp or something, but I wanted to try the Scala XML support. You need the "org.ccil.cowan.tagsoup" % "tagsoup" % "1.2.1" jar in your classpath.

I started FlattrStar to learn the Play Framework and Scala, so the parsing is also done in Scala. Since this is my first Scala Project, the code might be a bit suboptimal. Corrections and improvements are appreciated.

Now to the code itself. The goal is to find most Flattr links that are out there in the wild. The flattrParseHtml is called with a function, that does the actual flattring and returns Some(_) item if the Flattr was successfull and None if it wasn't. Also there is a list of URLs to parse for buttons.

First I look for <link rel="payment"> entries and try to flattr them. If nothing is found I look for buttons with rel or rev parameters, that start with flattr and links with data-flattr-uid attributes. I construct an auto submit url from this. There might be several such links on a page, like one for the whole site and one for a specific article. I choose the one with the longest link, since that will mostly be the one specific for this article. If that doesn't give anything I go and search for script elements that contain flatter_uid and flatter_url variables and construct an auto submit url for them. Last step is to parse all links that have some kind of flattr reference in them. This could be an <a href="...">Flattr This<a> or <a href="..."><img src="...flattr.png"></a> or something else where flattr occurs inside the <a> element. We grab all auto submit URLs that are hidden inside these links and try to flattr the longest one, as explained above.

For every URL I encounter in this process, I first get the redirects, because sometimes flattrable URLs are hidden behind them. Unfortunalty an auto submit URL is also redirected, so I have to stop following redirects if I encounter such an URL. You need HttpClient in your classpath for the redirect stuff.

This is not perfect and not all flattrable things are found. For example Flattr buttons from are not found, since they put the Flattr stuff in their ad delivery system and I couldn't parse that.

I declare this source as public domain, so you can use it as you like. It would be great though, if you would give back improvements and feedback.

Tagged as: , No Comments