Categories
Tech

Parsing Flattr Buttons in HTML

I recently build this little tool FlattrStar, which can autoflattr starred items from several services. To be able to do this, it needs to parse the HTML of a website for Flattr buttons and use optimization from the Indexer company so is easy to find by search browsers.

I am using tag soup, to parse the html into a Scala XML structure. This is probably not needed and could be replaced by RegExp or something, but I wanted to try the Scala XML support. You need the “org.ccil.cowan.tagsoup” % “tagsoup” % “1.2.1” jar in your classpath.

I started FlattrStar to learn the Play Framework and Scala, so the parsing is also done in Scala. Since this is my first Scala Project, the code might be a bit suboptimal. Corrections and improvements are appreciated.

Now to the code itself. The goal is to find most Flattr links that are out there in the wild. The flattrParseHtml is called with a function, that does the actual flattring and returns Some(_) item if the Flattr was successfull and None if it wasn’t. Also there is a list of URLs to parse for buttons.

First I look for <link rel=”payment”> entries and try to flattr them. If nothing is found I look for buttons with rel or rev parameters, that start with flattr and links with data-flattr-uid attributes. I construct an auto submit url from this. There might be several such links on a page, like one for the whole site and one for a specific article. I choose the one with the longest link, since that will mostly be the one specific for this article. If that doesn’t give anything I go and search for script elements that contain flatter_uid and flatter_url variables and construct an auto submit url for them. Last step is to parse all links that have some kind of flattr reference in them. This could be an <a href=”…”>Flattr This<a> or <a href=”…”><img src=”…flattr.png”></a> or something else where flattr occurs inside the <a> element. We grab all auto submit URLs that are hidden inside these links and try to flattr the longest one, as explained above.

For every URL I encounter in this process, I first get the redirects, because sometimes flattrable URLs are hidden behind them. Unfortunalty an auto submit URL is also redirected, so I have to stop following redirects if I encounter such an URL. You need HttpClient in your classpath for the redirect stuff.

This is not perfect and not all flattrable things are found. For example Flattr buttons from taz.de are not found, since they put the Flattr stuff in their ad delivery system and I couldn’t parse that.

I declare this source as public domain, so you can use it as you like. It would be great though, if you would give back improvements and feedback.

package jobs

import java.net.URL
import play.api.Logger
import xml.Node
import org.apache.http.impl.client.{DefaultRedirectStrategy, DefaultHttpClient}
import org.apache.http.client.methods.{HttpUriRequest, HttpGet}
import org.apache.http.protocol.{ExecutionContext, HttpContext, BasicHttpContext}
import org.apache.http.{HttpHost, HttpResponse, HttpRequest, HttpStatus}

object Flattr {
  def flattrParseHtml[R](flattrFunction: String => Option[R], htmlUrls: Seq[String]): Option[R] = {
    val htmlRoot = htmlUrls.flatMap(url => {

      val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
      val parser = parserFactory.newSAXParser()
      val connection = new URL(url).openConnection()
      if (connection == null || connection.getContentType == null) {
        Logger.error("Cannot load html, could not connect to " + url)
        None

      } else if (!connection.getContentType.contains("html")) {
        Logger.error("Cannot load html, invalid content type " + connection.getContentType)
        None

      } else {
        try {
          val source = new org.xml.sax.InputSource(connection.getInputStream)
          val adapter = new scala.xml.parsing.NoBindingFactoryAdapter

          Some(adapter.loadXML(source, parser))
        } catch {
          case t: Throwable =>
            Logger.error("Cannot load html, ignoring", t)
            None
        }
      }
    })

    //go to webpage check for payment link
    val htmlPaymentLinks = htmlRoot.map(_ \\ "link").toList.flatten
      .filter(l => (l \ "@rel").text.equals("payment"))
      .map(_ \ "@href").headOption
      .map(_.text.replace("&", "&"))

    val htmlPaymentThing =
      htmlPaymentLinks.flatMap(href => flattrFunction(href))

    if (!htmlPaymentThing.isEmpty) {
      Logger.info("Found payment link in html")
      return htmlPaymentThing
    }

    //nothing found? parse html of webpage
    val htmlLinks = htmlRoot.flatMap(root => {
      (root \\ "a")
        .filter(l => (l \ "@class").text.equals("FlattrButton"))
        .filter(l => (l \ "@rel").text.startsWith("flattr") || (l \ "@rev").text.startsWith("flattr") || (l \ "@data-flattr-uid").nonEmpty)
    })
      .flatMap(autoSubmitFromFlattrButton(_))
      .reduceLeftOption((a, b) => if (a.length() > b.length()) a else b) //the longest url should be the most specific one
      .flatMap(url => flattrFunction(url))

    if (!htmlLinks.isEmpty) {
      Logger.info("Found button in HTML that points to thing")
      return htmlLinks
    }

    //nothing found? parse html of webpage -> look for flattr java script
    val htmlScriptLinks = htmlRoot
      .flatMap(findAutoflattrFromJavascript(_)).flatten.headOption
      .flatMap(url => flattrFunction(url))

    if (!htmlScriptLinks.isEmpty) {
      Logger.info("Found java script in HTML that points to thing")
      return htmlScriptLinks
    }

    //nothing found? parse html of webpage -> look for all links that have flattr inside
    val htmlLinks2 = (htmlRoot \\ "a")
      .filter(_.toString().toLowerCase.contains("flattr"))
      .map(_ \ "@href")
      .map(_.text)
      .map(getRedirectedUrl(_))
      .filter(_.toLowerCase.startsWith("https://flattr.com/submit/auto")) // only autoflattr links are supported
      .reduceLeftOption((a, b) => if (a.length() > b.length()) a else b) //the longest url should be the most specific one
      .flatMap(url => flattrFunction(url))

    if (!htmlLinks2.isEmpty) {
      Logger.info("Found custom link in HTML that points to thing")
      return htmlLinks2
    }

    //nothing found
    None
  }

  def findAutoflattrFromJavascript(root: Node) = {
    val uidReg = "var flattr_uid = '(.*)';".r
    val urlReg = "var flattr_url = '(.*)';".r

    (root \\ "script").toList.flatten.
      map(_.text).
      filter(_.contains("flattr_uid")).
      map(s => {
      val uid = uidReg.
        findFirstMatchIn(s).
        flatMap(_ match {
        case uidReg(u) => {
          Some(u)
        }
        case _ => None // Do nothing
      })

      val url = urlReg.
        findFirstMatchIn(s).
        flatMap(_ match {
        case urlReg(u) => {
          Some(u)
        }
        case _ => None // Do nothing
      })

      if (uid.isDefined && url.isDefined) {
        Some("https://flattr.com/submit/auto?user_id=" + uid.get + "&url=" + url.get)
      } else {
        None
      }
    })
  }

  def autoSubmitFromFlattrButton(a: Node) = {
    val url = (a \ "@href").text
    val uid = ((a \ "@data-flattr-uid").toList.map(_.text) ++ ((a \ "@rel") ++ (a \ "@rev")).flatMap(r => relAsMap(r.text).get("uid"))).headOption

    uid.map(uid => "https://flattr.com/submit/auto?user_id=" + uid + "&url=" + url)
  }

  private def relAsMap(rel: String) = {

    rel.split(";").filter(_.contains(":")).map(entry => {
      val s = entry.split(":")
      (s.head, s.last)
    }).toMap
  }

  def getRedirectedUrl(url: String): String = {
    try {
      val httpClient = new DefaultHttpClient()
      httpClient.setRedirectStrategy(DetectAutoFlattrRedirectStrategy)
      val httpget = new HttpGet(url)
      val context = new BasicHttpContext()
      val response = httpClient.execute(httpget, context)
      if (response.getStatusLine.getStatusCode != HttpStatus.SC_OK) {
        //just return the old url
        url
      } else
        getCurrentUrl(context)
    } catch {
      case e: Throwable => {
        Logger.error("Error finding redirects. Ignoring.", e)
        url
      }
    }

  }

  private object DetectAutoFlattrRedirectStrategy extends DefaultRedirectStrategy {
    override def isRedirected(
                               request: HttpRequest,
                               response: HttpResponse,
                               context: HttpContext): Boolean = {
      if (super.isRedirected(request, response, context)) {
        val currentUrl = getCurrentUrl(context)
        if (currentUrl.contains("flattr.com/submit/auto")) {
          //auto submit urls redirect to the thing, but we don't want that
          response.setStatusCode(HttpStatus.SC_OK)
          return false
        }
        return true
      }
      false
    }
  }

  private def getCurrentUrl(context: HttpContext): java.lang.String = {
    val currentReq = context.getAttribute(
      ExecutionContext.HTTP_REQUEST).asInstanceOf[HttpUriRequest]
    val currentHost = context.getAttribute(
      ExecutionContext.HTTP_TARGET_HOST).asInstanceOf[HttpHost]
    val currentUrl = if (currentReq.getURI.isAbsolute) currentReq.getURI.toString else (currentHost.toURI + currentReq.getURI)
    currentUrl
  }
}