Processing real world HTML as if it were XML in scala

Using arbitrary SAX sources in scala 2.8

By default, scala reads XML files using the standard SAX parser that comes with your JRE. If you want to change this, you have to supply another one. Since scala 2.8, several of the methods of the scala XML framework accept a SAX parser to use, for example this method inFactoryAdapter:

def loadXML (source: InputSource, parser: SAXParser) : Node

But unfortunately, all these method take a SAXParser, which is the wrapper returned by the builtin SAX parser factory that can be configured via command line parameters or global properties when starting the JVM. SAXParser is not the interface implemented by normal SAX parsers. That interface is called XMLReader (there is a cheap joke about cache invalidation and off-by-one errors lurking somwhere around here).

Some libraries give acces to a SAXParser that you can use directly like in

val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val parser = parserFactory.newSAXParser()
val source = new org.xml.sax.InputSource("http://www.scala-lang.org")
val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
adapter.loadXML(source, parser)

If you have normal SAX parser, you must still derive some ugly classes that directly use the XMLReader you've got and ignore any SAXParser handlig code added in 2.8, like this:

import org.xml.sax.InputSource

import scala.xml._
import parsing._

class HTML5Parser extends NoBindingFactoryAdapter {

  override def loadXML(source : InputSource, _p: SAXParser) = {
    loadXML(source)
  }

  def loadXML(source : InputSource) = {
    import nu.validator.htmlparser.{sax,common}
    import sax.HtmlParser
    import common.XmlViolationPolicy

    val reader = new HtmlParser
    reader.setXmlPolicy(XmlViolationPolicy.ALLOW)
    reader.setContentHandler(this)
    reader.parse(source)
    rootElem
  }
}

This compiles if you put this line in your sbt project definition:

  val html5parser = "nu.validator.htmlparser" % "htmlparser" % "1.2.1"

and should these days be the prefered way to read arbitrary HTML, as it uses the HTML5 reference parser with well defined, deterministic error handling.

The old way to use a SAX parser for scala 2.7

In scala 2.7 you always had to supply a new FactoryAdapter that supplies an appropiate XMLReader. This can be done by implementing the abstract method getReader in the following trait.

import _root_.org.xml.sax.{XMLReader,InputSource}
import _root_.scala.xml.{Node,TopScope}

trait SAXFactoryAdapter extends NonBindingFactoryAdapter {

  /** The method [getReader] has to implemented by
      concrete subclasses */
  def getReader() : XMLReader;

  override def loadXML(source : InputSource) : Node = {
    val reader = getReader()
    reader.setContentHandler(this)
    scopeStack.push(TopScope)
    reader.parse(source)
    scopeStack.pop
    return rootElem
  }
}

The NonBindingFactoryAdapter used here is a variant of scalas standard scala.xml.parsing.NoBindingFactoryAdapter that has been turned into a trait:

import _root_.scala.xml.parsing.FactoryAdapter
import _root_.scala.xml.factory.NodeFactory
import _root_.scala.xml.{Elem,MetaData,NamespaceBinding,
			 Node,Text,TopScope}

trait NonBindingFactoryAdapter extends FactoryAdapter
                               with NodeFactory[Elem] {

  def nodeContainsText(localName: String) = true

  // methods for NodeFactory[Elem]
  /** constructs an instance of scala.xml.Elem */
  protected def create(pre: String, label: String,
                       attrs: MetaData, scpe: NamespaceBinding,
		       children: Seq[Node]): Elem =
    Elem( pre, label, attrs, scpe, children:_* )
  
	
  // -- methods for FactoryAdapter
  def createNode(pre: String, label: String,
                 attrs: MetaData, scpe: NamespaceBinding,
                 children: List[Node] ): Elem =
    Elem( pre, label, attrs, scpe, children:_* )
	 
  def createText(text: String) = Text(text)
	
  def createProcInstr(target: String, data: String) =
    makeProcInstr(target, data)
}

Using a DOM parser in scala 2.7

Using a DOM parser is a little bit more tricky, as scala assumes SAX input. We have to get rid of a method in FactoryAdapter by making it always throw an exception and replace it by an equivalent method that operates on a DOM node. Then we have to override all other load methods to call this method instead.

import _root_.java.io.{InputStream, InputStreamReader, Reader,
		       File, FileDescriptor, FileInputStream}
import _root_.org.apache.xalan.xsltc.trax.DOM2SAX
import _root_.org.xml.sax.InputSource
import _root_.scala.xml.{Node,TopScope}

trait DOMFactoryAdapter extends NonBindingFactoryAdapter {

  def getDOM(reader: Reader) : _root_.org.w3c.dom.Node

  /** loading from a SAX source is useless here */
  override def loadXML(unused : InputSource) : Node = {
    throw(new Exception("Not Implemented"))
  }
	  
  def loadXML(dom: _root_.org.w3c.dom.Node) : Node = {
    val dom2sax = new DOM2SAX(dom)
    dom2sax.setContentHandler(this)
    scopeStack.push(TopScope)
    dom2sax.parse()
    scopeStack.pop
    return rootElem
  }

  /** loads XML from given file */
  override def loadFile(file: File): Node = {
    val is = new FileInputStream(file)
    val elem = load(is)
    is.close
    elem
  }
 
  /** loads XML from given file descriptor */
  override def loadFile(fileDesc: FileDescriptor): Node = {
    val is = new FileInputStream(fileDesc)
    val elem = load(is)
    is.close
    elem
  }

  /** loads XML from given file */
  override def loadFile(fileName: String): Node = {
    val is = new FileInputStream(fileName)
    val elem = load(is)
    is.close
    elem
  }

  /** loads XML from given InputStream */
  override def load(is: InputStream): Node =
    load(new InputStreamReader(is))

  /** loads XML from given Reader */
  override def load(reader: Reader): Node =
    loadXML(getDOM(reader))

  /** loads XML from given sysID */
  override def load(sysID: String): Node = {
    val is = new java.net.URL(sysID).openStream()
    val elem = load(is)
    is.close
    elem
  }
}

Reading HTML

To read HTML, we might as well tell scala that the empty HTML elements don't contain any text we might be interested in:

import _root_.scala.xml.parsing.FactoryAdapter

trait HTMLFactoryAdapter extends FactoryAdapter {

  val emptyElements = Set("area", "base", "br", "col", "hr", "img",
                          "input", "link", "meta", "param")

  def nodeContainsText(localName: String) =
    !(emptyElements contains localName)
}

Now all we have to do is implementing the getReader methods for the sanitizing HTML-parsers we want to use. Only two of the sanitizers

compared by Ben McCann a year ago support SAX, TagSoup and nekoHTML, so I present example code for these two. In addition, i present code for one of the DOM parsers, HTMLCleaner. Generalizing it to other parsers should be

trivial.

TagSoup

import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl

class TagSoupFactoryAdapter extends SAXFactoryAdapter
                            with HTMLFactoryAdapter {
  private val parserFactory = new SAXFactoryImpl
  parserFactory.setNamespaceAware(true)

  def getReader() = parserFactory.newSAXParser().getXMLReader()
}

nekoHTML

import org.cyberneko.html.parsers.SAXParser

class NekoHTMLFactoryAdapter extends SAXFactoryAdapter
                             with HTMLFactoryAdapter {
  def getReader() = new SAXParser
}

HTMLCleaner

import _root_.java.io.Reader
import _root_.org.htmlcleaner.{HtmlCleaner,DomSerializer}

class HTMLCleanerFactoryAdapter extends DOMFactoryAdapter
                                with HTMLFactoryAdapter {
  private val cleaner = new HtmlCleaner
  private val props = cleaner.getProperties()
  private val serializer = new DomSerializer(props, true)

  def getDOM(reader: Reader) = {
    val node = cleaner.clean(reader)
    serializer.createDOM(node);
  }
}

This is not very efficient: HTMLCleaner first parses the document into an internal tree format, then it transforms it to a standard DOM tree, and then we build a scala XML tree from the DOM tree, so we build a total of three tree representations for the document.

Using it

Now you can put this code into a package (I choose to call mine de.hars.scalaxml) and use it to parse some HTML files. Here is an example session (sorry for the line length, but the important parts are at the beginning of the lines):

$ scala -cp build/scalaxml.jar:/usr/share/java/tagsoup-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/nekohtml.jar:/usr/share/java/htmlcleaner2_1.jar:/usr/share/java/xalan2.jar
Welcome to Scala version 2.7.3.final (Java HotSpot(TM) Client VM, Java 1.5.0_16).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import de.hars.scalaxml._
import de.hars.scalaxml._

scala> val url = "http://www.scala-lang.org"
url: java.lang.String = http://www.scala-lang.org

scala> new TagSoupFactoryAdapter load url
res0: scala.xml.Node =
<html xml:lang="en" lang="en">
<head>
<title>The Scala Programming Language</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"></meta>
<link href="/rss.xml" title="Front page feed" type="application/rss+xml" rel="alternate"></link>
<link type="image/x-icon" href="/sites/default/files/favicon.gif" rel="shortcut icon"></link>
<link href="/sites/...
scala> new NekoHTMLFactoryAdapter load url
res1: scala.xml.Node =
<HTML xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<HEAD>
  <TITLE>The Scala Programming Language</TITLE>
  <META content="text/html; charset=utf-8" http-equiv="Content-Type"></META>
<LINK href="/rss.xml" title="Front page feed" type="application/rss+xml" rel="alternate"></LINK>

<LINK type="image/x-icon" href="/sites/default/files/favicon.gif"...
scala> new HTMLCleanerFactoryAdapter load url
res2: scala.xml.Node =
<html>
<head>
<title>The Scala Programming Language</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
<link type="application/rss+xml" title="Front page feed" rel="alternate" href="/rss.xml"></link>
<link type="image/x-icon" rel="shortcut icon" href="/sites/default/files/favicon.gif"></link>
<link type="text/css" rel="stylesheet" medi...

(TagSoup and HTMLCleaner Output has been wrapped by hand).

One thing that is quite obvious is that the default configurations are problematic. TagSoup and HTMLCleaner seem to have some problems with namespaces, and nekoHTML turns every tag into uppercase. So all have problems with modern pages that are already XML. Loading a page like Sam Rubys that contains SVG gives suboptimal results. Where the page contains

<svg xmlns="http://www.w3.org/2000/svg" width="100" height="100" viewBox="0 0 100 100">
  <path d="M57,11c40-22,42-2,35,12c8-27-15-20-30-11z" fill="#47b"/>
  <path d="M36,56h56c4-60-83-60-86-6c13-16,26-26,36-30l-29,53c20,23,64,26,79-12h-30c0,14-26,12-25-5zM37,43c0-17,26-17,26,0zM39,89c-10,7-42,15-26-16l29-52c-15,6-36,40-37,48c-12,35,14,37,37,20" fill="#47b"/>
</svg>

you lose the namespace with TagSoup

<svg viewbox="0 0 100 100" height="100" width="100">
  <path fill="#47b" d="M57,11c40-22,42-2,35,12c8-27-15-20-30-11z"></path>
  <path fill="#47b" d="M36,56h56c4-60-83-60-86-6c13-16,26-26,36-30l-29,53c20,23,64,26,79-12h-30c0,14-26,12-25-5zM37,43c0-17,26-17,26,0zM39,89c-10,7-42,15-26-16l29-52c-15,6-36,40-37,48c-12,35,14,37,37,20"></path>
</svg>

and get wrong tag names with NekoHTML (the double xmlns is a bug in scalas xml library)

<SVG viewbox="0 0 100 100" height="100" width="100" xmlns="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/1999/xhtml">
  <PATH fill="#47b" d="M57,11c40-22,42-2,35,12c8-27-15-20-30-11z"></PATH>
  <PATH fill="#47b" d="M36,56h56c4-60-83-60-86-6c13-16,26-26,36-30l-29,53c20,23,64,26,79-12h-30c0,14-26,12-25-5zM37,43c0-17,26-17,26,0zM39,89c-10,7-42,15-26-16l2..."></PATH>
</SVG>

But the clear loser (at least in the default configuration) is HTMLCleaner which totally garbles the structure:

<svg width="100" viewbox="0 0 100 100" height="100">
  <path fill="#47b" d="M57,11c40-22,42-2,35,12c8-27-15-20-30-11z">
  <path fill="#47b" d="M36,56h56c4-60-83-60-86-6c13-16,26-26,36-30l-29,53c20,23,64,26,79-12h-30c0,14-26,12-25-5zM37,43c0-17,26-17,26,0zM39,89c-10,7-42,15-26-16l29-52c-15,6-36,40-37,48c-12,35,14,37,37,20">
</path></path></svg>

Caveat emptor, or at least read

the documentation and fix the code according to your needs.

The code

Here is the source code for 2.7 as a tar.gz (you will probably have to change some paths in the build.xml).