Using arbitrary SAX sources in scala 2.8
By default, scala reads XML files using the standard SAX parser that comes with your JRE. If you want to change this, you have to supply another one. Since scala 2.8, several of the methods of the scala XML framework accept a SAX parser to use, for example this method inFactoryAdapter:
But unfortunately, all these method take a SAXParser, which
is the wrapper returned by the builtin SAX parser factory that can be
configured via command line parameters or global properties when starting
the JVM.
SAXParser is not the interface implemented by normal SAX
parsers. That interface is called XMLReader (there is a cheap
joke about cache invalidation and off-by-one errors lurking somwhere around
here).
Some libraries give acces to a SAXParser that you can use
directly like in
If you have normal SAX parser, you must still derive some ugly classes
that directly use the XMLReader you've got and ignore
any SAXParser handlig code added in 2.8, like this:
This compiles if you put this line in your sbt project definition:
and should these days be the prefered way to read arbitrary HTML, as it
uses the HTML5 reference parser with well defined, deterministic error
handling.
The old way to use a SAX parser for scala 2.7
In scala 2.7 you always had to supply a new FactoryAdapter that supplies an appropiateXMLReader. This can be
done by implementing the abstract method getReader in the
following trait.
The NonBindingFactoryAdapter used here is a variant
of scalas standard
scala.xml.parsing.NoBindingFactoryAdapter that has been
turned into a trait:
Using a DOM parser in scala 2.7
Using a DOM parser is a little bit more tricky, as scala assumes SAX input. We have to get rid of a method inFactoryAdapter
by making it always throw an exception and replace it by an equivalent
method that operates on a DOM node.
Then we have to override all other load methods to call this method
instead.
Reading HTML
To read HTML, we might as well tell scala that the empty HTML elements don't contain any text we might be interested in: Now all we have to do is implementing thegetReader
methods for the sanitizing HTML-parsers we want to use.
Only two of the sanitizers
TagSoup
nekoHTML
HTMLCleaner
This is not very efficient: HTMLCleaner first parses the document into
an internal tree format, then it transforms it to a standard DOM tree,
and then we build a scala XML tree from the DOM tree, so we build a
total of three tree representations for the document.
Using it
Now you can put this code into a package (I choose to call minede.hars.scalaxml) and use it to parse some HTML files.
Here is an example session
(sorry for the line length, but the important parts are at the beginning
of the lines):
$ scala -cp build/scalaxml.jar:/usr/share/java/tagsoup-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/nekohtml.jar:/usr/share/java/htmlcleaner2_1.jar:/usr/share/java/xalan2.jar
Welcome to Scala version 2.7.3.final (Java HotSpot(TM) Client VM, Java 1.5.0_16).
Type in expressions to have them evaluated.
Type :help for more information.
scala> import de.hars.scalaxml._
import de.hars.scalaxml._
scala> val url = "http://www.scala-lang.org"
url: java.lang.String = http://www.scala-lang.org
scala> new TagSoupFactoryAdapter load url
res0: scala.xml.Node =
<html xml:lang="en" lang="en">
<head>
<title>The Scala Programming Language</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"></meta>
<link href="/rss.xml" title="Front page feed" type="application/rss+xml" rel="alternate"></link>
<link type="image/x-icon" href="/sites/default/files/favicon.gif" rel="shortcut icon"></link>
<link href="/sites/...
scala> new NekoHTMLFactoryAdapter load url
res1: scala.xml.Node =
<HTML xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml">
<HEAD>
<TITLE>The Scala Programming Language</TITLE>
<META content="text/html; charset=utf-8" http-equiv="Content-Type"></META>
<LINK href="/rss.xml" title="Front page feed" type="application/rss+xml" rel="alternate"></LINK>
<LINK type="image/x-icon" href="/sites/default/files/favicon.gif"...
scala> new HTMLCleanerFactoryAdapter load url
res2: scala.xml.Node =
<html>
<head>
<title>The Scala Programming Language</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
<link type="application/rss+xml" title="Front page feed" rel="alternate" href="/rss.xml"></link>
<link type="image/x-icon" rel="shortcut icon" href="/sites/default/files/favicon.gif"></link>
<link type="text/css" rel="stylesheet" medi...
One thing that is quite obvious is that the default configurations
are problematic. TagSoup and HTMLCleaner seem to have some
problems with namespaces, and nekoHTML turns every tag
into uppercase.
So all have problems with modern pages that are already XML.
Loading a page like
Sam Rubys that contains SVG gives suboptimal results.
Where the page contains
you lose the namespace with TagSoup
and get wrong tag names with NekoHTML (the double xmlns is a
bug in scalas xml library)
But the clear loser (at least in the default configuration) is HTMLCleaner which totally garbles the structure:
Caveat emptor, or at least read
the
documentation and fix the code according to your needs.
The code
Here is the source code for 2.7 as a tar.gz (you will probably have to change some paths in thebuild.xml).