Using arbitrary SAX sources in scala 2.8By default, scala reads XML files using the standard SAX parser that comes with your JRE. If you want to change this, you have to supply another one. Since scala 2.8, several of the methods of the scala XML framework accept a SAX parser to use, for example this method in
FactoryAdapter: But unfortunately, all these method take a
SAXParser, which is the wrapper returned by the builtin SAX parser factory that can be configured via command line parameters or global properties when starting the JVM.
SAXParseris not the interface implemented by normal SAX parsers. That interface is called
XMLReader(there is a cheap joke about cache invalidation and off-by-one errors lurking somwhere around here).
Some libraries give acces to a
SAXParser that you can use
directly like in
If you have normal SAX parser, you must still derive some ugly classes
that directly use the
XMLReader you've got and ignore
SAXParser handlig code added in 2.8, like this:
This compiles if you put this line in your sbt project definition:
and should these days be the prefered way to read arbitrary HTML, as it
uses the HTML5 reference parser with well defined, deterministic error
The old way to use a SAX parser for scala 2.7In scala 2.7 you always had to supply a new FactoryAdapter that supplies an appropiate
XMLReader. This can be done by implementing the abstract method
getReaderin the following trait. The
NonBindingFactoryAdapterused here is a variant of scalas standard
scala.xml.parsing.NoBindingFactoryAdapterthat has been turned into a trait:
Using a DOM parser in scala 2.7Using a DOM parser is a little bit more tricky, as scala assumes SAX input. We have to get rid of a method in
FactoryAdapterby making it always throw an exception and replace it by an equivalent method that operates on a DOM node. Then we have to override all other load methods to call this method instead.
Reading HTMLTo read HTML, we might as well tell scala that the empty HTML elements don't contain any text we might be interested in: Now all we have to do is implementing the
getReadermethods for the sanitizing HTML-parsers we want to use. Only two of the sanitizers compared by Ben McCann a year ago support SAX, TagSoup and nekoHTML, so I present example code for these two. In addition, i present code for one of the DOM parsers, HTMLCleaner. Generalizing it to other parsers should be trivial.
Using itNow you can put this code into a package (I choose to call mine
de.hars.scalaxml) and use it to parse some HTML files. Here is an example session (sorry for the line length, but the important parts are at the beginning of the lines):
$ scala -cp build/scalaxml.jar:/usr/share/java/tagsoup-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/nekohtml.jar:/usr/share/java/htmlcleaner2_1.jar:/usr/share/java/xalan2.jar Welcome to Scala version 2.7.3.final (Java HotSpot(TM) Client VM, Java 1.5.0_16). Type in expressions to have them evaluated. Type :help for more information. scala> import de.hars.scalaxml._ import de.hars.scalaxml._ scala> val url = "http://www.scala-lang.org" url: java.lang.String = http://www.scala-lang.org scala> new TagSoupFactoryAdapter load url res0: scala.xml.Node = <html xml:lang="en" lang="en"> <head> <title>The Scala Programming Language</title> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"></meta> <link href="/rss.xml" title="Front page feed" type="application/rss+xml" rel="alternate"></link> <link type="image/x-icon" href="/sites/default/files/favicon.gif" rel="shortcut icon"></link> <link href="/sites/... scala> new NekoHTMLFactoryAdapter load url res1: scala.xml.Node = <HTML xml:lang="en" lang="en" xmlns="http://www.w3.org/1999/xhtml"> <HEAD> <TITLE>The Scala Programming Language</TITLE> <META content="text/html; charset=utf-8" http-equiv="Content-Type"></META> <LINK href="/rss.xml" title="Front page feed" type="application/rss+xml" rel="alternate"></LINK> <LINK type="image/x-icon" href="/sites/default/files/favicon.gif"... scala> new HTMLCleanerFactoryAdapter load url res2: scala.xml.Node = <html> <head> <title>The Scala Programming Language</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta> <link type="application/rss+xml" title="Front page feed" rel="alternate" href="/rss.xml"></link> <link type="image/x-icon" rel="shortcut icon" href="/sites/default/files/favicon.gif"></link> <link type="text/css" rel="stylesheet" medi...
One thing that is quite obvious is that the default configurations
are problematic. TagSoup and HTMLCleaner seem to have some
problems with namespaces, and nekoHTML turns every tag
So all have problems with modern pages that are already XML.
Loading a page like Sam Rubys that contains SVG gives suboptimal results.
Where the page contains
you lose the namespace with TagSoup
and get wrong tag names with NekoHTML (the double
xmlns is a
bug in scalas xml library)
But the clear loser (at least in the default configuration) is HTMLCleaner which totally garbles the structure:
Caveat emptor, or at least read the
documentation and fix the code according to your needs.
The codeHere is the source code for 2.7 as a tar.gz (you will probably have to change some paths in the