Thursday, November 19, 2009

A Bad, Bad SPARQL Pattern & A Good smf:trace

I should have known better...

Whats wrong with the following:
  SELECT ?a ?b ?c ?d
WHERE {
?subject :child ?a,?b,?c,?d .
?a a :A .
?b a :B .
?c a :C .
?d a :D .

}
it is equivalent to:
  SELECT ?a ?b ?c ?d
WHERE {
?subject :child ?a .
?subject :child ?b .
?subject :child ?c .
?subject :child ?d .
?a a :A .
?b a :B .
?c a :C .
?d a :D .

}
the answer is "performance". The evaluation may run forever as the SPARQL engine attempts to try every permutation (random each time it seems) of ?a ?b ?c ?d obtained in the :child statements, that will match the type restrictions. The fix:
  SELECT ?a ?b ?c ?d
WHERE {
?subject :child ?a .
?a a :A .
?subject :child ?b .
?b a :B .
?subject :child ?c .
?c a :C .
?subject :child ?d .
?d a :D .

}
Perhaps less elegant by some measures but expressed this way the immediate type test of each variable reduces the remaining list of properties that need be checked in the pattern. This is the "check as you go" vs the initial "grab first and check later" approach which will cause your computer to overheat and shut off before evaluation ever completes (assuming at least a hundred instances).

Just how bad is it? Painfully bad. Using the SPARQL Profiler feature in TBC 3.2 I set up a comparison test of the two patterns queried over a single instance with 7 :child properties. In the original "grab first check later" pattern the profiler reported 91,760,824 finds over the same number of triples. In the "check as you go" pattern the profiler reported only 76 finds over the same number of triples required. How many orders of magnitude difference is that? In my real world problem I had several hundred instances and and a dozen :child nodes.

I ran into this while mapping an XML file imported into TopBraid Composer (TBC) which is automatically converted into RDF under the "Semantic XML" model. With the RDF representation you can then map the SXML representation into your target model. Previously I had been through this exercise with SPARQLMotion support, but this time around I wanted to try it with SPIN following the Holger Knublauch's "Ontology Mapping with SPIN Templates" blog entry. Really neat stuff once I got the SPARQL patterns right.

Along the way I also discovered smf:trace while trying to figure out where the performance bottleneck was. Initially I had thought it was the SPIN functions that I was using, but spin:trace allowed me to quickly realize that the functions were quite performant and the the culprit at all. smf:trace works like smf:buildURI and smf:buildString but echos your string into the error log. Use smf:trace with a dummy assignment in a LET statement within a WHERE clause of a SPARQL expression as per:

LET ( ?foo := smf:trace( "myFunction: {?result}" ) )

So moral of the story: how you would write N3 triples nice and concisely in an ontology body is not always the best way to express the same statements in a SPARQL query body.

Saturday, November 07, 2009

New TBC Feature - Group by Namespace

One of the greatest strengths of TopBraid Composer (TBC) that sets it apart from other ontology tools, is its ability to easily work with and manage collections of ontologies. When it is so easy to work with independent ontologies you'll find yourself going modular quite naturally. And why not? Modularity is a fundamental programming organization principle that you've used for years, decades even, that makes reuse possible. Who writes monolithic programs anymore? The same goes with ontology composition.

So when the capability is at your fingertips again you'll find yourself using it. The unintended consequence however will be a proliferation of namespaces. Typically, but not necessarily, at the level of one namespace per ontology file. As you get even more advanced in ontology asset management, you'll begin to split content, under a single namespace, across several files. Remember, namspaces and baseURIs are not one in the same.


With lots of namespaces in play searching for a particular class or property takes increasing effort. Either from longer lists to scroll through or from longer lists of search results to sort through when a search string shows up as a pattern in more and more names over your inventory. Here is where the new feature "Group by Namespace" comes to the rescue.


In the lower left corner of the TBC "Classes" and "Properties" views, on the left side of the search input area, you will find now an icon of three stacked items
( and ). By default the classes and properties are listed in alphabetical order, with hierarchy for the subclass and subproperty trees. Click this icon and the organization will change to where a list of namespace prefixes are shown at the root level. Each namespace prefix acts now like a folder that contains a flat, but still alphabetic, list of the properties or classes defined under that namespace.

This may not seem like a such a great convenience at first, but when you work with hundreds or even thousands of items under tens and tens of namespaces the feature's utility quickly comes to light.

Tuesday, October 27, 2009

Counting on SPARQL Aggregates

This is largely a note to myself so I can remember the syntax for a DISTINCT count which is somewhat unintuitive. The regular syntax to get a count:
  SELECT (count(?foo) AS ?count)
WHERE { ... }
However, this will not be a unique count of the ?foo items. To make the count DISTINCT the syntax is:
  SELECT (count(DISTINCT ?foo) AS ?count)
WHERE { ... }
Not:
  SELECT DISTINCT (count(?foo) AS ?count)
WHERE { ... }
where the DISTINCT in this case has no impact whatsoever.

The use of the "AS" keyboard is itself unintuitive as it breaks the norm for how variable assignment is expressed in the WHERE clause. In fact I think it would be good to drop it and keep the SPARQL vocabulary minimal. The only use of "AS" that I've encountered is to assign a variable in a SELECT statement in the form:
  SELECT (count(?foo) AS ?count) ...
The oddity here is that the variable ?count is assigned on the right hand side of the operator, "AS", instead of the left hand side as we've gotten used to with LET functions. Why not use LET here as well? For example:
  SELECT LET(?count := count(?foo)) ...


Wednesday, September 30, 2009

A New SPIN Cycle

My first SPIN cycle was about 10 months ago in the first part of December. It was short lived as I had to dive deep into other things quickly and the new concepts that I was struggling with then didn't get the chance to sink in before their memory would begin to fade.

At that time I was struck by SPIN's potential and the promise of having a semantic rule language that applied a language that I already knew and evoked daily -SPARQL. Working previously with SWRL had left a bad taste in my mouth for rule languages. Nothing against SWRL itself, it was the engines that processed it that seemed to be the problem. They were disappointingly slow to say the least. Just to detect simple "uncle" relationships in a small family tree the engines would take tens of minutes if they didn't crash altogether. The logician in me was drawn toward rule based specifications, but the pragmatist in me was left wondering why use rules when you could get the same information a thousand times faster with a SPARQL query?

It didn't occur to me to put the two together (not that I could have), fortunately it did occur to the very capable Holger Knublauch.


While I did not use SPIN during the intervening months, I did use it indirectly in setting up SPARQLMotion web services. To make a SPARQLMotion script accessible as a web service you first define a SPIN function, incoming arguments, then point the function at a SPARQLMotion return module. Reapproaching SPIN at the end of September I found that the function side of it was already familiar, and TopBraid Composer has evolved nicely to make working with SPIN a more natural experience.

SPIN now comes with a small library of useful functions in addition to the XPath (fn:) and ARQ (afn:) SPARQL functions that Jena supports as well as the SPARQLMotion functions (smf:). These functions serve the most common types of operations that you would encounter and provide the basic building blocks that you would need to build new functions in SPIN.

Consider the scenario that I found myself in recently where I wanted to convert only the first character of a string to uppercase. fn:upper-case was the closest thing available but it operates on the entire string. Not to worry, a few lines of SPARQL can handle it:

?resource rdfs:label ?label .
LET (?lcFirstChar := smf:regex(?label, "^(.).*$", "$1")) .
LET (?ucFirstChar := fn:upper-case(?lcFirstChar)) .
LET (?newLabel := smf:regex(?label, "^(.)", ?ucFirstChar)) .

This gets the job done but I didn't fancy copying and pasting the lines over and over again each time I need them in some new query. Here's where SPIN functions come in. With SPIN functions you can take useful fragments of SPARQL expressions from a WHERE clause and turn them into parameterized functions that you can simply use anywhere. The above can be defined in a new SPIN function as seen in this TopBraid screenshot:


Note that in the spin:body the ?arg1 is a special variable in spin that corresponds to the first function argument. We may now apply the function in a LET statement in any SPARQL block as per:

LET ( ?newLabel := myFunc:ucFirst( ?label ) )

It gets even better. The family of available SPARQL functions are useful up to a point, but shortly you may find that you need a little more horsepower. Enter JavaScript and SPINX for extension languages. JavaScript can be applied to write function bodies that operate on variables passed in from a SPARQL WHERE clause. Lets try the ucFirst again, but this time to write the complement for lowercasing lcFirst:



On the SPARQL side the syntax is unchanged:

LET ( ?newLabel := myFunc:lcFirst( ?label ) )

In principle any scripting language may be applied in SPIN functions for which there is a JSR-223 scripting engine available. Note that I've been using the namespace "myFunc:" in these examples. It is the intention that developers would maintain the spin functions that they develop in an ontology just for functions. In this way creating a reusable library of functions that can be imported into new ontologies as needed.