Thursday, November 19, 2009

A Bad, Bad SPARQL Pattern & A Good smf:trace

I should have known better...

Whats wrong with the following:
  SELECT ?a ?b ?c ?d
WHERE {
?subject :child ?a,?b,?c,?d .
?a a :A .
?b a :B .
?c a :C .
?d a :D .

}
it is equivalent to:
  SELECT ?a ?b ?c ?d
WHERE {
?subject :child ?a .
?subject :child ?b .
?subject :child ?c .
?subject :child ?d .
?a a :A .
?b a :B .
?c a :C .
?d a :D .

}
the answer is "performance". The evaluation may run forever as the SPARQL engine attempts to try every permutation (random each time it seems) of ?a ?b ?c ?d obtained in the :child statements, that will match the type restrictions. The fix:
  SELECT ?a ?b ?c ?d
WHERE {
?subject :child ?a .
?a a :A .
?subject :child ?b .
?b a :B .
?subject :child ?c .
?c a :C .
?subject :child ?d .
?d a :D .

}
Perhaps less elegant by some measures but expressed this way the immediate type test of each variable reduces the remaining list of properties that need be checked in the pattern. This is the "check as you go" vs the initial "grab first and check later" approach which will cause your computer to overheat and shut off before evaluation ever completes (assuming at least a hundred instances).

Just how bad is it? Painfully bad. Using the SPARQL Profiler feature in TBC 3.2 I set up a comparison test of the two patterns queried over a single instance with 7 :child properties. In the original "grab first check later" pattern the profiler reported 91,760,824 finds over the same number of triples. In the "check as you go" pattern the profiler reported only 76 finds over the same number of triples required. How many orders of magnitude difference is that? In my real world problem I had several hundred instances and and a dozen :child nodes.

I ran into this while mapping an XML file imported into TopBraid Composer (TBC) which is automatically converted into RDF under the "Semantic XML" model. With the RDF representation you can then map the SXML representation into your target model. Previously I had been through this exercise with SPARQLMotion support, but this time around I wanted to try it with SPIN following the Holger Knublauch's "Ontology Mapping with SPIN Templates" blog entry. Really neat stuff once I got the SPARQL patterns right.

Along the way I also discovered smf:trace while trying to figure out where the performance bottleneck was. Initially I had thought it was the SPIN functions that I was using, but spin:trace allowed me to quickly realize that the functions were quite performant and the the culprit at all. smf:trace works like smf:buildURI and smf:buildString but echos your string into the error log. Use smf:trace with a dummy assignment in a LET statement within a WHERE clause of a SPARQL expression as per:

LET ( ?foo := smf:trace( "myFunction: {?result}" ) )

So moral of the story: how you would write N3 triples nice and concisely in an ontology body is not always the best way to express the same statements in a SPARQL query body.

Saturday, November 07, 2009

New TBC Feature - Group by Namespace

One of the greatest strengths of TopBraid Composer (TBC) that sets it apart from other ontology tools, is its ability to easily work with and manage collections of ontologies. When it is so easy to work with independent ontologies you'll find yourself going modular quite naturally. And why not? Modularity is a fundamental programming organization principle that you've used for years, decades even, that makes reuse possible. Who writes monolithic programs anymore? The same goes with ontology composition.

So when the capability is at your fingertips again you'll find yourself using it. The unintended consequence however will be a proliferation of namespaces. Typically, but not necessarily, at the level of one namespace per ontology file. As you get even more advanced in ontology asset management, you'll begin to split content, under a single namespace, across several files. Remember, namspaces and baseURIs are not one in the same.


With lots of namespaces in play searching for a particular class or property takes increasing effort. Either from longer lists to scroll through or from longer lists of search results to sort through when a search string shows up as a pattern in more and more names over your inventory. Here is where the new feature "Group by Namespace" comes to the rescue.


In the lower left corner of the TBC "Classes" and "Properties" views, on the left side of the search input area, you will find now an icon of three stacked items
( and ). By default the classes and properties are listed in alphabetical order, with hierarchy for the subclass and subproperty trees. Click this icon and the organization will change to where a list of namespace prefixes are shown at the root level. Each namespace prefix acts now like a folder that contains a flat, but still alphabetic, list of the properties or classes defined under that namespace.

This may not seem like a such a great convenience at first, but when you work with hundreds or even thousands of items under tens and tens of namespaces the feature's utility quickly comes to light.