Thursday, November 19, 2009

A Bad, Bad SPARQL Pattern & A Good smf:trace

I should have known better...

Whats wrong with the following:
  SELECT ?a ?b ?c ?d
WHERE {
?subject :child ?a,?b,?c,?d .
?a a :A .
?b a :B .
?c a :C .
?d a :D .

}
it is equivalent to:
  SELECT ?a ?b ?c ?d
WHERE {
?subject :child ?a .
?subject :child ?b .
?subject :child ?c .
?subject :child ?d .
?a a :A .
?b a :B .
?c a :C .
?d a :D .

}
the answer is "performance". The evaluation may run forever as the SPARQL engine attempts to try every permutation (random each time it seems) of ?a ?b ?c ?d obtained in the :child statements, that will match the type restrictions. The fix:
  SELECT ?a ?b ?c ?d
WHERE {
?subject :child ?a .
?a a :A .
?subject :child ?b .
?b a :B .
?subject :child ?c .
?c a :C .
?subject :child ?d .
?d a :D .

}
Perhaps less elegant by some measures but expressed this way the immediate type test of each variable reduces the remaining list of properties that need be checked in the pattern. This is the "check as you go" vs the initial "grab first and check later" approach which will cause your computer to overheat and shut off before evaluation ever completes (assuming at least a hundred instances).

Just how bad is it? Painfully bad. Using the SPARQL Profiler feature in TBC 3.2 I set up a comparison test of the two patterns queried over a single instance with 7 :child properties. In the original "grab first check later" pattern the profiler reported 91,760,824 finds over the same number of triples. In the "check as you go" pattern the profiler reported only 76 finds over the same number of triples required. How many orders of magnitude difference is that? In my real world problem I had several hundred instances and and a dozen :child nodes.

I ran into this while mapping an XML file imported into TopBraid Composer (TBC) which is automatically converted into RDF under the "Semantic XML" model. With the RDF representation you can then map the SXML representation into your target model. Previously I had been through this exercise with SPARQLMotion support, but this time around I wanted to try it with SPIN following the Holger Knublauch's "Ontology Mapping with SPIN Templates" blog entry. Really neat stuff once I got the SPARQL patterns right.

Along the way I also discovered smf:trace while trying to figure out where the performance bottleneck was. Initially I had thought it was the SPIN functions that I was using, but spin:trace allowed me to quickly realize that the functions were quite performant and the the culprit at all. smf:trace works like smf:buildURI and smf:buildString but echos your string into the error log. Use smf:trace with a dummy assignment in a LET statement within a WHERE clause of a SPARQL expression as per:

LET ( ?foo := smf:trace( "myFunction: {?result}" ) )

So moral of the story: how you would write N3 triples nice and concisely in an ontology body is not always the best way to express the same statements in a SPARQL query body.

No comments: