• Nem Talált Eredményt

Implementation options for natural language semantic wildcards

Implementation of Natural Language Semantic Wildcards using Prolog *

3 Implementation options for natural language semantic wildcards

For the implementation of search engine with the pre-defined wildcards there are effective solutions. Data structures called Parallel Suffix Arrays [6, 7] offer a time-efficient solution to serve queries of a much richer query language than the above defined. In the case of natural language wildcards the implementation depends on the definition of matching. If matching is based purely on comparing dependency graphs, then we found it’s reasonable to represent these graphs in a Prolog database. For matching test on the grammatical constituent level the Apache Lucene full-text search system can be used. We will briefly show what problems we encountered during the different approaches and then describe experimental results. Figure 1 depicts the general approach.

Fig. 1. Lucene analyzer chain for indexing semantic data

3.1 Natural language semantic wildcards and Apache Lucene

Lucene is a Java-based open-source information retrieval software library. Lucene provides indexing and full-text search functionality that can be built in to various software. With Lucene we can index any textual data and store it in a schema-less index. To encode the semantic information in the index the easiest possible solution is to work around the problem, and store the semantic information as Lucene tokens. So we need to write our own Tokenizer or TokenFilter classes that generate these artificial, additional tokens. The SemanticFilter calls Apache OpenNLP parser for the received input sentences and splits the output into tokens. The user's search query is then interpreted as a composition of Lucene SpanQueries. Figure 2 depicts the Lucene query for the query *somebody* will feed the *dog*.1

1 In Figure 2 the notation of "NEAR" and "OR" were used for ease of clarity: they correspond respectively to the SpanNearQuery and SpanOrQuery queries. For SpanNearQuery the tokens matching must be in the order of the subqueries, for SpanOrQuery only one token

Fig. 2. The Lucene query tree of *somebody* will feed the *dog*

3.2 Natural language wildcards and Prolog language

In this section, we present a solution which makes it possible to construct indices which support the dependency approach. When we introduced the concept of dependency-based matching, it was already mentioned that in this case, dependencies extracted from the processed texts can be represented by directed, labelled graphs.

The vertices of such a graph are the words (tokens) of a given sentence S and the edges are labelled with elements of dependency relations ( ):

( ) ( ) In the graph there is an edge from to which is labelled by :

In this case, directed graph can be represented by a Prolog program with rules:

rel( ).

rel( ).

...

rel( ).

The Prolog representation can entrust the pattern matching to the Prolog runtime environment, as we shall later see. Of course, the above Prolog rule set represents

of only one sentence of a single document (or its only one field). Since we want to index multiple documents by the search engine, and a document (in a particular field) typically contains more than one sentence, we have to make sure that the de-pendencies of the different sentences do not get mixed up in a document. The

effi-ciency of a search in the Prolog runtime environment can be crucial, as in the end we will use the stored Prolog knowledge base to pass the goal clause

( ) ( ) corresponding to the query which can perform the comparison between the dependency graph of the query and of the stored sentences by evaluating the goal clause. The first idea could be that dependencies of all fields (and each block within each field) in all documents are stored in one large Prolog database. For a given field field of a given document doc, the field's sentences can be represented by the following Prolog code2:

rel( ).

...

rel( ).

rel( ).

...

rel( ).

...

rel( ).

...

rel(

,

,

, ).

In this case, an appropriate goal clause for the query can be as follows3: ( ) ( )

.

However, regarding the implementation of Prolog database built in this way, efficiency issues must be taken into account - especially in cases of large Prolog databases. Depending on what kind of Prolog system is used, we can optimize the evaluation time of the program in various ways. For example, if we require the text to not contain special characters or values (e.g., Rel elements) we can store Prolog atoms instead of strings. We have tested two Prolog systems: the SICStus Prolog [8] and TuProlog [9]. The performance with Prolog database atoms was always slightly faster than the string representation, but the biggest difference between the two representations was only 0.1203 seconds (in the case of the Bible corpus4). This is

2 In the Prolog code is the number of sentences in the field field of the document doc.

3 The following goal clause corresponds to the query just in case if the search term is com-posed of a natural language sentence, the more complex cases are not discussed here.

4 We made measurements on two text documents which are available free of charge: one is Tractatus Logico-Philosophicus by Ludwig Wittgenstein, and the other one is the English

1. sentence

2. sentence

Last sentence

probably due to both SICStus and TuProlog represent the atoms and strings with similar efficiencies in the background. A greater acceleration was achieved by finding the right order of the terms of the clauses. However, the speed difference is imperceptible in case of small datasets (such as the Tractatus), but it can be seen that the term indexing has a great impact on the performance when dealing with large corpora.

3.3 Term indexing: the optimal term order

In the previous Prolog example, we have presented a format with which the text dependencies can be represented by the facts of the Prolog language. All such facts were of form rel( ). The Prolog engines usually index the facts by the first term, so in this case by . Thus, for a given produc-ing the list of matchproduc-ing rules for this term will be very effective, while for the rest of the terms it5 won’t. To find the optimal term order we made some measurements. At first, one might be surprised that some orders are more efficient than others if we restrict ourselves only to query the facts of our Prolog representation and if the repre-sentation is supplemented with a few simple rules (see the next section for the rules).

If we are just querying the facts and there are no rules, the result is as shown in Figure 3.

Fig. 3. Effectiveness of Prolog queries for different term orders (using only facts, a lower value means better)

On the graph the storage strategy documents first means the order (doc, sentence, rel, word1, word2, field), while relations first means the order (rel, word1, word2, doc, sentence, field). With the strategy relations first (strings) we considered the same order but the other representation of the relations was used (instead of atoms, string values were represening the relations) - the difference is negligible. Words first, then documents means (word1, word2, doc, sentence, rel, field), and finally words first, then relations means the (word1, word2, rel, doc, sentence, field) ordering. The results shown in Figure 4 are the averages of 10 runs, for five different search terms. It's clear that if we work with only facts, then the strategy documents first gives the slowest method of the possible permutations6, and indexing words is the fastest. The reason is that we have a free variable in the argument document in the goal clause of the query.

However, we have significant changes in the case when we use inference rules in addition to Prolog facts in the search.

Fig. 4. Effectiveness of Prolog queries for different term orders (using facts and rules)

It's clear that the searches now are much slower than it can be seen on Figure 3 - this is due to the introduction of rules. Depending on what kinds of inference rules we work with, the obtained run times can be different from the above results. In these measurements we used two simple rules, the rules dobj and prep, as we will see later.

It is also shown that the strategies that were effective using only facts for pattern matching, are much slower when rules are also used. So, if we want to store data dependencies derived from the text in only one Prolog knowledge base and inference rules are used, then it is worth using the order documents first - of course, all this should only be addressed if we use such Prolog runtime environment that supports

6 The number of possible permutations is 6! / 2! = 360, or 5! / 2! = 60 if fields are omitted.

However, for the most of these permutations we received very similar results. We present

term indexing. Based on the measurements we can be satisfied with the effectiveness of SICStus Prolog. However, these results are given after a compilation step. The compilation is computationally rather intensive operation, for the Bible corpus it takes an average of 86 seconds (keep in mind that the Prolog representation in this case is a 24-megabyte source file). However, once that's done, then we can run fast and efficient queries on the index. The speed of queries, of course, depends not only on the term order, but the order of the terms (edges of the dependency graph to be matched) in the goal clause of the query.

Evaluation of the goal clause is sequential on its terms, so if the first term in the goal clause is too general (it means that the term matches a number of facts and left sides of rules), then the surplus calculation is accumulated for matching of the complete query, thereby matching the rest of the terms. However, if we are lucky, the first term of the goal clause corresponding to the query is as

:-rel(DOC,S,'mahalalel','years',conj), rel(DOC,S,'day','that',det).

This is a favorable case, because 'mahalalel' and 'years' are certainly less frequently together in the English Bible than 'that day'. Therefore, in addition to the term indexing (or in its absence) also a possibility raised that with some metadata we can further increase efficiency: for example, with automatic reordering of goal clauses (because we know that in our case it will not change the operation of search), or with specifying our own preprocessing algorithms, which in the first step filters the list of the applicable Prolog rules by different, domain-specific meta-information.

As a kind of metadata-based filter, we implemented a simple in-memory inverted index in Java programming language to store the Prolog representation. For seamless integration with the Java platform, we chose TuProlog, which is a Prolog engine implemented in pure Java. The advantage is that it is available for free, and we can use our search engine without installing any Prolog runtime environment. Java inte-gration was also necessary because we use NLP tools which are based on Java.