Skip to main content

Lucene Index

Indexing

To implement an Index Search we need to complete a few extra steps.  We need to index the site and place the results in the file directory.

Setup

Place the following search node into web of the web.config:

(code)

Create the following folders in the parent folder of the root of the site directory e.g. if the root is c:\mysite\wwwroot then the folders should be placed in c:\mysite.  These will be the folders that the site indexer will place the indexed data.  

Administrator note: The Read and Write directories will need to be allocated write permissions for the EonicWeb administrative Windows user account.

  • Index
  • Index/Read
  • Index/Write

 

Running the index

In order to search an index, you need to create an index to search! 

  • Log-on to the admin side of the website you are working on. 
  • Content Area > Web Settings > Index Site
  • The process may take several minutes, depending on the size of the site and the volume of content.
  • If you receive an EonicWeb error stating that access to a file is denied, you may need to check out the Index folder in Visual Studio/Source Safe.

 

Scheduling an index

Running an index can be automated according to a schedule.  As running an index is a resource intensive process, it should only be run at most once a day, and not more frequently.

Talk to the Production or Development at Eonic about setting up a schedule for indexing.

Customising an index

Overriding the location of the index folders

To override the index folder location from default, add the following settings into into web of the web.config:

(code)

Setting the types of content that are indexed

To set which type of content is searched for add the following comma separated list of content types into web of the web.config:

(code)

Indexing related content (not recommended)

By default if content on a page has related content, that related content is NOT indexed against the parent content.  To override this add the following setting into web of the web.config:

(code)

This is not recommended because of the additional resource involved in indexing and the feeling that it reduces the quality / accuracy of the search results.

Creating bespoke indexing XSL files

Copy /ewcommon/xsl/Search/CleanPage.xsl to /xsl/Search.

Determining what is indexed

The CleanPage.xsl file transforms the content into the following format which is read by the indexer (comments added for explanation):

(code)

By default content in (body) is indexed.

meta tags are used to store information against a record (i.e. stored but not searchable), or add further information that needs to be indexed.   

As noted pgid, name, artid, contenttype and abstract are required in order to be processed.  In addition name must be tokenized.

The abstract meta tag is returned in the search results.  You can use this to display a brief description of the content item. This can also be returned as XML, rather than the plain text stored in the content attribute.  To do this, omit the content attribute and add the xml to be returned as a child of the meta node.

Making the text within a meta tag searchable (i.e. tokenizing)

Tokenized items are items where search terms within text can be found.

Untokenized items are items where search terms are only matched to the whole term or wildcard queries.

e.g.

  • If the string “something else” is tokenized then it will be returned for a search on “else”
  • If the string “something else” is not tokenized, then it will only be returned for searches on “something else” or “some*” (or similar wildcard searches)

To tokenize a meta tag use the attribute tokenize="true".

Making the meta tag sortable

To make a meta tag sortable in search results add the attribute sortable=”true”.

Indexing numbers and dates

To index numbers user the type attribute with a value of "float" or "date".

Excluding items from the index

Use a ROBOTS meta tag to exclude an item from the index and/or exclude the child pages and content from the index, e.g.

(code)

Optimising the index

Please consider what you index.   The following is an example of a bad tag for indexing:

(code)

In the example above, everything is indexed including the tags and their attributes.  So a search for “class”, “picture” etc. will return this item, even though it has nothing to do with the item. 

Not only do you get really poor quality of search results, you also unnecessarily store data in the index, which makes the index larger, slower to create and slower to search.

The example above can be optimised to the following.

(code)

If you really need to store tags, then store them in the abstract meta tag as XML.

Troubleshooting

If you are having problems with indexing, then you can do the following:

Turn on site search debugging by adding the following setting into web of the web.config:

(code)

This will output the HTML produced by CleanPage.xsl to the /Index/IndexedSite folder. This should allow you to observe what has been sent to the indexer for processing.

Use a tool to read the index, such as Luke

Luke is a Java application that reads indexes, allowing you to check what is actually in the index, as well as testing out searches.  It can be downloaded from http://code.google.com/p/luke/ .