Solr

THL Toolbox > Developers' Zone > SOLR

SOLR

Contributor(s): Nathaniel Grove

For SOLR configuration documentation, see server administration & support guidelines#solr

Indexing THL Texts in SOLR

The basic process of indexing THL digital texts in SOLR is a two-part process. First, the texts to be indexed need to be digested into a SOLR add-doc, which is an XML document that follows the SOLR schema for that index. The schema defines the fields that SOLR will accept and how to index them, whether to store the data in the index in a retrievable form or not and so forth. Once the original texts have been converted into valid SOLR add-docs, these can then be ingested into the SOLR index. If one ingests a SOLR add-doc for a text that is already in the index, SOLR will update the information for that text based on the new add doc.

This process is the same for all SOLR indices. However, the means for achieving these two steps may be different.

Digital Texts in the THL Catalogs

To index digital texts in the THL catalogs, I have written several shell scripts that can be run on the site if you have permission to log onto the various servers for development, staging, and production. Each server has its own version of SOLR. So, the scripts must be run on each server. However, in the following instructions, I will use the Dev site as the basis but the same instructions will apply to Staging and Development other than the location of the solrdocs folder.

Things to note:

  1. In general, to run shell script in Linux, you prepend "./" to the name of the script.
  2. In commands, items in curly braces {} describe the value to put there, e.g. {text number} means put the number of the text here.
    1. {coll} means the collection, use its sigla, e.g. kt, ngb, dkcw, etc.
    2. {ed} means edition sigla. For collections without multiple editions, user the word "main".
    3. Text numbers are always 4-digit, pad with zeros if necessary, e.g., 0039
  3. In many cases with scripts that take parameters (collection and edition siglas and so forth), if the script is run without parameters or with the parameter "help" a "usage" message is displayed describing how the text is supposed to be called.
  4. The text catalog has only a single index so that searches may be performed across collections and editions. Therefore, when optimizing, one does not need to specify the index.

Location of SOLR Indexing Scripts and Running them

As of October 2012, the shell scripts for indexing THL Catalogs are found in the following locations:

  • Development: /usr/local/projects/thlib-texts/solr-home/solrdocs
  • Staging: /usr/local/projects/thdl-texts/current/solr-home/solrdocs
  • Production: /usr/local/projects/thdl-texts/current/solr-home/solrdocs

The following sections describe the scripts to be called for certain actions and the parameters they require. In all cases, the following steps must be done first:

  1. Open SecureCRT
  2. Log on to server with your creditials and go to the appropriate folder by typing:
    cd /usr/local/projects/thlib-texts/solr-home/solrdocs
    for development or the other paths on the other servers.
  3. Run the script in question by typing:
    ./{scriptname}.sh {parameters}

Indexing All the Texts

Specific scripts to index all the texts in a catalog have been written to make the process run by one command. These are as follows:

Dege Kangyur

Run the following script:

./createKTTextAdds.sh

Other collections

To index all texts in other collections, use the createAllTextAdds.sh script which is called in the following manner:

./createAllTextAdds.sh {coll} {ed (optional)} {startnum (optional)}

For example,

./createAllTextAdds.sh dkcw

or

./createAllTextAdds.sh dkcw main 0100

In the last case, the edition sigla is not needed for indexing all the text because the collection has only one main edition, but in the second case "main" must be specified as a placeholder so that the starting text number is not confused with the edition sigla.

Indexing One Text in a THL Catalog

To index just one text in a THL catalog run the "createOneTextAll.sh" script by typing:

./createOneTextAdd.sh {coll} {ed (optional)} {text number}

For example,

./createOneTextAdd.sh kt d 0049

where "kt" stands for Kangyur ("ngb" and "dkcw" are the other collections) and "d" stands for "sde dge" and "0049" is the text number. *Note:* you should put the 4-digit version of the number with leading zeros.The resulting output should look something like this:

[sds-deployer@dev solrdocs]$ ./createOneTextAdd.sh kt d 0049
********** Processing kt-d-0049-text.xml! **********
********** Adding to SOLR **********
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to external link: http://localhost:8080/thlib-solr-multicore/thlib-texts/update..
SimplePostTool: POSTing file kt-d-0049-text-add.xml
SimplePostTool: COMMITting Solr index changes..

What this shows is that the XML file of the text was processed through a XSLT stylesheet into a SOLR add-doc and then, that solr add-doc was ingested into SOLR. The WARNING is a standard one but any errors that display should be reported.

Optimizing an index

Optimizing is a SOLR routine to compact the index making it faster and more efficient. The scripts to add all texts of a collection automatically include commands to optimize the index when the adding is done, but when adding individual texts or groups of texts, this must be done by hand. So, When one is done indexing whatever individual documents need indexing, one can optimize the index by typing the following command:

./optimize.sh

which will result in

[sds-deployer@dev solrdocs]$ ./optimize.sh 
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to external link: http://localhost:8080/thlib-solr-multicore/thlib-texts/update..
SimplePostTool: POSTing file optimize.xml
SimplePostTool: COMMITting Solr index changes..

Indexing One Bibliographic Record

The command for indexing a single bibliographic record in a catalog is:

./indexOneBibl.sh {coll} {edition} {folder} {text number}

as in for example:

./indexOneBibl.sh kt d 0 0039

The result should look something like:

[sds-deployer@dev solrdocs]$ ./indexOneBibl.sh kt d 0 0039
Indexing one bibl: kt/d/0/kt-d-0039-bib.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to external link: http://localhost:8080/thlib-solr-multicore/thlib-texts/update..
SimplePostTool: POSTing file kt-d-0039-add.xml
SimplePostTool: COMMITting Solr index changes..

Provided for unrestricted use by the external link: Tibetan and Himalayan Library