Contributor(s): Nathaniel Grove
For SOLR configuration documentation, see server administration & support guidelines#solr
The basic process of indexing THL digital texts in SOLR is a two-part process. First, the texts to be indexed need to be digested into a SOLR add-doc, which is an XML document that follows the SOLR schema for that index. The schema defines the fields that SOLR will accept and how to index them, whether to store the data in the index in a retrievable form or not and so forth. Once the original texts have been converted into valid SOLR add-docs, these can then be ingested into the SOLR index. If one ingests a SOLR add-doc for a text that is already in the index, SOLR will update the information for that text based on the new add doc.
This process is the same for all SOLR indices. However, the means for achieving these two steps may be different.
To index digital texts in the THL catalogs, I have written several shell scripts that can be run on the site if you have permission to log onto the various servers for development, staging, and production. Each server has its own version of SOLR. So, the scripts must be run on each server. However, in the following instructions, I will use the Dev site as the basis but the same instructions will apply to Staging and Development other than the location of the solrdocs folder.
Things to note:
As of October 2012, the shell scripts for indexing THL Catalogs are found in the following locations:
The following sections describe the scripts to be called for certain actions and the parameters they require. In all cases, the following steps must be done first:
cd /usr/local/projects/thlib-texts/solr-home/solrdocs
./{scriptname}.sh {parameters}
Specific scripts to index all the texts in a catalog have been written to make the process run by one command. These are as follows:
Run the following script:
./createKTTextAdds.sh
To index all texts in other collections, use the createAllTextAdds.sh script which is called in the following manner:
./createAllTextAdds.sh {coll} {ed (optional)} {startnum (optional)}
For example,
./createAllTextAdds.sh dkcw
or
./createAllTextAdds.sh dkcw main 0100
In the last case, the edition sigla is not needed for indexing all the text because the collection has only one main edition, but in the second case "main" must be specified as a placeholder so that the starting text number is not confused with the edition sigla.
To index just one text in a THL catalog run the "createOneTextAll.sh" script by typing:
./createOneTextAdd.sh {coll} {ed (optional)} {text number}
For example,
./createOneTextAdd.sh kt d 0049
where "kt" stands for Kangyur ("ngb" and "dkcw" are the other collections) and "d" stands for "sde dge" and "0049" is the text number. *Note:* you should put the 4-digit version of the number with leading zeros.The resulting output should look something like this:
[sds-deployer@dev solrdocs]$ ./createOneTextAdd.sh kt d 0049
********** Processing kt-d-0049-text.xml! **********
********** Adding to SOLR **********
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8080/thlib-solr-multicore/thlib-texts/update..
SimplePostTool: POSTing file kt-d-0049-text-add.xml
SimplePostTool: COMMITting Solr index changes..
What this shows is that the XML file of the text was processed through a XSLT stylesheet into a SOLR add-doc and then, that solr add-doc was ingested into SOLR. The WARNING is a standard one but any errors that display should be reported.
Optimizing is a SOLR routine to compact the index making it faster and more efficient. The scripts to add all texts of a collection automatically include commands to optimize the index when the adding is done, but when adding individual texts or groups of texts, this must be done by hand. So, When one is done indexing whatever individual documents need indexing, one can optimize the index by typing the following command:
./optimize.sh
which will result in
[sds-deployer@dev solrdocs]$ ./optimize.sh
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8080/thlib-solr-multicore/thlib-texts/update..
SimplePostTool: POSTing file optimize.xml
SimplePostTool: COMMITting Solr index changes..
The command for indexing a single bibliographic record in a catalog is:
./indexOneBibl.sh {coll} {edition} {folder} {text number}
as in for example:
./indexOneBibl.sh kt d 0 0039
The result should look something like:
[sds-deployer@dev solrdocs]$ ./indexOneBibl.sh kt d 0 0039
Indexing one bibl: kt/d/0/kt-d-0039-bib.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8080/thlib-solr-multicore/thlib-texts/update..
SimplePostTool: POSTing file kt-d-0039-add.xml
SimplePostTool: COMMITting Solr index changes..