Elasticsearch Guide
Slurm provides multiple Job Completion Plugins. These plugins are an orthogonal way to provide historical job accounting data for finished jobs.
In most installations, Slurm is already configured with an AccountingStorageType plugin — usually slurmdbd. In these situations, the information captured by a completion plugin is intentionally redundant.
The jobcomp/elasticsearch plugin can be used together with a web layer on top of the Elasticsearch server — such as Kibana — to visualize your finished jobs and the state of your cluster. Some of these visualization tools also let you easily create different types of dashboards, diagrams, tables, histograms and/or apply customized filters when searching.
Configuration
The plugin requires the libcurl library to be installed and usable on the controller, and the development libraries to be available at configure time. At Slurm configuration time, the configure script should emit a message like this if the appropriate library and headers have been successfully located:
checking whether libcurl is usable... yesThere are two configure options to control whether to look for the library or not and where:
--with-libcurl=PREFIX look for the curl library in PREFIX/lib and headers in PREFIX/include (default PREFIX is curl-config path or $PATH). --without-libcurl
The Elasticsearch instance should be running and reachable from the multiple SlurmctldHost configured. Refer to the Elasticsearch Official Documentation for further details on setup and configuration.
There are three slurm.conf options related to this plugin:
-
JobCompType
is used to select the job completion plugin type to activate. It should be set
to jobcomp/elasticsearch.
JobCompType=jobcomp/elasticsearch
-
JobCompLoc should be set to
the Elasticsearch server URL endpoint (including the port number and the target
index).
JobCompLoc=<host>:<port>/<target>/_doc
NOTE: Since Elasticsearch 8.0 the APIs that accept types are removed, thereby moving to a typeless mode. The Slurm elasticsearch plugin in versions prior to 20.11 removed any trailing slashes from this option URL and appended a hardcoded /slurm/jobcomp suffix representing the /index/type respectively. Starting from Slurm 20.11 the URL is fully configurable and handed as-is without modification to the libcurl library functions. In addition, this also allows users to index data from different clusters to the same server but to different indices.
NOTE: The Elasticsearch official documentation provides detailed information around these concepts, the type to typeless deprecation transition as well as reindex API references on how to copy data from one index to another if needed.
-
JobCompParams should be
set to the Elasticsearch server connecting comma delimited options:
-
JobCompParams=timeout=5
Use a timeout when communication with Elasticsearch server. After the timeout, error out and queue job record for 30 seconds to try again. -
JobCompParams=connect_timeout=5
Use a timeout when connecting to Elasticsearch server. After the timeout, error out and queue job record for 30 seconds to try again.
-
-
DebugFlags could include
the Elasticsearch flag for extra debugging purposes.
DebugFlags=Elasticsearch
It is a good idea to turn this on initially until you have verified that finished jobs are properly indexed. Note that you do not need to manually create the Elasticsearch index, since the plugin will automatically do so when trying to index the first job document.
Visualization
Once jobs are being indexed, it is a good idea to use a web visualization layer to analyze the data. Kibana is a recommended open-source data visualization plugin for Elasticsearch. Once installed, an Elasticsearch index name or pattern has to be configured to instruct Kibana to retrieve the data. Once data is loaded it is possible to create tables where each row is a finished job, ordered by any column you choose — the @end_time timestamp is suggested — and any dashboards, graphs, or other analysis of interest.
Testing and Debugging
For debugging purposes, you can use the curl command or any similar tool to perform REST requests against Elasticsearch directly. Some of the following examples using the curl tool may be useful.
Query information assuming a slurm index name, including the document count (which should be one per job indexed):
$ curl -XGET http://localhost:9200/_cat/indices/slurm?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open slurm 103CW7GqQICiMQiSQv6M_g 5 1 9 0 142.8kb 142.8kb
Query all indexed jobs in the slurm index:
$ curl -XGET 'http://localhost:9200/slurm/_search?pretty=true&q=*:*' | less
Delete the slurm index (caution!):
$ curl -XDELETE http://localhost:9200/slurm {"acknowledged":true}
Query information about _cat options. More can be found in the official documentation.
$ curl -XGET http://localhost:9200/_cat
Failure management
When the primary slurmctld is shut down, information about all completed but not yet indexed jobs held within the Elasticsearch plugin saved to a file named elasticsearch_state, which is located in the StateSaveLocation. This permits the plugin to restore the information when the slurmctld is restarted, and will be sent to the Elasticsearch database when the connection is restored.Acknowledgments
The Elasticsearch plugin was created as part of Alejandro Sanchez's Master's Thesis.
Last modified 5 April 2019