TM job has no output (tomogram number not found in wedge list) but the job is still running
I sumbitted a TM job and it was running for an hour without producing any output. Here is the error message:
Exception: Traceback (most recent call last):
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
return self._engine.get_loc(casted_key)
File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 2606, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 2630, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 62
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/gapstop/template_match.py", line 205, in _run_tm
inp = setup(params, n_tiles, random_seed)
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/gapstop/setup.py", line 28, in setup
pixelsize = (wedgelist.loc[params["tomo_num"]]["pixelsize"]
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/pandas/core/indexing.py", line 1191, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/pandas/core/indexing.py", line 1431, in _getitem_axis
return self._get_label(key, axis=axis)
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/pandas/core/indexing.py", line 1381, in _get_label
return self.obj.xs(label, axis=axis)
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/pandas/core/generic.py", line 4301, in xs
loc = index.get_loc(key)
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 62
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/bin/gapstop", line 8, in <module>
sys.exit(sg_tm())
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/gapstop/cli.py", line 210, in sg_tm
args.func(args)
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/gapstop/cli.py", line 21, in _tm
tm.template_match(
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/gapstop/template_match.py", line 331, in template_match
_run_tm(idx, line, n_tiles, random_seed, comm, restart, cleanup)
File "/mpcdf/soft/SLE_15/packages/skylake/gapstop-tm/gcc_12-12.1.0-impi_2021.11-2021.11.0/0.3/lib/python3.10/site-packages/gapstop/template_match.py", line 218, in _run_tm
raise RuntimeError(msg) from Exception(errc)
RuntimeError: Setup failed for row 0. Skipping ...
srun: got SIGCONT
slurmstepd: error: *** STEP 13686875.0 ON ravg1018 CANCELLED AT 2024-11-14T10:30:53 ***
slurmstepd: error: *** JOB 13686875 ON ravg1018 CANCELLED AT 2024-11-14T10:30:53 ***
srun: forcing job termination
When I looked into the wedge list file, there is no line for tomogram 62. However, the job was not immediately automatically killed. It would be good to add some codes to kill the job if keyError / no tomogram found in wedge list, to prevent unnecessarily using large amount of computation that does not produce any results.