Possibility of running scripts on specific dataset type and specific telescope
At the moment, the scripts taking most of the time to run are hillas_preprocessing.py
and add_orig_mc_tree.py
, because they run over MAGIC calibrated data and for simulated data the number of calibrated files is quite huge.
However, without worrying about the efficiency of data processing from the memory point of view, one thing can be noted: both scripts process each data sample one after the other sequentially. The current order is: MC train sample, MC test sample, data train sample (OFF) and data test sample (ON). For each of those samples, there are files for both M1 and M2. This applies to hillas_preprocessing.py
, indeed add_orig_mc_tree.py
runs over simulated data only.
Therefore one could gain time in running over all those data samples if the scripts would have the option to run over a specific data type (MC or data), specific sample (train or test) and specific telescope (M1 or M2). This is similar to what is done by MARS programs like sorcerer
and star
. In this way the user can run more instances of the script in parallel e.g. one on M1 MC train sample and one on M2 MC train sample. This means that to process all data sample (MC and data, for both telescopes), the user can have 8 instances of the script running at the same time.
To leave the user with full flexibility, this change can be thought in the following way:
- the user can specify only the type of data to process i.e. real or MC --> both train and test samples for both telescopes will be processed for that data type
- the user can specify the data type and the sample (train or test) --> files for both telescopes will be processed for that data type and sample
- the user can specify the data type, the sample and the telescope (M1 or M2) --> the processed files belong to the specified combination of data type, sample and telescope
Otherwise, if we want to "force" users to run in parallel more instances of the scripts, option 3. can be enforced.