Adding a new RailDatasetHolder

Because of the variety of formats of files in RAIL, and the variety of analysis flavors in a RailProject, it is useful to be able to have re-usable tools that wrap particular types datasets from a RailProject These are implemented as subclasses of the rail.plotting.dataset_holder.RailDatasetHolder class. A RailDatasetHolder is intended to take a particular set of inputs and extract a particular set of data from the RailProject. The inputs and outputs are all defined in particular ways to allow RailDatasetHolder objects to be integrated into larger data analysis pipelines.

New RailDatasetHolder Example

The following example has all of the required pieces of a RailDatasetHolder and almost nothing else.

class RailPZPointEstimateDataHolder(RailDatasetHolder):
    """Simple class for holding a dataset for plotting data that comes from a RailProject"""

    config_options: dict[str, StageParameter] = dict(
        name=StageParameter(str, None, fmt="%s", required=True, msg="Dataset name"),
        project=StageParameter(
            str, None, fmt="%s", required=True, msg="RailProject name"
        ),
        selection=StageParameter(
            str, None, fmt="%s", required=True, msg="RailProject data selection"
        ),
        flavor=StageParameter(
            str, None, fmt="%s", required=True, msg="RailProject analysis flavor"
        ),
        tag=StageParameter(
            str, None, fmt="%s", required=True, msg="RailProject file tag"
        ),
        algo=StageParameter(
            str, None, fmt="%s", required=True, msg="RailProject algorithm"
        ),
    )

    extractor_inputs: dict = {
        "project": RailProject,
        "selection": str,
        "flavor": str,
        "tag": str,
        "algo": str,
    }

    output_type: type[RailDataset] = RailPZPointEstimateDataset

    def __init__(self, **kwargs: Any):
        RailDatasetHolder.__init__(self, **kwargs)
        self._project: RailProject | None = None

    def __repr__(self) -> str:
        ret_str = (
            f"{self.config.extractor} "
            "( "
            f"{self.config.project}, "
            f"{self.config.selection}_{self.config.flavor}_{self.config.tag}_{self.config.algo}"
            ")"
        )
        return ret_str

    def get_extractor_inputs(self) -> dict[str, Any]:
        if self._project is None:
            self._project = RailDatasetFactory.get_project(self.config.project).resolve()
        the_extractor_inputs = dict(
            project=self._project,
            selection=self.config.selection,
            flavor=self.config.flavor,
            tag=self.config.tag,
            algo=self.config.algo,
        )
        self._validate_extractor_inputs(**the_extractor_inputs)
        return the_extractor_inputs

    def _get_data(self, **kwargs: Any) -> dict[str, Any] | None:
        return get_pz_point_estimate_data(**kwargs)

    @classmethod
    def generate_dataset_dict(
        cls,
        **kwargs: Any,
    ) -> list[dict[str, Any]]:

The required pieces, in the order that they appear are:

  1. The RailProjectDatasetHolder(RailDatasetHolder): defines a class called RailProjectDatasetHolder and specifies that it inherits from RailDatasetHolder.

  2. The config_options lines define the configuration parameters for this class, as well as their default values. Note that we are specifying a helper class to actually extract the data.

  3. The extractor_inputs = [('input', PqHandle)] and outputs = [('output', PqHandle)] define the inputs that will be based to the

  4. The output_type: type[RailDataset] = RailPZPointEstimateDataset line specifies that this class will return a RailPZPointEstimateDataset dataset.

  5. The __init__ method does any class-specific initialization, in this case defining that this class will store and project and extractor

  6. The __repr__ method is optional, here it gives a useful representation of the class

  7. The get_extractor_inputs() method does the first part of the actual work, note that it doesn’t take any arguments, that it uses the factories to find the helper objects and passes algo it’s configuration and validates it’s outputs

  8. The _get_data() method does the rest of actual work (in this case it passes it off to a utility function get_pz_point_estimate_data which knows how to extract data from the RailProject

  9. The generate_dataset_dict() can scan a RailProject and generate a dictionary of all the available datasets