What is dials-data

dials-data is a lightweight, simple python(-only) package. It is used to provide access to data files used in regression tests, but does not contain any of those data files itself.

Although we envisage it mostly being used in a cctbx/DIALS environment for tests in DIALS, dxtbx, xia2 and related packages, it has no dependencies on either cctbx or DIALS, in fact all dependencies are explicitly declared in the setup.py file and are installable via standard setuptools/pip methods. This means dials-data can easily be used in other projects accessing the same data, and can be used in temporary environments such as Travis containers.

But - why?

In the past DIALS tests used an internal SVN repository called dials_regression to collect any file that would be useful in tests, but should not be distributed with a standard DIALS distribution. This made it easy to add files, from a single JSON file to example files from different detectors to whole datasets. Similarly, all a developer had to do to update the collection was to update the checked out SVN repository.

Over time the downsides of the SVN repository approach became obvious: a checked out copy requires twice the storage space on the local disk. Sparse checkouts are possible, but become increasingly complicated as more files are added. This quickly becomes impractical in distributed testing environments. The disk space required for checkouts can be reduced by compressing the data, but then they need to be unpacked for using the data in tests. By its nature the internal SVN repository was not publicly accessible. The data files were too large to convert the repository to a git repository to be hosted on Github, and in any case a git repository was not the best place either to store large amounts of data, as old versions of the data or retired datasets are kept forever, and sparse checkouts would be even more complex. Git LFS would just raise the complexity even further and would incur associated costs. A solution outside SVN/git was built with xia2_regression, which provided a command to download a copy of datasets from a regular webhost. This worked well for a while but still made it complicated to use the data files in tests, as they had to be downloaded – in full – first.

With dxtbx, dials and xia2 moving to pytest we extended the xia2_regression concept into the regression_data fixture to provide a simple way to access the datasets in tests, but the data still needed downloading separately and could not easily be used outside of the dials repository and not at all outside of a dials distribution. Adding data files was still a very involved process.

dials-data is the next iteration of our solution to this problem.

What can dials-data do

The entire pipeline, from adding new data files, to the automatic download, storage, preparation, verification and provisioning of the files to tests happens in a single, independent Python package.

Data files are versioned without being kept in an SVN or git repository. The integrity of data files can be guaranteed. Files are downloaded/updated as and when required. The provenance of the files is documented, so it can be easily identified who the author of the files is and under what license they have been made available. New datasets can be created, existing ones can be updated easily by anyone using Github pull requests.