Menu

Plan, Package and Post Your Data!

You now have the opportunity to publish the data produced by your scientific experiments and projects in addition to publishing papers and articles. Publishing the data in this way will boost citations of your associated publications as well as attracting separate citations for the data records. In turn, you will be able to study and re-use other scientists’ data to support your work so that you can concentrate on making advances with a reduced risk of having to repeat experiments which were done by others. Analysing their data may also give you new ideas for your own research.

HYDRALAB+ has provided a simple, three step "ppprocess" for achieving this:

  1. Plan
    Take some time to plan and define the data outputs from your experiment(s).
  2. Package
    Arrange your output data into a package that will be easy for other scientists to understand.
  3. Post
    Upload your data package into the Zenodo repository through the HYDRLAB+ website.

 

 

Plan

If you take some time to plan and define the data which is going to be produced by your experiments then this will help to structure the experiments and maximise the impact of your findings. In particular, consider which datasets you would like to share with others. Your experiment may produce a vast amount of data – is it necessary to share all of it?

Also, consider how the results data will be offered to future users. If the overall package size is going to be large (greater than 1GB or with many files) then break it down into sensible and well described sub-packages. Users can then select those that they are interested in without having to download the whole package.

Your data will end up posted into a repository (see below). Each data package uploaded to the repository will automatically receive a unique Digital Object Identifier (DOI) to allow it to be referenced and located in the future: 

  • You may wish to create a single package and DOI for your project, containing results from a variety of experiments;
  • You may wish to create a single package and DOI for the results from each experiment;
  • You may wish to create a single data package and DOI for each results dataset.

You will know best how to present your data to future users, but remember to make sure that all necessary data and supporting information is included in each data package. Your data package should be complete and coherent at the point it is published.

Document all of this information as part of your Data Storage Report. The HYDRALAB+ Data Storage Report template can be found here: https://zenodo.org/record/1318030.

Package

It is important to package your data so that it is easy for others to analyse and understand. Sufficient supporting information should be provided alongside the data itself, including the Data Storage Report. It is up to you how much data you include in your package and how it is arranged. Please include a ‘README.txt’ file describing what is in the package. An example of a data package which has been prepared according to these guidelines can be found here: https://zenodo.org/record/1197273.

Overall make sure your package has 3 "cccharacteristics":

  1. Completeness: Include everything necessary for others to understand your package; data, definitions, context.
  2. Clarity: Explain everything clearly and make no assumptions about others’ knowledge.
  3. Convenience: Use common formats, arrange the package tidily, use appropriate keywords.

Data Format

Select a good file format for storing your data: one which is appropriate for the data structure and size; one which is usable and sustainable. Laboratory systems often have formats defined as standard, but you may have some opportunity to decide which you use.

  • Does the data format match the natural data structure of your data (i.e. flat, hierarchical, multidimensional)?
  • Does the data format allow you to comfortably store the entire final dataset?
  • Is the data format broadly understood within your community and acceptable to funders?
  • Is the data format supported by other communities and likely to be compatible with future common operating systems and applications?
  • Is there a broad range of software that can read / write the data format?
  • Are the terms and conditions for use of the read / write software favourable? Is it free? Is it proprietary?
  • Is the conversion process from the data format to / from other formats cheap and easy?

Many experimenters output their data in simple formats such as csv or xslx. Leading, more complex formats for long timeseries data include TimeseriesML or WaterML2: Part 1 - Timeseries. If your dataset is too large to be stored in these formats then a leading format for larger, multi-dimensional array-based data is netCDF.

Supporting Information

Supporting information needs to be supplied together with the data itself. All data should be accompanied by adequate metadata so that future users can understand and apply the contents. Sometimes adequate metadata is stored within the data structure format, sometimes an additional file is required. If possible use an established metadata standard. Remember, the next person to use the dataset is likely to be you! How will you understand this data a year from now?

Dublin Core is an ISO standard (ISO 15836:2009). Another, more complex, ISO standard for describing metadata is ISO19115/19139.

Parameter Names and Units

Too many data files have column names called ‘MyCol’ or ‘Col1’. Sometimes the units are omitted even when the column name is clear. It is also important to give this information to someone who needs to understand your data.

  • Avoid meaningless field names and remember to include the units.
  • If possible, take all parameter names and units from established vocabularies. When you use a vocabulary to describe parameters, include a reference to its on-line record.
  • If you use your own field names then make sure they are defined somewhere nearby.

Leading vocabularies include SeaDataNet, CF Standard Names, CSDMS Standard Names and ITTC Symbols and Terminology List.

Spatial Characteristics

The spatial characteristics of the data must be clearly and completely defined. It is crucial that the full geometry of the experiment is described. The minimum level of information should include:

  • Appropriate maps, plans and lengthwise engineering drawings;
  • The locations of all instruments deployed;
  • Supporting tabular data giving associated numeric values;
  • Full topography including spatial extents;
  • Definition of datums and reference levels used, definition of local or general coordinate systems used;
  • Notes on resolution, accuracy and scaling.

Temporal Characteristics

The temporal characteristics of the data must be clearly and completely defined. This includes the articulation of time instances as well as time series data and temporal extents.

  • Where the data relates to real-world times then it must be articulated in or referenced against a full timestamp. This must include timezone modifiers and be of a form similar to: “YYYY-MM-DDThh:mm:ss+hh:mm” e.g. 11:44am on 26th April 2019 in British Summer Time is articulated as “2019-04-26T11:44:00+01:00”.
  • Where data relates to a notional experiment time, then all times must be related back to a single origin time at the start of the experiment time.
  • Time series data and temporal extents can be represented using either of the above two methods. Each time step can be represented as a full timestamp or an interval from an origin or the previous step.
  • The method used to derive any spectra from time series observations must also be included.

Permissions

Make sure that you remember to include information giving future users permission to use the data they have obtained. If possible, include an open license, with as few restrictions as possible, to allow others to use your data. Sometimes an embargo period is applied, to give the originators time to publish articles based on the data.

  • A suitable list of licenses is given here: http://opendefinition.org/licenses/. The default license for HYDRALAB+ is the Creative Commons Attribution 4.0 International (CC BY 4.0) license (https://creativecommons.org/licenses/by/4.0/legalcode).
  • If an embargo period is required, for HYDRALAB+ experiment and research activity publication, you should select an embargo period of not more than two years.
  • The embargo period should be included in the data management plan.

Aspects Particularly Related to Physical Modelling Data Packages

Physical modelling studies can produce considerable quantities of data. This is particularly true of the raw instrument files which may be required for an expert practitioner to fully understand the datasets, but extends to some of the files necessary to fully represent the conditions (e.g. point clouds to describe topography). Often a large number of simulations are conducted and output is required as video. Clearly it is preferable to reduce the data package to a more convenient size, but often it is necessary to include the full data package which can amount to terabytes of data. When this is the case, strategies for user access as well as the fundamental storing of the data package must be considered – is it necessary for a user to download the whole package in order to get the information they need?

Moreover, data from instruments is typically part of a longer processing chain where raw instrument data files are repeatedly post-processed into successive formats and versions. Knowledge of the full processing chain utilized is often necessary for interpretation by experts.

The differing nature of physical modelling experiments and numerical modelling experiments leads to issues at boundary conditions. The physical limitations of the laboratory do not apply in the virtual world. For example, wave reflections from hard boundaries – a common feature of physical models – cause differences in results from their numerical counterparts. This is not an issue that can be addressed by the protocols for the transfer of data.  However, information about these effects can be of use to the downstream numerical modellers and therefore should be included in the physical model data package where possible.

Factors such as these may, indeed, effect the design of the physical model. In this regard, when selecting physical modelling approaches, the largest possible model scale should be selected to ensure that the dominant forces are well represented.

Post

Zenodo logoThose performing experiments as part of the HYDRALAB+ project are asked to store the dataset package from their experiment(s) in the Zenodo repository. This process will automatically give the dataset a DOI to allow it to be uniquely referenced. It can be done through the HYDRALAB+ website.

There are two ways to do this, depending on whether or not you are posting data associated with a Transnational Access Project. The Transnational Access interface simply contains a few more fields directly associated with those projects, otherwise it is the same.

1. If you are posting data associated with a Transnational Access Project:

  • Use the Transnational Access form provided on the HYDRALAB+ website (Add Dataset). Click on the ‘Add Dataset’ option under ‘TAKING PART’, ‘Transnational Access Projects’.
  • The form itself contains some additional metadata required by the Zenodo repository.
  • How to instructions.
  • Video instructions.

2. If you are posting any other HYDRALAB+ data including JRA project data or project deliverables:

  • Go to https://hydralab.eu/. Log in and go to the 'Participant Area' menu and select 'DOI Datasets'. This will take you to https://hydralab.eu/participant-area/zenodo-doi/
  • Click on 'Add New Dataset' and fill in the form, also including the datasets you are uploading. Click on 'Save'.
  • A draft version of your data package will then appear in the 'DATASET LIST' shown. You can keep editing it until you are happy with the record.
  • When you are ready to publish it in Zenodo and you do not want to make any more changes, hit Publish in the 'Actions' column of the DATASET LIST table. Once you publish your data package you will not be able to change it, although you will be able to upload more recent versions.