Datasets#

Datasets are how measurements, simulations, and analysis results become linked data. In the ontology, a dataset represents a collection of data points or files that describe or result from a process, test, or computation.

This section explains how to describe datasets, connect them to experiments, and make them machine-interpretable using EMMO and schema.org patterns.

1. What a Dataset Represents#

A dataset can represent many kinds of information:

Measurement data from an ElectrochemicalTest (e.g., current, voltage, temperature vs. time)
Simulation outputs (e.g., potential distributions, reaction rates)
Process monitoring logs (e.g., pressure, temperature profiles)
Derived analysis data (e.g., fitted impedance spectra or extracted diffusion coefficients)

In EMMO and its domain ontologies, datasets are modeled using the class:

schema:Dataset

and connected to the relevant test or process via:

hasInput — for data consumed by a process
hasOutput or hasResult — for data produced by a process or test

2. Minimal Dataset Structure#

At minimum, each dataset should define:

Example: Minimal Dataset

{
  "@context": "https://w3id.org/emmo/domain/electrochemistry/context",
  "@type": "schema:Dataset",
  "name": "Galvanostatic_Cycle_01.csv",
  "encodingFormat": "text/csv",
  "variableMeasured": ["Time / s", "Voltage / V", "Current / A"],
  "distribution": {
    "@type": "schema:DataDownload",
    "contentUrl": "https://zenodo.org/record/1234567/files/Cycle_01.csv",
    "encodingFormat": "text/csv"
  }
}

3. Linking Datasets to Tests or Processes#

Datasets usually describe the results of a process or test. You can connect them using hasResult (for outputs) or hasInput (for input data).

Example: Dataset linked to a test

{
  "@type": "GalvanostaticChargeDischargeTest",
  "hasTestObject": { "@type": "BatteryCell" },
  "hasResult": {
    "@type": "schema:Dataset",
    "name": "Cycle_01.csv",
    "encodingFormat": "text/csv",
    "variableMeasured": ["Time / s", "Voltage / V", "Current / A"]
  }
}

4. Describing Variables in More Detail#

Variables can be described not only by their names, but also by semantic meaning and units, using schema:variableMeasured entries linked to EMMO quantities.

Example: Semantic variable description

{
  "@type": "schema:PropertyValue",
  "name": "Voltage / V",
  "propertyID": "emmo:Voltage",
  "unitText": "V"
}

or embedded in the dataset:

{
  "@type": "schema:Dataset",
  "variableMeasured": [
    {
      "@type": "schema:PropertyValue",
      "name": "Voltage / V",
      "propertyID": "emmo:Voltage",
      "unitText": "V"
    },
    {
      "@type": "schema:PropertyValue",
      "name": "Current / A",
      "propertyID": "emmo:Current",
      "unitText": "A"
    }
  ]
}

This allows automatic mapping between dataset columns and ontology-defined quantities.

5. Connecting Metadata and Provenance#

A dataset can include rich metadata to describe its origin, authorship, and licensing.

{
  "@type": "schema:Dataset",
  "name": "INR21700_HighRate_Test",
  "creator": [
    { "@type": "schema:Person", "name": "Dr. Jane Smith" },
    {
      "@type": "schema:Organization",
      "name": "SINTEF Industry",
      "url": "https://www.sintef.no"
    }
  ],
  "dateCreated": "2025-10-01",
  "license": "https://spdx.org/licenses/CC-BY-4.0",
  "hasResult": {
    "@type": "schema:DataDownload",
    "contentUrl": "https://zenodo.org/records/12345/files/data.csv"
  }
}

You can also link datasets to:

Instruments (via wasGeneratedBy or hasEquipment)
Projects or experiments (via isPartOf)
Public repositories (via contentUrl or identifier for DOIs)

6. Grouping Datasets and Derived Data#

Sometimes one test produces several datasets (e.g., raw current–voltage data and post-processed impedance fits). These can be grouped using hasPart or linked via isDerivedFrom.

Example: Linking raw and processed data

{
  "@type": "schema:Dataset",
  "name": "EIS_FittedParameters.json",
  "encodingFormat": "application/json",
  "isDerivedFrom": {
    "@type": "schema:Dataset",
    "name": "EIS_Spectrum.csv",
    "encodingFormat": "text/csv"
  }
}

7. Expressing Tabular Data Structure#

For structured tabular data, you can describe the schema using the CSV on the Web (CSVW) model.

Example: CSVW schema embedded in JSON-LD

{
  "@context": "http://www.w3.org/ns/csvw",
  "@type": "Table",
  "url": "Cycle_01.csv",
  "tableSchema": {
    "columns": [
      { "name": "Time / s", "datatype": "number", "propertyUrl": "emmo:Time" },
      { "name": "Voltage / V", "datatype": "number", "propertyUrl": "emmo:Voltage" },
      { "name": "Current / A", "datatype": "number", "propertyUrl": "emmo:Current" }
    ]
  }
}

This approach enables fully machine-readable, column-level annotation.

8. Datasets as Input and Output#

Datasets aren’t only results — they can also serve as inputs for modeling or analysis processes.

Example: Dataset as input

{
  "@type": "ParameterEstimationProcess",
  "hasInput": {
    "@type": "schema:Dataset",
    "name": "HPPC_Data.csv"
  },
  "hasOutput": {
    "@type": "schema:Dataset",
    "name": "FittedParameters.json"
  }
}

This provides a clear provenance trail: which data were used, how they were processed, and what results were produced.

9. Recommended Best Practices#

Always specify a unique identifier (DOI, URI, or GitHub URL).
Include both human-readable metadata and machine-readable variable definitions.
Reference EMMO quantities in variableMeasured to link to physical meaning.
Use isDerivedFrom for processed datasets to maintain provenance.
Prefer standard MIME types (text/csv, application/json, etc.).
Store units in column headers (e.g., Voltage / V) or in associated metadata.

10. Example: Complete Dataset Description#

{
  "@context": "https://w3id.org/emmo/domain/electrochemistry/context",
  "@type": "schema:Dataset",
  "@id": "https://doi.org/10.5281/zenodo.1234567",
  "name": "Galvanostatic Cycling Data for Zn-Air Cell",
  "description": "Charge-discharge cycling data recorded at 25°C with 1 M KOH electrolyte.",
  "encodingFormat": "text/csv",
  "variableMeasured": [
    { "@type": "schema:PropertyValue", "name": "Time / s", "propertyID": "emmo:Time" },
    { "@type": "schema:PropertyValue", "name": "Voltage / V", "propertyID": "emmo:Voltage" },
    { "@type": "schema:PropertyValue", "name": "Current / A", "propertyID": "emmo:Current" }
  ],
  "creator": { "@type": "schema:Person", "name": "Dr. Jane Smith" },
  "license": "https://spdx.org/licenses/CC-BY-4.0",
  "hasResult": {
    "@type": "ElectrochemicalTest",
    "hasTestObject": { "@type": "ZincAirCell" }
  },
  "distribution": {
    "@type": "schema:DataDownload",
    "contentUrl": "https://zenodo.org/records/1234567/files/ZnAir_Cycling.csv",
    "encodingFormat": "text/csv"
  }
}

11. Summary#

Key takeaway

A Dataset in EMMO is not just a file — it’s a semantic description of what that file represents and how it connects to physical, experimental, and computational reality. Properly described datasets allow machines (and people) to find, interpret, and reuse your data with confidence.