Datasets ======== Datasets are how measurements, simulations, and analysis results become **linked data**. In the ontology, a dataset represents a collection of **data points** or **files** that describe or result from a process, test, or computation. This section explains how to describe datasets, connect them to experiments, and make them machine-interpretable using EMMO and `schema.org` patterns. 1. What a Dataset Represents ---------------------------- A dataset can represent many kinds of information: - Measurement data from an **ElectrochemicalTest** (e.g., current, voltage, temperature vs. time) - Simulation outputs (e.g., potential distributions, reaction rates) - Process monitoring logs (e.g., pressure, temperature profiles) - Derived analysis data (e.g., fitted impedance spectra or extracted diffusion coefficients) In EMMO and its domain ontologies, datasets are modeled using the class: ``schema:Dataset`` and connected to the relevant test or process via: - ``hasInput`` — for data consumed by a process - ``hasOutput`` or ``hasResult`` — for data produced by a process or test 2. Minimal Dataset Structure ---------------------------- At minimum, each dataset should define: +----------------------+------------------------------+----------------------+ | Property | Description | Example | +======================+==============================+======================+ | ``@type`` | Dataset type | ``schema:Dataset`` | +----------------------+------------------------------+----------------------+ | ``name`` | Human-readable label | ``"EIS_Spectrum.csv"`` | +----------------------+------------------------------+----------------------+ | ``encodingFormat`` | MIME type | ``"text/csv"`` | +----------------------+------------------------------+----------------------+ | ``distribution`` | File or download link | ``DataDownload`` | +----------------------+------------------------------+----------------------+ | ``variableMeasured`` | Variables or columns | ``["Time / s", "Voltage / V"]`` | +----------------------+------------------------------+----------------------+ **Example: Minimal Dataset** .. code-block:: json { "@context": "https://w3id.org/emmo/domain/electrochemistry/context", "@type": "schema:Dataset", "name": "Galvanostatic_Cycle_01.csv", "encodingFormat": "text/csv", "variableMeasured": ["Time / s", "Voltage / V", "Current / A"], "distribution": { "@type": "schema:DataDownload", "contentUrl": "https://zenodo.org/record/1234567/files/Cycle_01.csv", "encodingFormat": "text/csv" } } 3. Linking Datasets to Tests or Processes ----------------------------------------- Datasets usually describe the results of a process or test. You can connect them using ``hasResult`` (for outputs) or ``hasInput`` (for input data). **Example: Dataset linked to a test** .. code-block:: json { "@type": "GalvanostaticChargeDischargeTest", "hasTestObject": { "@type": "BatteryCell" }, "hasResult": { "@type": "schema:Dataset", "name": "Cycle_01.csv", "encodingFormat": "text/csv", "variableMeasured": ["Time / s", "Voltage / V", "Current / A"] } } 4. Describing Variables in More Detail -------------------------------------- Variables can be described not only by their names, but also by semantic meaning and units, using ``schema:variableMeasured`` entries linked to EMMO quantities. **Example: Semantic variable description** .. code-block:: json { "@type": "schema:PropertyValue", "name": "Voltage / V", "propertyID": "emmo:Voltage", "unitText": "V" } or embedded in the dataset: .. code-block:: json { "@type": "schema:Dataset", "variableMeasured": [ { "@type": "schema:PropertyValue", "name": "Voltage / V", "propertyID": "emmo:Voltage", "unitText": "V" }, { "@type": "schema:PropertyValue", "name": "Current / A", "propertyID": "emmo:Current", "unitText": "A" } ] } This allows automatic mapping between dataset columns and ontology-defined quantities. 5. Connecting Metadata and Provenance ------------------------------------- A dataset can include rich metadata to describe its origin, authorship, and licensing. .. code-block:: json { "@type": "schema:Dataset", "name": "INR21700_HighRate_Test", "creator": [ { "@type": "schema:Person", "name": "Dr. Jane Smith" }, { "@type": "schema:Organization", "name": "SINTEF Industry", "url": "https://www.sintef.no" } ], "dateCreated": "2025-10-01", "license": "https://spdx.org/licenses/CC-BY-4.0", "hasResult": { "@type": "schema:DataDownload", "contentUrl": "https://zenodo.org/records/12345/files/data.csv" } } You can also link datasets to: - **Instruments** (via ``wasGeneratedBy`` or ``hasEquipment``) - **Projects or experiments** (via ``isPartOf``) - **Public repositories** (via ``contentUrl`` or ``identifier`` for DOIs) 6. Grouping Datasets and Derived Data ------------------------------------- Sometimes one test produces several datasets (e.g., raw current–voltage data and post-processed impedance fits). These can be grouped using ``hasPart`` or linked via ``isDerivedFrom``. **Example: Linking raw and processed data** .. code-block:: json { "@type": "schema:Dataset", "name": "EIS_FittedParameters.json", "encodingFormat": "application/json", "isDerivedFrom": { "@type": "schema:Dataset", "name": "EIS_Spectrum.csv", "encodingFormat": "text/csv" } } 7. Expressing Tabular Data Structure ------------------------------------ For structured tabular data, you can describe the schema using the **CSV on the Web (CSVW)** model. **Example: CSVW schema embedded in JSON-LD** .. code-block:: json { "@context": "http://www.w3.org/ns/csvw", "@type": "Table", "url": "Cycle_01.csv", "tableSchema": { "columns": [ { "name": "Time / s", "datatype": "number", "propertyUrl": "emmo:Time" }, { "name": "Voltage / V", "datatype": "number", "propertyUrl": "emmo:Voltage" }, { "name": "Current / A", "datatype": "number", "propertyUrl": "emmo:Current" } ] } } This approach enables fully machine-readable, column-level annotation. 8. Datasets as Input and Output ------------------------------- Datasets aren’t only *results* — they can also serve as *inputs* for modeling or analysis processes. **Example: Dataset as input** .. code-block:: json { "@type": "ParameterEstimationProcess", "hasInput": { "@type": "schema:Dataset", "name": "HPPC_Data.csv" }, "hasOutput": { "@type": "schema:Dataset", "name": "FittedParameters.json" } } This provides a clear provenance trail: which data were used, how they were processed, and what results were produced. 9. Recommended Best Practices ----------------------------- - Always specify a **unique identifier** (DOI, URI, or GitHub URL). - Include both **human-readable metadata** and **machine-readable variable definitions**. - Reference EMMO quantities in ``variableMeasured`` to link to physical meaning. - Use ``isDerivedFrom`` for processed datasets to maintain provenance. - Prefer standard MIME types (``text/csv``, ``application/json``, etc.). - Store units in column headers (e.g., ``Voltage / V``) or in associated metadata. 10. Example: Complete Dataset Description ----------------------------------------- .. code-block:: json { "@context": "https://w3id.org/emmo/domain/electrochemistry/context", "@type": "schema:Dataset", "@id": "https://doi.org/10.5281/zenodo.1234567", "name": "Galvanostatic Cycling Data for Zn-Air Cell", "description": "Charge-discharge cycling data recorded at 25°C with 1 M KOH electrolyte.", "encodingFormat": "text/csv", "variableMeasured": [ { "@type": "schema:PropertyValue", "name": "Time / s", "propertyID": "emmo:Time" }, { "@type": "schema:PropertyValue", "name": "Voltage / V", "propertyID": "emmo:Voltage" }, { "@type": "schema:PropertyValue", "name": "Current / A", "propertyID": "emmo:Current" } ], "creator": { "@type": "schema:Person", "name": "Dr. Jane Smith" }, "license": "https://spdx.org/licenses/CC-BY-4.0", "hasResult": { "@type": "ElectrochemicalTest", "hasTestObject": { "@type": "ZincAirCell" } }, "distribution": { "@type": "schema:DataDownload", "contentUrl": "https://zenodo.org/records/1234567/files/ZnAir_Cycling.csv", "encodingFormat": "text/csv" } } 11. Summary ----------- +----------------------------+---------------------------------------+--------------------------------------------+ | **Concept** | **Role** | **Ontological Representation** | +============================+=======================================+============================================+ | Dataset | Collection of structured data | ``schema:Dataset`` | +----------------------------+---------------------------------------+--------------------------------------------+ | DataDownload | File access information | ``schema:DataDownload`` | +----------------------------+---------------------------------------+--------------------------------------------+ | VariableMeasured | Defines dataset columns | ``schema:PropertyValue`` linked to EMMO | +----------------------------+---------------------------------------+--------------------------------------------+ | hasInput / hasResult | Link between processes and datasets | Provenance connections | +----------------------------+---------------------------------------+--------------------------------------------+ | isDerivedFrom | Connects processed to raw data | Enables traceability | +----------------------------+---------------------------------------+--------------------------------------------+ | CSVW Table | Tabular metadata | Machine-readable schema definition | +----------------------------+---------------------------------------+--------------------------------------------+ **Key takeaway** A **Dataset** in EMMO is not just a file — it’s a *semantic description of what that file represents* and *how it connects* to physical, experimental, and computational reality. Properly described datasets allow machines (and people) to *find, interpret, and reuse* your data with confidence.