Documenting a resource¶

In the tripper.datadoc sub-package, the documents documenting the resources internally represented as JSON-LD documents are stored as Python dicts. However, the API tries to hide the complexities of JSON-LD behind simple interfaces. To support different use cases, the sub-package provide several interfaces for data documentation, including Python dicts, YAML files and tables. These are further described below.

Documenting as a Python dict¶

The API supports two Python dict representations, one for documenting a single resource and one for documenting multiple resources.

Single-resource dict¶

Below is a simple example of how to document a SEM image dataset as a Python dict:

>>> dataset = {
...     "@id": "kb:image1",
...     "@type": "sem:SEMImage",
...     "creator": {
...         "name": "Sigurd Wenner",
...     },
...     "description": "Back-scattered SEM image of cement, polished with 1 um diamond compound.",
...     "distribution": {
...         "downloadURL": "https://github.com/EMMC-ASBL/tripper/raw/refs/heads/master/tests/input/77600-23-001_5kV_400x_m001.tif",
...         "mediaType": "https://www.iana.org/assignments/media-types/image/tiff"
...     }
... }

The keywords are defined in the default JSON-LD context and documented under Predefined keywords.

This example uses two namespace prefixes not included in the predefined prefixes. We therefore have to define them explicitly

>>> prefixes = {
...     "sem": "https://w3id.org/emmo/domain/sem/0.1#",
...     "kb": "http://example.com/kb/"
... }

Side note

This dict is actually a JSON-LD document with an implicit context. You can use told() to create a valid JSON-LD document from it. In addition to add a @context field, this function also adds some implicit @type declarations.

>>> import json
>>> from tripper.datadoc import told
>>> d = told(dataset, prefixes=prefixes)
>>> print(json.dumps(d, indent=4))  # doctest: +SKIP

{
    "@context": "https://raw.githubusercontent.com/EMMC-ASBL/tripper/refs/heads/master/tripper/context/0.3/context.json",
    "@id": "http://example.com/kb/image1",
    "@type": "https://w3id.org/emmo/domain/sem/0.1#SEMImage",
    "creator": {
        "@type": [
            "http://xmlns.com/foaf/0.1/Agent",
            "https://w3id.org/emmo#EMMO_2480b72b_db8d_460f_9a5f_c2912f979046"
        ],
        "name": "Sigurd Wenner"
    },
    "description": "Back-scattered SEM image of cement, polished with 1 um diamond compound.",
    "distribution": {
        "@type": "http://www.w3.org/ns/dcat#Distribution",
        "downloadURL": "https://github.com/EMMC-ASBL/tripper/raw/refs/heads/master/tests/input/77600-23-001_5kV_400x_m001.tif",
        "mediaType": "https://www.iana.org/assignments/media-types/image/tiff"
    }
}

You can use store() to save this documentation to a triplestore. Since the prefixes "sem" and "kb" are not included in the Predefined prefixes, they are have to be provided explicitly.

>>> from tripper import Triplestore
>>> from tripper.datadoc import store
>>> ts = Triplestore(backend="rdflib")
>>> d = store(ts, dataset, prefixes=prefixes)

The returned AttrDict instance is an updated copy of dataset (casted to a dict subclass with attribute access). It correspond to a valid JSON-LD document and is the same as returned by told().

You can use ts.serialize() to list the content of the triplestore (defaults to turtle):

>>> print(ts.serialize())  # doctest: +SKIP

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix emmo: <https://w3id.org/emmo#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix kb: <http://example.com/kb/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sem: <https://w3id.org/emmo/domain/sem/0.1#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

kb:image1 a sem:SEMImage ;
    dcterms:creator [ a foaf:Agent,
                emmo:EMMO_2480b72b_db8d_460f_9a5f_c2912f979046 ;
            foaf:name "Sigurd Wenner"^^xsd:string ] ;
    dcterms:description "Back-scattered SEM image of cement, polished with 1 um diamond compound."^^rdf:langString ;
    dcat:distribution [ a dcat:Distribution ;
            dcat:downloadURL "https://github.com/EMMC-ASBL/tripper/raw/refs/heads/master/tests/input/77600-23-001_5kV_400x_m001.tif"^^xsd:anyURI ;
            dcat:mediaType <https://www.iana.org/assignments/media-types/image/tiff> ] .

Note that the image implicitly has been declared to be an individual of the classes dcat:Dataset and emmo:Dataset. This is because the type argument of store() defaults to "dataset".

Multi-resource dict¶

It is also possible to document multiple resources as a Python dict.

Note

Unlike the single-resource dict representation, the multi-resource dict representation is not valid (possible incomplete) JSON-LD.

This dict representation accepts the following keywords:

@context: Optional user-defined context to be appended to the documentation of all resources.
prefixes: A dict mapping namespace prefixes to their corresponding URLs.
datasets/distributions/accessServices/generators/parsers/resources: A list of valid single-resource dict of the given resource type.

See semdata.yaml for an example of a YAML representation of a multi-resource dict documentation.

Documenting as a YAML file¶

The save_datadoc() function allow to save a YAML file in multi-resource format to a triplestore. Saving semdata.yaml to a triplestore can e.g. be done with

>>> from tripper.datadoc import save_datadoc
>>> save_datadoc(  # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
...    ts,
...    "https://raw.githubusercontent.com/EMMC-ASBL/tripper/refs/heads/master/tests/input/semdata.yaml"
... )
{'@graph': [...], ...}

Documenting as table¶

The TableDoc class can be used to document multiple resources as rows in a table.

The table must have a header row with defined keywords (either predefined or provided with a custom context). Nested fields may be specified as dot-separated keywords. For example, the table

@id	distribution.downloadURL
:a	http://example.com/a.txt
:b	http://example.com/b.txt

correspond to the following turtle representation:

:a dcat:distribution [
    a dcat:Distribution ;
    downloadURL "http://example.com/a.txt" ] .

:b dcat:distribution [
    a dcat:Distribution ;
    downloadURL "http://example.com/b.txt" ] .

The below example shows how to save all datasets listed in the CSV file semdata.csv to a triplestore.

>>> from tripper.datadoc import TableDoc

>>> td = TableDoc.parse_csv(
...     "https://raw.githubusercontent.com/EMMC-ASBL/tripper/refs/heads/master/tests/input/semdata.csv",
...     prefixes={
...         "sem": "https://w3id.org/emmo/domain/sem/0.1#",
...         "semdata": "https://he-matchmaker.eu/data/sem/",
...         "sample": "https://he-matchmaker.eu/sample/",
...         "mat": "https://he-matchmaker.eu/material/",
...         "dm": "http://onto-ns.com/meta/characterisation/0.1/SEMImage#",
...         "parser": "http://sintef.no/dlite/parser#",
...         "gen": "http://sintef.no/dlite/generator#",
...     },
... )
>>> td.save(ts)