Documenting a resource¶
In the tripper.datadoc sub-package, the documents documenting the resources internally represented as JSON-LD documents are stored as Python dicts. However, the API tries to hide the complexities of JSON-LD behind simple interfaces. To support different use cases, the sub-package provide several interfaces for data documentation, including Python dicts, YAML files and tables. These are further described below.
Documenting as a Python dict¶
The API supports two Python dict representations, one for documenting a single resource and one for documenting multiple resources.
Single-resource dict¶
Below is a simple example of how to document a SEM image dataset as a Python dict:
>>> dataset = {
... "@id": "kb:image1",
... "@type": "sem:SEMImage",
... "creator": {
... "name": "Sigurd Wenner",
... },
... "description": "Back-scattered SEM image of cement, polished with 1 um diamond compound.",
... "distribution": {
... "downloadURL": "https://github.com/EMMC-ASBL/tripper/raw/refs/heads/master/tests/input/77600-23-001_5kV_400x_m001.tif",
... "mediaType": "https://www.iana.org/assignments/media-types/image/tiff"
... }
... }
The keywords are defined in the default JSON-LD context and documented under Predefined keywords.
This example uses two namespace prefixes not included in the predefined prefixes. We therefore have to define them explicitly
>>> prefixes = {
... "sem": "https://w3id.org/emmo/domain/sem/0.1#",
... "kb": "http://example.com/kb/"
... }
Warning
Prefixes and keywords shares the same namespace and must therefore be distinct.
This is a concequence of JSON-LD and cannot be changed by Tripper. A good rule of thumb is to write keywords out as full words and use short abberiviations (about 2-5 characters) for prefixes.
Side note
This dict is actually a JSON-LD document with an implicit context.
You can use told() to create a valid JSON-LD document from it.
In addition to add a @context field, this function also adds some implicit @type declarations.
>>> import json
>>> from tripper.datadoc import told
>>> d = told(dataset, prefixes=prefixes)
>>> print(json.dumps(d, indent=4)) # doctest: +SKIP
{
"@context": "https://raw.githubusercontent.com/EMMC-ASBL/tripper/refs/heads/master/tripper/context/0.3/context.json",
"@id": "http://example.com/kb/image1",
"@type": "https://w3id.org/emmo/domain/sem/0.1#SEMImage",
"creator": {
"@type": [
"http://xmlns.com/foaf/0.1/Agent",
"https://w3id.org/emmo#EMMO_2480b72b_db8d_460f_9a5f_c2912f979046"
],
"name": "Sigurd Wenner"
},
"description": "Back-scattered SEM image of cement, polished with 1 um diamond compound.",
"distribution": {
"@type": "http://www.w3.org/ns/dcat#Distribution",
"downloadURL": "https://github.com/EMMC-ASBL/tripper/raw/refs/heads/master/tests/input/77600-23-001_5kV_400x_m001.tif",
"mediaType": "https://www.iana.org/assignments/media-types/image/tiff"
}
}
You can use store() to save this documentation to a triplestore. Since the prefixes "sem" and "kb" are not included in the Predefined prefixes, they are have to be provided explicitly.
>>> from tripper import Triplestore
>>> from tripper.datadoc import store
>>> ts = Triplestore(backend="rdflib")
>>> d = store(ts, dataset, prefixes=prefixes)
The returned AttrDict instance is an updated copy of dataset (casted to a dict subclass with attribute access).
It correspond to a valid JSON-LD document and is the same as returned by told().
You can use ts.serialize() to list the content of the triplestore (defaults to turtle):
>>> print(ts.serialize()) # doctest: +SKIP
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix emmo: <https://w3id.org/emmo#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix kb: <http://example.com/kb/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sem: <https://w3id.org/emmo/domain/sem/0.1#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
kb:image1 a sem:SEMImage ;
dcterms:creator [ a foaf:Agent,
emmo:EMMO_2480b72b_db8d_460f_9a5f_c2912f979046 ;
foaf:name "Sigurd Wenner"^^xsd:string ] ;
dcterms:description "Back-scattered SEM image of cement, polished with 1 um diamond compound."^^rdf:langString ;
dcat:distribution [ a dcat:Distribution ;
dcat:downloadURL "https://github.com/EMMC-ASBL/tripper/raw/refs/heads/master/tests/input/77600-23-001_5kV_400x_m001.tif"^^xsd:anyURI ;
dcat:mediaType <https://www.iana.org/assignments/media-types/image/tiff> ] .
Note that the image implicitly has been declared to be an individual of the classes dcat:Dataset and emmo:Dataset.
This is because the type argument of store() defaults to "dataset".
Multi-resource dict¶
It is also possible to document multiple resources as a Python dict.
Note
Unlike the single-resource dict representation, the multi-resource dict representation is not valid (possible incomplete) JSON-LD.
The root of this dict representation accepts the following keywords:
- domain: Optional name of one of more domains to load keywords for. Defaults to "default".
- keywordfile: Optional YAML file with keyword definitions to parse. May also be an URI in which case it will be accessed via HTTP GET.
- @context: Optional user-defined context to be appended to the documentation of all resources.
- base: Base IRI against which to resolve relative IRIs.
- prefixes: A dict mapping namespace prefixes to their corresponding URLs.
- \<Class>: Class name followed by a list of valid single-resource dicts for the specified class. The class name must be defined by the domain, in a keywordfile or in a custom @context. The "default" domain already include many common classes, like Resource, Dataset, Distribution, DataService, Agent...
See semdata.yaml for an example of a YAML representation of a multi-resource dict documentation.
Documenting as a YAML file¶
The save_datadoc() function allow to save a YAML file in multi-resource format to a triplestore. Saving semdata.yaml to a triplestore can e.g. be done with
>>> from tripper.datadoc import save_datadoc
>>> save_datadoc( # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
... ts,
... "https://raw.githubusercontent.com/EMMC-ASBL/tripper/refs/heads/master/tests/input/semdata.yaml"
... )
{'@context': {...}, '@graph': [...], ...}
Documenting as table¶
The TableDoc class can be used to document multiple resources as rows in a table.
The table must have a header row with defined keywords (either predefined or provided with a custom context).
Tripper allows multiple columns with the same keyword in the header.
Nested fields may be specified as dot-separated keywords.
Ex: distribution.downloadURL.
The header keyword(s) may be followed by an optional square bracket with the following syntax (inspired by URL options):
name[label?var1=value1&var2=value2]
where
- name: is the column name that is mapped to a keyword (or set of dot-separated keywords) defined in the JSON-LD context.
The only formal requirement is that it cannot contain a begin brace ([), but it might be wise to be more strict.
- label: is a label for the column.
It is used to make the column unique or to group related columns.
The only formal requirement is that is cannot contain a question mark (?), but it is probably wise to be more strict.
A digit should be allowed.
- key: a key identifying an option for the column.
Should be a valid C or Python identifier (regex: [_a-zA-Z][_a-zA-Z0-9]*).
- value: the value of a key.
Should not contain (unescaped) ampersand (&) or end braces (]).
Currently recognised keys:
- unit: A unit symbol. All numbers in this column have this unit.
- sep: Separator character. A common user request when you have multiple values for a column, is to be able to provide multiple values in a single cell, instead of duplicating the column. This option makes it possible to specify a separator character that can be used in this column.
Warning
The syntax for the square brackets is currently experimental and may change in the future.
Examples of use cases¶
Ensure that columns are unique (e.g. for use with pandas):
| @id | @type[1] | @type[2] |
|---|---|---|
Grouping of columns (e.g. for DLite datamodels):
| @id | @type | distribution[1].downloadURL | distribution[1].mediaType | distribution[2].accessURL |
|---|---|---|---|---|
Specifying unit:
| @id | @type | length[?unit=m] |
|---|---|---|
| ex:my_length | emmo:Length | 3.2 |
Specifying a separator:
| @id | @type | keyword[?sep=;] |
|---|---|---|
| ex:mydata | emmo:Dataset | geology;stone;cave |
Complete example¶
For example, the table
| @id | distribution.downloadURL |
|---|---|
| :a | http://example.com/a.txt |
| :b | http://example.com/b.txt |
corresponds to the following turtle representation:
:a dcat:distribution [
a dcat:Distribution ;
downloadURL "http://example.com/a.txt" ] .
:b dcat:distribution [
a dcat:Distribution ;
downloadURL "http://example.com/b.txt" ] .
The below example shows how to save all datasets listed in the CSV file semdata.csv to a triplestore.
>>> from tripper.datadoc import TableDoc
>>> td = TableDoc.parse_csv(
... "https://raw.githubusercontent.com/EMMC-ASBL/tripper/refs/heads/master/tests/input/semdata.csv",
... prefixes={
... "sem": "https://w3id.org/emmo/domain/sem/0.1#",
... "semdata": "https://he-matchmaker.eu/data/sem/",
... "sample": "https://he-matchmaker.eu/sample/",
... "mat": "https://he-matchmaker.eu/material/",
... "dm": "http://onto-ns.com/meta/characterisation/0.1/SEMImage#",
... "par": "http://sintef.no/dlite/parser#",
... "gen": "http://sintef.no/dlite/generator#",
... },
... )
>>> d = td.save(ts)
Note: If you parse multiple CSV files that cross-reference each other's classes (e.g. a table of dataset types referenced as
hasInput/hasOutputin a computations table), you must callupdate_context()with the first table's output before parsing the next table. Without this step, object properties whose values are classes defined in a previously parsed table will silently produce plain triples instead ofowl:Restrictionnodes. See Multi-table workflows for details and a worked example.