Datadoc¶
datadoc
is a command-line tool for data documentation which comes with Tripper.
It provides functionality for both populating a triplestore with data documentation and to search for and find the documentation at a later stage.
Running datadoc --help
from the shell will show the following help message:
usage: datadoc [-h] [--backend BACKEND] [--base-iri BASE_IRI]
[--database DATABASE] [--package PACKAGE] [--parse LOCATION]
[--parse-format PARSE_FORMAT]
[--prefixes PREFIX=URL [PREFIX=URL ...]]
{add,find,load} ...
Tool for data documentation. It allows populating and searching a triplestore
for existing documentation.
positional arguments:
{add,find,load} Subcommands:
add Populate the triplestore with data documentation.
find Find documented resources in the triplestore.
load Load documented dataset from a storage.
options:
-h, --help show this help message and exit
--backend BACKEND, -b BACKEND
Triplestore backend to use. Defaults to 'rdflib' - an
in-memory rdflib triplestore, that can be pre-loaded
with --parse.
--base-iri BASE_IRI, -B BASE_IRI
Base IRI of the triplestore.
--database DATABASE, -d DATABASE
Name of database to connect to (for backends
supporting it).
--package PACKAGE Only needed when `backend` is a relative module.
--parse LOCATION, -p LOCATION
Load triplestore from this location.
--parse-format PARSE_FORMAT, -F PARSE_FORMAT
Used with `--parse`. Format to use when parsing
triplestore.
--prefixes PREFIX=URL [PREFIX=URL ...], -P PREFIX=URL [PREFIX=URL ...]
Namespace prefixes to add to bind to the triplestore.
Currently, datadoc
has currently three sub-commands, add
, find
and load
for populating the triplestore, searching the triplestore and accessing a dataset documented the triplestore, respectively.
General options¶
-
The
--backend
,--base-iri
,--database
and--package
options are all for connecting to a triplestore. -
The
--parse
,--parse-format
and--prefixes
options are for pre-loading the triplestore with with triples from an external source, like a ntriples or turtle file, and for adding namespace prefixes. They are typically used with the default "rdflib" in-memory backend.
Subcommands¶
add¶
Populates the triplestore with data documentation.
Running datadoc add --help
will show the following help message:
usage: datadoc add [-h] [--input-format {yaml,csv}]
[--csv-options OPTION=VALUE [OPTION=VALUE ...]]
[--context CONTEXT] [--dump FILENAME] [--format FORMAT]
input
positional arguments:
input Path or URL to the input the triplestore should be
populated from.
options:
-h, --help show this help message and exit
--input-format {yaml,csv}, -i {yaml,csv}
Input format. By default it is inferred from the file
extension of the `input` argument.
--csv-options OPTION=VALUE [OPTION=VALUE ...]
Options describing the CSV dialect for --input-
format=csv. Common options are 'dialect', 'delimiter'
and 'quotechar'.
--context CONTEXT Path or URL to custom JSON-LD context for the input.
--dump FILENAME, -d FILENAME
File to dump the populated triplestore to.
--format FORMAT, -f FORMAT
Format to use with `--dump`. Default is "turtle".
The positional input
argument is the path or URL to the source from which to populate the triplestore.
The --input-format
, --csv-options
and --context
options provide more details about the input.
The --dump
and --format
options allow to dump the populated triplestore to file.
This is useful if you are working with an in-memory triplestore.
Example
The tests/input/
folder in the source code contain the semdata.csv
CSV file documenting four datasets, a SEM image, two nested dataset series and the sample the image was acquired from as shown in the figure below (click on it to see it in full size).
Running the following command from the root folder of the source code will populate an in-memory rdflib store with the data documented in the semdata.csv
file.
datadoc add tests/input/semdata.csv --context tests/input/semdata-context.json --dump kb.ttl
The --context
option provides a user-defined context defining prefixes and keywords used by the input.
The --dump
option dumps the in-memory triplestore to the file kb.ttl
.
If you open the file, you will notice that it in addition to the four datasets listed in the input, also include the SEMImage
class and its properties, providing structural documentation of the SEMImage
individuals.
Generated turtle file
@prefix chameo: <https://w3id.org/emmo/domain/characterisation-methodology/chameo#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix emmo: <https://w3id.org/emmo#> .
@prefix oteio: <https://w3id.org/emmo/domain/oteio#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://onto-ns.com/meta/matchmaker/0.2/SEMImage> a owl:Class ;
rdfs:subClassOf [ a owl:Restriction ;
owl:onClass <http://onto-ns.com/meta/matchmaker/0.2/SEMImage#data> ;
owl:onProperty emmo:EMMO_b19aacfc_5f73_4c33_9456_469c1e89a53e ;
owl:qualifiedCardinality "1"^^xsd:nonNegativeInteger ],
[ a owl:Restriction ;
owl:onClass <http://onto-ns.com/meta/matchmaker/0.2/SEMImage#labels> ;
owl:onProperty emmo:EMMO_b19aacfc_5f73_4c33_9456_469c1e89a53e ;
owl:qualifiedCardinality "1"^^xsd:nonNegativeInteger ],
[ a owl:Restriction ;
owl:onClass <http://onto-ns.com/meta/matchmaker/0.2/SEMImage#metadata> ;
owl:onProperty emmo:EMMO_b19aacfc_5f73_4c33_9456_469c1e89a53e ;
owl:qualifiedCardinality "1"^^xsd:nonNegativeInteger ],
[ a owl:Restriction ;
owl:onClass <http://onto-ns.com/meta/matchmaker/0.2/SEMImage#pixelheight> ;
owl:onProperty emmo:EMMO_b19aacfc_5f73_4c33_9456_469c1e89a53e ;
owl:qualifiedCardinality "1"^^xsd:nonNegativeInteger ],
[ a owl:Restriction ;
owl:onClass <http://onto-ns.com/meta/matchmaker/0.2/SEMImage#pixelwidth> ;
owl:onProperty emmo:EMMO_b19aacfc_5f73_4c33_9456_469c1e89a53e ;
owl:qualifiedCardinality "1"^^xsd:nonNegativeInteger ],
emmo:EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a ;
skos:prefLabel "SEMImage"@en ;
emmo:EMMO_967080e5_2f42_4eb2_a3a9_c58143e835f9 "SEM image with elemental mappings. Represented as a stack of elemental mapping\\nfollowed by the image formed by the back-scattered electrons (BSE).\\nSet `nelements=0` if you only have the back-scattered image.\\n"@en ;
oteio:hasURI "http://onto-ns.com/meta/matchmaker/0.2/SEMImage"^^xsd:anyURI .
<https://he-matchmaker.eu/data/sem/SEM_cement_batch2/77600-23-001/77600-23-001_5kV_400x_m001> a dcat:Dataset,
<https://w3id.com/emmo/domain/sem/0.1#SEMImage>,
emmo:EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a ;
dcterms:creator "Sigurd Wenner" ;
dcterms:description "Back-scattered SEM image of cement sample 77600 from Heidelberg, polished with 1 µm diamond compound." ;
dcterms:title "SEM image of cement" ;
dcat:contactPoint "Sigurd Wenner <Sigurd.Wenner@sintef.no>" ;
dcat:distribution [ a dcat:Distribution ;
<http://sintef.no/dlite/parser#> "http://sintef.no/dlite/parser#sem_hitachi" ;
dcat:downloadURL "https://github.com/EMMC-ASBL/tripper/raw/refs/heads/master/tests/input/77600-23-001_5kV_400x_m001.tif" ;
dcat:mediaType "image/tiff" ] ;
dcat:inSeries <https://he-matchmaker.eu/data/sem/SEM_cement_batch2/77600-23-001> ;
emmo:EMMO_f702bad4_fc77_41f0_a26d_79f6444fd4f3 <https://he-matchmaker.eu/material/concrete1> ;
oteio:hasDatamodel "http://onto-ns.com/meta/matchmaker/0.2/SEMImage" ;
oteio:hasDatamodelStorage "https://github.com/HEU-MatCHMaker/DataDocumentation/blob/master/SEM/datamodels/SEMImage.yaml" .
<https://he-matchmaker.eu/sample/SEM_cement_batch2/77600-23-001> a dcat:Dataset,
emmo:EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a,
chameo:Sample ;
dcterms:title "Series for SEM images for sample 77600-23-001." .
<http://onto-ns.com/meta/matchmaker/0.2/SEMImage#data> a owl:Class ;
rdfs:subClassOf [ a owl:Restriction ;
owl:hasValue <http://onto-ns.com/meta/matchmaker/0.2/SEMImage#data_dimension0> ;
owl:onProperty emmo:EMMO_0a9ae0cb_526d_4377_9a11_63d1ce5b3499 ],
[ a owl:Restriction ;
owl:onProperty emmo:EMMO_e5a34647_a955_40bc_8d81_9b784f0ac527 ;
owl:someValuesFrom emmo:EMMO_ac9e518d_b403_4d8b_97e2_06f9d40bac01 ],
emmo:EMMO_28fbea28_2204_4613_87ff_6d877b855fcd,
emmo:EMMO_50d6236a_7667_4883_8ae1_9bb5d190423a ;
skos:prefLabel "Data"@en ;
emmo:EMMO_967080e5_2f42_4eb2_a3a9_c58143e835f9 "Image data - a stack of images for each channel"@en ;
oteio:datasize "4"^^xsd:nonNegativeInteger .
<http://onto-ns.com/meta/matchmaker/0.2/SEMImage#data_dimension0> a emmo:EMMO_b4c97fa0_d82c_406a_bda7_597d6e190654 ;
skos:prefLabel "data_dimension0"@en ;
emmo:EMMO_23b579e1_8088_45b5_9975_064014026c42 "channels"^^xsd:string ;
emmo:EMMO_499e24a5_5072_4c83_8625_fe3f96ae4a8d <http://onto-ns.com/meta/matchmaker/0.2/SEMImage#data_dimension1> ;
emmo:EMMO_967080e5_2f42_4eb2_a3a9_c58143e835f9 "Number of channels."@en .
<http://onto-ns.com/meta/matchmaker/0.2/SEMImage#data_dimension1> a emmo:EMMO_b4c97fa0_d82c_406a_bda7_597d6e190654 ;
skos:prefLabel "data_dimension1"@en ;
emmo:EMMO_23b579e1_8088_45b5_9975_064014026c42 "height"^^xsd:string ;
emmo:EMMO_499e24a5_5072_4c83_8625_fe3f96ae4a8d <http://onto-ns.com/meta/matchmaker/0.2/SEMImage#data_dimension2> ;
emmo:EMMO_967080e5_2f42_4eb2_a3a9_c58143e835f9 "Number of pixels along the image height."@en .
<http://onto-ns.com/meta/matchmaker/0.2/SEMImage#data_dimension2> a emmo:EMMO_b4c97fa0_d82c_406a_bda7_597d6e190654 ;
skos:prefLabel "data_dimension2"@en ;
emmo:EMMO_23b579e1_8088_45b5_9975_064014026c42 "width"^^xsd:string ;
emmo:EMMO_967080e5_2f42_4eb2_a3a9_c58143e835f9 "Number of pixels along the image width."@en .
<http://onto-ns.com/meta/matchmaker/0.2/SEMImage#labels> a owl:Class ;
rdfs:subClassOf [ a owl:Restriction ;
owl:hasValue <http://onto-ns.com/meta/matchmaker/0.2/SEMImage#labels_dimension0> ;
owl:onProperty emmo:EMMO_0a9ae0cb_526d_4377_9a11_63d1ce5b3499 ],
[ a owl:Restriction ;
owl:onProperty emmo:EMMO_e5a34647_a955_40bc_8d81_9b784f0ac527 ;
owl:someValuesFrom emmo:EMMO_5f334606_f67d_4f0e_acb9_eeb21cb10c66 ],
emmo:EMMO_28fbea28_2204_4613_87ff_6d877b855fcd,
emmo:EMMO_50d6236a_7667_4883_8ae1_9bb5d190423a ;
skos:prefLabel "Labels"@en ;
emmo:EMMO_967080e5_2f42_4eb2_a3a9_c58143e835f9 "The label of each channel. For elemental mapping this should be the chemical symbol of the element or BSE for the back-scattered image."@en .
<http://onto-ns.com/meta/matchmaker/0.2/SEMImage#labels_dimension0> a emmo:EMMO_b4c97fa0_d82c_406a_bda7_597d6e190654 ;
skos:prefLabel "labels_dimension0"@en ;
emmo:EMMO_23b579e1_8088_45b5_9975_064014026c42 "channels"^^xsd:string ;
emmo:EMMO_967080e5_2f42_4eb2_a3a9_c58143e835f9 "Number of channels."@en .
<http://onto-ns.com/meta/matchmaker/0.2/SEMImage#metadata> a owl:Class ;
rdfs:subClassOf emmo:EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a,
emmo:EMMO_50d6236a_7667_4883_8ae1_9bb5d190423a ;
skos:prefLabel "Metadata"@en ;
emmo:EMMO_967080e5_2f42_4eb2_a3a9_c58143e835f9 "Reference to data model for SEM metadata."@en .
<http://onto-ns.com/meta/matchmaker/0.2/SEMImage#pixelheight> a owl:Class ;
rdfs:subClassOf [ a owl:Restriction ;
owl:onClass emmo:Metre ;
owl:onProperty emmo:EMMO_bed1d005_b04e_4a90_94cf_02bc678a8569 ;
owl:qualifiedCardinality "1"^^xsd:nonNegativeInteger ],
emmo:EMMO_50d6236a_7667_4883_8ae1_9bb5d190423a,
emmo:EMMO_52fa9c76_fc42_4eca_a5c1_6095a1c9caab ;
skos:prefLabel "Pixelheight"@en ;
emmo:EMMO_967080e5_2f42_4eb2_a3a9_c58143e835f9 "Height of each pixel."@en ;
oteio:datasize "8"^^xsd:nonNegativeInteger .
<http://onto-ns.com/meta/matchmaker/0.2/SEMImage#pixelwidth> a owl:Class ;
rdfs:subClassOf [ a owl:Restriction ;
owl:onClass emmo:Metre ;
owl:onProperty emmo:EMMO_bed1d005_b04e_4a90_94cf_02bc678a8569 ;
owl:qualifiedCardinality "1"^^xsd:nonNegativeInteger ],
emmo:EMMO_50d6236a_7667_4883_8ae1_9bb5d190423a,
emmo:EMMO_52fa9c76_fc42_4eca_a5c1_6095a1c9caab ;
skos:prefLabel "Pixelwidth"@en ;
emmo:EMMO_967080e5_2f42_4eb2_a3a9_c58143e835f9 "Width of each pixel."@en ;
oteio:datasize "8"^^xsd:nonNegativeInteger .
<https://he-matchmaker.eu/data/sem/SEM_cement_batch2> a dcat:Dataset,
<https://w3id.com/emmo/domain/sem/0.1#SEMImageSeries>,
emmo:EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a ;
dcterms:creator "Sigurd Wenner" ;
dcterms:description "..." ;
dcterms:title "Nested series of SEM images of cement batch2" ;
dcat:contactPoint "Sigurd Wenner <Sigurd.Wenner@sintef.no>" ;
dcat:distribution [ a dcat:Distribution ;
dcat:downloadURL "sftp://nas.aimen.es/P_MATCHMAKER_SHARE_SINTEF/SEM_cement_batch2" ;
dcat:mediaType "inode/directory" ] .
<https://he-matchmaker.eu/data/sem/SEM_cement_batch2/77600-23-001> a dcat:Dataset,
<https://w3id.com/emmo/domain/sem/0.1#SEMImageSeries>,
emmo:EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a ;
dcterms:creator "Sigurd Wenner" ;
dcterms:description "Back-scattered SEM image of cement sample 77600, polished with 1 µm diamond compound." ;
dcterms:title "Series of SEM image of cement sample 77600" ;
dcat:contactPoint "Sigurd Wenner <Sigurd.Wenner@sintef.no>" ;
dcat:distribution [ a dcat:Distribution ;
dcat:downloadURL "sftp://nas.aimen.es/P_MATCHMAKER_SHARE_SINTEF/SEM_cement_batch2/77600-23-001" ;
dcat:mediaType "inode/directory" ] ;
dcat:inSeries <https://he-matchmaker.eu/data/sem/SEM_cement_batch2> .
find¶
Search the triplestore for documented resources.
Running datadoc find --help
will show the following help message:
usage: datadoc find [-h] [--type TYPE]
[--criteria KEYWORD=VALUE [KEYWORD=VALUE ...]]
[--output FILENAME] [--format {iris,json,turtle,csv}]
options:
-h, --help show this help message and exit
--type TYPE, -t TYPE Either a resource type (ex: "dataset", "distribution",
...) or the IRI of a class to limit the search to.
--criteria KEYWORD=VALUE [KEYWORD=VALUE ...], -c KEYWORD=VALUE [KEYWORD=VALUE ...]
One of more additional matching criteria for resources
to find. Only resources with the given KEYWORD and
VALUE will be matched. The match is exact.
--output FILENAME, -o FILENAME
Write matching output to the given file. The default
is to write to standard output.
--format {iris,json,turtle,csv}, -f {iris,json,turtle,csv}
Output format to list the matched resources. The
default is to infer from the file extension if
--output is given. Otherwise it defaults to "iris".
The --type
and --criteria
options provide search criteria.
The --type
option an be any of the recognised resource types to limit the search to.
Alternatively, it may be the IRI of a class.
This limits the search to only resources that are individuals of this class.
The --output
options allows to write the matching output to file instead of standard output.
The --format
option controls how the search result should be presented.
The following formats are currently available:
- iris: Return the IRIs of matching resources.
- json: Return a JSON array with documentation of matching resources.
-
turtle: Return a turtle representation of matching resources.
Note: In case the matching resources are datasets with a
datamodel
keyword, the serialised data model will also be included in the turtle output. -
csv: Return a CSV table with the matching resources.
Examples
For all the examples below, we use --parse
option to pre-load the triplestore from the kb.ttl
file that we generated in the previous example.
Ex 1: List IRIs of all datasets:
$ datadoc --parse=kb.ttl find --type=dataset
https://he-matchmaker.eu/data/sem/SEM_cement_batch2/77600-23-001/77600-23-001_5kV_400x_m001
https://he-matchmaker.eu/sample/SEM_cement_batch2/77600-23-001
https://he-matchmaker.eu/data/sem/SEM_cement_batch2
https://he-matchmaker.eu/data/sem/SEM_cement_batch2/77600-23-001
Ex 2: List IRIs of all samples (individuals of chameo:Sample
):
$ datadoc --parse=kb.ttl find --type=chameo:Sample
https://he-matchmaker.eu/sample/SEM_cement_batch2/77600-23-001
Ex 3: List IRIs of all resources with a given title:
$ datadoc --parse=kb.ttl find --criteria title="Series of SEM image of cement sample 77600"
https://he-matchmaker.eu/data/sem/SEM_cement_batch2/77600-23-001
Ex 4: List all sample individuals as JSON:
$ datadoc --parse=kb.ttl find --type=chameo:Sample --format=json
[
{
"@id": "https://he-matchmaker.eu/data/sem/SEM_cement_batch2/77600-23-001",
"@type": [
"http://www.w3.org/ns/dcat#Dataset",
"https://w3id.com/emmo/domain/sem/0.1#SEMImageSeries",
"https://w3id.org/emmo#EMMO_194e367c_9783_4bf5_96d0_9ad597d48d9a"
],
"creator": "Sigurd Wenner",
"description": "Back-scattered SEM image of cement sample 77600, polished with 1 \u00b5m diamond compound.",
"title": "Series of SEM image of cement sample 77600",
"contactPoint": "Sigurd Wenner <Sigurd.Wenner@sintef.no>",
"distribution": {
"@id": "_:nac2552b949a94ef391080807ca2a02e4b14",
"@type": "http://www.w3.org/ns/dcat#Distribution",
"downloadURL": "sftp://nas.aimen.es/P_MATCHMAKER_SHARE_SINTEF/SEM_cement_batch2/77600-23-001",
"mediaType": "inode/directory"
},
"inSeries": "https://he-matchmaker.eu/data/sem/SEM_cement_batch2"
}
]
Ex 5: Show the documentation of a resource with a given IRI as JSON:
$ datadoc --parse=kb.ttl find --criteria @id=https://he-matchmaker.eu/data/sem/SEM_cement_batch2/77600-23-001 --format=json
This will show the same output as in Ex 4.
load¶
Loads documented dataset from a storage.
Running datadoc load --help
will show the following help message:
usage: datadoc load [-h] [--output FILENAME] iri
positional arguments:
iri IRI of dataset to load.
options:
-h, --help show this help message and exit
--output FILENAME, -o FILENAME
Write the dataset to the given file. The default is to
write to standard output.
Note
The load
subcommand is specific for datasets since it uses DCAT documentation of how to fetch the dataset.
The positional iri
argument is the IRI of the documented dataset to load.
The --output
option allows to write the dataset to a local file.
Example
Save the dataset https://he-matchmaker.eu/data/sem/SEM_cement_batch2/77600-23-001/77600-23-001_5kV_400x_m001
documented in kb.ttl
to file:
$ datadoc -p kb.ttl load https://he-matchmaker.eu/data/sem/SEM_cement_batch2/77600-23-001/77600-23-001_5kV_400x_m001 -o cement.tif
This should create the file cement.tif
containing the image data.