Basic registers and standards data as precursors for Linked Data
Base registers are central components of a Linked Data ecosystem. Together with commonly used data models or ontologies, they ensure that data sets can be linked with each other even across organisational boundaries. Without them, Linked Data would not be possible. Based on an ongoing project that aims to advance the publication of Linked Open Data by Swiss authorities, we describe the status quo and the planned measures to systematically promote the publication of relevant basic registers and vocabularies. As described in an earlier article (Estermann 2019), a project commissioned by eGovernment Switzerland will identify those data sets that can serve as basic registers or central vocabularies in connection with the publication of Linked Open Data (LOD) by Swiss authorities. Their timely publication as Linked Open Data would promote the linking of public authority data. The fact that the publication of basic registers or central vocabularies is a very important topic in Switzerland was also shown at the Opendata.ch/2019 unconference held at the beginning of July: The question of which basic registers and vocabularies Swiss authorities should publish as LOD was rated by the participants as one of the most important questions and was dealt with in a workshop. In order to identify those basic registers and vocabularies that have the greatest potential for use in the context of Swiss government data, the Bern University of Applied Sciences conducted an initial screening of data sets as part of an eGovernment Switzerland project. Two approaches were pursued in parallel:
- Screening of existing databases of Swiss authorities with regard to their suitability as base registers or vocabularies.
- Screening of Wikidata for suitability as a base register or vocabulary in connection with data publication by Swiss authorities.
The screening was supplemented by a survey of Swiss authorities that already publish data as Linked Data or plan to do so in the near future. In the process, additional data from the field of memory institutions and digital humanities was identified, especially in the area of archives and libraries. In the following, the advantages and disadvantages of these different types of data sources are briefly discussed and initial shortlists are presented, which will then be commented on and supplemented by the Swiss LOD community in an open process.
Data holdings of Swiss authorities
Most of the data holdings of the Swiss authorities are created and maintained on the basis of a legal mandate. It can therefore be assumed not only that the data are of high quality, but also that the continuity of data publication is guaranteed, i.e. that the data will also be maintained and made available in the future. However, it is important to bear in mind that the mere fact that the data are provided by public authorities is no guarantee of data quality. Data quality must be thought of as a process and only becomes tangible in connection with concrete applications. A diverse and frequent use of data generally increases data quality, since errors and deficiencies in the data are often only discovered when it is used. In the case of some official data (e.g. commercial register, municipal directory), it can be assumed that they are used regularly and in different contexts; in the case of others, the previous context of use and the frequency of use remain largely in the dark (e.g. cantonal monument lists). Unfortunately, only a few public administration datasets are published as Linked Open Data today, and the feasibility and willingness of the various data holders with regard to such publication generally still needs to be clarified. Based on the screening and the outcome of the above-mentioned workshop, we have drawn up an initial shortlist of data sets from Swiss public authorities that could serve as basic registers or controlled vocabularies in connection with the publication of Swiss public authority data as Linked Open Data:
|Name||Responsible authority||Short description|
|UID register||FSO||All companies operating in Switzerland are listed in the UID register. The information on the companies is accessible to the administration (UID offices), to the company itself and partly to the public.|
|Commercial register||Cantonal commercial register offices||In Switzerland, the commercial registers are organised in a decentralised manner and are kept by the cantons. The commercial registers are public and serve to constitute and identify companies. Their purpose is to record and disclose facts relevant to commercial and corporate law, thereby helping to ensure legal certainty and protect third parties.|
|TERMDAT||Federal Chancellery (FC)||TERMDAT is the multilingual terminology database of the Swiss federal administration and contains, among other things, the official names of all federal offices. A partial implementation as Linked Data has already been prototypically realised.|
|Nomenclatures||FSO||The FSO nomenclatures include in particular:
In addition, versioned matching between postcodes and FSO municipality numbers would be desirable.
|Official list of municipalities||swisstopo||Official list of localities with postcode and perimeter.|
|Federal Register of Buildings and Dwellings (GWR)||FSO||Records the most important basic data on buildings and dwellings in Switzerland for statistical and administrative purposes.|
|NOGA||FSO||The “general classification of economic activities” (Nomenclature générale des activités économiques) is used for the consistent use of sector names in statistical evaluations.|
|ISCO||FSO||International Standard Classification of Occupations for the consistent use of occupational names in statistical evaluations.|
This list should be understood as a suggestion of which existing datasets should be published as Linked Open Data with the highest priority from a usage perspective.
Data sets in Wikidata have the advantage that they have a very good degree of coverage due to the crowdsourcing approach, and missing data can be easily created or added. In addition, data from Wikidata can be immediately integrated with a worldwide Linked Data cloud, since reconciliation with other data sets takes place immediately during data entry, and not only after data publication, as is often the case with other data sets. However, the crowdsourcing approach also leads to certain problems, especially with regard to data quality. This can only be ensured with additional effort, e.g. by identifying and limiting the data to reliable sources. Furthermore, there is a considerable need for data cleansing and harmonisation of modelling practices in various areas. Here too, based on the screening, we have drawn up an initial shortlist of data sets in Wikidata that could serve as basic registers or controlled vocabularies in connection with the LOD publication of Swiss government data:
|Name||Wikidata query||No. of entries (June 2019)|
|Administrative units of Switzerland||https://w.wiki/53U||5139|
|Swiss memory institutions||https://w.wiki/5Gm||2169|
|People born in Switzerland||https://w.wiki/53V||24537|
|People who died in Switzerland||https://w.wiki/53X||13396|
|People with Swiss nationality||https://w.wiki/53Z||31006|
|People with a connection to Switzerland (citizenship, place of birth or death, place of work or residence)||https://w.wiki/53c||40549|
|Buildings in Switzerland||https://w.wiki/53f||20147|
|Swiss Cultural Property of National or Regional Importance (KGS Inventory)||https://w.wiki/53j||13121|
|Water bodies in Switzerland||https://w.wiki/53q||2942|
|Mountains in Switzerland||https://w.wiki/53r||7965|
|Human sex or gender (vocabulary)||https://w.wiki/546||10+|
|Fabrics from which objects are made (vocabulary)||https://w.wiki/548||3318|
|Colours used to identify objects (vocabulary)||https://w.wiki/54D||61|
It could also be interesting to publish official authority data directly in Wikidata. This would have the advantage of directly opening up a high potential for use in an international context, since the data can be combined more easily with data from other countries. Such an approach is particularly useful for topics that are also to be covered in Wikipedia articles. In order to ensure the semantic interoperability of data across national borders, appropriate coordination between the data publishing bodies is required. If this is not already being done elsewhere, this coordination can take place directly within the Wikidata community.
Data from the field of memory institutions and digital humanities
The National Library and the two archives interviewed also pointed out the importance of international standards data and vocabularies. These include, for example, the Gemeinsame Normdatei (GND), which is maintained cooperatively by the German National Library and the German-language library networks, as well as the Virtual Internet Authority File (VIAF) and the Dewey Decimal Classification, both of which are operated by the US Online Computer Library Center (OCLC). With regard to the networking of Swiss holdings, other standards data and directories that relate specifically to Switzerland also play a role:
|Common standards file (GND)||German National Library||Subject index for persons, corporate bodies, congresses, geography, subject headings and titles of works. It is mainly used for the cataloguing of literature in libraries, but is also increasingly used by archives, museums, projects and in web applications.|
|Virtual International Authority File (VIAF)||OCLC||Virtual international authority file linking 25 national authority files via a concordance file.|
|Dewey Decimal Classification||OCLC Online Computer Library Center||The most widely used international classification for indexing library holdings. It is mainly used in the Anglo-American language area.|
|Photography Metadata||Photo CH||Metadata on Swiss photographers and photography holdings (photographers, places of work, institutions, holdings, exhibitions).|
|Inventory of research libraries in Switzerland||Swissbib/UB Basel||Data on the approximately 900 Swiss research libraries connected to the library metacatalogue of Swissbib.|
|Authority files on Swiss history||histHub||Named entities (persons, places), typologies (professions, place types) and vocabularies (first names, concepts) relevant to historical holdings on Switzerland. Some of these are still under construction.|
|Metadata of the Historical Dictionary of Switzerland||HLS||Metadata on entries in the Historical Dictionary of Switzerland (coordinates, persons, organisations, links to GND and VIAF).|
|Metagrid||SAGW / Dodis||Concordance file for historical reference data with reference to Switzerland.|
Historicised databases as a major challenge
The availability and use of historicised data holdings poses a particular challenge. This topic is highlighted again and again in discussions about the publication of Open Government Data as Linked Data, including at the workshop mentioned above. It is not only about the availability itself, which is still incomplete today (for example, municipal perimeters). It is also about how different historicised data sets can be linked: This is often not easy today, as different historicisation approaches have been used for the historicisation of the various data sets.
As can be seen from the survey of Swiss authorities that already publish data as linked data or plan to do so in the near future, the additional effort that is put into the preparation and linking of the data with other inventories is motivated by the fact that this will allow the data to be used as a source of information:
- an improved search in the holdings can be offered in the future (e.g. multilingual search in historical holdings of the Federal Archives; geolocalised search in holdings of the State Archives of Basel-City);
- new insights can be generated (e.g. linking of FOEN data holdings or information from the commercial register with statistical key figures from the FSO; integration of semantically enriched archive catalogues in research environments); and
- increasing transparency (e.g. tariff of Swiss electricity suppliers; data from electricity market monitoring).
The tables above reflect the current status regarding the basic registers and vocabularies that should be made available as Linked Data with the highest priority from a user perspective. In the coming weeks, we will be seeking further input from the Swiss LOD community to add to the tables and the list of possible usage scenarios, so that we end up with a broadly supported and prioritised list of basic registers and vocabularies. In a next step, we will work through this list in dialogue with the data holders in order to take into account not only the dimension of the potential for use but also the evaluation criteria of “feasibility” and “willingness of the data holder” (see Estermann 2019). The result of this next step will be several data sets prepared for LOD, as well as an analysis of the challenges and hurdles with regard to the conversion of further data sets to Linked Data. Based on this analysis, recommendations for further action will then be formulated. The first part of the article has already been published.
- Estermann, B. (2019). “The central role of basic registers and norm data in breaking data silos”. In: SocietyByte, June-July 2019.