Transform a vocabulary from spreadsheet/CSV format
Transform a vocabulary into RVA editor spreadsheet format from any other spreadsheet/CSV format
1. Introduction
In order to support our partners in making their vocabularies available and browsable via the RVA portal, we provide this guide and accompanying ingestion template, which outline a transformation and ingestion process. In addition, section 3 details the discussion of a completed example of this transformation and ingestion process. If you have any additional questions about the transformation or ingestion of your vocabulary, please contact services@ardc.edu.au.
2. Getting started: Questions about your vocabulary
File formats
In what format is the vocab currently being maintained/stored?
Examples of formats in which a vocabulary may be stored: Spreadsheet, CSV, PDF, text, HTML, RDF, database tables, etc.
Has ARDC already developed a transformation and ingestion process for that format? Current processes can be found here.
Has your organisation developed a process to transform the current format to RDF? If not, we will work with you to develop a process.
Note : This is a guide for cases in which there are vocabularies that have a semantic model that can be adequately expressed within the constraints of a spreadsheet or comma separated values (CSV) format. For information about the transformation and ingestion of vocabularies that are maintained/stored in other formats, please consult our other transformation and ingestion guides .
Concept definitions
How are the vocabulary concepts currently being described?
What are the elements used to describe metadata about the concepts?
What do these elements mean?
How do the current elements used to describe metadata about the concepts map to the vocabulary ingestion template?
In order for your vocabulary to be ingested into Research Vocabularies Australia, the information provided in the original format needs to be translated into the ingestion template provided below.
Vocabulary ingestion template [spreadsheet-csv].xlsx
The template allows ARDC partners to indicate what information about the vocabulary should be captured within the following elements:
URI | <uri> |
|
---|---|---|
Scheme | <scheme> |
|
Concept | <concept> |
|
Preferred label | <prefLabel> |
|
Alternate label | <altLabel> |
|
Hidden label | <hiddenLabel> |
|
Notation | <notation> |
|
Scope note | <scopeNote> |
|
Example | <example> |
|
Definition | <definition> |
|
Exact match | <exactMatch> | |
Close match | <closeMatch> | |
Related match | <relatedMatch> | |
Broader match | <broaderMatch> | |
Broader | <broader> | |
Related | <related> |
And the following tag:
Language | @lang |
Any language code scheme may be used. Guidelines for best practice in language tag usage can be found here . Multiple languagesIf your vocabulary is multilingual, each concept may have more than one prefLabel , as long as each prefLabel is designated with a DIFFERENT language tag. For example, the concept “potato” might have 2 preferred labels: prefLabel "Potato"@en; prefLabel "Kartoffel"@de. |
---|
This is not a complete list of all elements which can be captured for your vocabulary in the ARDC Vocabulary Service. If your organization captures extra information that does not fall under the listed elements or tag, we can work with you to create a solution for including that information in your transformation. Please contact services@ardc.edu.au if you have any questions about your transformation process.
Hierarchical structure
In order for the vocabulary to be ingested properly, the hierarchy (narrower and broader nature of the concepts) must be notated in a machine-readable way. This may require some reorganization of the concepts for insertion into the ingestion template.
Additional preprocessing considerations
What preprocessing needs to be done?
Are there any additional requirements of the vocab owners or other stakeholders that might impact the transformation or ingestion of the vocabulary into the ARDC Vocabulary Service?
Have all non-ingestible ( non-ASCII ) symbols been removed?
Is the vocabulary multilingual (does it include content in multiple languages)? If so, please provide ARDC with a list of languages used in the vocabulary prior to ingestion.
3. ANZSRC-FOR: An example transformation
The process of vocabulary transformation and ingestion has been performed on the ANZSRC-FOR vocabulary, and the artefacts from that process are provided here as an example for future use.
This is just one example of the transformation of a vocabulary, and is meant to be used as a learning tool. The steps taken in order to transform your vocabulary may vary from those outlined below. Please contact services@ardc.edu.au if you have any questions about your transformation process.
In what format is the vocabulary currently being maintained/stored?
The vocabulary was initially provided as a spreadsheet in Microsoft .xls format, which can be viewed here . An annotated version of this original document (in Google Docs spreadsheet format) can be viewed here . ( Completed transformed versions of the ANZSRC-FOR are available in spreadsheet format and CSV format . )
How are the vocabulary concepts currently being described?
The ANZSRC-FOR vocabulary spreadsheet includes the title of the vocabulary, some column headings to explain how to read the spreadsheet (shown below in grey cells), names of vocabulary concepts, and codes that correspond to the concepts. This structure and code scheme is explained in detail by the Australian Bureau of Statistics here .

Preprocessing of the vocabulary
Examination of the vocabulary in its original spreadsheet format reveals that the given column headings (shown below in grey cells) do not provide all of the information we need to transform the spreadsheet. There are multiple types of information recorded in individual columns:

In order for ANZSRC-FOR to be ingested into the ARDC Vocabulary Service, the content provided in the original spreadsheet needs to be entered into the i ngestion template provided by ARDC. The template allows us to indicate what original vocabulary content should be captured within the following elements:
URI | <uri> |
|
---|---|---|
Concept | <concept> |
|
Notation | <notation> |
|
This is not a complete list of all elements which can be captured for your vocabulary in the ARDC Vocabulary Service. If your organization captures extra information that does not fall under the listed elements or tag, we can work with you to create a solution for including that information in your transformation. Please contact services@ardc.edu.au if you have any questions about your transformation process.
In the case of ANZSRC-FOR, the elements used are unique identifier , concept and notation . Because the original ANZSRC-FOR spreadsheet doesn’t include content such as concept definitions or alternate labels for concepts (and in fact, these pieces of information don’t exist for this particular vocabulary), those columns are left blank in the completed ANZSRC-FOR ingestion template example.
The ingestion template allows for ARDC partners to capture information about the hierarchical structure of their vocabulary and metadata about the concepts in one document.
Concept metadata
A number of steps were performed in order to properly record metadata about the ANZSRC-FOR concepts in the ingestion template.
The preferred labels were pulled from columns B, C and D of the original spreadsheet and pasted into the column titled “concept” in the template and the codes corresponding to the Preferred labels (pulled from columns A, B and C of the original spreadsheet) were pasted into the column title “notation” ensuring that codes corresponding with labels were pasted into the same row of the spreadsheet.
For example:
A | B | C | D |
---|---|---|---|
Research Classification - Field of Research | |||
Level 1 | |||
Level 2 | |||
02 | Level 3 | ||
Physical Sciences | |||
0201 | Astronomical and Space Instrumentation | ||
020101 | Astrobiology | ||
020102 | Astronomical and Space Instrumentation |
becomes:
C | D | E | F |
---|---|---|---|
concept | concept | concept | notation |
Physical Sciences | 02 | ||
Astronomical Sciences | 0201 | ||
Astrobiology | 020101 | ||
Astronomical and Space Instrumentation | 020102 |
2. Because ANZSRC-FOR is a monolingual vocabulary (English language), no language tags are necessary.
3. Unique identifiers for each concept were created based on the Preferred labels by making the labels all lowercase and inserting hyphens between the words using the =lower and =substitute functions and by deleting all punctuation and any text within parentheses.
For example:
Preferred label of Analytical Chemistry becomes unique identifier analytical-chemistry
Preferred label of Automotive Combustion and Fuel Engineering (incl. Alternative/Renewable Fuels) becomes unique identifier automotive-combustion-and-fuel-engineering
Unique identifiers corresponding with labels were used to create URIs for each concept (using the predefined URI structure) and were inserted into the URI column of the spreadsheet.
Completion of ANZSRC-FOR example
The completed vocabulary ingestion template for the ANZSRC-FOR is available in spreadsheet format and CSV format.