Suggested schema change
The proposal is to extend the vocabulary for the collection type attribute. Existing types are:
- catalogueOrIndex: collection of resource descriptions describing the content of one or more repositories or collective works
- collection: compiled content created as separate and independent works and assembled into a collective whole for distribution and use
- registry: collection of registry objects compiled to support the business of a given community
- repository: collection of physical or digital objects compiled for information and documentation purposes and/or for storage and safekeeping
- dataset: collection of physical or digital objects generated by research activities
Proposed additions to the collection @type value list:
- sourceCode: computer instructions written in programming language (one or more software source files; a collection of source files within a software codebase) eg. software used in research, such as models and workflows
- classificationScheme: a list or arrangement of terms used in a particular context eg. ontologies, thesauri
- publication: scholarly material consisting mainly of written text eg. journal publications, book chapters
Problem this suggestion addresses
These scenarios prompt us to extend the schema’s ability to encode research assets and to return to some of the concepts behind collections, collective works, resources, and items. The proposal returns to the ISO2146 “collection” and its guidance for profiling and customising ISO2146 for particular business domains. For ANDS domain of research data, this document proposes options for extending collection types based on major ‘research assets’ (datasets, software, publications, standard vocabularies…). The collection sub-class in our domain covers content used as input and output of research.
Collective work is used in this International Standard to refer to compiled content created as separate and independent works and assembled into a collective whole for distribution and use. Examples include journals and newspapers, personal archives and datasets.
The approach under consideration is to extend the collection type list with some broad categories of assets relevant to the business of research (eg dataset, software, scholarly publication, classification scheme). The proposal establishes a model that may be extensible in the future for other collective works that our community wants to register as output or input to research, since various research disciplines deal with various types of research materials. The proposed type list is not seen as exhaustive, rather a response to current registered demand.
Granularity is an important concern in this proposal. Since collection types are obligatory for all schema users, the proposed collection types are deliberately generic so as to be relevant/intelligible to all users: dataset, source code, publication, classification scheme. It will be unacceptable to have dozens of terms at this level.
Of course further specificity is possible; source code, for example, may include a number of research applications such as models, workflows, or even more specifically “microarray spot detection and characterisation software”. Publications come in all shape and sizes. Such finer-grained types are out of scope for this current proposal, although a logical future step would be to allow an optional second level of typing with values taken from global standards.
Collection vs Service
The proposal would allow “source code” to be a collection type. It is already possible to encode software in RIF-CS as a service type. The distinction between a Service of type ‘software’ and a Collection of type ‘sourcecode’ is:
- Service->software - actionable service e.g. webservice/api/provenance service/website/process (eg executable file actually as a running process)
- Collection->sourcecode - collective work of sourcecode brought together into a collective whole for distribution and re-use
A service record might typically point to the web service interfaces of a system, a collection record to a git hub repository.
Publications and the ANDS registry
The proposal would allow “publications” to be a collection type in RIF-CS. This is not intended for use in Research Data Australia but for other systems that use RIF-CS and who specialise in publications as a research asset to be registered. The ANDS RDA service specialises in research data and only accepts publication information as a dataset’s “Related Info” (which typically points to a publication repository somewhere else). There is no intention to change this practice. This proposal is intended to enable the RIF-CS schema to be used in other business contexts in the research sector where publication information is potentially independent of dataset information.  The RIF-CS schema caters for a broader set of scenarios than the specific scope policies of the ANDS registry.
RIF-CS schema components affected
Impact on content providers
The addition of new types will allow data providers to:
- Reflect more accurately the kinds of objects that research organisations are registering in the ANDS registry using the collection element. See for example, this collection record in RDA describing software and this record describing a workflow.
- Allow the typing of such research assets within the ANDS registry, to allow segmenting of the ANDS holdings (e.g. faceted search for software).
- Allow RIF-CS compatible systems (not necessarily the ANDS Registry/RDA) to ‘ingest’ information about research assets (such as publications) that may not have connections with collections/datasets (whereas publications are currently always relatedInfo about a collection). As an example the Research Data Switchboard ingests information (using RIF-CS) about publications and then makes links with people, grants and datasets.
- Allow registries that use RIF-CS to include ontologies and other classification schemes which are considered as research assets
- Provide an extensible precedent in the schema to deal with future research asset types from specific disciplines eg humanities and social science.
- Resources such as models are sometimes cited or referenced in scholarly communications or counted as the output or impact of research; allow them to be processed in the same way as citable datasets (eg exported to Thomson Reuters Citation Index).
When we allow collection->sourcecode, we will sometimes have software described as collection->dataset (as we currently have) and other times as collection->sourcecode. This will need to be kept in mind when considering system-behaviour and user-experience.
The vocabs.xml file will need to be amended to add the new vocabulary term and its definition.
The vocabularies.html will need to be regenerated to reflect the addition of new terms and the changes to the descriptions of the activity types.
A change will be required to the Content Providers Guide.
 This would require a structural change, most likely the introduction of a new schema element, eg ResourceType (mentioned in ISO 2146 but not implemented in RIF-CS) which would describe the resource type. Values (describing the fine grained resource type or types contained in the collection) could be taken from international controlled resource type lists appropriate to the collection type.
 “The service data element allows … services to be described in terms of access policies, service levels and obligations and protocol support” ISO 2146:2010(E) p 23.
 Note there is no plan to change the deposit policy