Research Link Australia: Information Model and Metadata

What RLA Expects of Data Contributors

RLA is seeking information that can be provided and will promote the value of research organisations and their research activities, and to provide back information to help address their information challenges. This will help us to better understand the potential of Research Link Australia providing national infrastructure of real value. An example project may involve providing administrative information to RLA about collaborating organisations (research and industry), researchers, funded projects or research activities, and outputs - these RLA will link within a larger data source for the partner to explore and gather insights.

RLA Information/Data Model

During the RLA project consultation and co-design phases, three important types of information are identified for the support to identify potential collaborators:

Research Capability information from both research and business/industry sectors.
Research capabilities can be represented by researcher profile through their identifier, research output (publication, research data, IP and patent), and received funding (e.g. ARC/NHMRC grants, other Gov. and Philanthropy funding), information through OrgID about a stakeholder that are of some research capabilities, Research facilities and services owned by an institution, business R&D activity.
Collaboration capability/culture fit: collaborators from either side know about culture, language and priority and how to work together.
Information about collaboration capability can be evidenced by Successful Partnerships that could be inferred by past collaboration history, for example, through ARC linkage grants or CRC grants. However, there is no indicator for many researchers or SMEs (small to medium enterprises) who haven’t yet had such collaboration.
Capacity: Capacity indicates if potential collaborators are available for a collaboration even if they are capable.
While some of this capacity information could be obtained from an organisation’s HR systems, there is also a need to collect voluntary declarations by the research collaborators themselves.

What data does RLA collect?

Currently, RLA focuses on providing information on research capability and research collaboration capability. The RLA data model below shows how the two types of capabilities are interlinked. In particular:

Individuals’ and organisations’ research capabilities are demonstrated through their research input and output, e.g. publications, datasets, instruments, and funded projects.
Individuals and organisations’ collaboration capabilities are demonstrated through their collaborative work from past funded projects that have both research and industry participation (e.g. ARC Linkage Projects).

Each RLA entity/object: - such as researcher and expert, organisation, research input/output, and funded activity is represented by a data type or a data object. The RLA data model, also known as the RLA graph, links each entity or data type to another in the model.

Minimum metadata required per RLA entity

As the RLA project advances, there will be the opportunity to incorporate new data sources into the RLA Graph. This includes the integration of significant databases such as patent information and instruments, as well as data from research management systems at both university and individual group levels. This section provides recommendations for the minimum and optimal properties required for such integrations.

The following Entity Relationship diagram represents the minimum required fields and relationships based on the RLA Data Model.

The recommended fields are selected based on two principles:

Principle 1: Minimise the number of properties with the aim of simplifying the data contribution process

Principle 2: Require only properties that support either a functional requirement of RLA or identify the entity across the RLA graph/data model.

The following metadata properties would satisfy the minimum requirements for the RLA functional requirements.

For Researchers, the essential properties include:

First Name: The given name of the researcher.
Last Name: The family or surname of the researcher
Identifier: e.g. ORCID (preferred): A unique identifier for academic authors and contributors, Scopus Author ID: An identifier used within the Scopus database to associate researchers with their publications accurately, and/or Website/URL: A link to the researcher's institutional webpage, providing access to their professional profile, publications, and contributions.

For Publications, key metadata encompasses:

Title: The publication title.
Abstract: A brief summary of the publication's content, outlining the main arguments, results, and conclusions.
Publication Type: For example journal article, conference paper, etc., indicating the context and intended audience of the work.
DOI: Digital Object Identifier, providing a permanent and direct link to the publication.
Publication Year: The year the publication was published, is important for understanding the context and currency of the research.

For Grant-Project/Research Activity, the necessary information includes:

Title: The name of the grant or project.
Summary: A brief overview of the project, including goals, and expected outcomes.
Identifier: DOI (for Grant), GrantID, or RAiD (for project), and/or Website/url: A link to more detailed information about the grant or project, providing access to a landing page, findings, or reports.
Announcement Year: The year the grant was announced, giving context to the project's timeline and funding cycle.
Funder: The organisation providing the financial support, indicating the source of funding and potentially the project's thematic alignment.

For Organisations, the metadata should cover:

Name: The official name of the organisation.
Identifier: RoR, DOI, ABN (Australian Business Number, a unique identifier within Australia, necessary for legal and financial transactions), or website (The organisation’s website, provides a gateway to its activities, mission, and resources).
Country: The country where the organisation is located, is important for understanding the geographical and regulatory context.
GeoLocation: The location where the organisation is based, offering more precise localization and potential collaboration opportunities.

For Instruments, the metadata should cover:

Title: The name of the instrument
Identifier: DOI (or other resolvable identifiers)
AlternateIdentifier (Recommended): An identifier other than the primary Identifier applied to the resource being registered. This may be any alphanumeric string which is unique within its domain of issue.
Description: Technical information about the instrument and its capability.
HostInstitution: An institution responsible for the management of the instrument. This may include the legal owner, the operator, or an institute providing access to the instrument.
GeoLocation: Spatial region or named place where the data was gathered or about which the instrument is hosted.

For Patents:

Invention title: title of the patent.
Application_number: The application number that uniquely identify the application of an IP right.
Status: The current status of the IP right or IP application
Inventor(s) Name: Inventor(s) of the patent.
Applicant(s) Name: Applicant(s) of the patent.

These properties ensure comprehensive coverage of research activities, resources, and affiliations, thereby fulfilling RLA's functional requirements for metadata. However, there is much other important information that can be captured as part of the integration with the RLA system. The Appendix I details a complete list of properties for each RLA data object.

The ER Diagram below presents the complete metadata schema with all optional fields.

RLA data model - with optional metadata fields

Persistent Identifiers

The role of persistent identifiers is essential to support interoperability and long-term data integrity in the RLA graph/data model. Specifically, the following persistent identifiers play an important role in RLA metadata.

PIDs for Grants and Projects: Allocating PIDs (Persistent Identifiers) to research grants and projects enables the identification and disambiguation of these entities across the RLA graph. This is particularly important when a project has participants from different universities. While there is no globally accepted PID for Grants and Projects, there are three main options for allocating PIDs to projects and grants.
- Firstly, both Crossref and DataCite allow the minting of DOIs (Digital Object Identifiers) for grant.
- Secondly, a Persistent URL (PURL) can be used to transform local identifiers into Persistent URLs.
- Finally, RAiD (Research Activity Identifier Service) opens new opportunities to mint PIDs for research projects or activities.
ORCID for researchers: Allocating ORCID identifiers to researchers is crucial for disambiguating individual researchers across information ingested into the RLA from various universities. Furthermore, ORCID allows the RLA to connect researchers with a wealth of information from publishers and funding bodies. As such it is highly recommended to adopt the use of ORCID for the researcher information provided to RLA. If ORCID is not available in the contributed metadata, a search provided by ORCID API and filtering the graph by related work can lead to identifying the missing ORCID identifiers.
ROR or ABN for Organisations: Identifying the type, location, and domain of activity for organisations mentioned in the RLA graph is crucial for offering valuable insights into current or potential research collaborations. Internationally, http://ROR.org is a viable option for universities and research organisations, whereas the Australian Business Number (ABN) serves as a comprehensive database for companies and all registered legal entities in Australia. Combining these PIDs offers adequate support for disambiguating organisations in the RLA graph. For new records provided to RLA, it is essential to aim for mapping the organisation names to one of these PIDs.
- Note: ROR has been omitted from the “Figure: RLA Optimum Metadata Nodes and Relationships” in favour of simplicity. At the current stage of ROR development, most Australian organisations with ROR already have a registered ABN number.
DOI (or URL) for Publications and Datasets: Identifying publications and datasets with a DOI is highly recommended. However, for non-traditional research outputs where a DOI is not available, using a resolvable URL as an identifier can support disambiguation and facilitate the retrieval of complementary information from the webpage.
- Note: Dataset has been omitted from the “Figure: RLA Optimum Metadata” as very few RLA requirements at this stage of the development benefit from Datasets metadata.

Metadata Provenance

Capturing and preserving the provenance of metadata records for both nodes and relationships is essential for quality control and resolving any metadata issues. The following metadata elements support the required provenance data.

Data Source: Every new record added to the RLA graph should maintain its provenance. The “source” property will hold the required information, and the recommended approach is to use the domain of the data provider, such as “http://ipaustralia.gov.au ”. This is the domain of the data contributor, not the data repository. For example, in the case of RAPID IP data discoverable via Research Data Australia (https://researchdata.edu.au/ip-rapid/2761752), the source will be “http://ipaustralia.gov.au ” instead of “http://researchdata.edu.au ”. Also, to provide consistency for machine-readable access points, the best practice is to change the source value to lowercase.
Local Identifier: Records from databases and institutional repositories often come with local identifiers (also known as primary keys). When there is no DOI, ORCID, or other global PIDs available, combining the source with the local identifier offers a reference point in the graph. Additionally, these local identifiers are crucial for future interoperability with the source repository, facilitating the ability to receive updates beyond the initial data import.
Last Update Information: One of the best practices in database design is to include in each record the date and time of the last update and the agent (person, software, service) responsible for the update. This approach facilitates debugging, supports time series analysis, and, in some cases, serves as an efficient tool for database replication. Therefore, it is recommended that each node and relationship in the RLA graph incorporate two properties:
- “last_update,” indicating the date and time of the last update to any of the properties, and
- “updated_by,” identifying the software component, service name, or user ID (in case of a manual update) that modified the record.