Havesting and Importing
Harvesting / Importing
GET Harvester- A single, one-off harvest. In internet terms, it is a simple HTTP GET. Combined with the Harvest Frequency option on your Harvester Settings, you can schedule a series of 'Direct' harvests at a set frequency, say weekly.
OAI-PMH Harvester- The OAI-PMH protocol allows fast, bulk harvests and includes features like resumption tokens and exception reporting. To use this option data providers need to have OAI-PMH capability. Most repository software includes an OAI-PMH data provider. Free, open-source OAI-PMH solutions are also available. More information on implementing an OAI-PMH data provider is available.
CKAN Harvester (custom method)- Harvester connects to CKAN API and downloads JSON in format specified in the Provider Type.
CSW Harvester (custom method)- Harvester connects to Catalogue Service for the Web (OGC CSW) implementation and downloads XML in format specified in the Provider Type.
There could be various reasons why records are not properly harvested by the ARDC Harvester. Often, it is not a problem with the Harvester but minor issues with the XML or with the harvest settings in your Data Source account. The following are the most common issues when experiencing this problem:
The Harvester uses the earliestDatestamp to check which records need to be harvested and added to the Registry. You have to make sure that this date is set to a date earlier than your records dates .
Quick Check:
Go to an internet address bar in your browser and then append "?verb=Identify" to your harvest URI. For example: http://abc.org.au/oai/oai2.php?verb=Identify
You will see information similar to this:
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2011-07-06T00:05:54Z</responseDate>
<request verb="Identify">http://abc.org.au/oai/oai2.php</request>
<Identify>
<repositoryName>Example</repositoryName>
<baseURL>http://abc.org.au/oai/oai2.php</baseURL>
<protocolVersion>2.0</protocolVersion>
<earliestDatestamp>2011-01-01T00:00:00Z</earliestDatestamp>
<deletedRecord>no</deletedRecord>
<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
<adminEmail>abc@abc.org.au</adminEmail>
</Identify>
</OAI-PMH>
NOTE: For more information on the use of OAI-PMH verbs, please visit this link-> OAI-PMH Metadata Harvesting
Double-check the date/time that you have set the harvester to run. Does it have a 'Z' at the end? Z is for Zulu time (also known as GMT). AEST is 10 hours ahead of Zulu time. It is probable that the date/time you have set for harvest has not yet been reached.
ARDC is investigating ways to ensure that you can set your harvest time in your local time, for a future release of ARDC software.
For more information, please send an email to services@ardc.edu.au.
When you schedule a harvest, the harvester checks for all the records from your ealiestDatestamp (see item a above) until now or the current time when the harvest is scheduled to run.
If the datestamp of the record is set to the future, then, the harvester thinks that no record is available. Similar to the scheduled harvest time in item d above, please review your record and correct the datestamp or schedule your harvest to run at a time later than the record's datestamp.
Quick check:
Go to an internet address bar in your browser and then append "?verb=ListRecords@metadataPrefix=rif" to your harvest URI. For example: http://abc.org.au/oai/oai2.php?verb=ListRecords&metadataPrefix=rif
You will see information similar to this:
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2011-07-06T00:32:59Z</responseDate>
<request verb="ListRecords" metadataPrefix="rif">http://abc.org.au/oai/oai2.php</request>
<ListRecords>
<record>
<header>
<identifier>test-ARDC-org-au</identifier>
<datestamp>2011-07-06T14:25:02Z</datestamp>
</header>
<metadata>
...
...
</metadata>
</record>
</ListRecords>
Check the configuration of your OAI-PMH implementation to make sure that your resumption token does not expire before harvest completion.
The following are some of the common problems with the Data Source URI:
- missing "http://" in the URI field of the Data Source account
- the URI is inaccessible from the internet.
- the URI, although accessible from the internet, requires authentication
If any of the above did not solve your problem, please send an email to services@ardc.edu.au.
The RDA Registry is capable of harvesting content that is not RIF-CS XML, however, an XSLT that generates a RIF-CS XML representation of the retrieved content must be made available for the ARDC Harvester. The absence of a valid XSLT will result to a harvest error.
To understand how this works, visit the Data source harvest configuration page, particularly the section 'Configuring a Provider Type other than RIF-CS'.
To initiate an instantaneous harvest, do the following:
Login to the RDA Registry
Go to your data source dashboard
Clicl the 'Import from Harvester' button
Your scheduled harvest will be cancelled and an immediate harvest will be initiated. Based on a defined frequency, a new recurring harvest will be scheduled after the completion of the harvest. For any issue or concern, please email services@ardc.edu.au.
These are two of the common reasons why your feed may not ingest the complete records to the RDA Registry:
Incorrect harvester settings
If you are using an OAI provider, you should make sure that the correct parameters in the Harvester Settings of your data source account is set correctly. Make sure that the Harvest Method is set to 'OAI-PMH Harvester' and the URI points to the base OAI URL. We have seen cases wherein institutions try to use 'GET Harvester' Harvest Method and the URI is set to something like 'http://<base URL>?verb=ListRecords&MetadataPrefix=rif.
When you try to use '?verb=ListRecords&metadataPrefix=rif', it will only harvest the first set of records since OAI harvest rely on the resumption token to continue harvesting the rest of the records. For instance, if your OAI resumption token is set to harvest 100 records at a time but you have a total of 500 in your feed, then the harvester will only harvest the first 100.
A record already exists in another data source
If your institution has more than one data source or if your institution also feeds records to other providers, there is a possibility that the record(s) in your feed may have already been published in Research Data Australia. Check the Harvester log in your data source Dashboard to find out which record(s) already exist in other data source(s).
If you need assistance, please email services@ardc.edu.au.
There are many reasons why a harvest may be stalled. This does not mean that the RDA Registry harvester or importer is down or not functioning.
Here is one way to re-start the harvest:
If the above didn't work, please send an email to services@ardc.edu.au.