Writing a custom harvest type for the ARDC Harvester
The ARDC Harvester comes with 4 basic harvest types
- GET
- OAI-PMH
- CKAN
- CSW
Each harvest type is implemented as a Python class and extends the Harvester
class
The harvest types are located in the harvest_handlers
directory and each harvest type should be self contained within it's own file.
Looking into the simplest of Harvester as an example on how a harvest handler should perform
from Harvester import *
class GETHarvester(Harvester):
"""
{
"id": "GETHarvester",
"title": "GET Harvester",
"description": "simple GET Harvester to fetch a single metadata document",
"params": [
{"name": "uri", "required": "true"},
{"name": "xsl_file", "required": "false"}
]
}
"""
def harvest(self):
self.getHarvestData()
self.storeHarvestData()
self.runCrossWalk()
self.postHarvestData()
self.finishHarvest()
The GETHarvester
harvest handler shows a minimum required of what a harvest handler should have, namely the def harvest(self)
definition, the property docstring
that defines the harvest handler and the new class definition being a subclass of the Harvester
class.
The subclass method may override the methods of the Harvester
class
The docstring at the beginning of the class is used to provide the Registry with information about the harvest type. Although it was planned to use this information to determine and populate some of the fields in the Registry user interface, this is not yet fully implemented.
The docstring must take the form of a JSON object with the following properties:
id
: The unique identifier for this harvest type.title
: The name of the harvest type as it appears in the “Harvest Method” dropdown in the “Harvester Settings” of the Registry.description
: A brief description of the harvest type, displayed to the right of the “Harvest Method” dropdown in the “Harvester Settings” of the Registry.params
: An array of the other parameters required to complete the specification of the harvest of a particular data source. For example, this might include a URI, or the type of harvester crosswalk to be used. Each array element is an object with these keys:name
: an identifier for this parameter namerequired
: one of the strings "true" or "false", indicating whether or not a Registry user must specify a value for this parameter.
Please see the implementation of the existing harvest types for examples.
This docstring will then be parsed and notify the ARDC Registry to populate the available harvester type field in the Data Source Settings page.