In my last blog I covered the basics of CKAN and how you can set up your own CKAN data portal using Docker. Now let’s move on to some of the most used CKAN extensions, starting with the one for data validation.

In the past 6 years we’ve worked on numerous CKAN data portals, many of which are using ckanext-validation, an extension that validates dataset’s tabular data, with validation badges (valid or invalid data) displayed on the CKAN portal interface. 

Validation of the actual data is performed against the default schema and options. Data can be validated against our own schema and options as well. In the create/update resource form there are corresponding fields that lets the user tweak the validation settings. If the data schema only covers part of the data, you can use the infer_fields and order_fields  validation options to infer the remaining fields. The validation process can be tweaked by passing any of the supported options on Good Tables. Here is an example:

{

   “headers”: 3,

   “delimiter”: “;”,

   “skip_rows”: [

       “#”

   ],

   “skip_checks”: [

       “blank-row”

   ],

   “checks”: [

       “duplicate-row”

   ],

   “infer_fields”: “true”,

   “order_fields”: “true”

}

Additionally, on each uploaded tabular data, data validation is performed automatically in the background after resource creation, and the results are stored against each resource. If the uploaded data is valid, the “Data Success” badge will be displayed next to the resource otherwise the “Data Invalid” badge will be shown.

If the data is invalid, the data creator or the user can click the validation badge to see which column or row has invalid data, and then easily fix the invalid values. Validation reports are displayed in the user interface. The issues found within the data are described, both at the structure level (missing headers, blank rows, etc) and at the data schema level (wrong data types, values out of range etc). 

Why do you need ckanext-validation on your portal?

The first and main reason is so that you can ensure all your uploaded data is correct.

Additionally, many CKAN portals use the ckanext-harvest extension which is an extension that makes harvesting data from one CKAN portal to another CKAN portal. Sometimes, the harvested data may not be valid, so the validation extension will check the data automatically and you will be able to ensure all the data you’ve harvested is correct, or you can fix it if it isn’t. 

How to set up ckanext-validation

This extension has been tested with CKAN 2.4 to 2.7.

It is strongly recommended to use it alongside ckanext-scheming to define the necessary extra fields in the default CKAN schema. If you want to use asynchronous validation with background jobs and are using CKAN 2.6 or lower, ckanext-rq is also needed. 

Installation

To install ckanext-validation, activate your CKAN virtual env and run:

git clone https://github.com/frictionlessdata/ckanext-validation.git
cd ckanext-validation
pip install -r requirements.txt
python setup.py develop

Create the database tables running:

paster validation init-db -c ../path/to/ini/file

Configuration

Once installed, add the validation plugin to the ckan.plugins configuration option on your INI file:

ckan.plugins = ... validation

Note: if using CKAN 2.6 or lower and the asynchronous validation also add the rq plugin (see Versions supported and requirements) to ckan.plugins. The extension requires changes in the CKAN schema. The easiest way to add those is by using ckanext-scheming. Use these two configuration options to link to the dataset schema (replace with your own if you need to customize it) and the required presets:

scheming.dataset_schemas = ckanext.validation.examples:ckan_default_schema.json
scheming.presets = ckanext.scheming:presets.json
	               Ckanext.validation:presets.json

Use the following configuration options to choose the operation modes:

ckanext.validation.run_on_create_async = True|False (Defaults to True)
ckanext.validation.run_on_update_async = True|False (Defaults to True)

ckanext.validation.run_on_create_sync = True|False (Defaults to False)
ckanext.validation.run_on_update_sync = True|False (Defaults to False)

By default validation will be run against the following formats: CSV, XLSX and XLS. You can modify these formats using the following option:

ckanext.validation.formats = csv xlsx

You can also provide validation options that will be used by default when running the validation:

ckanext.validation.default_validation_options={
    "skip_checks": ["blank-rows", "duplicate-headers"],
	"headers": 3}


If you are using a cloud-based storage backend for uploads check Private datasets for other configuration settings that might be relevant.


If you want to dive deep in the installation and configuration of the ckanext-validation extension go to the official readme, and if you need help you can always reach out to us and our CKAN team.

About Petar Efnushev

Computer whisperer at Keitaro