tokenpocket最新下载|csvw

首页
csvw

tokenpocket最新下载|csvw

作者： tokenpocket最新下载

2024-03-07 18:01:13

IT账号自助服务

自助服务You need to enable JavaScript to run this a

Note: Since your browser does not support JavaScript,

you must press the Continue button once to proceed.

CSVW - CSV on the Web

csvw.org

Guides

Tools

Standards

CSV on the Web

CSVW is a standard for describing and clarifying the content of CSV tables

This site explains CSVW and suggests some tools you can use for working with it.

The UK Government Digital Service recommends CSVW:

Use the CSV on the Web (CSVW) standard to add metadata to describe the contents and structure of comma-separated values (CSV) data files.

GDS Recommended Open Standards for Government

Learn why to use CSVW

Use CSVW to annotate CSV files with JSON metadata

Here the metadata provides column names and datatypes. It also declares that the geo-coordinates use the Ordnance Survey National Grid.

grit_bins.csv

42,

425584,

439562

43,

425301,

439519

44,

425379,

439596

45,

425024,

439663

46,

424915,

439697

48,

425157,

440347

49,

424784,

439681

50,

424708,

439759

51,

424913,

440642

52,

425342,

440376

...

grit_bins.json

{

"@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],

"tables": [{

"url": "http://opendata.leeds.gov.uk/downloads/gritting/grit_bins.csv",

"tableSchema": {

"columns": [

{

"name": "location",

"datatype": "integer"

{

"name": "easting",

"datatype": "decimal",

"propertyUrl": "http://data.ordnancesurvey.co.uk/ontology/spatialrelations/easting"

{

"name": "northing",

"datatype": "decimal",

"propertyUrl": "http://data.ordnancesurvey.co.uk/ontology/spatialrelations/northing"

}

"aboutUrl": "#{location}"

}

}],

"dialect": {

"header": false

}

Standardise dialects

CSVW adopts the CSV Dialect standard to tell CSV parsers whether the table is delimited with commas or tabs, or whether to expect double quotes etc.

Express types

Table cells may be annotated with a datatype to indicate whether they contain numbers or dates etc. This makes parsing trivial. Users waste less time preparing the data and can instead get started using it right away.

Machine-readable footnotes

Don't treat data tables like spreadsheets, cluttering them with footnotes that can only be read by humans! Having a separate metadata file means that both the annotations and the data table itself are cleaner and easier to process. Indeed using a standard format makes it practically effortless.

Relate tables together

Declare indicies and foreign key constraints between tables letting CSV work like a relational database. You can share multiple tables in a neat package without having to squash everything into one page.

Disambiguate resources

Identify the resources described a CSV file using URI templates. Understanding the data model helps users to interpret the data - for example to distinguish which columns refer to which entities.

Create Linked-data

Create 5 Star Open Data. The CSVW standard describes a csv2rdf procedure for translating CSV into linked-data. Enjoy a default translation out of the box or customise your model with the three URI templates that apply to each cell.

Learn how to make CSVW

A tribute to CSVW made by

Swirrl Logo

and shared under the Creative Commons Attribution-ShareAlike license.

We'd be delighted to receive contributions. If you've got an idea for a guide or tool please

create an issue on the site's repository.

上汽大众汽车有限公司

汽车有

CSV on the Web: A Primer

CSV is one of the most popular formats for publishing data on the web. It is concise, easy to understand by both humans and computers, and aligns nicely to the tabular nature of most data.

But CSV is also a poor format for data. There is no mechanism within CSV to indicate the type of data in a particular column, or whether values in a particular column must be unique. It is therefore hard to validate and prone to errors such as missing values or differing data types within a column.

The CSV on the Web Working Group has developed standard ways to express useful metadata about CSV files and other kinds of tabular data. This primer takes you through the ways in which these standards work together, covering:

What we mean by "tabular data" and "CSV"

Where files that provide metadata about CSV live

How to create a schema to validate the content of a CSV file

How to specify how a CSV file should be converted to RDF or JSON

How to provide other documentation and metadata about a CSV file

Where possible, this primer links back to the normative definitions of terms and properties in the standards. Nothing in this primer overrides those normative definitions.

The CSV on the Web Working Group was chartered to produce a recommendation "Access methods for CSV Metadata" as well as recommendations for "Metadata vocabulary for CSV data" and "Mapping mechanism to transforming CSV into various formats (e.g. RDF, JSON, or XML)". This non-normative document is a primer that describes how these standards work together for new readers. The normative standards are:

Model for Tabular Data and Metadata on the Web

Metadata Vocabulary for Tabular Data

Generating JSON from Tabular Data on the Web

Generating RDF from Tabular Data on the Web

Introduction

We'll begin with some basic concepts.

What is tabular data and CSV?

Tabular data is any data that can be arranged in a table, like the one here:

column 1column 2column 3

row 1cell in column 1 and row 1cell in column 2 and row 1cell in column 3 and row 1

row 2cell in column 1 and row 2cell in column 2 and row 2cell in column 3 and row 2

row 3cell in column 1 and row 3cell in column 2 and row 3cell in column 3 and row 3

There are lots of syntaxes for expressing tabular data on the web. You can put it in HTML tables, pass it around as Excel spreadsheets, or store it in an SQL database.

One easy way to pass around tabular data is as CSV: as comma-separated values. A CSV file writes each row on a separate line and each cell is separated from the next with a comma. The values of cells can be written with double quotes around them; this is necessary when a cell value contains a line break or a comma. So the tabular data above can be expressed in CSV as:

cell in column 1 and row 1,cell in column 2 and row 1,cell in column 3 and row 1

cell in column 1 and row 2,cell in column 2 and row 2,cell in column 3 and row 2

cell in column 1 and row 3,cell in column 2 and row 3,cell in column 3 and row 3

or, with double quotes around cell values:

"cell in column 1 and row 1","cell in column 2 and row 1","cell in column 3 and row 1"

"cell in column 1 and row 2","cell in column 2 and row 2","cell in column 3 and row 2"

"cell in column 1 and row 3","cell in column 2 and row 3","cell in column 3 and row 3"

CSV files usually have an additional row at the top called a header row, which gives human-readable names or titles for each of the columns. Here is a sample CSV file that contains a header row:

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"

"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

Column titles are a type of annotation on a column, not part of the data itself. For example, they aren't included when you're counting the rows of data in a table:

column 1column 2column 3column 4column 5column 6column 7

titlescountrycountry groupname (en)name (fr)name (de)latitudelongitude

row 1ATeuAustriaAutricheÖsterreich47.696554513.34598005

row 2BEeuBelgiumBelgiqueBelgien50.5010454.47667405

row 3BGeuBulgariaBulgarieBulgarien42.7256737525.4823218

See also:

Tabular Data Models in [[tabular-data-model]]

How can you provide extra information about CSV data?

You can provide extra information, known as metadata, about CSV files using a JSON metadata file. If you're just providing metadata about one file, the easiest thing to do is to name the metadata file by adding -metadata.json to the end of the name of the CSV file. For example, if your CSV file is called countries.csv then call the metadata file countries.csv-metadata.json.

The simplest metadata file you can create contains a single table description and looks like:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv"

}

Metadata files must always include the @context property with a value "http://www.w3.org/ns/csvw": this enables implementations to tell that these are CSV metadata files. The url property points to the CSV file that the metadata file describes.

These metadata documents should be served from a web server with a media type of application/csvm+json if possible.

The description of a table within a metadata file can include:

documentation about the CSV file, such as a description, authorship or licensing information, which can provide context for people using the CSV file

a definition of the structure of the CSV file, which can be used to validate the data that it holds

instructions that enable processors to transform the CSV file into other formats, such as JSON or RDF

By default, if implementations can't find a metadata file by appending -metadata.json to the filename of the CSV file, they'll just look for a file called csv-metadata.json in the same directory as the CSV file.

Metadata files can also describe several CSV files at once, using a slightly different syntax:

{

"@context": "http://www.w3.org/ns/csvw",

"tables": [{

"url": "countries.csv"

}, {

"url": "country-groups.csv"

}, {

"url": "unemployment.csv"

}]

}

Here, the tables property holds an array of table descriptions, each with the URL of the CSV file that it's describing. The metadata file as a whole describes a group of tables. This is usually used when the tables relate to each other in some way: perhaps they're data in the same format from different periods of time, or perhaps they reference each other.

See also:

Locating Metadata in [[tabular-data-model]]

Metadata Format in [[tabular-metadata]]

How can you provide extra information about the columns in a CSV file?

You can give information about the columns in a CSV file through a table schema. The simplest thing you can do is say what those columns are called. For example, if you have some CSV like this:

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"

"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

You can say that the table contains seven columns named as they are in this CSV file like so:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv"

"tableSchema": {

"columns": [{

"titles": "country"

},{

"titles": "country group"

},{

"titles": "name (en)"

},{

"titles": "name (fr)"

},{

"titles": "name (de)"

},{

"titles": "latitude"

},{

"titles": "longitude"

}]

}

A validator can check that the CSV file holds the expected columns (both the right number of columns and columns with the expected titles).

See also:

Schemas in [[tabular-metadata]]

Columns in [[tabular-metadata]]

What tools implement or use CSV on the Web?

This Note can't keep an up-to-date list of the tools that implement or otherwise use CSV on the Web. Instead, look at:

CSV on the Web implementation reports

Entries tagged as CSV on the Web tools on the CSV on the Web Community Group blog

Documenting CSVs

Providing metadata about CSVs can be useful simply in providing extra information to anyone who wants to work with them.

How can you provide documentation about a CSV file?

Here's an example that includes some extra descriptive documentation about a number of CSV files:

{

"@context": "http://www.w3.org/ns/csvw",

"dc:title": "Unemployment in Europe (monthly)"

"dc:description": "Harmonized unemployment data for European countries."

"dc:creator": "Eurostat",

"tables": [{

"url": "countries.csv",

"dc:title": "Countries"

}, {

"url": "country-groups.csv",

"dc:title": "Country groups"

}, {

"url": "unemployment.csv",

"dc:title": "Unemployment (monthly)",

"dc:description": "The total number of people unemployed"

}]

}

This example uses Dublin Core as a vocabulary for providing metadata. You can tell that's the vocabulary that's being used because the terms like dc:title and dc:description begin with the prefix dc, which stands for Dublin Core.

There are several different metadata vocabularies in common use around the web. Some people use Dublin Core. Some people use schema.org. Some people use DCAT. All of these vocabularies can be used independently or together. A publisher could alternatively use:

{

"@context": "http://www.w3.org/ns/csvw",

"schema:name": "Unemployment in Europe (monthly)",

"schema:description": "Harmonized unemployment data for European countries."

"schema:creator": { "schema:name": "Eurostat" },

"tables": [{

"url": "countries.csv",

"schema:name": "Countries"

}, {

"url": "country-groups.csv",

"schema:name": "Country groups"

}, {

"url": "unemployment.csv",

"schema:name": "Unemployment (monthly)",

"schema:description": "The total number of people unemployed"

}]

}

It's not clear at the moment which metadata vocabulary will give publishers the most benefits. Search engines are likely to recognise schema.org. RDF-based systems are more likely to recognise Dublin Core.

More generally, you can use prefixed properties like these on any of the objects in a metadata document. The prefixes that are recognised are those used in the RDFa 1.1 Initial Context. Other properties must be named with full URLs.

See also:

Common Properties in [[tabular-metadata]]

How can you provide documentation about the columns in a CSV file?

You can use metadata properties like the ones used for tables for individual columns as well. For example:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv"

"tableSchema": {

"columns": [{

"titles": "country",

"dc:description": "The ISO two-letter code for a country, in lowercase."

},{

"titles": "country group",

"dc:description": "A lowercase two-letter code for a group of countries."

},{

"titles": "name (en)",

"dc:description": "The official name of the country in English."

},{

"titles": "name (fr)",

"dc:description": "The official name of the country in French."

},{

"titles": "name (de)",

"dc:description": "The official name of the country in German."

},{

"titles": "latitude",

"dc:description": "The latitude of an indicative point in the country."

},{

"titles": "longitude",

"dc:description": "The longitude of an indicative point in the country."

}]

}

See also:

Columns in [[tabular-metadata]]

Common Properties in [[tabular-metadata]]

What about when metadata like this is structured?

We've already seen an example where metadata supplied about a table is structured. Look at the schema:creator here:

{

"@context": "http://www.w3.org/ns/csvw",

"schema:name": "Unemployment in Europe (monthly)",

"schema:description": "Harmonized unemployment data for European countries."

"schema:creator": { "schema:name": "Eurostat" },

"tables": [{

"url": "countries.csv"

}, {

"url": "country-groups.csv"

}, {

"url": "unemployment.csv"

}]

}

The metadata supplied for a table or group of tables can have nested objects within it. You can provide arrays of values. It will be interpreted as if it is (a minimal version of) [[json-ld]]. Particular patterns that are useful are:

Use nested objects to create structure; for example you could supply the website and telephone number of the creator of a table as well as their name:

"schema:creator": {

"schema:name": "Eurostat",

"schema:url": "http://ec.europa.eu/eurostat",

"schema:contactPoint": [{

"schema:telephone": "+44 20 300 63103",

"schema:availableLanguage": "English"

}, {

"schema:telephone": "+49 30 223 86 029",

"schema:availableLanguage": "Deutsch"

}, {

"schema:telephone": "+33 1 77 68 89 07"

"schema:availableLanguage": "français"

}]

}

Use @type to indicate the type of objects if there's an appropriate type to use, for example in this case Eurostat is a governmental organisation:

"schema:creator": {

"schema:name": "Eurostat",

"@type": "schema:GovernmentOrganization"

}

Use an @id property to provide values that are URLs for things about which more information is available at that URL. For example, rather than providing all the information about Eurostat embedded within the metadata, you could just reference out to it:

"schema:creator": { "@id": "http://ec.europa.eu/eurostat" }

or you could include the @id along with other basic properties to make it more readable for humans.

See also:

Common Properties in [[tabular-metadata]]

How should you annotate individual cells?

There's no standardised facility in the CSV on the Web specifications for annotating individual cells, but there is a hook that will enable best practice about how to do that to emerge: the notes property on a table description can contain objects that represent annotations.

The W3C Web Annotation Working Group is working on a vocabulary for annotations themselves. This vocabulary includes the concept of a target for an annotation and its body (the content of the annotation).

If you are annotating a single cell, the target needs to point to that cell. The easiest way to do that is to use fragment identifiers for CSV files as defined in [[RFC7111]]. These fragment identifiers reference cells based on their position within the original CSV file, with the first row in the CSV file (usually the header row) counted as 1. For example, with the CSV file:

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"

"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

the cell containing Belgique is at #cell=3,4. It's also possible to refer to ranges of cells with this syntax and to use * to refer to the last row in the file. For example, to target a comment on all the locations in the CSV file you could use the fragment identifier #cell=2,6-*,7.

To create comments, then, the notes property can hold an array of objects that use the Web Annotation structure. For example:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv",

"notes": [{

"type": "Annotation",

"target": "countries.csv#cell=2,6-*,7",

"body": "These locations are of representative points.",

"motivation": "commenting"

}, {

"type": "Annotation",

"target": "countries.csv#cell=3,4",

"body": "Corrected.",

"motivation": "editing"

}]

}

See also:

Table notes in [[tabular-metadata]]

Validating CSVs

Validation is all about checking whether a file contains what you expect it to contain. For CSV files, this can be about:

the number of columns there are and what they're called (see )

what kind of values columns contain

whether any of those values are unique

whether they match values in other CSV files

How can you say what kinds of values are expected in a column?

There's lots more that you can say about the expected content of columns in a CSV file. The most obvious thing is to indicate the data type. For example, with the CSV file:

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"

"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

The first five columns are strings and the last two are numbers. You can indicate this with the datatype property for each column:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv"

"tableSchema": {

"columns": [{

"titles": "country",

"datatype": "string"

},{

"titles": "country group",

"datatype": "string"

},{

"titles": "name (en)",

"datatype": "string"

},{

"titles": "name (fr)",

"datatype": "string"

},{

"titles": "name (de)",

"datatype": "string"

},{

"titles": "latitude",

"datatype": "number"

},{

"titles": "longitude",

"datatype": "number"

}]

}

You don't have to include "datatype": "string" for columns that are strings — columns are assumed to hold strings if no datatype is explicitly specified.

There are a number of different datatypes supported by CSV on the Web implementations, based on the set defined in [[xmlschema11-2]]. The complete set is shown in the following diagram:

Diagram showing the built-in datatypes, based on [[xmlschema11-2]]; names in parentheses denote aliases to the [[xmlschema11-2]] terms (see the diagram in SVG or PNG formats)

See also:

Datatypes in [[tabular-metadata]]

Datatypes in [[tabular-data-model]]

How do you define new datatypes?

You can define new datatypes based on the built-in datatypes using an object as the value of the datatype property rather than a string. For example:

"datatype": {

"base": "integer",

"minimum": "1",

"maximum": "5"

}

The base property must be an existing datatype. The other properties on the new datatype define extra restrictions on values of the new datatype. You can give the new datatype a name and description to provide extra documentation for people using the data:

"datatype": {

"dc:title": "Star Rating",

"dc:description": "A star rating between 1 and 5."

"base": "integer",

"minimum": "1",

"maximum": "5"

}

See also:

Derived Datatypes in [[tabular-metadata]]

Datatypes in [[tabular-data-model]]

How do you restrict what kind of strings a column contains?

In the example we've been using, the first column always contains a country code consisting of two lowercase letters. This is a new datatype based on string. You can specify the format for string values in a column using a regular expression, like this:

{

"titles": "country",

"datatype": {

"dc:title": "Country Code",

"dc:description": "Country codes as specified in ISO 3166.",

"base": "string",

"format": "[a-z]{2}"

}

It's also possible to restrict the length of a string-based datatype using the length or minLength and/or maxLength properties. For example the following says that the column holding the English names of countries must have values between 3 and 128 characters long:

{

"titles": "name (en)",

"datatype": {

"base": "string",

"minLength": "3",

"maxLength": "128"

}

See also:

Formats for other types in [[tabular-data-model]]

Length Constraints in [[tabular-data-model]]

How do you restrict the size of numbers a column contains?

The size of numbers in a column can be restricted using the minimum and maximum properties and/or the minExclusive and maxExclusive properties. In our example, one column contains latitudes, which can range between -90 and +90:

{

"titles": "latitude",

"datatype": {

"base": "number",

"minimum": "-90",

"maximum": "90"

}

See also:

Value Constraints in [[tabular-data-model]]

How do you ensure that decimal numbers have a particular precision or leading zeros?

In the example we're using, the latitudes are provided to between six and eight decimal places and there are no leading or trailing zeros. You can use the format property to provide a pattern that matches these numbers. In the pattern, 0 represents a required digit and # represents an optional digit. For the latitude, the definition looks like:

{

"titles": "latitude",

"datatype": {

"base": "number",

"minimum": "-90",

"maximum": "90",

"format": "#0.000000##"

}

The format property can also be used to indicate that values in a column should have leading zeros. For example, if a column were supposed to hold a three digit number you could use the pattern 000.

See also:

Formats for numeric types in [[tabular-data-model]]

How do you validate numbers that aren't in standard numeric formats?

Sometimes numbers within a CSV file won't be in a standard numeric format. For example, they might include commas as grouping characters (eg 12,345,678) or as decimal points (eg 12,3). In these cases, you can use the format property with an object value.

To match numbers with grouping separators as in 12,345,678 you should specify "," as the groupChar for the format. The pattern property then holds the pattern that indicates how many digits should be in each group. This example validates numbers with groups of three digits separated by commas:

"datatype": {

"base": "integer",

"format": {

"groupChar": ",",

"pattern": "#,##0"

}

To match numbers with decimal separators other than ., as in 12,3, you should specify "," as the decimalChar for the format. This example validates numbers with commas as decimal points:

"datatype": {

"base": "integer",

"format": {

"decimalChar": ","

}

You can mix and match decimal and grouping characters and patterns. For example, in France it's standard to use commas for decimal points and spaces for grouping characters, so CSV files produced in France might contain numbers like 1 234 567,89. These could be validated with a datatype like:

"datatype": {

"base": "integer",

"format": {

"decimalChar": ",",

"groupChar": " ",

"pattern": "# ##0,0#"

}

See also:

Formats for numeric types in [[tabular-data-model]]

How do you restrict what kind of dates and times a column contains?

Dates and times are treated similarly to numbers. You can use the minimum, maximum, minExclusive and/or maxExclusive properties to restrict their values.

For example, to indicate that the column should contain dates later than 1st January 2000, you can use the datatype:

"datatype": {

"base": "date",

"minimum": "2000-01-01"

}

To indicate that the column should contain times before midday (exclusive), you can use the datatype:

"datatype": {

"base": "time",

"minExclusive": "12:00:00"

}

The format of dates or times used for minimum or maximum values is always the ISO 8601 format: yyyy-MM-dd for dates, HH:mm:ss.S for times and yyyy-MM-ddTHH:mm:ss.S for date/times.

See also:

Value Constraints in [[tabular-data-model]]

How do you validate dates that aren't in standard date or time formats?

Dates and times in CSV files often come in formats other than the standard ISO 8601 format. You can use the format property to indicate the expected format of the date or time.

For example, to recognise dates in the usual UK format such as 31/10/2015 for 31st October 2015, you could use:

"datatype": {

"base": "date",

"format": "dd/MM/yyyy"

}

Implementations are only required to understand a particular set of common formats for dates and times. These formats are, for dates:

yyyy-MM-dd e.g. 2015-03-22

yyyyMMdd e.g. 20150322

dd-MM-yyyy e.g. 22-03-2015

d-M-yyyy e.g. 22-3-2015

MM-dd-yyyy e.g. 03-22-2015

M-d-yyyy e.g. 3-22-2015

dd/MM/yyyy e.g. 22/03/2015

d/M/yyyy e.g. 22/3/2015

MM/dd/yyyy e.g. 03/22/2015

M/d/yyyy e.g. 3/22/2015

dd.MM.yyyy e.g. 22.03.2015

d.M.yyyy e.g. 22.3.2015

MM.dd.yyyy e.g. 03.22.2015

M.d.yyyy e.g. 3.22.2015

For times:

HH:mm:ss.S with one or more trailing S characters indicating the maximum number of fractional seconds e.g. HH:mm:ss.SSS for 15:02:37.143

HH:mm:ss e.g. 15:02:37

HHmmss e.g. 150237

HH:mm e.g. 15:02

HHmm e.g. 1502

And for date/times:

yyyy-MM-ddTHH:mm:ss.S with one or more trailing S characters indicating the maximum number of fractional seconds e.g. yyyy-MM-ddTHH:mm:ss.SSS for 2015-03-15T15:02:37.143

yyyy-MM-ddTHH:mm:ss e.g. 2015-03-15T15:02:37

yyyy-MM-ddTHH:mm e.g. 2015-03-15T15:02

any of the date formats above, followed by a single space, followed by any of the time formats above, e.g. M/d/yyyy HH:mm for 3/22/2015 15:02 or dd.MM.yyyy HH:mm:ss for 22.03.2015 15:02:37

None of these formats include names or abbreviations for months or days. The implementation you use might support other date and time formats as well, including specialised formats for the other date and time datatypes such as gMonthYear. Check your implementation's documentation to see what it supports.

See also:

Formats for dates and times in [[tabular-data-model]]

How do you validate boolean values that aren't true or false?

By default, validators will recognise boolean values that are 1 or 0 or true or false. If a CSV file contains boolean values like T and F or Yes and No then you can create a derived boolean datatype that uses that syntax using the format property, with the two possible values separated by a |, for example:

"datatype": {

"base": "boolean",

"format": "Yes|No"

}

See also:

Formats for booleans in [[tabular-data-model]]

How do you specify a list of valid values for a column?

The example CSV file we're using is this:

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"

"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

In it, the second column, country group, contains either the value eu or the value non-eu. Despite there being two values, this isn't a boolean column. Instead, it's a column that has only two valid values.

There are two ways to specify that a column contains one of a list of values: using a regular expression to list the values and using a reference to a separate CSV file that contains the values.

Using a regular expression to give a list of valid values

Using a regular expression to list values works best if those values are strings, if there are only a few of them, and if they are self-explanatory such that you don't want to provide any additional information about them.

In this example, the country group column could be specified as:

{

"titles": "country group",

"datatype": {

"base": "string",

"format": "eu|non-eu"

}

As described in , the format property contains a regular expression. List the options separated by | and ensure that you escape any of the characters in the options that have special meaning in regular expressions.

See also:

Formats for other types in [[tabular-data-model]]

Using a separate CSV file to give a list of valid values

A more powerful method of listing the valid values in a particular column is to list those values in a separate CSV file. The CSV file can be very simple, containing just a single column that lists the valid values. In this example, we can create country-groups.csv containing:

group

non-eu

We can then provide definitions for both the countries.csv and country-groups.csv files, and state that the country group column in countries.csv references the group column in country_groups.csv. This reference from one file to another is called a foreign key.

To use a foreign key, both files must be referenced in the same metadata document, and both columns must be given names. Column names are only used inside the metadata document and you can only use (ASCII) letters, numbers, . and _ within them. So the basic metadata document, before adding the foreign key, should look like:

{

"@context": "http://www.w3.org/ns/csvw",

"tables": [{

"url": "countries.csv",

"tableSchema": {

"columns": [{

"titles": "country"

},{

"name": "country_group",

"titles": "country group"

},{

"titles": "name (en)"

},{

"titles": "name (fr)"

},{

"titles": "name (de)"

},{

"titles": "latitude"

},{

"titles": "longitude"

}]

}

}, {

"url": "country-groups.csv",

"tableSchema": {

"columns": [{

"name": "group",

"titles": "group"

}]

}

}]

}

The foreign key is defined in the schema for the countries.csv table, as follows:

{

"@context": "http://www.w3.org/ns/csvw",

"tables": [{

"url": "countries.csv",

"tableSchema": {

"columns": [{

"titles": "country"

},{

"name": "country_group",

"titles": "country group"

},{

"titles": "name (en)"

},{

"titles": "name (fr)"

},{

"titles": "name (de)"

},{

"titles": "latitude"

},{

"titles": "longitude"

}],

"foreignKeys": [{

"columnReference": "country_group",

"reference": {

"resource": "country-groups.csv",

"columnReference": "group"

}

}]

}

}, {

"url": "country-groups.csv",

"tableSchema": {

"columns": [{

"name": "group",

"titles": "group"

}]

}

}]

}

The foreignKeys property can hold several foreign keys. Each contains a columnReference to a column or list of columns in one CSV file, and a reference which defines a column or list of columns in another CSV file.

The advantage of this method of listing the values allowed in a column is that the CSV file that contains the list of possible values can also provide additional information about those values. For example, we can provide expansions of what eu and non-eu mean in different languages:

group,name (en),name (fr),name (de)

eu,"European Union","Union européenne","Europäische Union"

non-eu,"Non EU countries","Pays hors Union européenne",Nicht-EU-Länder

See also:

Foreign key examples in [[tabular-metadata]]

How do you enable a column to have a mix of value types?

Sometimes a column that contains numbers will contain special values, such as X or NK, when a value is unknown or redacted. If these columns are simply classified as numeric then the non-numeric values will be classed as errors.

To avoid values being classified as errors when they are being used to indicate missing values, list those values as null values using the null property. This can take either a single string or an array of strings. For example, the latitude column might usually be numeric but hold an X if there is no indicative point for the country:

{

"titles": "latitude",

"null": "X",

"datatype": {

"base": "number",

"minimum": "-90",

"maximum": "90"

}

The null property can also be useful when a column contains values that are of the right type but used to indicate a missing value. It's not uncommon, for example, for publishers to use the value 99 in a column that contains integers to indicate that a value is missing.

See also:

null property in [[tabular-metadata]]

Parsing Cells in [[tabular-data-model]]

What if the cells in a column contain lists of values?

Cells may contain lists of values with spaces, semi-colons or other characters acting as separators. For example, instead of using separate latitude and longitude columns, the CSV that we're looking at could contain a single latlong column consisting of the latitude and longitude separated by a space:

"country","country group","name (en)","name (fr)","name (de)","latlong"

"at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"

In this scenario, the separator property can be used to indicate that the values in a column are lists themselves, and what separator is used between the items in the list. For example:

{

"titles": "latlong",

"separator": " ",

"datatype": {

"base": "number",

"minimum": "-180",

"maximum": "180"

}

When separator is specified, the datatype property applies to each of the values in the list. There's no way to indicate that the values in the list have different datatypes, or set limits on the length of the list.

See also:

separator property in [[tabular-metadata]]

Parsing Cells in [[tabular-data-model]]

How do you ensure every row has a value for a column?

By default, a validator won't give any errors if a value is missing in a column. If you want to ensure that a value is provided for every row in the column, use the required property for that column, with the value true.

In our example, we might say that all the columns are required except the French and German names (applications being expected to default to the English name if the translation is missing):

"tableSchema": {

"columns": [{

"titles": "country",

"required": true

},{

"titles": "country group",

"required": true

},{

"titles": "name (en)",

"required": true

},{

"titles": "name (fr)"

},{

"titles": "name (de)"

},{

"titles": "latitude",

"required": true

},{

"titles": "longitude",

"required": true

}]

}

Setting required to true means that you can't have any null values in a column. If, in this example, latitude and longitude had null set to X then those columns couldn't contain an X. It doesn't usually make sense to specify both null and required.

See also:

required property in [[tabular-metadata]]

How do you indicate all the values in a column are unique?

In our example, the country column must contain unique values: each row should define a different country. To specify this, you can use a primary key to refer to the name of the column:

"tableSchema": {

"columns": [{

"name": "country",

"titles": "country"

},{

"titles": "country group"

},{

"titles": "name (en)"

},{

"titles": "name (fr)"

},{

"titles": "name (de)"

},{

"titles": "latitude"

},{

"titles": "longitude"

}],

"primaryKey": "country"

}

Each CSV file can only have one primary key. A primary key can be made up of a number of columns that have to be unique in combination: the classic example would be ["firstName", "lastName"].

See also:

primaryKey property in [[tabular-metadata]]

Transforming CSVs

CSV is great for transferring data around the place, but it's often not as useful for processing that data. Other formats can be better at providing structure and meaning for data. So a means for transforming data out of CSV and into other formats is a common requirement.

What can you transform CSV into?

You can of course transform CSV into anything you like using your favourite programming language. However, the CSV on the Web specs provide standardised ways of mapping CSV into two other formats:

JSON, in [[csv2json]]

RDF, in [[csv2rdf]]

These specifications make use of the metadata described in this primer during the transformation to decide what to include in the transformed output and how to include it. Processors that support these standardised transformations can be used by people who can't program.

These specifications describe what output you get if you don't supply any metadata. Given a CSV file like this:

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"

"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

the usual, minimal JSON output would be:

[{

"country": "at",

"country group": "eu",

"name (en)": "Austria",

"name (fr)": "Autriche",

"name (de)": "Österreich",

"latitude": "47.6965545",

"longitude": "13.34598005"

}, {

"country": "be",

"country group": "eu",

"name (en)": "Belgium",

"name (fr)": "Belgique",

"name (de)": "Belgien",

"latitude": "50.501045",

"longitude": "4.47667405"

}, {

"country": "bg",

"country group": "eu",

"name (en)": "Bulgaria",

"name (fr)": "Bulgarie",

"name (de)": "Bulgarien",

"latitude": "42.72567375",

"longitude": "25.4823218"

}]

and the RDF output would be:

@prefix xsd: .

[

<#country> "at";

<#country%20group> "eu";

<#latitude> "47.6965545";

<#longitude> "13.34598005";

<#name%20%28de%29> "Österreich";

<#name%20%28en%29> "Austria";

<#name%20%28fr%29> "Autriche"

] .

[

<#country> "be";

<#country%20group> "eu";

<#latitude> "50.501045";

<#longitude> "4.47667405";

<#name%20%28de%29> "Belgien";

<#name%20%28en%29> "Belgium";

<#name%20%28fr%29> "Belgique"

] .

[

<#country> "bg";

<#country%20group> "eu";

<#latitude> "42.72567375";

<#longitude> "25.4823218";

<#name%20%28de%29> "Bulgarien";

<#name%20%28en%29> "Bulgaria";

<#name%20%28fr%29> "Bulgarie"

] .

The specifications define how to transform CSV into RDF. In this Primer all the examples use Turtle as the serialisation for that RDF. Implementations may generate other serialisations for RDF such as RDF/XML or JSON-LD.

See also:

Minimal Mode in [[csv2json]]

Converting Tabular Data to RDF in [[csv2rdf]]

What values get used in the output of a transformation?

The result of a transformation will include typed values based on the datatype and language of each column. So if we state that the longitude and latitude are numbers and the names are strings in the given language, as in this metadata:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv",

"tableSchema": {

"columns": [{

"titles": "country"

},{

"titles": "country group"

},{

"titles": "name (en)",

"lang": "en"

},{

"titles": "name (fr)",

"lang": "fr"

},{

"titles": "name (de)",

"lang": "de"

},{

"titles": "latitude",

"datatype": "number"

},{

"titles": "longitude",

"datatype": "number"

}]

}

then the JSON will look like:

[{

"country": "at",

"country group": "eu",

"name (en)": "Austria",

"name (fr)": "Autriche",

"name (de)": "Österreich",

"latitude": 47.6965545,

"longitude": 13.34598005

},{

"country": "be",

"country group": "eu",

"name (en)": "Belgium",

"name (fr)": "Belgique",

"name (de)": "Belgien",

"latitude": 50.501045,

"longitude": 4.47667405

},{

"country": "bg",

"country group": "eu",

"name (en)": "Bulgaria",

"name (fr)": "Bulgarie",

"name (de)": "Bulgarien",

"latitude": 42.72567375,

"longitude": 25.4823218

}]

and the RDF will look like:

@prefix rdf: .

@prefix xsd: .

[

<#country> "at";

<#country%20group> "eu";

<#latitude> 4.76965545e1;

<#longitude> 1.334598005e1;

<#name%20%28de%29> "Österreich"@de;

<#name%20%28en%29> "Austria"@en;

<#name%20%28fr%29> "Autriche"@fr

] .

[

<#country> "be";

<#country%20group> "eu";

<#latitude> 5.0501045e1;

<#longitude> 4.47667405e0;

<#name%20%28de%29> "Belgien"@de;

<#name%20%28en%29> "Belgium"@en;

<#name%20%28fr%29> "Belgique"@fr

] .

[

<#country> "bg";

<#country%20group> "eu";

<#latitude> 4.272567375e1;

<#longitude> 2.54823218e1;

<#name%20%28de%29> "Bulgarien"@de;

<#name%20%28en%29> "Bulgaria"@en;

<#name%20%28fr%29> "Bulgarie"@fr

] .

See also:

Datatypes in [[csv2json]]

Datatypes in [[csv2rdf]]

What value gets used in the output if it's missing in the CSV?

If there's a missing value in the CSV, then usually the property will be omitted in the result as well. So if the latitude and longitude are missing for the first row of the CSV file we've been using, the equivalent JSON won't include those properties either:

{

"country": "at",

"country group": "eu",

"name (en)": "Austria",

"name (fr)": "Autriche",

"name (de)": "Österreich"

}

and nor will the RDF:

[

<#country> "at";

<#country%20group> "eu";

<#name%20%28de%29> "Österreich"@de;

<#name%20%28en%29> "Austria"@en;

<#name%20%28fr%29> "Autriche"@fr

] .

If you want a value to appear even when the value is missing in the CSV, you can provide that value as the default for the column. This value must be supplied as a string but will be treated exactly as if it had appeared within the CSV file. For example, if you supply a non-numeric string for a numeric column, as in:

{

"titles": "latitude",

"datatype": "number",

"default": "NOT KNOWN"

}

The default value will be seen as an invalid value and therefore represented as a string in the output.

See also:

Minimal Mode in [[csv2json]]

Converting Tabular Data to RDF in [[csv2rdf]]

How are the properties in the output of a transformation named?

By default, the properties in the JSON and RDF come from the titles of the columns (the headers in the CSV file). In RDF, since the properties are URIs, the names are URL-encoded and turned into fragment identifiers in a URL based on the location of the CSV file being transformed. Hence in the previous examples there have been property names like "name (en)" in JSON and #name%20%28en%29 in RDF.

You can override this default by supplying a name for the column. That name will be used instead of the title of the column when creating the property. So if you have:

{

"titles": "name (en)",

"name": "english_name"

}

then the property will be called english_name in the JSON output and #english_name in the RDF output.

You can also use the propertyUrl property to supply a prefixed name or a URL for the property. For example, to use schema:latitude as the name for the latitude property in both the JSON and the RDF output, you could use:

{

"titles": "latitude",

"propertyUrl": "schema:latitude"

"datatype": "number"

}

The propertyUrl property can be used to map several columns in the CSV file onto properties with the same name. In our example, each country has several names which are all really the same property; the schema could look like:

{

"titles": "name (en)",

"propertyUrl": "schema:name"

"lang": "en"

},{

"titles": "name (fr)",

"propertyUrl": "schema:name"

"lang": "fr"

},{

"titles": "name (de)",

"propertyUrl": "schema:name"

"lang": "de"

}

With that schema, the JSON output will contain:

"schema:name": ["Belgium", "Belgique", "Belgien"]

and the RDF output will contain:

schema:name "Belgium"@en, "Belgique"@fr, "Belgien"@de

If there isn't a relevant property in one of the vocabularies that is built-in to CSV on the Web (those listed as part of the RDFa 1.1 initial context), the propertyUrl can hold a URL template. Usually this template won't include any substitutable parts because it's generally the case that the property should be the same for the whole column. For example, you might have:

{

"titles": "country group",

"propertyUrl": "http://example.org/def/countryGroup"

}

In this case, the result of a transformation to JSON will contain:

"http://example.org/def/countryGroup": "eu"

and the RDF similarly:

"eu"

See also:

name property in [[tabular-metadata]]

propertyUrl property in [[tabular-metadata]]

How do you map values into URLs?

Sometimes a column contains a value that can be programmatically mapped into a URL. In this case, the valueUrl property contains a template for the URL that it should be mapped into.

For example, say that there were pages for each country on the web at e.g. http://example.org/country/at. In that case, the URL for the country could be generated with the URL template http://example.org/country/{country}. Within this template, {country} inserts the value from the column named country into the URL. So the metadata could contain:

{

"titles": "country",

"name": "country",

"valueUrl": "http://example.org/country/{country}",

"propertyUrl": "schema:url"

}

The JSON output would then contain:

"schema:url": "http://example.org/country/at"

and the RDF output:

schema:url

If you want to preserve the original value from the column and use it to create a URL, you may want to introduce a virtual column. For example, with the latitude and longitude of each country available, you might want to provide a link to a map centered on the country within Google Maps. The URLs for these look like https://www.google.com/maps/@50.501045,4.476674,7z, and a template like https://www.google.com/maps/@{lat},{long},7z.

To add a property that points to this URL, add a virtual column at the end of the column definitions within the schema. A virtual column definition looks just like a normal column definition but with the virtual property set to true:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv",

"tableSchema": {

"columns": [{

"titles": "country"

},{

"titles": "country group"

},{

"titles": "name (en)",

"lang": "en"

},{

"titles": "name (fr)",

"lang": "fr"

},{

"titles": "name (de)",

"lang": "de"

},{

"titles": "latitude",

"name": "lat",

"datatype": "number"

},{

"titles": "longitude",

"name": "long",

"datatype": "number"

},{

"virtual": true,

"propertyUrl": "schema:hasMap",

"valueUrl": "https://www.google.com/maps/@{lat},{long},7z"

}]

}

In JSON, this will result in the output:

"schema:hasMap": "https://www.google.com/maps/@42.72567375,25.4823218,7z"

and in RDF, the output:

schema:hasMap

See also:

valueUrl property in [[tabular-metadata]]

How do you include an identifier for the thing described by each row?

By default, the things described by each row don't have identifiers associated with them in either JSON or RDF outputs. You can add an identifier for the row by setting the aboutUrl property. Usually that's done at the top level of the schema.

For example, say each row in countries.csv was about a country whose identifier looked like http://example.org/country/{code} where code was the value within the first column of the CSV file (the country column). The aboutUrl could be set to the generate this URL for each row using a URL template:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv",

"tableSchema": {

"aboutUrl": "http://example.org/country/{code}",

"columns": [{

"titles": "country",

"name": "code"

},{

"titles": "country group"

},{

"titles": "name (en)",

"lang": "en"

},{

"titles": "name (fr)",

"lang": "fr"

},{

"titles": "name (de)",

"lang": "de"

},{

"titles": "latitude",

"datatype": "number"

},{

"titles": "longitude",

"datatype": "number"

}]

}

In the JSON, these identifiers are turned into @id properties on the objects generated for each row:

[{

"@id": "http: //example.org/country/at",

"code": "at",

"country group": "eu",

"name (en)": "Austria",

"name (fr)": "Autriche",

"name (de)": "Österreich"

},{

"@id": "http: //example.org/country/be",

"code": "be",

"country group": "eu",

"name (en)": "Belgium",

"name (fr)": "Belgique",

"name (de)": "Belgien",

"latitude": 50.501045,

"longitude": 4.47667405

},{

"@id": "http: //example.org/country/bg",

"code": "bg",

"country group": "eu",

"name (en)": "Bulgaria",

"name (fr)": "Bulgarie",

"name (de)": "Bulgarien",

"latitude": 42.72567375,

"longitude": 25.4823218

}]

In the RDF, these identifiers become the identifiers for the entities that the properties relate to, rather than those being blank nodes:

@prefix rdf: .

@prefix xsd: .

<#code> "at";

<#country%20group> "eu";

<#latitude> 4.76965545e1;

<#longitude> 1.334598005e1;

<#name%20%28de%29> "Österreich"@de;

<#name%20%28en%29> "Austria"@en;

<#name%20%28fr%29> "Autriche"@fr .

<#code> "be";

<#country%20group> "eu";

<#latitude> 5.0501045e1;

<#longitude> 4.47667405e0;

<#name%20%28de%29> "Belgien"@de;

<#name%20%28en%29> "Belgium"@en;

<#name%20%28fr%29> "Belgique"@fr .

<#code> "bg";

<#country%20group> "eu";

<#latitude> 4.272567375e1;

<#longitude> 2.54823218e1;

<#name%20%28de%29> "Bulgarien"@de;

<#name%20%28en%29> "Bulgaria"@en;

<#name%20%28fr%29> "Bulgarie"@fr .

See also:

aboutUrl property in [[tabular-metadata]]

How do you indicate the type of the thing described by each row?

Whether generating JSON or RDF it can be useful to indicate that each row contains data about a particular type of thing, such as a Person or a Country. There isn't usually a column within tabular data that indicates the type of the row (because it's generally the same for every row, so including it would be superfluous), so you have to add it as a virtual column.

The virtual column needs to come after the descriptions of columns actually within the data. It should have its virtual property set to true and its propertyUrl set to rdf:type to indicate that the virtual column will indicate the type of entity the row is about. The valueUrl property can then be set to the prefixed name or URL of the type of the entity. For example, when each row represents a Country, you might use:

{

"virtual": true,

"propertyUrl": "rdf:type",

"valueUrl": "schema:Country"

}

In the JSON this value will be transformed into the value of the @type property on the relevant object:

"@type": "schema:Country"

In RDF output, it becomes the class for the entity:

a schema:Country

See also:

virtual property in [[tabular-metadata]]

virtual columns in [[tabular-data-model]]

How do you include extra metadata in the result of the transformation?

Transformations into JSON or RDF can be carried out in one of two modes. In minimal mode, which is what we've looked at so far, the output only contains data from the rows within the CSV file. In full mode, the output also contains metadata about the CSV file, including metadata from the metadata file.

For example, say that our metadata file looked like:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv",

"schema:name": "Countries",

"schema:description": "European countries for which comparative statistics are collected by Eurostat.",

"schema:creator": { "schema:name": "Eurostat" },

"tableSchema": {

...

}

The output of a full JSON transformation would look like:

{

"tables": [{

"url":"countries.csv",

"schema:name":"Countries",

"schema:description":"European countries for which comparative statistics are collected by Eurostat.",

"schema:creator": {

"schema:name": "Eurostat"

"row": [{

"url": "countries.csv#row=2",

"rownum":1,

"describes": [{

"country": "at",

"country group": "eu",

"name (en)": "Austria",

"name (fr)": "Autriche",

"name (de)": "Österreich",

"latitude": 47.6965545,

"longitude": 13.34598005

}]

}, {

"url": "countries.csv#row=3",

"rownum": 2,

"describes": [{

"country": "be",

"country group": "eu",

"name (en)": "Belgium",

"name (fr)": "Belgique",

"name (de)": "Belgien",

"latitude": 50.501045,

"longitude": 4.47667405

}]

}, {

"url": "countries.csv#row=4",

"rownum": 3,

"describes": [{

"country": "bg",

"country group": "eu",

"name (en)": "Bulgaria",

"name (fr)": "Bulgarie",

"name (de)": "Bulgarien",

"latitude": 42.72567375,

"longitude": 25.4823218

}]

}

Similarly, the output of the full RDF would look like:

@prefix csvw: .

@prefix rdf: .

@prefix schema: .

@prefix xsd: .

[

a csvw:TableGroup;

csvw:table [

a csvw:Table;

schema:creator [ schema:name "Eurostat" ];

schema:description "European countries for which comparative statistics are collected by Eurostat.";

schema:name "Countries";

csvw:row [

a csvw:Row;

csvw:describes [

<#country> "at";

<#country%20group> "eu";

<#latitude> 4.76965545e1;

<#longitude> 1.334598005e1;

<#name%20%28de%29> "Österreich"@de;

<#name%20%28en%29> "Austria"@en;

<#name%20%28fr%29> "Autriche"@fr

];

csvw:rownum 1;

csvw:url <#row=2>

], [

a csvw:Row;

csvw:describes [

<#country> "be";

<#country%20group> "eu";

<#latitude> 5.0501045e1;

<#longitude> 4.47667405e0;

<#name%20%28de%29> "Belgien"@de;

<#name%20%28en%29> "Belgium"@en;

<#name%20%28fr%29> "Belgique"@fr

];

csvw:rownum 2;

csvw:url <#row=3>

], [

a csvw:Row;

csvw:describes [

<#country> "bg";

<#country%20group> "eu";

<#latitude> 4.272567375e1;

<#longitude> 2.54823218e1;

<#name%20%28de%29> "Bulgarien"@de;

<#name%20%28en%29> "Bulgaria"@en;

<#name%20%28fr%29> "Bulgarie"@fr

];

csvw:rownum 3;

csvw:url <#row=4>

];

csvw:url <>

]

] .

Metadata provided about tables is interpreted based on the rules for [[json-ld]] which means that you can provide as much structure within that metadata as you like, including providing structured values, languages and datatypes, so that the data in the output includes what you need it to.

See also:

Common Properties in [[tabular-metadata]]

Standard Mode in [[csv2json]]

Converting Tabular Data to RDF in [[csv2rdf]]

How can you remove output from a transformation result?

By default, the output from a JSON or RDF transformation will include all the data from all the columns of all the tables in the metadata document. It may be that you're not interested in some of that within the output of your transformation. In that case, you can use the suppressOutput property in the metadata to exclude the data that you're not interested in.

For example, perhaps I'm not interested in the non-English names of countries in my output. In that case, I could suppress them as follows:

{

"titles": "name (fr)",

"lang": "fr",

"suppressOutput": true

}

Similarly, when generating the output for a set of tables, you can suppress the output from a whole table by adding the suppressOutput property to the description of that table:

{

"@context": "http://www.w3.org/ns/csvw",

"tables": [{

"url": "countries.csv"

}, {

"url": "country-groups.csv",

"suppressOutput": true

}, {

"url": "unemployment.csv"

}]

}

See also:

suppressOutput property for columns in [[tabular-metadata]]

suppressOutput property for tables in [[tabular-metadata]]

How do you transform into nested structures in JSON?

While tabular data is by necessity flat, it often holds data that is actually structured. For example, the data that we have been looking at:

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"

"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

if modelled according to the schema.org vocabulary, would look like:

[{

"@type": "schema:Country",

"schema:name": ["Austria", "Autriche", "Österreich"],

"schema:geo": {

"@type": "schema:GeoCoordinates",

"schema:latitude":47.6965545,

"schema:longitude":13.34598005

}

}, {

"@type": "schema:Country",

"schema:name": ["Belgium", "Belgique", "Belgien"],

"schema:geo": {

"@type": "schema:GeoCoordinates",

"schema:latitude":50.501045,

"schema:longitude":4.47667405

}

}, {

"@type": "schema:Country",

"schema:name": ["Bulgaria", "Bulgarie", "Bulgarien"],

"schema:geo": {

"@type": "schema:GeoCoordinates",

"schema:latitude":42.72567375,

"schema:longitude":25.4823218

}

}]

Generating JSON in this shape requires the judicious use of virtual columns, aboutUrl and valueUrl: if you create a column whose valueUrl corresponds to the aboutUrl of another column, you will create nested properties.

In this example, we can use two aboutUrls: one in the form http://example.org/country/{code} for the Country and one in the form http://example.org/country/{code}#geo for the geo-coordinates of the country. The names are properties of the former while the longitude and latitude are properties of the latter. A virtual column can add the association between the two objects, with a propertyUrl of schema:geo, like so:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv",

"tableSchema": {

"aboutUrl": "http://example.org/country/{code}",

"columns": [{

"titles": "country",

"name": "code",

"suppressOutput": true

},{

"titles": "country group",

"suppressOutput": true

},{

"titles": "name (en)",

"lang": "en",

"propertyUrl": "schema:name"

},{

"titles": "name (fr)",

"lang": "fr",

"propertyUrl": "schema:name"

},{

"titles": "name (de)",

"lang": "de",

"propertyUrl": "schema:name"

},{

"titles": "latitude",

"datatype": "number",

"aboutUrl": "http://example.org/country/{code}#geo",

"propertyUrl": "schema:latitude"

},{

"titles": "longitude",

"datatype": "number",

"aboutUrl": "http://example.org/country/{code}#geo",

"propertyUrl": "schema:longitude"

},{

"virtual": true,

"propertyUrl": "rdf:type",

"valueUrl": "schema:Country"

},{

"virtual": true,

"propertyUrl": "schema:geo",

"valueUrl": "http://example.org/country/{code}#geo"

},{

"virtual": true,

"aboutUrl": "http://example.org/country/{code}#geo",

"propertyUrl": "rdf:type",

"valueUrl": "schema:GeoCoordinates"

}]

}

Note also in this example the use of suppressOutput to remove properties that we're not interested in, and the use of virtual columns to add type information to both types of generated object.

The result of this transformation is close to what we were aiming for, though with the addition of @id properties:

[{

"@id": "http://example.org/country/at",

"@type": "schema:Country",

"schema:name": ["Austria", "Autriche", "Österreich"],

"schema:geo": {

"@id": "http://example.org/country/at#geo",

"@type": "schema:GeoCoordinates",

"schema:latitude": 47.6965545,

"schema:longitude": 13.34598005

}

}, {

"@id": "http://example.org/country/be",

"@type": "schema:Country",

"schema:name": ["Belgium", "Belgique", "Belgien"],

"schema:geo": {

"@id":"http://example.org/country/be#geo",

"@type": "schema:GeoCoordinates",

"schema:latitude": 50.501045,

"schema:longitude": 4.47667405

}

}, {

"@id": "http://example.org/country/bg",

"@type": "schema:Country",

"schema:name": ["Bulgaria", "Bulgarie", "Bulgarien"],

"schema:geo": {

"@id": "http://example.org/country/bg#geo",

"@type": "schema:GeoCoordinates",

"schema:latitude": 42.72567375,

"schema:longitude": 25.4823218

}

}]

The same metadata will generate similar RDF, though the nesting structure is not so obvious because of the way RDF works:

@prefix rdf: .

@prefix schema: .

@prefix xsd: .

a schema:Country;

schema:geo ;

schema:name "Austria"@en, "Autriche"@fr, "Österreich"@de .

a schema:Country;

schema:geo ;

schema:name "Belgium"@en, "Belgique"@fr, "Belgien"@de .

a schema:Country;

schema:geo ;

schema:name "Bulgaria"@en, "Bulgarie"@fr, "Bulgarien"@de .

a schema:GeoCoordinates;

schema:latitude 4.76965545e1;

schema:longitude 1.334598005e1 .

a schema:GeoCoordinates;

schema:latitude 5.0501045e1;

schema:longitude 4.47667405e0 .

a schema:GeoCoordinates;

schema:latitude 4.272567375e1;

schema:longitude 2.54823218e1 .

See also:

Generating Nested Objects in [[csv2json]]

How do you indicate that values should be mapped to a list rather than repeating properties in RDF?

If you are used to using RDF you'll know that there's a big difference between having a property that has multiple values (ie multiple triples with the same subject and property) and a property that has a rdf:List as a value. Sometimes one is appropriate, sometimes the other. The ordered property enables you to indicate which to use for values that are sequences in the original data.

Let's use as an example the version of our data in which the latitude and longitude are in the same property:

"country","country group","name (en)","name (fr)","name (de)","latlong"

"at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"

In this example, if we state that the latlong column is space-separated like so:

{

"titles": "latlong",

"separator": " ",

"datatype": "number"

}

we'll end up with output like this:

[

<#country> "at";

<#country%20group> "eu";

<#name%20%28en%29> "Austria"@en;

<#name%20%28fr%29> "Autriche"@fr;

<#name%20%28de%29> "Österreich"@de;

<#latlong> 4.76965545e1, 1.334598005e1;

] .

This shows the #latlong property having two values: 4.76965545e1 and 1.334598005e1. These values could easily get mixed up such that the first was taken to be the longitude and the second the latitude rather than the other way around.

To avoid this mix-up occurring, set the ordered property to true:

{

"titles": "latlong",

"separator": " ",

"ordered": true,

"datatype": "number"

}

This will make the processor use a rdf:List for the value of the #latlong property instead, which in Turtle looks like:

[

<#country> "at";

<#country%20group> "eu";

<#name%20%28en%29> "Austria"@en;

<#name%20%28fr%29> "Autriche"@fr;

<#name%20%28de%29> "Österreich"@de;

<#latlong> ( 4.76965545e1, 1.334598005e1 );

] .

The ordered property makes no difference to JSON output because sequences in CSV are always transformed into arrays in JSON.

See also:

ordered property in [[tabular-metadata]]

How should you transform CSV into JSON-LD?

As illustrated above, it is possible to transform CSV into something that looks like JSON-LD by transforming it into JSON. You can add @id properties for identifiers using aboutUrl and add @type properties for types using virtual columns.

However, if you're really after JSON-LD as an output, the best route is to transform into RDF and emit that RDF as JSON-LD. This will give you more control over the context that's used to determine the properties and structure of the output.

See also:

Converting Tabular Data to RDF in [[csv2rdf]]

How should you display CSV tables in HTML?

There's no single specified way of displaying CSV tables in HTML. If you are writing code to do so, it's good practice to:

use the locale of the user of the browser to determine which titles to display for columns, and how to format numbers and dates within the table

respect the directionality of the table and the text within the table that's indicated through the tableDirection and textDirection properties

use the aboutUrl and valueUrl properties to link out to other pages from cells within the table, and the propertyUrl to link out to information about the property a column represents; these properties could also be used to embed RDFa into the HTML table, as described in [[html-rdfa]]

display the metadata that's available about the table so that people get a full idea of the context of the data

provide a way to access notes and annotations that have been provided about individual cells, rows or columns

highlight cells that contain errors so that they're easy to spot and correct

provide links to other related resources indicated through the metadata supplied for the table

keep columns that are used to provide a title for each row (using rowTitles) visible; for screen readers, use these to label the rows so that people can navigate through the table easily

link out to the original CSV file so that people can download it

See also:

Presenting Tables in [[tabular-data-model]]

How can you transform CSV into the DataCube vocabulary?

The Data Cube vocabulary [[vocab-data-cube]] is a vocabulary for statistical data based on SDMX (Statistical Data and Metadata eXchange). Statistical data is often expressed in tables, sometimes with one row per Observation and sometimes with each cell containing a different Observation.

Generating data in the Data Cube vocabulary requires the use of techniques that have been described above, such as adding identifiers to entities, adding structure and types through virtual columns, and using metadata to supply additional static context.

There is a fully worked out example of transforming to DataCube, using meterological data as its basis, available on the Github repository for the Working Group.

See also:

Generating Data Cube Example

How can you transform CSV into other formats?

While there are only specifications for transforming CSV into JSON and RDF, there is an extension mechanism within CSV metadata to indicate other transformations that could be applied to CSV files. The transformations property on a table description holds an array of descriptions of transformations that processors could carry out. There's no guarantees that a given processor will recognise them, but over time it might be that there begins to be recognised practices for how such transformations might work.

The transformations must have the following properties:

targetFormat

gives a URL for the format that the transformation transforms into, for example http://www.iana.org/assignments/media-types/application/xml for XML

url

points to a script or template that can be used to transform the CSV into that format

scriptFormat

gives a URL for the format that the script is in, for example https://mustache.github.io/ for Mustache or http://www.iana.org/assignments/media-types/application/javascript for Javascript

You can also supply a titles property to provide a human-readable description of the output of the transformation, and a source property to indicate that the input to the transformation isn't the original CSV or tabular data, but json or rdf.

For example, if I wanted to convert the data that we've been using into XML, I could create a Mustache template like this at xml-template.mustache:

In the metadata for the CSV file I could then include:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "countries.csv",

"transformations": [{

"targetFormat": "http://www.iana.org/assignments/media-types/application/xml",

"titles": "Simple XML version",

"url": "xml-template.mustache",

"scriptFormat": "https://mustache.github.io/",

"source": "json"

}]

}

Processors that recognised the URL for Mustache could offer users the option of passing the JSON output to a Mustache processor, which would generate:

Austria

Autriche

Österreich

Belgium

Belgique

Belgien

Bulgaria

Bulgarie

Bulgarien

Processors should document the kind of transformation scripts that they can recognise and how they process them.

See also:

Transformation Definitions in [[tabular-metadata]]

Handling language in CSVs

There are a number of features in the CSV on the Web metadata documents that support scenarios encountered in CSV files that use different languages. We already discussed using varying number formats in and date formats in . Here we'll look at how to create metadata files, schemas and CSV files that take account of and work across multiple languages.

How do you indicate the language used by the metadata file?

The metadata file will often contain natural language text, such as titles and descriptions of columns and tables. Unless you specify otherwise, implementations will assume all this text is in an undefined language (und). If you want to specify what natural language is in use within the metadata file, you have to change the way the @context is specified. Instead of the normal value:

"@context": "http://www.w3.org/ns/csvw"

The @context should take an array, where the first value is the usual URL as a string and the second is an object with a @language property set to the language being used within the metadata file. This example, which has English-language titles and descriptions, illustrates:

{

"@context": [ "http://www.w3.org/ns/csvw", { "@language": "en "} ],

"dc:title": "Countries"

"url": "countries.csv"

"tableSchema": {

"columns": [{

"titles": "country",

"dc:description": "The ISO two-letter code for a country, in lowercase."

},{

"titles": "country group",

"dc:description": "A lowercase two-letter code for a group of countries."

},{

"titles": "name (en)",

"dc:description": "The official name of the country in English."

},{

"titles": "name (fr)",

"dc:description": "The official name of the country in French."

},{

"titles": "name (de)",

"dc:description": "The official name of the country in German."

},{

"titles": "latitude",

"dc:description": "The latitude of an indicative point in the country."

},{

"titles": "longitude",

"dc:description": "The longitude of an indicative point in the country."

}]

}

See also:

default language in [[tabular-metadata]]

How do you provide metadata such as descriptions in different languages?

Metadata such as descriptions can be objects rather than strings. Using objects is useful when you want to provide the language for a value. In this case, the object should have two properties: a @value property holding the natural-language string and a @lang value indicating what language that string is in. For example:

{

"titles": "name (en)",

"dc:description": {

"@value": "The official name of the country in English.",

"@lang": "en"

}

You can use an array to provide the same metadata in many different languages, for example:

"dc:title": [{

"@lang": "en",

"@value": "Unemployment in Europe (monthly)"

},{

"@lang": "de",

"@value": "Arbeitslosigkeit in Europa (monatlich)"

},{

"@lang": "fr",

"@value": "Le Chômage en Europe (mensuel)"

}]

If you don't indicate the language used for metadata, processors will assume it's the default language used in the metadata as a whole.

See also:

Values of Common Properties in [[tabular-metadata]]

How do you provide titles for columns in different languages?

You can use an object as the value for the titles property for a column to provide titles in different languages. Within the object, each property is a language and the value is the title in that language:

"titles": {

"en": "Country",

"de": "Land",

"fr": "Pays"

}

If you don't indicate the language used for the title of a column, processors will assume it's the default language used in the metadata as a whole.

See also:

Natural Language Properties in [[tabular-metadata]]

How do you specify the language of the values in a column?

Within an individual CSV file, it may be that different columns contain values that are in different languages. In the example we're using, there are three columns that each contain the name of a country, in English, French and German:

"country","country group","name (en)","name (fr)","name (de)","latlong"

"at","eu","Austria","Autriche","Österreich","47.6965545 13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045 4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375 25.4823218"

Use the lang property on the column description to indicate the language of text in that column:

{

"@context": [ "http://www.w3.org/ns/csvw", { "@language": "en "} ],

"dc:title": "Countries"

"url": "countries.csv"

"tableSchema": {

"columns": [{

"titles": "country",

"dc:description": "The ISO two-letter code for a country, in lowercase."

},{

"titles": "country group",

"dc:description": "A lowercase two-letter code for a group of countries."

},{

"titles": "name (en)",

"dc:description": "The official name of the country in English.",

"lang": "en"

},{

"titles": "name (fr)",

"dc:description": "The official name of the country in French.",

"lang": "fr"

},{

"titles": "name (de)",

"dc:description": "The official name of the country in German.",

"lang": "de"

},{

"titles": "latitude",

"dc:description": "The latitude of an indicative point in the country."

},{

"titles": "longitude",

"dc:description": "The longitude of an indicative point in the country."

}]

}

There's no relationship between the language used in a metadata file and that used in the CSV file that it describes.

See also:

lang property in [[tabular-metadata]]

How do you indicate that tables should be displayed right-to-left?

Implementations that display tables according to the specs should mostly be able to guess whether a table should be displayed left-to-right or right-to-left based on the content of the table. Implementations will look at the content of the cells to work out which way to display their content and will look at the content of the table as a whole to work out whether to display the first column on the right or left of the page.

If you want to override the display of a particular column then you can use the textDirection property on a column description to explicitly be rtl or ltr:

{

"titles": "name (ar)",

"lang": "ar",

"textDirection": "rtl"

}

If you want to override the display of the table overall then you can use the tableDirection property on the description of the table, or for all tables in the group.

{

"@context": "http://www.w3.org/ns/csvw",

"url": "results.csv",

"tableDirection": "rtl"

}

The value of the tableDirection property is inherited to all columns in the table, so any text within this table will similarly be displayed right-to-left. This can be overridden by setting textDirection to ltr or auto (in which case the direction of the text within each cell will be determined by its contents).

See also:

textDirection property in [[tabular-metadata]]

tableDirection property in [[tabular-metadata]]

Bidirectional Tables in [[tabular-data-model]]

Advanced Use

How do you support units of measure?

There is no native support for expressing the units of measure for a particular column. You can, however, use documentation to tell people who are using the data what unit of measure is used for that particular column. This can be informal within the description of the column:

{

"titles": "distance",

"dc:description": "Distance (kilometres)"

}

Alternatively, it can be more explicit using an existing units-of-measure property and vocabulary, such as:

{

"titles": "distance",

"http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": {

"@id": "http://qudt.org/vocab/unit#Kilometer"

}

to which you could even add more detail if you wanted (this is replicating the canonical definition of definition of unit:Kilometer from the Quanitites, Units, Dimensions and Data Types (QUDT) Ontologies):

{

"titles": "distance",

"http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure": {

"@id": "http://qudt.org/vocab/unit#Kilometre",

"@type": [

"http://qudt.org/schema/qudt#SIUnit",

"http://qudt.org/schema/qudt#DerivedUnit",

"http://qudt.org/schema/qudt#LengthUnit"

"rdfs:label": "Kilometer",

"http://qudt.org/schema/qudt#abbreviation": "km",

"http://qudt.org/schema/qudt#code": "1091",

"http://qudt.org/schema/qudt#conversionMultiplier": 1000,

"http://qudt.org/schema/qudt#conversionOffset": 0.0,

"http://qudt.org/schema/qudt#symbol": "km",

"skos:exactMatch": { "@id": "http://dbpedia.org/resource/Kilometre" }

}

Supporting units of measure by transforming to structured values

If you are generating JSON or RDF from CSV, you may want to generate structured values that include the units of each value from the CSV file. This is a little complicated, but useful if different rows contain values that use different units. In this case, the output that you're aiming for in RDF would look something like:

[] :distance <#row-1-distance> .

<#row-1-distance>

schema:value 3.5 ;

schema:unitCode ;

and in JSON something like this:

"distance": {

"@id": "#row-1-distance",

"schema:value": 3.5

"schema:unitCode": "http://qudt.org/vocab/unit#Kilometer"

}

You may want to use different properties than schema:value and schema:unitCode; if so, just use different propertyUrls.

You need to decide on a pattern for the URLs for the values themselves, and set the aboutUrl for the relevant column create that URL. In this example, the URLs that look like #row-1-distance can be generated with the pattern #row-{_row}-distance. The propertyUrl for the column needs to be schema:value as the value in the column provides the value for that property. So the column description looks like:

{

"name": "distance_value",

"titles": "distance",

"datatype": "number",

"aboutUrl": "#row-{_row}-distance",

"propertyUrl": "schema:value"

}

You then need to use virtual columns (descriptions of additional columns that don't exist in the source CSV) to generate the relationship between the thing whose distance is being measured and the structured value, and the additional property providing the unit for the structured value.

To generate the relationship being the thing that has the distance and the structured value, the virtual column's valueUrl needs to hold the same URL template as you used before:

{

"name": "distance",

"virtual": true,

"valueUrl": "#row-{_row}-distance"

}

To create the units property, you need another virtual column where the aboutUrl of that virtual column generates the URL for the structured value, the propertyUrl is schema:unitCode and the valueUrl is the URL representing the unit (in this case http://qudt.org/vocab/unit#Kilometre):

{

"name": "distance_unit",

"aboutUrl": "#row-{_row}-distance",

"propertyUrl": "schema:unitCode",

"valueUrl": "http://qudt.org/vocab/unit#Kilometer"

}

If it's necessary to add more detail about the unit (e.g. the fact that it's a unit of length) this can be done with additional virtual columns:

{

"name": "kilometer_abbreviation",

"virtual": true,

"aboutUrl": "http://qudt.org/vocab/unit#Kilometer",

"propertyUrl": "rdf:type",

"valueUrl": "http://qudt.org/schema/qudt#LengthUnit"

}

Usually, however, processors should recognise or be able to resolve the URL for the unit to understand that it's a unit of length, if this is important for onward processing.

See also:

Quanitites, Units, Dimensions and Data Types (QUDT) Ontologies

Supporting units of measure with named datatypes in RDF

If you are generating RDF from CSV, you may want to define a datatype for a column and then provide additional information about that datatype as properties. For example, the column description could look like:

{

"titles": "distance",

"datatype": {

"@id": "http://example.org/unit/kilometre",

"rdfs:label": "Kilometre",

"base": "number"

}

When values are generated in RDF for this column, they will be assigned the relevant datatype, for example:

[] :distance "3.5"^^ .

rdfs:label "Kilometre" .

Again, it is possible to include additional information about the unit being used as the datatype within the definition of the datatype:

{

"titles": "distance",

"datatype": {

"@id": "http://example.org/unit/kilometre",

"@type": "http://example.org/quantity/length",

"rdfs:label": "Kilometre",

"base": "number",

"skos:notation": "km"

}

See also:

Interpreting Datatypes in [[csv2rdf]]

How do you support geospatial data?

There are many different ways of representing geospatial data within a CSV file, and no single best practice for doing so.

At the simplest level, it's possible to reference geospatial coordinates as points with separate columns for latitude, longitude and if necessary altitude (or using a different spatial reference system). This enables separate validation for the separate coordinates. The example used throughout this primer uses this setup:

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"

"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

Metadata can be used to provide specialist types for the values of these coordinates, to indicate that they are latitude and longitude (by mapping to the well-known schema:latitude and schema:longitude properties which specify the use of WGS84), to group the coordinates together, and to provide a link that uses the coordinates to provide a map:

{

"titles": "latitude",

"name": "lat",

"datatype": {

"base": "number",

"minimum": "-90",

"maximum": "90"

"aboutUrl": "http://example.org/country/{code}#geo",

"propertyUrl": "schema:latitude"

}, {

"titles": "longitude",

"name": "long",

"datatype": {

"base": "number",

"minimum": "-180",

"maximum": "180"

"aboutUrl": "http://example.org/country/{code}#geo",

"propertyUrl": "schema:longitude"

}, {

"virtual": true,

"propertyUrl": "schema:geo",

"valueUrl": "http://example.org/country/{code}#geo"

}, {

"virtual": true,

"aboutUrl": "http://example.org/country/{code}#geo",

"propertyUrl": "rdf:type",

"valueUrl": "schema:GeoCoordinates"

}, {

"virtual": true,

"propertyUrl": "schema:hasMap",

"valueUrl": "https://www.google.com/maps/@{lat},{long},7z"

}

You can put latitude and longitude into a single column, with a character separator between the numbers as shown in . However, this makes it harder to accurately validate the individual coordinates. They also cannot be separated out into separate property values when converting to JSON or RDF. So this is a more restrictive method and best avoided.

CSV files may also need to contain geometries beyond individual points. There are no built-in formats for geometries recognised by implementations of CSV on the Web. Geometries may be expressed using GeoJSON, GML, KML or OGC Well-Known Text (WKT) representations. In each case, schemas may indicate that columns containing geometries adhere to a particular type:

json if the format is JSON-based such as GeoJSON

xml if the format is XML-based such as GML or KML

string if the format is text-based such as OGC WKT

You can use the format property to further constrain the content of columns containing these values. For example, you could use:

"datatype": {

"@id": "http://geojson.org/",

"base": "json",

"format": "\\{ ?\"type\": ?\"Polygon\",.+\\}"

}

to indicate that a column contains a Polygon in GeoJSON, or:

"datatype": {

"@id": "http://www.iana.org/assignments/media-types/application/gml+xml",

"base": "xml",

"format": ".*\\.+\\"

}

to indicate that a column contains a GML Point, or:

"datatype": {

"@id": "http://www.iana.org/assignments/media-types/application/vnd.google-earth.kml+xml",

"base": "xml",

"format": ".*\\.+\\"

}

to indicate that a column contains KML, or:

"datatype": {

"base": "string",

"format": "POLYGON \\(\\(\\d+(\\.\\d+)? \\d+(\\.\\d+)?(, \\d+(\\.\\d+)? \\d+(\\.\\d+)?)+\\)\\)"

"rdfs:seeAlso": "http://www.opengeospatial.org/standards/sfa"

}

to indicate that a column contains a very basic OGC WKT Polygon. (More sophisticated regular expressions could be used if properties held different types of objects such as points or lines or more complex polygons expressed in OGC WKT.)

There is no standard way of expressing which coordinate reference system (CRS) is being used within CSV metadata, but GeoJSON, GML and OGC WKT values can include information about which CRS they use. The W3C Spatial Data on the Web Working Group intend to make recommendations in this area within [[sdw-bp]].

See also:

Datatypes in [[tabular-metadata]]

Datatypes in [[tabular-data-model]]

How can you specify a single schema for multiple CSV files?

CSV on the Web is designed to enable you to reuse the same schema when publishing multiple CSV files, even if those files are created by different organisations and therefore reside in different places. Rather than embedding a schema within the description of a table, the tableSchema property can point out to a schema held somewhere else.

For example, if you were a statistical agency and wanted municipalities to publish their unemployment figures using the same schema, you could specify the columns that you wanted included within the schema you specified at http://example.org/schema/unemployment.json:

{

"columns": [{

"name": "municipality"

}, {

"name": "month",

"datatype": "gYearMonth"

}, {

"name": "unemployment",

"datatype": "number"

}]

}

Experience shows that publishers of data in CSV files often use their own headings for the columns. So long as these don't change the meaning of the column, as a statistical agency you probably don't care. The titles property allows you to provide multiple alternative titles that people may use in an array:

{

"columns": [{

"name": "municipality",

"titles": [ "Municipality", "City", "Area" ]

}, {

"name": "month",

"titles": [ "Month", "Period" ],

"datatype": "gYearMonth"

}, {

"name": "unemployment",

"titles": [ "Unemployment Rate", "Unemployment", "Number unemployed" ]

"datatype": "number"

}]

}

In cases where the data is being gathered from multiple countries, it may also be useful to specify multiple possible titles in different languages, as described in .

It will help consistency and processing of the data if the municipalities use a consistent set of codes to indicate which municipality the data relates to. As the statistical agency, you can supply the relevant codes in a CSV file, e.g. http://example.org/ref/municipalities.csv:

code,name

0101,Absecon City

0102,Atlantic City

0103,Brigantine City

...

This file will of course have its own simple schema, http://example.org/schema/municipalities.json, that provides the datatype for the municipality codes and indicates that they are unique through a primary key:

{

"columns": [{

"name": "code",

"datatype": { "format": "\d{4}" }

}, {

"name": "name"

}],

"primaryKey": "code"

}

A foreign key in the schema supplied by the statistical authority ensures that the codes used in the unemployment data match up with the standard set supplied by the statistical authority:

{

"columns": [{

"name": "municipality",

"titles": [ "Municipality", "City", "Area" ]

}, {

"name": "month",

"titles": [ "Month", "Period" ],

"datatype": "gYearMonth"

}, {

"name": "unemployment",

"titles": [ "Unemployment Rate", "Unemployment", "Number unemployed" ]

"datatype": "number"

}],

"foreignKeys": [{

"columnReference": "municipality",

"reference": {

"resource": "http://example.org/ref/municipalities.csv",

"columnReference": "code"

}

}]

}

When a municipality describes their unemployment CSV, they will need to point to your schema for their data and to the centrally provided municipalities reference data and schema:

{

"@context": "http://www.w3.org/ns/csvw",

"tables": [{

"url": "http://local.example.org/data/unemployment.csv",

"tableSchema": "http://example.org/schema/unemployment.json"

}, {

"url": "http://example.org/ref/municipalities.csv",

"tableSchema": "http://example.org/schema/municipalities.json"

}]

}

A more complex example in which there is linking between pairs of files where the schemas are provided by a central authority is provided in the section Foreign Key Reference Between Schemas within [[tabular-metadata]].

See also:

Examples of Foreign Keys in [[tabular-metadata]]

How can you provide a title for a row?

It is useful to have titles for rows both for screen readers and in other displays where it may be difficult to view the complete context of a table and therefore to understand the content of a row.

The rowTitles property within a schema provides an array of columns that provide sufficient context to label the row. These may sometimes be the same as the columns used for the primaryKey in the table, but are more likely to be columns that contain human readable text.

For example, with the CSV:

"country","country group","name (en)","name (fr)","name (de)","latitude","longitude"

"at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"

"be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"

"bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

The rowTitles property could be set to reference the columns containing the name of each country, in the different languages available:

"tableSchema": {

"columns": [{

"titles": "country"

},{

"titles": "country group"

},{

"name": "name_en",

"titles": "name (en)",

"lang": "en"

},{

"name": "name_fr",

"titles": "name (fr)",

"lang": "fr"

},{

"name": "name_de",

"titles": "name (de)",

"lang": "de"

},{

"titles": "latitude"

},{

"titles": "longitude"

}],

"rowTitles": ["name_en", "name_fr", "name_de"]

}

In this case, a screen reader or other display of the table could choose to read or display only the row title that matched the user's preferred language.

In other cases, the rowTitles property may be set to an array of columns that together provided sufficient context to understand the column (eg ["firstName", "lastName"]).

See also:

rowTitles property in [[tabular-metadata]]

Column and Row Labelling in [[tabular-data-model]]

What about CSV that isn't standard CSV?

A lot of what's called "CSV" that's published on the web isn't actually CSV. It might use something other than commas (such as tabs or semi-colons) as separators between values, or might have multiple header lines.

The specification for CSV as a format is [[RFC4180]]. However, this is an informational specification and and not a formal standard. Therefore, applications may deviate from it.

The metadata that's described here can be used with files that contain tabular data but that aren't CSV. You can provide guidance to processors that are trying to parse those files through the dialect property on a table description. For example, say we were dealing with a tab-separated file that contains multiple header lines at http://example.org/data/unemployment.tsv:

"country" "country group" "name (en)" "name (fr)" "name (de)" "latitude" "longitude"

"Land" "Ländergruppe" "Name (en)" "Name (fr)" "Name (de)" "Breite" "Länge"

"pays" "groupe de pays" "nom (en)" "nom (fr)" "nom (de)" "latitude" "longitude"

"at" "eu" "Austria" "Autriche" "Österreich" "47.6965545" "13.34598005"

"be" "eu" "Belgium" "Belgique" "Belgien" "50.501045" "4.47667405"

"bg" "eu" "Bulgaria" "Bulgarie" "Bulgarien" "42.72567375" "25.4823218"

The metadata for this file could be:

{

"@context": "http://www.w3.org/ns/csvw",

"url": "http://example.org/data/unemployment.tsv",

"dialect": {

"delimiter": "\t",

"headerRowCount": 3

}

There are a number of other properties that you can set within the dialect to cater for the large range of weird things that people do in CSV files. They are:

commentPrefix

If the file contains comment lines, set this to the character used at the start of the lines that are comments (usually that's #).

delimiter

If the file doesn't use commas as separators between values, set this to the separator that it uses.

doubleQuote

If the file uses \ to escape double quotes within values, set this to false.

encoding

If the encoding of the file is not UTF-8, set this to the encoding.

header

If the file doesn't have a header line, set this to false.

headerRowCount

If the file has more than one header line, set this to the number of header lines it has.

lineTerminators

If the file uses an unusual character at the end of its lines, set this to that character.

quoteChar

If the file doesn't use double quotes (") around values that contain commas, set this to the character that it does use.

skipBlankRows

If the file contains blank rows that should just be ignored, set this to true.

skipColumns

If the file has some columns at the start that don't contain useful information, set this to that number of columns.

skipInitialSpace

If values in the file sometimes start with whitespace that should be ignored, set this to true.

skipRows

If the file has some rows at the start that don't contain useful information, set this to that number of rows. (Sometimes people put metadata at the start of a CSV file, before the header lines.)

trim

If you don't want to ignore whitespace around values, set this to false. If you want to only ignore whitespace at the beginning of values, set it to start and if you want to only ignore whitespace at the end of values, to end. By default whitespace at the start and the end of a value will be stripped away.

See also:

Parsing Tabular Data in [[tabular-data-model]]

Dialect Descriptions in [[tabular-metadata]]

What about tables in HTML?

Tables in HTML are a bit more complicated than tables in CSV files. However, it is possible to use the metadata described here and in [[tabular-metadata]] to described HTML tables by embedding the metadata within a script element in the header of the HTML page. A description and example of how to do this is provided in the [[csvw-html]] Note.

See also:

[[csvw-html]]

What if you want to put metadata files somewhere else?

By default, if you point a processor at a CSV file on the web, they will look for a Link header within the response with rel="describedby" and specifying a content type of application/csvm+json, application/ld+json or application/json. If you're publishing CSV files and it's possible to set the Link header, this is the best way of telling processors where to look for metadata files.

If processors don't see an appropriate Link header, they will append -metadata.json to the end of the URL of the CSV file to try to find metadata for it. If they can't find a metadata file there, they will look in the directory containing the CSV file for a file called csv-metadata.json and use that file.

Looking for files in this way isn't appropriate on all servers. As a publisher, it might be that you have URLs for CSV files that are automatically generated using queries, such as http://example.org/data?format=csv&x=15&y=53 or that you want metadata to live in a separate subdirectory or central location.

As a publisher, if you can't or don't want to use the Link header, you can control where processors look for metadata for your CSV files by listing the locations to look at within the /.well-known/csvm file on your server. (This is a well-known location as defined in [[RFC5785]].) This file should contain a list of URL patterns which will be expanded by substituting url for the URL of the CSV file, and then resolving against the location of the CSV file. Lines might look like:

/metadata?for={url}

?format=metadata

{+url}m

With a CSV file at http://example.org/data.csv this would lead the processor to search for metadata at:

http://example.org/metadata?for=http://example.org/data.csv

http://example.org/data.csv?format=metadata

http://example.org/data.csvm

See also:

Locating Metadata in [[tabular-data-model]]

CSVW - Standards

csvw.org

Guides

Tools

Standards

The Core CSVW Specification

The CSVW specification is quite deep, building on multiple standards. It can be quite overwhelming to read from start to finish. This page provides an overview to help you find the parts most relevant to your interests.

The CSVW syntax specification describes a data model for tabular data. That is to say, it defines that a table is a collection of cells arranged into columns and rows. It also describes how to annotate a table with a metadata file to help processors parse and interpret the contents.

The schema for the metadata file is provided by the CSVW metadata vocabulary. The vocabulary defines the properties that can be used in an annotation. This includes things like a schema of column descriptions, datatypes and foreign key relations.

CSVW also defines how a table may be translated into RDF via the csv2rdf procedure or into JSON via the csv2json procedure.

Building on a deep stack of standards

The specification depends upon a range of other technical standards which may be of interest themselves.

CSV Dialect provides a standard for describing what “kind” of CSV file you have - i.e. the terminator strings, quoting rules, and escape rulesetc. This allows you to use tab-separated instead of comma-separated values for example or to deal with the often invisible differences between applications and operating systems (like encoding or line endings). The CSVW syntax specification provides some sensible Best Practice recommendations for standardising your CSV serialisation (e.g. use UTF-8 encoding and CRLF line endings).

JSON-LD is a serialisation for linked-data in JSON which is used for the CSVW metadata file. This standard gives JSON documents an “@context” property which defines how to get from keys and values to URIs through processes called “expansion” and “compaction”. In CSVW the context is fixed for compatibility with non-JSON-LD-aware processors.

XML Schema Datatypes lists the various datatypes like string, decimal, and datetime etc. The Datatypes section of the CSVW syntax specification builds upon this to allow annotated datatypes that might restrict the range of a number or format strings for parsing dates.

Compact URIs is a standard which describes how to make URIs shorter by using prefixes (i.e. if I tell you dcterms means “http://purl.org/dc/terms/” then I can write dcterms:title or dcterms:publisher without having to spell the whole URI out each time). Compact URIs are much easier for humans to read!

URI Templates can be used to create URIs by interpolate variables from table cells. A template like http://example.net/thing/{id} when combined with the cell value “123456”, for example, will expand to the URI http://example.net/thing/123456. This is particularly useful for aboutUrl or valueUrl properties.

CSV Fragment Identifiers make it possible to refer to parts of a CSV file identified by row, column, or cell e.g. http://example.net/data.csv#row=5-7. These are used by the default metadata to give any CSV file a linked-data translation without requiring any configuration.

Vocabularies for interoperability

CSVW suggests a set of ontologies you can adopt to integrate with other data on the web.

The CSVW namespace defines a set of ontologies that may be used in an annotation in their prefixed form (i.e. it provides a JSON-LD “@context”). This means that instead of writing:

{“http://purl.org/dc/terms/title”: “My Great Table Group”}

you’re able to write:

{“dcterms:title”: “My Great Table Group”}

which is much more legible.

These ontologies aren’t required in an notation but the fact that the authors chose to include them in the CSVW Namespace implies a sort of tacit recommendation. Indeed these ontologies are a great place to find useful properties for annotating your tables.

Data Catalog Vocabulary (DCAT) defines Catalogs and Datasets it also makes heavy use of Dublin Core Terms;

Dublin Core Terms (DCTERMS) defines fundamental datasets metadata terms like title, modified date, publisher and license;

RDF Data Cube (QB) provides a vocabulary for describing multi-dimensional statistical data building upon the SDMX standard. This defines things like observations, dimensions, and measures.

Simple Knowledge Organisation System (SKOS) can be used to express concept schemes such as “thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary”.

Good Relations (gr) is designed for e-commerce and defines things like Business Entities, Products or Services and Price Specifications.

PROV let’s you explain the provenance of a dataset with reference to Entities and Activities.

This is just a shortlist, consult the namespace document itself for the full details.

Indeed if you’re looking for URIs to use in your annotations then you might also like to try searching with the Linked Open Vocabularies project or browsing around the prefix.cc namespace lookup (just bear in mind that if the prefixes aren’t used in the CSVW namespace then you’ll need to spell out the URIs in full).

A tribute to CSVW made by

Swirrl Logo

and shared under the Creative Commons Attribution-ShareAlike license.

We'd be delighted to receive contributions. If you've got an idea for a guide or tool please

create an issue on the site's repository.

CSVW - Guides

csvw.org

Guides

Tools

Standards

Guides

How to make CSVW

This guide walks shows how to turn a CSV file into CSVW

We’ve discussed why to use CSVW, but how can you do that?

Let’s annotate a csv table with csvw metadata to accurately describe it’s contents. The annotation will improve the machine-readability of the dataset and connect it to a wider knowledge graph spread across the web.

We’ll start with something simple.

Starting with CSV

Interpreting a CSV file often requires a bit of investigation

Data Mill North publishes a list of Grit Bins on behalf of Leeds City Council.

The first five lines of the csv looks like this:

On it’s own, this is of course a little tricky to interpret.

Fortunately the page we got the file from tells us that the columns are:

location

easting

northing

We interpret location as an identifier for the spot where the bin is placed - it serves to distinguish the location, not to locate it (i.e. on a map). The second two columns locate these location with Ordnance Survey National Grid coordinates.

Annotate the table

Metadata can help to make implicit knowledge explicit and machine-readable

We can use the csvw standard to make a machine-readable definition of how to interpret the csv. This is done with a json file.

Here’s how we would give the columns names:

We’ve also specified here that the data table doesn’t have column names specified in a header row.

We can then add simple datatypes:

We can also explain that the second and third are grid references.

We could describe the datatype of these columns as being in metres, which are the units for eastings and northings (and yes the data does seem rather more precise than it needs to be). This wouldn’t be terribly helpful as we’re not really talking about distance per se, but a 2-dimensional distance from the National Grid’s origin point (0, 0 is somewhere south-west of the Isles of Scilly). Indeed it’s only possible to interpret those distances when you know the origin and the projection used to get from the curved surface of the globe to a flat 2D map.

Incorporate knowledge from the web

Connect your data to a wider knowledge graph of standards and vocabularies

Fortunately, we don’t need to worry about transforming between coordinate systems or projections. We can just say that we’re using the Ordnance Survey system by using the identifiers that they provide for this purpose.

We can find those identifiers with the Linked Open Vocabularies search engine. When we search for “easting”, the Ordnance Survey’s Spatial Relations Ontology is the top hit. If we browse to the ontology, we can see that it also declare the “northing” property.

We declare them by setting the “propertyUrl” for each column.

Now applications that know what to do with eastings and northings can interpret the values correctly.

Provide identifiers for others to use

Let others enrich your data by connecting to it

We also want to identify the entities in our data - the grit bins themselves. This lets other uses refer to them unambigiously.

We can tell the data user that the first column serves as an identifier to distinguish each row.

We can do that by saying that each row is about the location - or what csvw calls “aboutUrl”. We don’t just want to set this on the location column, as it applies to all 3 columns. Instead we set it for the whole table schema. We also want to include the value from the location column in our URIs. To do this we can use a URI template, this standard lets us describe a URI in terms of some variables.

That is basically all the description we really need to provide for the schema.

Fit the pieces together

Associate the annotation file with the csv file

Finally, we can connect our annotation to the original csv file. We can do this by including the URL of the table. There’s a few other options described in the standard but typically you’ll want to publish the json alongside the csv like this.

We also need to say that this json file is specified in terms of the csvw standard, we set this using a “@context” key which follows the lead of the json-ld standard (a json serialisation for linked-data).

Our final metadata annotation for the table looks like this:

Using CSVW

What can you do with a well annotated CSV file?

You may be wondering how you can use this newly annotated table…

Ideally we’d be able to open this up in a mapping tool like QGIS, and have it automatically recognise which columns to use as X and Y coordinates so we could create a map of grit bins.

Sadly we’re not quite at that point yet. This standard is still new and hasn’t been adopted by applications like this yet.

What we do have is a good deal of libraries which implement csvw in various programming languages.

These can be used to validate the compatibility of csv and it’s json annotations. This is useful either to check the annotations are correct, or to validate that a given csv file conforms to a schema.

They can also translate the csv into either json or rdf formats (for example, using a csv2rdf tool). This is what a couple of lines of our csv look like as linked data:

The substantive content is the same, but the extra description earns us the final two open data stars, as promised in the first guide explaining the rationale for CSVW:

★★★★ using identifiers to denote things, so that people can talk about your resources unambigiously

★★★★★ linking your data to other data to provide context

We’ve given the bins URIs. Here location 696 get’s the URI http://opendata.leeds.gov.uk/downloads/gritting/grit_bins.csv#696. People can talk about this anywhere on the web and know exactly what we’re talking about.

We’ve also adopted URIs from other vocabularies both the Ordnance Survey’s Spatial Relations Ontology and the venerable XSD Datatypes, linking this csv file into the web of data. Any tools that know how to deal with Ordnance Survey points or decimal datatypes will know how to interpret this csv.

This example was nice and simple. It’s almost certain that the data you’re working with is considerably more complex! It mightn’t be immediately obvious how to apply this lesson to your data. Which vocabularies should you use? How should you combine them? How should you structure the table? We’ll provide answers to these questions in our next guide in this series, with further examples of how to publish csvw.

A tribute to CSVW made by

Swirrl Logo

and shared under the Creative Commons Attribution-ShareAlike license.

We'd be delighted to receive contributions. If you've got an idea for a guide or tool please

create an issue on the site's repository.

CSV on the Web Community Group

Skip to toolbar

SkipLog InMy W3C Account

Community & Business Groups

Current Groups

Reports

About

Home /

CSV on the Web Community Group

Discussions and mutual support for implementers, publishers and spec developers of the technologies developed by the CSV on the Web Working Group. The group is not chartered to change published W3C documents, but can record new issues, errata, and test cases for those specification. The CG may also discuss related work, e.g. R2RML or other potential extensions.

w3c/csvw

Group's public email, repo and wiki activity over time

Note: Community Groups are proposed and run by the community. Although W3C hosts these conversations, the groups do not necessarily represent the views of the W3C Membership or staff.

No Reports Yet Published

Learn more about publishing.

Chairs, when logged in, may publish draft and final reports. Please see report requirements.

Publish Reports

CSV metadata generator and editor

Dan Brickley | Posted on: January 20, 2016

Jürgen Umbrich has announced an early preview release of a CSV metadata generator and editor that uses the new W3C CSVW specifications. From his announcement:

One of our students ( Manuel Undesser) developed as part of his bachelor thesis a UI for generating and editing the metadata of CSV files which is compliant with the CSV on the Web metadata standard.

the current early alpha status and test version is available at: http://data.wu.ac.at/csvengine/

We would be very happy for any feedback, comments, criticism, suggestions, bug reports, etc…

We plan to release our editor in the near future to the public with the hope that such a tool will help to further drive the adoption of the metadata standard and ease the generation of an initial version. We will also release the source code on github after another round of testing, improvements and adding documentation.

1 Comment |

GitHub - w3c/csvw: Documents produced by the CSV on the Web Working Group

Toggle navigation

Product

Actions

Automate any workflow

Packages

Host and manage packages

Security

Find and fix vulnerabilities

Codespaces

Instant dev environments

Copilot

Write better code with AI

Code review

Manage code changes

Issues

Plan and track work

Discussions

Collaborate outside of code

Explore

All features

Documentation

GitHub Skills

Blog

Solutions

For

Enterprise

Teams

Startups

Education

By Solution

CI/CD & Automation

DevOps

DevSecOps

Resources

Learning Pathways

White papers, Ebooks, Webinars

Customer Stories

Partners

Open Source

GitHub Sponsors

Fund open source developers

The ReadME Project

GitHub community articles

Repositories

Topics

Trending

Collections

Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

You signed in with another tab or window. Reload to refresh your session.

You signed out in another tab or window. Reload to refresh your session.

You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

w3c

csvw

Public

Notifications

Fork

Star

160

Documents produced by the CSV on the Web Working Group

License

View license

160

stars

forks

Branches

tokenpocket最新下载|csvw

tokenpocket最新下载|csvw

IT账号自助服务

CSVW - CSV on the Web

上汽大众汽车有限公司

CSV on the Web: A Primer

CSVW - Standards

CSVW - Guides

CSV on the Web Community Group

GitHub - w3c/csvw: Documents produced by the CSV on the Web Working Group

菜单

支持

Follow

tokenpocket最新下载|csvw

tokenpocket最新下载|csvw

IT账号自助服务

CSVW - CSV on the Web

上汽大众汽车有限公司

CSV on the Web: A Primer

CSVW - Standards

CSVW - Guides

CSV on the Web Community Group

GitHub - w3c/csvw: Documents produced by the CSV on the Web Working Group

最近的新闻

您可能喜欢的文章

tokenpocket安卓苹果互通|csi

imtoken官网下载app|dam

比特币钱包40G：如何选择适合您的比特币钱包存

如何赚取以太坊的钱包和相关关键词

TP钱包资金池玩法介绍及操作指南

忘记比特币钱包私钥怎么办？恢复和解决方法详

TP钱包自定义排序教程：简单操作一步搞定