A large percentage of the data published on the Web is tabular data, commonly published as comma separated values CSV files.

The CSV on the Web Working Group aim to specify technologies that provide greater interoperability for data dependent applications on the Web when working with tabular datasets comprising single or multiple files using CSV, or similar, format. This document lists the use cases compiled by the Working Group that are considered representative of how tabular data is commonly used within data dependent applications. The use cases observe existing common practice undertaken when working with tabular data, often illustrating shortcomings or limitations of existing formats or technologies.

This document also provides a set of requirements derived from these use cases that have been used to guide the specification design. This is a draft document which may be merged into another document or eventually make its way into being a standalone Working Draft. CSV files may be of a significant size but they can be generated and manipulated easily, and there is a significant body of software available to handle them.

However, although these tools make conversion to CSV easy, it is resisted by some publishers because CSV is a much less rich format that can't express important detail that the publishers want to express, such as annotations, the meaning of identifier codes etc. Existing formats for tabular data are format-oriented and hard to process e.

Excel

None of these formats allow developers to pull in multiple data sets, manipulate, visualize and combine them in ways. Other information relevant to these datasets, such as access rights and provenance, is not easy to find.

CSV is a very useful and simple format, but to unlock the data and make it portable to environments other than the one in which it was created, there needs to be a means of encoding and associating relevant metadata. Each use case provides a narrative describing how a representative user works with tabular data to achieve their goal, supported, where possible, with example datasets.

As a result, the use cases seek to identify where user effort may be reduced. The use cases below describe many applications of tabular data. Whilst there are many different variations of tabular data, all the examples conform to the definition of tabular data defined in the Model for Tabular Data and Metadata on the Web [[! Tabular data is data that is structured into rows, each of which contains information about some thing.

Each row contains the same number of fields although some of these fields may be empty, which provide values of properties of the thing described by the row. In tabular data, fields within the same column provide values for the same property of the thing described by the particular row. In selecting the use cases we have reviewed a number of row oriented data formats that, at first glance, appear to be tabular data.

However, closer inspection indicates that one or other of the characteristics of tabular data were not present. For example, the HL7 formatfrom the health informatics domain defines a separate schema for each row known as a "segment" in that format which means that HL7 messages do not have a regular number of columns for each row. The laws of England and Wales place obligations upon departments and The National Archives for the collection, disposal and preservation of records.

Government departments are obliged within the Public Records Act sections 3, 4 and 5 to select, transfer, preserve and make available those records that have been defined as public records. These obligations apply to records in all formats and media, including paper and digital records. Details concerning the selection and transfer of records can be found here. Departments transferring records to TNA must catalogue or list the selected records according to The National Archives' defined cataloguing principles and standards.

Cataloguing is the process of writing a description, or Transcriptions of Records for the records being transferred. Once each Transcription of Records is added to the Records Catalogue, records can be subsequently discovered and accessed using the supplied descriptions and titles. TNA specifies what information should be provided within a Transcriptions of Records and how that information should be formatted.

A number of formats and syntaxes are supported, including RDF. As a result, it is necessary to describe the interrelationships between Records within a single CSV file. Each row within a CSV file relates to a particular C kan master trax latino dating and is allocated a unique identifier.

This unique identifier behaves as a primary key for the Record within the scope of the CSV file and is used when referencing that Record from within other Record transcriptions.

The unique identifier is unique within the scope of the datafile; in order for the Record to be referenced from outside this datafile, the local identifier must be mapped to a globally unique identifier such as a URI.

Upon receipt by TNA, each of the Transcriptions of Records is validated against the set of centrally published data definition s ; it is essential that received CSV metadata comply with these specifications to ensure efficient and error free ingest into the Records Catalogue.

The validation applied is dependent the type of entity described in each row. Entity type is specified in a specific column e.


The CSV files do not include all the information required to undertake the transformation, e. However, the conversion of CSV to XML in this case used as an interim conversion step is illustrative of a common data conversion workflow. It is responsible for collecting and publishing statistics related to the economy, population and society at national, regional and local levels.

Sets of statistics are typically grouped together into datasets comprising of collections of related tabular data. Within their underlying information systems, ONS maintains a clear separation between the statistical data itself and the metadata required for interpretation.

ONS classify the metadata into two categories:. These datasets are published on-line in both CSV format and as Microsoft Excel Workbooks that have been manually assembled from the underlying data. For example, refer to dataset QSEW Economic activity, derived from the Census, is available as a precompiled Microsoft Excel Workbook for several sets of administrative geographies, e.

A user may choose to browse through the entire list or filter that list by topic. To enable the user to determine whether or not a dataset meets their need, summary information is available for each dataset. Once the required dataset has been selected, the user is prompted to choose how they would like the statistical data to be aggregated. In the case of QSEW Economic activity, the user is required to choose between the two mutually exclusive geography types: Effectively, the QSEW Economic activity dataset is partitioned into two separate tables for publication.

The user is also provided with an option to sub-select only the elements of the dataset that they deem pertinent for their needs. In the case of QSEW Economic Activity the user may select data from upto geographic areas within the dataset to create a data subset that meets their needs. The data subset is provided as a compressed file containing both a CSV formatted data file and a complementary html file containing the reference metadata.

White space has been added for clarity. Correct interpretation of the statistics requires additional qualification or awareness of context. To achieve this the complementary html file includes supplementary information and annotations pertinent to the data published in the accompanying CSV file.

Annotation or references may be applied to:. Furthermore, these statistical data sets make frequent use of predefined category codes and geographic regions.

Climate change and global warming have become one of the most pressing environmental concerns in society today. Whilst there is an abundance of data recording the climate at locations the world over, the scrutiny under which climate science is put means that much of this data remains unused leading to a paucity of data in some regions with which to verify our understanding of climate change.


The International Surface Temperature Initiative seeks to create a consolidated global land surface temperatures databank as an open and freely available resource to climate scientists.

Given the need for openness and transparency in creating the databank, it is essential that the provenance of the source data is clear. Original source data, particularly for records captured prior to the mid-twentieth century, may be in hard-copy form. In order to incorporate the widest possible scope of source data, the International Surface Temperature Initiative is supported by data rescue activities to digitise hard copy records.

The Stage 1 data is typically provided in tabular form - the most common variant is white-space delimited ASCII files. Each data deck comprises multiple files which are packaged as a compressed tar ball. Included within the compressed tar ball package, and provided alongside, is a read-me file providing unstructured supplementary information. Summary information is often embedded at the top of each file.

For example, see the Ugandan Stage 1 data deck local copy and associated readme file local copy. The Ugandan Stage 1 data deck appears to be comprised of two discrete datasets, each partitioned into a sub-directory within the tar ball: Each sub-directory includes a Microsoft Word document providing supplementary information about the provenance of the dataset; of particular note is that uganda-raw is collated from 9 source datasets whilst uganda-bestguess provides what is considered by the data publisher to be the best set of values with duplicate values removed. Dataset uganda-raw is split into 96 discrete files, each providing maximum, minimum or mean monthly air temperature for one of the 32 weather observation stations sites included in the data set.

Similarly, dataset uganda-bestguess is partitioned into discrete files; this case just 3 files each of which provide maximum, minimum or mean monthly air temperature data for all sites. The mapping from data file to data sub-set is described in the Microsoft Word document. A snippet of the data indicating maximum monthly temperature for Entebbe, Uganda, from uganda-raw is provided below. A snippet of the data indicating maximum monthly temperature for all stations in Uganda from uganda-bestguess is provided below truncated to 9 columns.

Additionally, we see that:. At present, the global surface temperature databank comprises 25 Stage 1 data decks for monthly temperature observations.

These are provided by numerous organisations in heterogeneous forms. In order to merge these data decks into a single combined dataset, each data deck has to be converted into a standard form. An example Stage 2 data file is given for Entebbe, Uganda, below. Because of the heterogeneity of the Stage 1 data decks, bespoke data processing programs were required for each data deck consuming valuable effort and resource in simple data pre-processing.

If the semantics, structure and other supplementary metadata pertinent to the Stage 1 data decks had been machine readable, then this data homogenisation stage could have been avoided altogether. Data provenance is crucial to this initiative, therefore it would be beneficial to be able to associate the supplementary metadata without needing to edit the original data files.

The data pre-processing tools created to parse each Stage 1 data deck into the standard Stage 2 format and the merge process to create the consolidated Stage 3 data set were written using the software most familiar to the participating scientists: The merge software source code is available online.

It is worth noting that this sector of the scientific community also commonly uses IDL and is gradually adopting Python as the default software language choice.

The resulting merged dataset is published in several formats — including tabular text. A snippet of the inventory for merged data is provided below; each row C kan master trax latino dating one of the 31, sites in the dataset. The data is fixed format rather than delimited. Similarly, a snippet of the merged data itself is provided. Given that the original. Again, the data is fixed format rather than delimited. Here we see the station identifier REC being used as a foreign key to refer to the observing station details; in this case Entebbe Airport.

