Welcome to the website of GitTables!

About GitTables

GitTables is a corpus of currently 1.7M relational tables extracted from CSV files in GitHub. Our continuing curation aims at growing the corpus to at least 20M tables. Table columns in GitTables have been annotated with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions.

The high-level pipeline in Figure 1 illustrates how GitTables was created.

Figure 1: high-level pipeline of the process of constructing GitTables.

Why GitTables

Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, additional resources are needed with tables that resemble relational database tables. We built GitTables to facilitate that need.

The corpus

The tables in GitTables were extracted from CSV files. On average the tables have 25 columns and 209 rows. We annotated table columns with real-world concepts that the columns refer to. The labels for these column annotations (referred to as semantic types) were extracted from the DBpedia and Schema.org ontologies. We used two different annotation methods:

Figure 2 presents the distribution of semantic types of the tables per annotation method and ontology.

Figure 2: distribution of top 25 semantic types resulting from different annotation methods and ontologies.

Each table stored in a Parquet file, and consists of:

Downloads

GitTables is hosted on Zenodo with DOI: 10.5281/zenodo.4943312. To facilitate different use-cases, we publish different versions of GitTables. To ensure usage, extension and replication of GitTables on the longer term, we publish the ontologies used for annotation as well.

Corpus downloads

The GitHub Search API requires queries to include a keyword, which we refer to as a ‘topic’. For example, you can search code files related to the topic ‘thing’. This returns all CSV files that contain the string ‘thing’. We have kept this ‘topic’ structure in place, hence each zip file consists of the tables retrieved for that topic.

Ontology downloads

The tables have been annotated with snapshots of DBpedia and Schema.org. These ontologies are provided in the form of a pickle file. Each pickle contains a pickled Pandas DataFrame that can be read through Pandas.

License

GitTables is licensed under the Creative Commons Attributions 4.0 International license (CC BY 4.0). The table data might however be licensed under different licenses as inherited from the GitHub repositories that the CSVs were retrieved from.

A new version of GitTables will soon be released in which 1) all tables have a license, 2) the license of each table is in the metadata. In the meantime, we suggest to use GitHub’s License API to retrieve the license associated with a table (you can use the URL in the metadata to do so) to understand what restrictions apply to each table.

Citation

The paper describes the construction and analysis of GitTables in more detail and can be downloaded here. If you use GitTables, please cite our paper:

@article{GitTables,
   title={GitTables: A Large-Scale Corpus of Relational Tables},
   author={Hulsebos, Madelon and Demiralp, Çağatay and Groth, Paul},
   journal={arXiv preprint arXiv:2106.07258},
   url={https://arxiv.org/abs/2106.07258},
   year={2021}
}

Contact

GitTables has been developed by:

If you have feedback, suggestions or did an interesting project with GitTables, feel free to share through the form below!




Alternatively, you can send an email to m.hulsebos(at)uva.nl.