Welcome to the website of GitTables!

dataset | paper | github repository

About GitTables

GitTables is a dataset of currently 1.7M relational tables extracted from CSV files in GitHub. Our continuing curation aims at growing the dataset to at least 20M tables. Table columns in GitTables have been annotated with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions.

The high-level pipeline in Figure 1 illustrates how GitTables was created.

Figure 1: high-level pipeline of the process of constructing GitTables.

Why GitTables

Existing large-scale table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent database tables. These table corpora also lack semantic annotations, like the semantic column types.

To train and evaluate models for applications beyond the Web, additional resources are needed with tables that resemble relational database tables. We built GitTables to facilitate that need.

The dataset

The tables in GitTables were extracted from CSV files from GitHub. On average the tables have 25 columns and 209 rows. We annotated table columns with real-world concepts that the columns refer to. The labels for these column annotations (referred to as semantic types) were extracted from the DBpedia and Schema.org ontologies.

We used two different annotation methods:

Figure 2 presents the distribution of semantic types of the tables per annotation method and ontology.

Figure 2: distribution of top 25 semantic types resulting from different annotation methods and ontologies.

Each table stored in a Parquet file, and consists of:

Downloads

GitTables is hosted on Zenodo with DOI: 10.5281/zenodo.4943312. To ensure usage, extension and replication of GitTables on the longer term, we publish the ontologies used for annotation as well.

Dataset downloads

The GitHub Search API requires queries to include a keyword, which we refer to as a topic. For example, you can search code files related to the topic thing. This returns all CSV files that contain the string thing. We have kept this structure in place, hence each zip file consists of the tables retrieved for a topic.

Ontology downloads

The tables have been annotated with snapshots of DBpedia and Schema.org. These ontologies are provided in the form of a pickle file. Each pickle contains a pickled Pandas DataFrame that can be read through Pandas.

License

GitTables is licensed under the Creative Commons Attributions 4.0 International license (CC BY 4.0). The table data might however be licensed under different licenses as inherited from the GitHub repositories that the CSVs were retrieved from.

A new version of GitTables will soon be released in which all tables have a license, and the license of each table is contained in the metadata. In the meantime, we suggest to use GitHub’s License API to retrieve the license associated with a table (you can use the URL in the metadata to do so) to understand what restrictions apply to each table.

Citation

The paper describes the construction and analysis of GitTables in more detail and can be downloaded here. If you use GitTables, please cite our paper:

@article{GitTables,
   title={GitTables: A Large-Scale Corpus of Relational Tables},
   author={Hulsebos, Madelon and Demiralp, Çağatay and Groth, Paul},
   journal={arXiv preprint arXiv:2106.07258},
   url={https://arxiv.org/abs/2106.07258},
   year={2021}
}

Contact

GitTables has been developed by:

Please consider reporting cases of personal or otherwise undesired tables in GitTables using the form below. Feedback, suggestions and results from projects with GitTables are also very welcome!




Alternatively, you can send an email to m.hulsebos(at)uva.nl.