Canterbury Corpus

www.data-compression.info
The Data Compression Resource on the Internet

Canterbury Corpus

The University of Canterbury

The Canterbury Corpus was developed by Ross Arnold and Timothy Bell in 1997 at the University of Canterbury, New Zealand, as an improved version of the Calgary Corpus. The files were chosen because their results on existing compression algorithms are typical. The corpus itself was published at DCC 97 in the paper "A corpus for the evaluation of lossless compression". The final files of the corpus were chosen from a set of more than 800 files, which were relevant for inclusion in the corpus. The DCC 97 paper explains how the files were chosen, and why it is difficult to find "typical" files.
There are two main editions of the Canterbury Corpus: the Standard Canterbury Corpus, consisting of 11 files (alice29.txt, asyoulik.txt, cp.html, fields.c, grammar.lsp, kennedy.xls, lcet10.txt, plrabn12.txt, ptt5, sum, xargs.1) and the Large Canterbury Corpus, consiting of 3 files (bible.txt, e.coli, world192.txt).

The corpus is available below and at http://corpus.canterbury.ac.nz/.

Publications

Logo

Title

Description

A corpus for the evaluation of lossless compression

The paper which introduced the Canterbury Corpus from Ross Arnold and Timothy Bell in 1997 published at the DCC 97. The explains how the files were chosen, and why it is difficult to find "typical" files.

The Canterbury Corpus

The internet site of the Canterbury Corpus maintained by Matt Powell
This site includes many information about the corpus itself, the different editions, purpose, summary and details of compression rates and times for a variety of compression algorithms.

Evaluating Lossless Compression Methods

Matt Powell desribes in his paper from 2001 the work of maintaining the Canterbury Corpus website, and in particular the process of automating results generation. The popularity and usefulness of the Canterbury Corpus as a data compression standard is investigated, and several areas for further research and development of the current system are proposed.

People

Logo

Name

Description

Ross Arnold

Ross Arnold was a student of Timothy Bell and is together with him the father of the Canterbury Corpus.

Timothy Bell

Timothy Bell works at the University of Canterbury, New Zealand, and is the "father" of the Canterbury Corpus. His research interests include compression, computer science for children, and music.

Matt Powell

Matt Powell is studying Computer Science at the University of Canterbury, New Zealand. He likes academic life and drawing cartoons. As the secretary of the University Comedy Club he does all sorts of comedy, including skits, improv, stand-up and songs.

Source Code

Logo

Title

Description

The Large Canterbury Corpus

ZIP-file with: bible.txt, e.coli, world192.txt

The Standard Canterbury Corpus

ZIP-file with: alice29.txt, asyoulik.txt, cp.html, fields.c, grammar.lsp, kennedy.xls, lcet10.txt, plrabn12.txt, ptt5, sum, xargs.1

Copyright © 2002-2025 Dr.-Ing. Juergen Abel, Neckarstrasse 4, 41469 Neuss, Germany. All rights reserved.