www.data-compression.info
The Data Compression Resource on the Internet

Contents

 Canterbury Corpus


The University of Canterbury

The University of Canterbury
 

The Canterbury Corpus was developed by Ross Arnold and Timothy Bell in 1997 at the University of Canterbury, New Zealand, as an improved version of the Calgary Corpus. The files were chosen because their results on existing compression algorithms are typical. The corpus itself was published at DCC 97 in the paper "A corpus for the evaluation of lossless compression". The final files of the corpus were chosen from a set of more than 800 files, which were relevant for inclusion in the corpus. The DCC 97 paper explains how the files were chosen, and why it is difficult to find "typical" files.
There are two main editions of the Canterbury Corpus: the
Standard Canterbury Corpus, consisting of 11 files (alice29.txt, asyoulik.txt, cp.html, fields.c, grammar.lsp, kennedy.xls, lcet10.txt, plrabn12.txt, ptt5, sum, xargs.1) and the Large Canterbury Corpus, consiting of 3 files (bible.txt, e.coli, world192.txt).

The corpus is available
below and at http://corpus.canterbury.ac.nz/.

 Publications


Logo

Title

Description

A corpus for the evaluation of lossless compression

A corpus for the evaluation of lossless compression
 

The paper which introduced the Canterbury Corpus from Ross Arnold and Timothy Bell in 1997 published at the DCC 97. The explains how the files were chosen, and why it is difficult to find "typical" files.
 

The Canterbury Corpus

The Canterbury Corpus
 

The internet site of the Canterbury Corpus maintained by Matt Powell
This site includes many information about the corpus itself, the different editions, purpose, summary and details of compression rates and times for a variety of compression algorithms.
 

Evaluating Lossless Compression Methods

Evaluating Lossless Compression Methods
 

Matt Powell desribes in his paper from 2001 the work of maintaining the Canterbury Corpus website, and in particular the process of automating results generation. The popularity and usefulness of the Canterbury Corpus as a data compression standard is investigated, and several areas for further research and development of the current system are proposed.
 

 People


Logo

Name

Description

Ross Arnold

Ross Arnold
 

Ross Arnold was a student of Timothy Bell and is together with him the father of the Canterbury Corpus.
 

Timothy Bell

Timothy Bell
 

Timothy Bell works at the University of Canterbury, New Zealand, and is the "father" of the Canterbury Corpus. His research interests include compression, computer science for children, and music.
 

Matt Powell

Matt Powell
 

Matt Powell is studying Computer Science at the University of Canterbury, New Zealand. He likes academic life and drawing cartoons. As the secretary of the University Comedy Club he does all sorts of comedy, including skits, improv, stand-up and songs.
 

 Source Code


Logo

Title

Description

The Large Canterbury Corpus

The Large Canterbury Corpus
 

ZIP-file with: bible.txt, e.coli, world192.txt
 

The Standard Canterbury Corpus

The Standard Canterbury Corpus
 

ZIP-file with: alice29.txt, asyoulik.txt, cp.html, fields.c, grammar.lsp, kennedy.xls, lcet10.txt, plrabn12.txt, ptt5, sum, xargs.1
 

 

Copyright © 2002-2022 Dr.-Ing. Jürgen Abel, Lechstraße 1, 41469 Neuß, Germany. All rights reserved.