|
|
|
The Canterbury Corpus was developed by Ross Arnold and Timothy Bell in 1997 at the University of Canterbury, New Zealand, as an improved version of the Calgary Corpus. The files were chosen because their results on existing compression algorithms are typical. The corpus itself was published at DCC 97 in the paper "A corpus for the evaluation of lossless compression". The final files of the corpus were chosen from a set of more than 800 files, which were relevant for inclusion in the corpus. The DCC 97 paper explains how the files were chosen, and why it is difficult to find "typical" files. There are two main editions of the Canterbury Corpus: the Standard Canterbury Corpus, consisting of 11 files (alice29.txt, asyoulik.txt, cp.html, fields.c, grammar.lsp, kennedy.xls, lcet10.txt, plrabn12.txt, ptt5, sum, xargs.1) and the Large Canterbury Corpus, consiting of 3 files (bible.txt, e.coli, world192.txt).
The corpus is available below and at http://corpus.canterbury.ac.nz/.
Logo
|
Title
|
Description
|
|
A corpus for the evaluation of lossless compression
|
The paper which introduced the Canterbury Corpus from Ross Arnold and Timothy Bell in 1997 published at the DCC 97. The explains how the files were chosen, and why it is difficult to find "typical" files.
|
|
The Canterbury Corpus
|
The internet site of the Canterbury Corpus maintained by Matt Powell This site includes many information about the corpus itself, the different editions, purpose, summary and details of compression rates and times for a variety of compression algorithms.
|
|
Evaluating Lossless Compression Methods
|
Matt Powell desribes in his paper from 2001 the work of maintaining the Canterbury Corpus website, and in particular the process of automating results generation. The popularity and usefulness of the Canterbury Corpus as a data compression standard is investigated, and several areas for further research and development of the current system are proposed.
|
Logo
|
Name
|
Description
|
|
Ross Arnold
|
Ross Arnold was a student of Timothy Bell and is together with him the father of the Canterbury Corpus.
|
|
Timothy Bell
|
Timothy Bell works at the University of Canterbury, New Zealand, and is the "father" of the Canterbury Corpus. His research interests include compression, computer science for children, and music.
|
|
Matt Powell
|
Matt Powell is studying Computer Science at the University of Canterbury, New Zealand. He likes academic life and drawing cartoons. As the secretary of the University Comedy Club he does all sorts of comedy, including skits, improv, stand-up and songs.
|
Logo
|
Title
|
Description
|
|
The Large Canterbury Corpus
|
ZIP-file with: bible.txt, e.coli, world192.txt
|
|
The Standard Canterbury Corpus
|
ZIP-file with: alice29.txt, asyoulik.txt, cp.html, fields.c, grammar.lsp, kennedy.xls, lcet10.txt, plrabn12.txt, ptt5, sum, xargs.1
|
|
|
Copyright © 2002-2025 Dr.-Ing. Juergen Abel, Neckarstrasse 4, 41469 Neuss, Germany. All rights reserved.
|
|
|
|
|
|
|