Why data should be free

Germanic Lexicon Project
Why data should be free

Previous Up Next

Humanity benefits when there is a common pool of knowledge and information which everybody can freely draw from. Old works and old pieces of information can be reworked and combined in novel ways.

Sharing is the basic social instinct. If you create data and make it available, you benefit other people. Likewise, you benefit when other people share their data with you.

When you share your data without restrictions, it becomes possible for others to create derived works. These derived works often bring benefit back to you. People will use your data in interesting ways which you might have never foreseen.

I'm not going to attempt to fully explore the arguments in favor of data sharing. Here are a few links to projects which have a similar philosophy to the one I have just briefly described, and which discuss this issue in greater depth:

Creative Commons

Free Software Foundation

Project Gutenberg

On copyright

The publishing, music, and motion picture industries have come under strong criticism from many quarters because of a long history of abuses. These abuses have led some people to question the whole notion of copyright.

Despite the ethic of data sharing which I sketched above, I am not opposed to the general concept of copyright. I recognize that copyright helps create a financial incentive for authors and artists to publish, and these publications benefit the public. (Financial reward is by no means the only reason why authors and artists create; but it is often a motivating factor.) I do not think that there is anything fundamentally wrong about businesses which make a reasonable profit by creating, copyrighting, and selling books, music, motion pictures, and other forms of information.

On the other hand, a large pool of free information is also beneficial to the public. This pool can only remain vital if new materials continue to enter it. No copyrights have expired in the United States since 1978. Under current law, no new works will enter the public domain thru expiration of copyright until 2018. Such a long term of copyright is not beneficial to the public.

The term of copyright can be too short or too long. If it is too short, then it is not good for the public, because there are fewer new publications due to reduced monetary incentive. If it is too long, then it is also not good for the public, because the pool of common information is not replenished, and scholarship and art stagnate because of an inability to rework and recombine what has come before.

Even when copyright expires, the original publisher can still continue to profit by selling copies of the public domain text. The pricing comes to be determined by competition rather than by a legally enforced monopoly. This competitive situation still allows opportunities for profit. For example, the works of Charles Dickens can be downloaded for free; and yet publishers continue to make a livelihood by selling inexpensive paperback editions to one market segment, expensive leather-bound hardback editions to another market segment, etc. Favoring shorter copyright terms is not necessarily an anti-business view.

The great majority of published works do not continue to have any significant commercial value many decades after their original publication. When the copyright on a work expires, the former copyright owner generally suffers no financial harm and usually does not care that the copyright has expired.

However, there is a tiny fraction of works which continue to have significant commercial value many decades after their original publication. The current copyright laws are written entirely to serve the interests of the owners of copyright on this tiny minority of works.

The duration of copyright in the U.S. was originally 14 years. For many years, the duration was 28 years with the ability to renew for another 28. Now, the duration of copyright in the U.S. is for the entire life of the author plus an extra 75 years, which in many cases is longer than the shelf life of the physical medium of the work (as in the case of celluloid films of many magnetic media, for example).

One solution which has been proposed is that a copyright owner must pay the U.S. Copyright Office one dollar per year to maintain copyright on a work (some easy system can be set up for copyright owners to re-register their copyrights online). This solution strikes me as as very sensible and a very good balance. If a copyright owner doesn't consider a work worth the trouble of renewing every year (which will be true of many works), then the work will enter the public domain. This strategy requires a modest amount of extra bookkeeping and effort by major publishing or entertainment corporations, but it is a very small amount compared to the entire range of activity of such a business.

Following are a few points in question-and-answer format.

Q: What is the copyright status of these documents?

Under current U.S. law, everything published prior to 1923 has passed into the public domain (i.e., the copyright has permanently expired, and anybody who wants to can republish the materials). Hence, the content in the scanned image documents and raw text files is unambiguously not under copyright.

A work published in the U.S. between 1923-1950 might or might not still be under copyright. It depends on whether the copyright was renewed. If the copyright owner did not renew the copyright, then the copyright expired after 28 years; the copyright was permanently lost and was not subject to the extensions of copyright term which took effect in 1978. The only way to tell is by looking up the work in the U.S. Copyright Office in Washington, D.C. (Older works published outside the U.S. are treated under the Uruguay Round Agreements Act as if the copyright had been renewed after 28 years. For example, a book published in Germany in 1925 will automatically be under copyright in the U.S. until 2018.)

U.S. copyright law might automatically assign copyright to me on certain materials here. Since my intent is for these materials to be freely shareable, I hereby assign any copyright I have on these language materials to the public domain. This might be an issue in the following two cases: 1) my annotations to the Tocharian texts, and 2) the markup tags which I have added to some of the public domain texts. The markup tags are probably not copyrightable to begin with (see below), but my relinquishment of copyright clears up any ambiguity at least for the materials posted here.

Q: If a text is in the public domain and I digitize it and add markup tags, do I have copyright on the markup tags?

This is a very good question. I've looked into this question, and I believe that the answer I will give here is a well-informed one, but let me emphasize that I am not an attorney. I did discuss this matter with a professor of copyright law. He did not have direct experience with the matter of electronic markup tags and did not know of any cases which addressed the very specific question I am considering here. However, he felt that my interpretation was a reasonable one.

First, let me explain what is meant by markup tags. Consider an entry in a paper dictionary. The entry consists of various different sections, such as a headword, a pronunciation, an etymology, a gloss (or definition), etc. A paper dictionary could explicitly label each of these fields; for example, it might include the label "Headword:" before the headword. Because of space considerations, most paper dictionaries do not do this; the structure of the entry is not made this fully explicit, altho some fields might be indicated with text styling (such as boldface type for the headword).

In an electronic version of a dictionary, by contrast, these fields or sections of an entry are often explicitly denoted. This is usually done with markup tags; for example, the headword might be marked with the tags <HEADWORD>...</HEADWORD>.

Suppose you digitize a paper dictionary whose copyright has expired. You add tags to the text to make its structure explicit. The question is this: do you have a legitimate claim to copyright on the tags you have added, even tho the text itself is in the public domain?

I have seen two clear instances where such a claim of copyright on markup tags has been made in exactly these circumstances. However, just because somebody says that they own the copyright on something does not necessarily make it so; some things cannot be copyrighted. My interpretation is that this particular claim of copyright is not enforceable, at least in the United States. If the matter went to court, there is good reason to believe that the court would find that the individual or group who added the tags does not own a copyright on those tags, for the following reasons.

In 1991, the U.S. Supreme Court issued a ruling in the case of Feist Publications, Inc., v. Rural Telephone Service Co., Inc. (499 US 340). Rural claimed that Feist had violated its copyright by republishing the listings in a telephone book published by Rural. It is a principle of U.S. Copyright Law that facts cannot be copyrighted. The court ruled that a telephone book is simply a listing of facts; the contents are devoid of expressive content and do not rise to the standard of being an original creative work which is capable of being copyrighted. Rural argued that it was entitled to copyright on the telephone book contents because it had invested a large effort in compiling and maintaining this collection of facts. The court specifically rejected this "sweat of brow" argument; simply putting a lot of work into something does not by itself create a right to copyright.

A related case is Matthew Bender v. West Publishing Co. (158 F.3d 674). West created online editions of court rulings, adding public information about the attorneys and about the subsequent history of each case. Another company copied the electronic editions from West, stripping the clearly copyrightable commentary but leaving in the purely factual information added by West. West sued, alleging violation of copyright. The court acknowledged that West Publishing Company spends considerable time and effort ensuring that its electronic versions of court rulings are accurate. Nevertheless, the court ruled that this editorial effort was largely the mechanical application of pre-existing rules and did not involve a sufficient degree of originality or creativity to entitle West to copyright on these court texts. Court rulings are created by government officials at public expense; no amount of effort by a private company entitles it a copyright to this part of the public record.

With these rulings in mind, let's consider the case of markup tags.

Marking up a text with tags involves a lot of work. However, the mere fact that much work is involved does not by itself make the product capable of being copyrighted (as was held in both the Feist and the West cases). The tags can only be copyrighted if they rise to the standard of being "expressive" content, meaning that it is an original creative work of the author rather than mere facts.

The addition of the markup tags is an entirely mechanical task. It would be hard to argue that there is the spark of creativity or originality which would lead to the product being considered novel expressive content. Indeed, in the case of dictionaries, a computer program can perform most of the work involved in marking up a text. The tags simply make explicit the inherent structure of the existing text; they encode certain facts about the structure of a public-domain text, and facts cannot be copyrighted. These facts are generally expressed in a markup form prescribed by industry standard (e.g. the TEI Guidelines). Two individuals working separately, and applying the same industry-standard markup rules to the same text, would produce files which are indistinguishable.

Q: Can I copy these materials into my own web space and/or make modifications to them?

Yes, please do. I welcome this and encourage it. This is a good thing for at least the following reasons:

You're also providing a valuable backup in case my pages ever go under.

If you make modifications to the data, you're adding value to it for everybody else.

I would especially encourage you to create derived works from the data, e.g. a web service where you click a word in a Gothic text to look up dictionary information about it.

You don't need my permission to copy these public domain materials, but it is polite to inform me; and I'd be glad to know you copied them. Of course, I ask that you give credit where it is due, and a link back to my pages would be nice. I would also appreciate it if you report to me any errors you find.

Q: Are there any ways I can't use the data?

Since I don't own the copyright on these materials, I can't stop you from doing as you please with them. However, I would like to point out the following ethical matter.

It has already come up once that someone copied TIFF files of scanned books from my web pages, converted them to GIF files, and then posted them to his own site, adding a notice that he did not want others to copy his materials into their own web space (this request is not legally enforceable, since the materials are in the public domain; but it would likely scare off those who aren't aware of their rights under U.S. copyright law, and it creates an environment which is hostile to data sharing in general). I took issue with this; either participate in the gift economy or don't, but don't take freely available materials from others unless you're willing to give back to the public in like kind. I think this individual and I finally came to an understanding, but I'm sure that this won't be that last time I have to deal with this issue.

Q: Why are you making the materials available for free?

There are certainly cases where others have created online compilations of public domain historical linguistic data and are charging high fees for a CD (e.g. $200). The benefit to charging a fee like this is that I might pull in a few thousand dollars a year, which might be enough to pay one undergrad to correct the data for 10 or 20 hours a week.

However, one of my major goals here is to further interest and research in these languages, and I believe that I can best further this goal by making the data freely available. Consider the following sorts of people:

An undergrad who wants to crunch the data for a term paper

A non-specialist who occasionally needs to look something up

A high school student with a burning interest in Old English

Someone at a college in a lower-income country who has no access to a decent library but who is fortunate enough to have an Internet connection

Etc.

In most cases, it is impractical or impossible for these people to come up with the $200 for the CD; they would simply have to do without, which is not in the best interests of the field. Further, this kind of fee is breaking the backs of college libraries, and I don't wish to contribute to the problem.

There are benefits to making the data freely available. For example, I have had very good experience with volunteers helping to enter or correct the data. Volunteers are willing to do this because the resulting data belong to the community, for everybody to share and use as they please. Volunteer help would be unlikely if I were charging for the data.

Similarly, if the data are freely available, it is much more likely that individuals will create derived works. For example, if I make a dictionary of Old English freely available, and someone else makes a corpus of Old English texts freely available, then some third party could sit down and create a handy web resource where you can click words in the text to obtain dictionary information. Some hobbyist could do this in his or her spare time. However, if a license is needed for each piece of the derived product, such projects are much less likely to happen.

Just because somebody claims copyright on materials does not necessarily mean that the claim is enforceable. For example, I could state that I hold the copyright on the online versions of the Indo-European materials on this site on the grounds that I put a lot of work into collecting them and putting them online; but this interpretation of U.S. copyright law would simply be wrong (Feist Publications, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340 (1991), in which the U.S. Supreme Court specifically rejected the "sweat of brow" doctrine). There's no law against incorrectly stating that you own the copyright on something which is in fact in the public domain, but it can intimidate people who aren't sure of their rights under copyright law.

The reason I'm emphasizing the point is that there are some who have chosen otherwise, to my own personal detriment as well as to the detriment of the community in general. I hope that enough of us continue to contribute to the pool of free data that the economy of artificial scarcity will come to be engulfed in the rising tide.

Contact