Pandora's Jar
Beginning in December of 2004, Google, in its ever expanding quest to index content, began scanning and indexing the texts of millions of volumes of books from partner libraries. Google allows users to search the database and identify those books relevant to their research needs. In one sense, the google book search project is not a new concept. Proprietary databases have existed for years. LexisNexis, a database of legal cases was founded in 1970, and by 1980 included news sources. Westlaw, a competitor, also began in the 1970s.
There are plenty of other proprietary archives such as JSTOR for academic papers, and open access databases like Project Gutenberg, a user created attempt to make available all out of copyright works. There are two main differences between Google's book search and these other projects. First, these other resources more or less cater to specific markets and set of users, not as Google does, offering up the vast contents of books to the world. Second, Google is much less interested in making available the texts they index and much more interested in making those texts searchable.
Publishers and some authors don't seem to comprehend the difference. Since the announcement of the project, the publishing industry has been suing to stop Google from indexing books under copyright which includes the vast majority of books. Book search is however, more about connecting a potential customer with a potential product. Book search will only display a short snippet of a copyrighted text and then offer a link to purchase the book. In short, Google Book search is and will continue to sell more books, mostly books that many users may not necessarily ever have found.
Google is not the only software giant attempting to catalogue the world's literature. Microsoft has their own version as well, and many large libraries, especially university libraries, have digitized their card catalogues. But all of these projects, and the proprietary databases as well, all have holes. Early texts often are overlooked due to the difficulty in scanning in delicate manuscripts. Proprietary databases are not linked together, so a research actually needs to search multiple databases no matter what.
But assuming Google and Microsoft prevail in their legal fights with the publishing industry, and there seems little reason to think they won't considering how little water the publishers' argument holds, it is quite reasonable to believe in a few short years, most modern writing, western writing, will appear in a search friendly databse, available for anyone to find.
A few years ago, I was on a team researching a local politician, looking through the municipal records of his tenure in office. His career spanned about ten years. The tomes of municipal records each covered about six months of time; the books were more than five hundred pages each, and about the size of a newspaper. The municipality had records going back to the 19th century, had I been particularly interested in council meeting minutes from 1890. In all, identifying the relevant pages from the books took more than 50 man hours, just to cover ten years of information. The thought occurred to me: why is this not indexed by google?
New Jersey's Open Public Records Act requires all public records be accessible. Clerks and secretaries of government agencies are allowed to charge a small copying fee per page, but otherwise must fulfill all requests; they have seven days from the time the request is filed. During the course of my research, we could have simply filed an OPRA request, giving the municipal clerk 7 days to make roughly 10,000 copies for a mere $2,500. Such a request would have overwhelmed the municipal clerk, which is why ultimately I spent three days marking off the exact pages we needed copies of. How much simpler this would have been to do had the records been digitized and available online in a database that could be searched with Google.
To the chagrin of many New Jersey politicians more accustomed to hiding their deeds behind the veil of government bureaucracy, the Open Public Records Act did not bring undue financial hardships to the state's agencies. Government did not stop working under the weight of unfettered OPRA requests. Likewise, mandating the digitization of public records would cause little more trouble.
The process is largely a matter of linking existing hardware with existing software. In short, there are few obstacles, other than the political ramifications for corrupt politicians, to keep this information from being easily accessible in a searchable database. From the perspective of researchers, raw data like municipal meeting minutes could prove invaluable, if only it were freely available. In all likelihood, digitizing public records in the same way Google book search is cataloging library contents will be a common practice in the coming years.
The information age is upon us, and the digitization of all information, on demand, anytime, anywhere, will define the coming future. Digital archives are becoming so ubiquitous, hard copy libraries are shrinking. Thanks to LexisNexis and West Law, the traditional law library is in many firms nothing more than ornamental. JSTOR's archive of academic journals is so far superior to anything a single academic library could contain, many institutions are cutting back on journals in favor of paying for database access. Yet, while digital archives are physically smaller, more available, and easier to identify relevant information, the system is not perfect.
As the New Yorker hinted at in Future Reading -- available digitally of course -- all this digitizing may democratize information, but that doesn't speak to the accuracy of it. Anthony Grafton explains: "When Erasmus told the story of Pandora, he said that she opened not a jar, as in the original version of the story, by the Greek poet Hesiod, but a box. In every European language except Italian, Pandora’s box became proverbial."
In essence, with digital information, as a society, we run the risk that all information becomes Pandora's Jar.
The plot of Star Wars, Attack of the Clones, centers around Jedi Obiwan Kenobi attempting to unravel a mystery. A source tells him of a mythical planet. He first heads to the main library archive searching the database-- but can't find information on the planet he knows exists. The librarian informations rather curtly, if the planet isn't in the database, it simply does not exist. The answer is revealed by the simple mind of a child: the database has been altered.
In Orwell's brilliantly frightening 1984, Winston Smith works in the Ministry of Truth where he and his coworkers painstakingly update history. They remove photographs and alter magazines and newspapers to provide the correct and "accurate" history. When Orwell wrote 1984, the idea of "cut," "copy" and "paste" were not figurative, but in fact literal, physical actions. With digital archives though, editing and revising history becomes significantly easier. Cut, Copy and Paste is a matter of a few easy keyboard commands, or a few clicks of the mouse.
A complete digital archive of history, literature and science is a real possibility in the not too distant future. But such a record has the very real possibility of manipulation and alteration. Relying solely on a database like google book search might seem an easily solution to the arduous research process, but we also risk opening Pandora's box-- or jar as it may actually be.
There are plenty of other proprietary archives such as JSTOR for academic papers, and open access databases like Project Gutenberg, a user created attempt to make available all out of copyright works. There are two main differences between Google's book search and these other projects. First, these other resources more or less cater to specific markets and set of users, not as Google does, offering up the vast contents of books to the world. Second, Google is much less interested in making available the texts they index and much more interested in making those texts searchable.
Publishers and some authors don't seem to comprehend the difference. Since the announcement of the project, the publishing industry has been suing to stop Google from indexing books under copyright which includes the vast majority of books. Book search is however, more about connecting a potential customer with a potential product. Book search will only display a short snippet of a copyrighted text and then offer a link to purchase the book. In short, Google Book search is and will continue to sell more books, mostly books that many users may not necessarily ever have found.
Google is not the only software giant attempting to catalogue the world's literature. Microsoft has their own version as well, and many large libraries, especially university libraries, have digitized their card catalogues. But all of these projects, and the proprietary databases as well, all have holes. Early texts often are overlooked due to the difficulty in scanning in delicate manuscripts. Proprietary databases are not linked together, so a research actually needs to search multiple databases no matter what.
But assuming Google and Microsoft prevail in their legal fights with the publishing industry, and there seems little reason to think they won't considering how little water the publishers' argument holds, it is quite reasonable to believe in a few short years, most modern writing, western writing, will appear in a search friendly databse, available for anyone to find.
A few years ago, I was on a team researching a local politician, looking through the municipal records of his tenure in office. His career spanned about ten years. The tomes of municipal records each covered about six months of time; the books were more than five hundred pages each, and about the size of a newspaper. The municipality had records going back to the 19th century, had I been particularly interested in council meeting minutes from 1890. In all, identifying the relevant pages from the books took more than 50 man hours, just to cover ten years of information. The thought occurred to me: why is this not indexed by google?
New Jersey's Open Public Records Act requires all public records be accessible. Clerks and secretaries of government agencies are allowed to charge a small copying fee per page, but otherwise must fulfill all requests; they have seven days from the time the request is filed. During the course of my research, we could have simply filed an OPRA request, giving the municipal clerk 7 days to make roughly 10,000 copies for a mere $2,500. Such a request would have overwhelmed the municipal clerk, which is why ultimately I spent three days marking off the exact pages we needed copies of. How much simpler this would have been to do had the records been digitized and available online in a database that could be searched with Google.
To the chagrin of many New Jersey politicians more accustomed to hiding their deeds behind the veil of government bureaucracy, the Open Public Records Act did not bring undue financial hardships to the state's agencies. Government did not stop working under the weight of unfettered OPRA requests. Likewise, mandating the digitization of public records would cause little more trouble.
The process is largely a matter of linking existing hardware with existing software. In short, there are few obstacles, other than the political ramifications for corrupt politicians, to keep this information from being easily accessible in a searchable database. From the perspective of researchers, raw data like municipal meeting minutes could prove invaluable, if only it were freely available. In all likelihood, digitizing public records in the same way Google book search is cataloging library contents will be a common practice in the coming years.
The information age is upon us, and the digitization of all information, on demand, anytime, anywhere, will define the coming future. Digital archives are becoming so ubiquitous, hard copy libraries are shrinking. Thanks to LexisNexis and West Law, the traditional law library is in many firms nothing more than ornamental. JSTOR's archive of academic journals is so far superior to anything a single academic library could contain, many institutions are cutting back on journals in favor of paying for database access. Yet, while digital archives are physically smaller, more available, and easier to identify relevant information, the system is not perfect.
As the New Yorker hinted at in Future Reading -- available digitally of course -- all this digitizing may democratize information, but that doesn't speak to the accuracy of it. Anthony Grafton explains: "When Erasmus told the story of Pandora, he said that she opened not a jar, as in the original version of the story, by the Greek poet Hesiod, but a box. In every European language except Italian, Pandora’s box became proverbial."
In essence, with digital information, as a society, we run the risk that all information becomes Pandora's Jar.
The plot of Star Wars, Attack of the Clones, centers around Jedi Obiwan Kenobi attempting to unravel a mystery. A source tells him of a mythical planet. He first heads to the main library archive searching the database-- but can't find information on the planet he knows exists. The librarian informations rather curtly, if the planet isn't in the database, it simply does not exist. The answer is revealed by the simple mind of a child: the database has been altered.
In Orwell's brilliantly frightening 1984, Winston Smith works in the Ministry of Truth where he and his coworkers painstakingly update history. They remove photographs and alter magazines and newspapers to provide the correct and "accurate" history. When Orwell wrote 1984, the idea of "cut," "copy" and "paste" were not figurative, but in fact literal, physical actions. With digital archives though, editing and revising history becomes significantly easier. Cut, Copy and Paste is a matter of a few easy keyboard commands, or a few clicks of the mouse.
A complete digital archive of history, literature and science is a real possibility in the not too distant future. But such a record has the very real possibility of manipulation and alteration. Relying solely on a database like google book search might seem an easily solution to the arduous research process, but we also risk opening Pandora's box-- or jar as it may actually be.
Labels: Apocalypse, Society, Technology
