EEBO in TypeWright
We are pleased to announce that the Mellon-funded Early Modern OCR Project – eMOP – has completed running Optical Character Recognition Software on the 138,538 documents in ProQuest’s Early English Books Online (EEBO), and we are now making almost all of them available in 18thConnect.org for correcting the OCR. Some document images were too poor to run through the software, but we have loaded the resulting “dirty OCR” for 113,909 documents into the TypeWright tool at 18thConnect.org for crowd-sourced correction (http://www.18thconnect.org/typewright/documents). We were able to get an excellent contract with both ProQuest and Gale for all the documents that are loaded into TypeWright, all of EEBO and Eighteenth-Century Collections Online (ECCO): any scholar or student who corrects a document gets to keep it to do whatever they wish with it, ideally create an online digital edition such one you can see here, created by an undergraduate student of Stephen Gregg’s.
Once corrected, 18thConnect will send you the document in both plain-text and TEI-encoded formats. Additionally, the full-text will then be full-text searchable in both ProQuest and Gale’s EEBO and ECCO, and in 18thConnect.org. When you search the latter, 18thConnect gives search returns in the form of links to the texts in EEBO or ECCO, but, for those who use 18thConnect without subscriptions to those databases, we also provide information about holding libraries. Moreover, for those who DO subscribe to these catalogues, our research capacities will have been increased by working on the data we care about. Please note that these catalogs are being sold to libraries just as they are – in correcting the data, we are NOT increasing the profits of these companies, only our own research capacities (please see Mandell and Grumbach, “The Business of Digital Humanities: Capitalism and Enlightenment,” Scholarly and Research Communication 6.4 ).
A word about search: although all of Gale’s ECCO is searchable by word, OCR errors diminish the number of results one gets. A forthcoming article by Mandell demonstrates that the error rate in searching for bigrams (two-word phrases) is 50 to 60%–that is, one is missing over half the results one might otherwise get. In the case of EEBO, only those texts that have been typed by the Text Creation Partnership are searched by word when you are searching EEBO, as you can see on the EEBO search page, in the drop-down box describing what is searchable:
We sincerely hope that professors and students can work together to make sure that these unstranscribed and poorly mechanically transcribed documents (the 85,200 documents so far not available to search as full text) do not become part of a “dark archive,” but can be fully searchable by future generations of scholars, both inside and outside the academy. [Note: This paragraph was slightly edited from it’s original version, on March 11th, 2016.]
You can access the EEBO documents at 18thConnect.org, using the TypeWright tab, “Advanced Search,” or the Search Tab and selecting “TypeWright Enabled Documents”; in both cases, also select “EEBO” under “Other Collections.”
In addition to the instructions for using TypeWright available on the site itself once you begin editing a document, we an introductory video available. We also have a few short videos available on a playlist on YouTube (and below) that introduces TypeWright features one by one, and includes a video about editing EEBO texts specifically, which pose their own kinds of problems.
Also, feel free to contact us with questions or concerns at technologies@18thConnect.org.