Brought to you by the Massachusetts Historical Society

"I have nothing to do here, but to take the Air, enquire for News, talk Politicks and write Letters."

John Adams to Abigail Adams, 30 June 1774

Wednesday, August 12, 2009

Phase 2 Timeline

Our project to digitize the Adams Papers Control File began in January 2009. We originally planned on spending a few short months on proofreading before moving into encoding. However, proofreading 109,348 slips, one by one, has taken a little longer than we anticipated. This phase of the work is vitally important, though, and we have continued doggedly pursuing our final reels. We have found important corrections and updates and have begun entering those changes into the XML files now. The input of corrections has been folded into first phase of encoding and so far is going smoothly.

The first seven months of the project were also devoted to schema development (see Master Encoding Guide) and this summer we secured the services of an excellent XSL consultant to write an XSL transformation to convert our abbreviated vendor schema into the full schema and populate much of the consistent data automatically. The XSLT's have been very helpful and we hope to build on them to automatically generate other data as we work through the initial encoding.

Thus our schedule for 2009:
  • January-August: proofreading (project manager, proofreader, EAD coordinator)
  • March-June: schema development (project manager and web developer)
  • July-August: XSL development and contract work (project manager, web developer, and consultant)
  • August-December: encoding level 1 (project manager, encoder, EAD coordinator)
  • September-December: XSL development (project manager and web developer)

Master Encoding Guide: Record Basics

<record>
The control file contains 109,348 color-coded slips. While each slip refers to a document, the slip is not exclusive. One document may have several slips representing it. For example, one letter written from John to Abigail may exist as a letterbook copy (white), a recipient’s copy held in the Adams Family Papers archive (pink), a contemporary copy held by the National Archives (yellow), and a printed version from the Boston Chronicle (blue). This same letter may have copies in a dozen institutions, and thus garner a dozen separate yellow accession slips. The overall database, then, is a database of the slips, not of the documents. Because of this data model, all text on the slip is retained as it is written and controlled vocabulary should appear in the attributes. Thus, a recipient that reads "to JA, with 1 enclosure (copy of Lovell to Dana, 6. Jan. 1781)" will appear as it is written but the attributes will be controlled for searching and sorting purposes.

<recipient> to <person ref="JA">JA</person ref="JA">, with <enclosure ref="030001"> 1 enclosure (copy of <person ref="lovelljames">Lovell</person> to <person ref="danafrancis">Dana</person>, 6. Jan. 1781)</enclosure></recipient>.

Required Attributes

@id. Each slip is assigned a unique ID number. It is an attribute of the Record element and is required. To learn how the number is constructed and assigned, see Record ID.

@color. The color of the slip conveys a great deal of information in a quick, easy way, therefore this information will be retained. The vendor was unable to record this during data entry because they were working from black and white microfilm. The color assignments are made in the first phase of work, proofreading, and will be input during the second phase, tag refinement. There are five color choices (1pink, 2white, 3yellow, 4blue, 5goldenrod) to chose from—records will not validate without a color selected.

Optional Attributes.

@z. In the case of redating a letter or correcting authors or recipients, slips are struck through with a large Z and a reference to the newly created slip. This allows for proper tracking of a letter that may have previously been published under the incorrect information. The original slip is marked with the @z (with a value of "z") and the new cross-reference is tagged with a <zref> note tag, see Notes below. [Occasionally, editors have underlined a date and drawn a line to a note on the slip. These are not cancelled but rather cross-references.]

@r. The review tag was introduced for the vendor to flag a record that presented confusion on their part. Almost all of these questions have been resolved at the proofreading phase, however the attribute will remain an option if needed throughout the project.

@language. The large majority of documents described in the control file are written in English, however there is a need to track the number of foreign language documents. When possible, the @language will include any language other than English. The codes will mirror those used in MARC21.
Within the structure of the schema, each record contains <ref name: "record contents"> This is a cross-reference in the <record> element and the contents are detailed under <define name="record-contents">.

Master Encoding Guide: Value-Added Information

Value-Add information

The value-added information will be encoded in Encoding Level 2. These will add links to the reel numbers of the Adams Papers Microfilm, and unique URLs to the Adams Papers Digital Edition, and the Adams Electronic Archives. The details will be determined at a later point.

Master Encoding Guide: Notes

Record Content: Notes

Many records have a line or several lines of text on the bottom half of the slip that does not conveniently fit into any other categories. The vendor was asked to transcribe all of this text and code with the <n> tag. If there was a logical paragraph break between one note and another, each line was tagged separately. There is no limit to the number of <n> tags in a record. In the full schema, the <n> tag will be parsed to control for different kinds of notes. This section may be expanded as we move through Encoding Level 1, but thus far we have identified ten distinct types of notes: cross-reference, see-reference, content, collection, auction, quote, z-reference, enclosure, subject, and internal. Examples will be shown as we work through encoding.

Master Encoding Guide: Printed

Record Content: Printed

Many records will have a line or several lines of text following the word “Printed.” All of this text was coded by the vendor with the <pr> tag. If there was a logical paragraph break between one printed citation and another, each citation was tagged separately. There is no limit to the number of <pr> tags in a record. In the full schema, the <pr> tag will be parsed to allow for tracking of bibliographic information, including titles and authors. This level of encoding is reserved for Level 2.

Example:
<printed>Printed: <ref target="AFC2" href="linktoAbigail">AFC vol. 2</ref>:345</printed>

<printed>Printed: <ref target="BcolCent">Boston Columbian Centinel</ref>, 18 August 1785</printed>

The @target and the @href will link to a separate database of the Adams Papers short titles and to the bibliographic record in ABIGAIL, the MHS online catalog. The section of the full schema may also change as we learn more about the capabilities of the full database.

Master Encoding Guide: Title

Record Content: Title Element

If there is no clearly defined "author to recipient" statement, the second line under the date should be tagged as a title, <ti>. To conform to the RelaxNG schema, there should always be either an AUTHOR and/or a TITLE present in each record.

<title>
As with the author and recipient elements, a title element may include child elements for <person>, <corporate>, and <office>, all with target attributes to separate directories. As with the recipient element, a title element may also include optional child elements for <enclosure>, <content>, and <ref> elements. The <ref> element includes a @target.

Example:

<title> Letter of credence from Congress <person target="huntingtonsamuel">(Samuel Huntington</person>, <office target="presidentofcon">President)</office> to the <office target="stadtholder">Stadtholder of the Netherlands.</office></title>

<legaltitle><docket>
A child element unique to <title> is the <legaltitle> element. Within <legaltitle> there is another child element, <docket>.

Example:

<title> <legaltitle>Minutes on court cases: <docket>J[oshua] Green, Admr. v. G[eorge] Green</docket> <docket>Malcom v. Mackay</docket> </legaltitle> </title>

Master Encoding Guide: Recipient

Record Content: Recipient

<recipient>
Most records consist of an author and a recipient. The author is loosely defined as any text before the first instance of the word "to." The recipient, then, is any text following the first instance of the word "to." The vendor was instructed to include all text following "to" in the recipient tag. Frequently, the text following a recipient’s name includes statements about content, enclosures or references to other letters. During Encoding Level 2, these other elements will be separately tagged. The recipient element includes the same child elements as the author element, that is, <person>, <corporate>, and <office>. The same guidelines and rules apply to these child elements as in the author element (see above). In addition there are three child elements may also be used in <recipient>.

<enclosure>
A letter often includes enclosures. The text of the record will expressly state that an "enclosure" is included. These may also contain a target reference to another letter and there may be multiple references. These are tagged with <ref @target>.

<content>
Similar to an enclosure statement, the recipient text block may also include a note regarding content. This is a child element of the recipient element.

<ref>
There may also appear miscellaneous references to other documents or persons that can be tagged with a <ref> tag with a target to a separate directory or separate document.

Example:

<recipient>to <person target="washingtongeorge">George Washington</person>: <content>Opinion re salaries of American diplomats,</content> with several <enclosure>enclosures (in which are found, on p. 9-10, copy of <ref target="000000">JA to John Jay, 13 May 1785,</ref> & on p. 10, copy of <ref target="00000"> John Jay to JA, 3 Aug. 1785)</ref></enclosure></recipient>

Master Encoding Guide: Author

Record Content: Author Element

<author>
Every record has an author name (or initials) on the second line under the date. The vendor was instructed to consider all text that appears before the first instance of the word "to" as an author and code under one author <a> tag. For most records, there will be one name (or initials) before the first instance of the word "to." In the full schema, there are three optional child elements to encode within the parent author tag: person, corporate, and/or office. These tags can be used together or separately. See below for examples.

<person>
Refers simply to a single individual. This element includes a target reference to a separate database of persons with birth and death dates. During encoding (Level 2), encoders will assign a target reference pulled on a list of the most commonly found names. In addition, any names not found on the list will be assigned a target reference. Construction of the target reference should be all lower case, "lastfirstmiddletitle." This will allow for cleaner sorting in the names database.

<corporate>
The corporate tag borrows from the standard cataloging rules for a corporate author. That is, a group of individuals, such as the U.S. Congress.

<office>
The office tag is used to distiguish an individual author from someone writing in the capacity of their office, such as the President of Congress, writing official documentation in his capacity as the President of Congress, but not necessarily as his own person. For example, John Adams may write to his colleague, Samuel Huntington, or he may write to the President of Congress who happens to be Samuel Huntington for several months.

Example:
<author><person target="huntingtonsamuel">Samuel Huntington,</person><office target="presidentofcon"> President of Congress,</office></author>

<author><corporate target="houseofrep">House of Representatives<corporate></author>

Master Encoding Guide: Series

Record Content: Series

Some records will have a handwritten code on the bottom right corner consisting of a Roman numeral, either “II” or “III”, followed by an Arabic number. Only the Roman numbers should be transcribed and should be tagged with <series>. The encoder need only verify that this code is valid. If not, check with project manager for possible slip error.

Example:

<series>II</series>

<series>III</series>

Master Encoding Guide: Copy

Record Content: Copy

Many records will include a format designation, usually on the fourth line, following the page number. The most common text will be: MS, Xpr, EnlPr, Xerox, microfilm, Photostat. These abbreviations and any text following should be coded with the <copy> tag and the information will be controlled using @format. There are only four @formats to chose from: Photocopy, Manuscript, Microfilm, Digital Image.

Example:

<copy format="Photocopy">Xerox print</copy>

<copy format="Manuscript">MS</copy>

<copy format="Microfilm">microfilm</copy>

<copy format="Digital Image">Electronic achive scan</copy>

Master Encoding Guide: Length

Record Content: Length

Many records will include the document length, generally on the third line, below the author and recipient line. The convention most records will follow is an Arabic number and "p." for "pages." All text on this line should be coded as <length pages="3">. If there are multiple page numbers listed to signify a letter and an enclosure, the pages are added up and entered as one value.

For example:

<length pages="4">2 p., 2 p.</length>

Master Encoding Guide: Code



Record Content: Code

About half of all records include a code found on the upper right corner of the slip. The codes are used to track individual accession documents, letterbook documents, or diary and miscellany documents. The codes are a combination of letters and numbers. The vendor was asked to encode all of this text with <c> and we later ran an XSL transformation to parse out the distinct pieces of information. This will allow us to track the documents by their accession and letterbook numbers as well as to separately track the institutional codes for accesssioned documents.

There are four types of codes and each one is also associated with slip color: letterbook=white, accesssion=yellow, diary=pink, and miscellany=pink.

Letterbook codes all begin with "Lb" and have up to a five-digit number following it. These slips are white (<record color="3white">). They have an optional @author as well, but this is only rarely applicable:
Example:

<code type="letterbook" number="1234">Lb1234</code>

<code type="letterbook" number="21" author="JQA">JQA/Lb/21 [end]</code>

Accession codes are made up of an insitution code (where the original is housed and from whom we received a copy) and up to a six-digit number (the document tracking number). These slips are almost always yellow (<record color="2yellow">). Occasionally, an accession code is found on a blue slip. This is considered an editor error in the creation of the slip and will be reconciled later. The code should be rendered the same.
Example:

<code type="accession" repository="DNA" number="2589">DNA:2589</code>

<code type="accession" privateowner="MBSmith" number="2589">MBSmith:2589</code>

Miscellany codes are essentially shelf marks for their physical location, constructed with M for Miscellany, initials (usually JA or JQA) for the author, and a number for the volume.
Example:

<code type="miscellany" author="JA" number="78">M/JA/78</code>

Diary codes are also essentially shelf marks for their physical location, constructed with D for Diary, initials (usually JA or JQA) for the author, and a number for the volume.
Example:

<code type="diary" author="JQA" number="12">D/JQA/12</code>

General codes. Some slips, usually blue, have a code associated only with an institution but do not have a number attached because we do not have a physical copy to accession and track.
Example:

<code type="general">NN</code>

Master Encoding Guide: Place

Record Content: Place

If a place is included on the slip, it will follow the date on the top line. Place names should be transcribed as they appear, either abbreviated or spelled out, tagged with <place> element. Place names are controlled through two attributes: location and axis.

Required Attributes

@location.
If a place name appears on the slip, it is usually rendered exactly as it appears in the document, therefore "The Hague" might be written in the French, "La Haye," if it appeared that way on the original manuscript. The value of the @location should have the English spelling of the city only. Any other place information, for examle a street address, will be found in the text of the record. If two city locations are included, both cities will be listed as the value of @location, separated with one white space. When there is a question about the proper English spelling of a city, consult the Getty Thesaurus for Geographic Names.

Example: <place location="The Hague">La haye</place>
<place location="Philadelphia">Philadelphia, 32 South Street</place>
<place location="Braintree Quincy">Braintree and Quincy<place>

Most of the city names have been automatically populated using XSL transformation. Encoders need to just check that the values are English city names only.

Optional Attributes

@axis.
There is an optional attribute to include the axis point for a given location. This information is provided in the Getty Thesaurus. The rules for encoding the longitude and latitude axis points will be determined during encoding level 2.

Still Proofreading... but Encoding Begins!

It has been a busy couple of months since the last post. While continuing with the proofreading, which has taken much longer than originally estimated, we have developed a full schema (also in RelaxNG) for encoding the data and have hired a consultant to develop some nifty XSL transformations to move the data from our short vendor schema to the full schema. Following this post I will begin uploading the master encoding guide that provides a detailed narrative of each element and examples of the mark-up.

Master Encoding Guide: Dates


Record Content: Dates

<date> The date element is the most important piece of information on each slip. The physical card file, as well as the physical archive collection and the microfilm, is arranged chronologically. It is the first access point to the documents. It is also the most complicated, due to the variety of date formats and the rules regarding sorting. The date section of the full schema is derived from the unique dating structure of the control file, as it was originally organized and inventoried. As noted above, all text in the date field is retained as is and the attributes control the data for sorting and searching.

Required Attributes
The <date> requires a choice of one of the following attributes:

@when.
The most comment type of date found in the control file are exact dates with a known day, month, and year. These are controlled in the @when with a date parameter of year-month-day (YYYY-MM-DD). If either the month or day is uncertain, encode the @when with "99", i.e. YYYY-99-99. This will allow for proper sorting—complete dates come before incomplete dates.

Example: <date when="1771-12-21">21 Dec. 1771.</date>
<date when="1776-07-99">July 1776</date>

@when and @ante.
Any date that is known to be written before a certain date using the term "ante." These are also controlled with the standard date parameter year-month-day.
Example: <date when="1790-07-11" ante="ante">ante 11 July 1790</date>

@when and @post.
Any date that is known to be written after a certain date using the term "post."
Example: <date when="1790-07-11" post="post">post 11 July 1790</date>

@when and @to.
Any range of dates with a known beginning and end. These elements must be used as a group.
Example: <date when="1790-07-11" to="1790-08-31">11 July–31 Aug. 1790</date>

Optional Attributes
The following attributes may be added to any of the required attributes above:

@circa.
Any of the above four types of dates (when, when/ante, when/post, and when/to) that are preceded by a "ca." or "circa" should also include the @circa. The value is "yes" and @circa can be combined with any other date attributes.
Example: <date when="1745-10-21" circa="yes">ca. 21 Oct. 1745</date>

@conjectural.
Any date that appears in brackets is a conjectural date. The date may be conjectural for a variety of reasons (illegible, not present on the letter, supplied by context), but the key feature is the use of brackets. The value is "yes" and @conjectural may be combined with any other date attributes.
Example: <date when="1772-12-02" conjectural="yes">2 Dec. [1772]</date>

@noDate.
Any date field that includes the phrase "n.d." should have the @noDate. This may appear alone, but more often accompanies a supplied date in brackets. The @noDate may be combined with any other date attribute.
Example: <date noDate="yes" post="1773-06-16" conjectural="yes" >n.d. [post 16 June 1773?]</date>

@rank.
The rank will be added using a separate XSL transformation. The ranking will be determined based on the number of attributes to allow for proper sorting.

A note on sorting.
The sorting rules for the database will follow the filing rules used in the physical paper file. Not only are the editors already familiar with these rules, they are the most logical method for the variety of dates—certain and uncertain, complete and incomplete—found in the control file. The original directive adequately described the sorting:

The file contains all ribbon copies of all four types of control slips. In cases where several slips represent different versions of the same document, the order of filing is as follows:

Pink (CDF), White (Letterbook), Yellow (Accession), Blue (Printed or Manuscript Lead)

Arrangement of slips in this file follows strict chronological order throughout the entire span of our documents, 1639-1889. Dates which contain all three elements (day, month, and year) precede those which are incomplete. Inclusive dates are filed under the earlier of the dates, but following all other slips for that day. (An exception to this rule is the filing of accounts with inclusive dates, which go under the later of the inclusive dates, but preceding all other slips for that day.) An example may be taken from the end of a theoretical year:
Ante 15 Dec. 1800
15 Dec. 1800
15-19 Dec. 1800
31 Dec. 1800
31 Dec. 1800-21 Jan. 1801
31 Dec. 1800-18 May 1801
Dec. 1800
1800
Ca. 1800
[1800?]
1800-1809
[post 1800]