Brought to you by the Massachusetts Historical Society

"I have nothing to do here, but to take the Air, enquire for News, talk Politicks and write Letters."

John Adams to Abigail Adams, 30 June 1774

Wednesday, December 15, 2010

What's in a name tag?

We started Encoding Level 2 in late April 2010 and this process continued until November. In Level 2, the focus was on names as well as data appearing within the <title> tag. An XSLT was created to automate much of this, so that in an author or recipient tag, each instance of JA, AA, JQA, etc. was automatically converted to "adams-john1735", "adams-abigail1744", "adams-john-quincy1767", etc. For non-Adams correspondents, Thomas Jefferson was flipped to "jefferson-thomas", etc. Would that it be this consistent the whole way through the project! While this did a lot of the work it did not count for all the variables that are inherent in a collection the size of the Adams Papers. For text within the <title> tag we added the information from scratch.

The XLST looked for the text within an author or recipient tag and flipped them around. If contained a recipient tag, an additional rule was created to skip the word "to" which always appears and thus take the second and the last word within the tag. For example,
<recipient>to HA</recipient>
was converted to
<recipient><ref target="adams-henry1838" type="person">to HA</ref></recipient>.
And
<recipient>to Simeon Andinwooll</recipient>
was converted to
<recipient><ref target="andinwooll-simeon" type="person">to the HA</ref></recipient>.
It should be clear also that all attributes were automatically converted to all lower case. Ultimately, this was correctly applied to the majority of records, but there were anomalies and thus the style sheet introduced also some bad attributes. For example, Classifying the type "person" was the default, so for offices, corporations, etc. we had to manually fix the type attribute. Those individuals who went by their initials (E. W. Dodge) posed another set of issues; we were very literal in our transcription of the data on the original slip files (which in its turn is faithful to the original document), so unless E. W. Dodge was defined as, for example Eliphalet Winchester, he (assumed) is confined to the anonymity of how his (assumed) name was signed.

Sounds easy, right? After the first few weeks we got into the grove and we learned tricks, what to look for, etc. On average a reel took maybe four or five days, depending on the number of records and any significant events such as deaths, wars, and the like. As I said above, we were very literal in the process of creating name authorities not all of us are Adams experts. A fine example of this is T. B. Johnson. T. B. Johnson signed many of his letters as T. B. Johnson. He also signed them Thomas B. Johnson. And, there were a few that were Thomas Baker Johnson. But, he is likely not the only T. B. Johnson in the history of the world, so it is difficult to determine if they are one in the same or different people. So currently all three variants exist in the database which could make searching kind of difficult and not exhaustive. Frequently in a run of letters we were able to determine that T. B. was indeed Thomas Baker, and so in cases like this we felt comfortable changing an instance of T. B. Johnson to the fuller Thomas Baker Johnson. However, if we could not conclusively determine that it was indeed Thomas Baker Johnson, we left it alone and one of the Adams editors can make that change.

More examples...

Letters with multiple authors and/or recipients were a little complicated, as well as letters addressed generally to someone by their title/position/office.

So, after Level 1 a sample multiple author tag appeared as:

<author>JA, B. Franklin, J. Jay, H. Laurens, and T. Jefferson.</author>


After we ran the schema and did some house cleaning, it was transformed to look like this:

<author> <ref target="adams-john1735" type="person">JA</ref>, <ref type="person" target="franklin-benjamin">B. Franklin</ref>, <ref type="person" target="jay-john">J. Jay</ref>, <ref type="person" target="laurens-henry">H. Laurens</ref>, and <ref type="person" target="jefferson-thomas">T. Jefferson</ref>.</author>


Where a letter was addressed to someone by their title/office etc., after Level 1, a sample recipient tag looked this way:

<recipient>to the President of Congress</recipient>


After the XSLT was run, it looked like this:

<recipient><ref target="congress-the" type="person">to the President of Congress</ref></recipient>


You can see that that transformation skipped the word "to" and then looked at the next word and the last word. Once we reviewed the records and conducted a little research, it thus became this:

<recipient><ref target="huntington-samuel" type="person">to the President of Congress</ref></recipient>.


Simple beauty!

As of right now, the names database contains 19,454 names. We have yet to systematically clean up possible duplicates like Johnson example above, or instances where the names were spelled different in America than in, say, the Netherlands. Fortunately these are exceptions and not the rule, so the process should go smoothly. A lot more needs to be said about Encoding Level 2 and I'm sure I didn't touch on many of the aspects of our process. But hopefully this post gives a little flavor as to the goings-on at the Massachusetts Historical Society during the spring, summer, and fall of 2010.

Tuesday, December 14, 2010

We've been busy!


The project manager was away for awhile... but the TEAM has been busy. More updates very soon!