The most recent task assigned to me to complete as we prepare to launch the Adams Papers Digital catalog was to review instances were names we assigned as attributes (e.g. adams-john1735 for John Adams (a.k.a. JA) appeared in the XML code, but not in the People database. This is part one of two, the second part being instances were people appear in the database but do not appear in the attributes in the XML.
This is another one of those combination's of human and computer errors, some of which were unavoidable (computer) and some of which happened because of how mundane some of the work was (human).
The attributes that needed seeing to was 1200 strong, and in printed form stretched to 110 pages. (Before you think we're total idiots, this represents approximately 5-6% of the database which means we got 94-95% right. On a ten point grading scale that is a solid A! Sheepishly we realize mistakes happen but anticipating negative criticism about the work we did we wanted to try to spin it positively.)
We did two queries to produce the data. The first query produced the faulty attributes and the slip ID number in which the attribute was contained. For the results that came about in the second run, we included within double-quotes the name associated with the faulty attribute. It gave a little insight into what we could expect before searching the database and editing the code.
Some typical instances looked like this:
a. [extract]-monroe] 160017 - "to Sec. of State [James Monroe] [extract]" | 160043 "to Sec. of State [James Monroe] [extract]"
b. bourne-sylvanus? 072781 - ""to Mr. Sylvanus? Bourne"
c. henry-laurens 031911 - "to Henry Laurens"
d. nicolay-albery-h 341282 - "to Albert H Nicolay"
e. palfrey-john 340735 - "John G. Palfrey" | 341083 - "John G. Palfrey"
There were a number of consistencies as to the errors that were made that were determined quickly when cleaning up the data. In some instances the attribute was valid, but had not been entered into the database. While frustrating, this was an easy fix and was most likely the product of human error. A second reason was due to a typographical error in the XML attribute that was not present in the attribute for that person in the people database. And the reverse, where there was a typo in the people database but the attribute was actually correct. Typographical errors includes transpositions of letters (examples c and d). A third possibility was that in the transformation process which took place in Level 1, the attribute was inaccurately reviewed (examples a and b above). And in the above it should be evident that punctuation and other marks like [ ) ? . were not allowed in the attribute; . Another kind of thing we saw was were attributes in the XML were not as complete as in the database, and the opposite (example e).
The fixes for the above examples should be relatively easy to make yourselves:
a. monroe-james (remove [], add james)
b. bourne-sylvanus (remove ?)
c. laurens-henry (flip first and last names)
d. nicolay-albert-h (typo in albert corrected)
e. palfrey-john-gorham (added middle name)
In addition to checking the slip, we also had to check the database for each name (unless it was a prominent figure). The back-end view for the Adams Papers editors (and us), allows for tabbed browsing of the digital control file. I have not really seen the public interface but imagine this might be a similar feature.
We did not keep count of the number of names added but there were quite a few. In the process we were able to clean up a number of bad attributes that were in the database and in some cases merge or separate people based on a close inspection of names, dates, etc. For example in correcting some of the slips for William Cunningham Jr. I discovered that his attribute should be "cunningham-william0" where as the majority of them were just "cunningham-william", which is the attribute for his father, William Sr. These have all been fixed now so that the users of the digital control file will get the respect they expect when searching for these people.
We will re-run the query in a few days to ensure that every instance was seen to accurately. Let's pretend this was the case, for if any were missed, I won't tell you about it!
The other side to this clean-up, whereby we'll run a query to determine in the database where attributes exist that do not appear in the XML, is more straightforward...those attributes will be deleted from the database.
Also going on has been Beta testing of the public interface. Susan and I were invited to sit in on a meeting yesterday about how that testing went, what some of the feedback was, etc. So I'll post a bit about the public interface next time. The project comes to a close on 30 June 2011 so blogging might slow down - if it doesn't stop outright - after that date. I don't want to turn this into a Brokeback Mountain "I can't quit you" kind of moment, but the reality is is that once the funds stops, so does the blog!
Showing posts with label Names. Show all posts
Showing posts with label Names. Show all posts
Tuesday, June 21, 2011
Friday, March 11, 2011
Document Types & Names
At the moment, we are working on entering "Document Types" into the database. This is a hybrid part of the project where we are working both with the slips and interface. See images below. We have completed JA's document types and are working now on JQA. I skipped AA as I couldn't find her slips.


At the same time, the good people in the Adams papers are doing preliminary work on cleaning up the names database. This involved printing out the entire list of names and looking particularly at the Adamses, Smiths, etc. for duplicates and then seeing which can be merged or which can be better identified. Of course our by now classic example is Thomas Baker Johnson who had at least three entries (johnson-t-b; johnson-thomas-b; and johnson-thomas-baker). They have all been fixed in that the attributes are all "johnson-thomas-baker" now. For now, "Johnson, T. B." and "Johnson, Thomas B." still appear in the drop down list of names when searching slips as these forms of his name do appear on the physical slips. However if one were to select these options they would get a return of 0 results. Perhaps it will be worth it to remove them altogether?

At the same time, the good people in the Adams papers are doing preliminary work on cleaning up the names database. This involved printing out the entire list of names and looking particularly at the Adamses, Smiths, etc. for duplicates and then seeing which can be merged or which can be better identified. Of course our by now classic example is Thomas Baker Johnson who had at least three entries (johnson-t-b; johnson-thomas-b; and johnson-thomas-baker). They have all been fixed in that the attributes are all "johnson-thomas-baker" now. For now, "Johnson, T. B." and "Johnson, Thomas B." still appear in the drop down list of names when searching slips as these forms of his name do appear on the physical slips. However if one were to select these options they would get a return of 0 results. Perhaps it will be worth it to remove them altogether?
Monday, January 10, 2011
Supporting Databases, Part 1
This sounds like an Oscar category...And the nominees for Best Supporting Database in a Digital Conversion Project are: Accessions. Institutions. People. Places.
The supporting databases in the project allow us to regularize and make a consistent way in which to store and retrieve information. At the present time, there are four supporting databases: Accessions, Institutions, People, Places. There are additional supporting documents that we created and used such the Microfilm Conversion Chart. Fellow Adams Slip File encoder and blogger Susan Martin worked with the Accessions and Institutions databases as well as the Microfilm Conversion Chart and MHS Collection Codes, so she will write on them.
As mentioned in the post on 15 December 2010, at that time the People database contained 19,454 names. This number will fluctuate a bit as digital control file staff and Adams Papers editors identify duplicate entries and/or clarify & identify more fully those records for which staff have more information. Occasionally also we find names skipped during encoding level 2; this generally was the result of the density or complexity of a record.
The Places database was the first to be built and populated during Level 1 Encoding. In Level 2, while not a focus, we took the opporutnity to review attributes and perform basic data clean-up if necessary. The Places database contains 3,090 records: from Abbeville to Zwolle.
The fields we populated in Level 1 in the Places database are location, city, state, country, and notes. The location field is the controlled form of the entry - the attribute. Generally the first time a city appeared it received a one word attribute: "quincy", "tallahassee", and "athol" for example. However, once the country expanded, we were left with the task of differentiating between places with the same name in different states and/or countries. A good example is Burlington. We have eight different records for Burlington: "burlington", "burlington-county", "burlington-ia", "burlington-ma", "burlington-me", "burlington-nj", "burlington-ny", and "burlington-vt". We assigned the fullest known attribute to distinguish one from the other. However, sometimes the address listed simply says Burlington. In these instances it was not always possible to determine if it was the Burlington in Massachusetts or some other state.
This is a long way of saying we did the best we could with the information we had. As with the People database, the Adams Papers editors can use their expertise to help solidly define and identify a place if needed.
The supporting databases in the project allow us to regularize and make a consistent way in which to store and retrieve information. At the present time, there are four supporting databases: Accessions, Institutions, People, Places. There are additional supporting documents that we created and used such the Microfilm Conversion Chart. Fellow Adams Slip File encoder and blogger Susan Martin worked with the Accessions and Institutions databases as well as the Microfilm Conversion Chart and MHS Collection Codes, so she will write on them.
As mentioned in the post on 15 December 2010, at that time the People database contained 19,454 names. This number will fluctuate a bit as digital control file staff and Adams Papers editors identify duplicate entries and/or clarify & identify more fully those records for which staff have more information. Occasionally also we find names skipped during encoding level 2; this generally was the result of the density or complexity of a record.
The Places database was the first to be built and populated during Level 1 Encoding. In Level 2, while not a focus, we took the opporutnity to review attributes and perform basic data clean-up if necessary. The Places database contains 3,090 records: from Abbeville to Zwolle.
The fields we populated in Level 1 in the Places database are location, city, state, country, and notes. The location field is the controlled form of the entry - the attribute. Generally the first time a city appeared it received a one word attribute: "quincy", "tallahassee", and "athol" for example. However, once the country expanded, we were left with the task of differentiating between places with the same name in different states and/or countries. A good example is Burlington. We have eight different records for Burlington: "burlington", "burlington-county", "burlington-ia", "burlington-ma", "burlington-me", "burlington-nj", "burlington-ny", and "burlington-vt". We assigned the fullest known attribute to distinguish one from the other. However, sometimes the address listed simply says Burlington. In these instances it was not always possible to determine if it was the Burlington in Massachusetts or some other state.
This is a long way of saying we did the best we could with the information we had. As with the People database, the Adams Papers editors can use their expertise to help solidly define and identify a place if needed.
Labels:
About the project,
Database,
Encoding Level 1,
Encoding Level 2,
Names,
Place
Wednesday, December 15, 2010
What's in a name tag?
We started Encoding Level 2 in late April 2010 and this process continued until November. In Level 2, the focus was on names as well as data appearing within the <title> tag. An XSLT was created to automate much of this, so that in an author or recipient tag, each instance of JA, AA, JQA, etc. was automatically converted to "adams-john1735", "adams-abigail1744", "adams-john-quincy1767", etc. For non-Adams correspondents, Thomas Jefferson was flipped to "jefferson-thomas", etc. Would that it be this consistent the whole way through the project! While this did a lot of the work it did not count for all the variables that are inherent in a collection the size of the Adams Papers. For text within the <title> tag we added the information from scratch.
The XLST looked for the text within an author or recipient tag and flipped them around. If contained a recipient tag, an additional rule was created to skip the word "to" which always appears and thus take the second and the last word within the tag. For example,
Sounds easy, right? After the first few weeks we got into the grove and we learned tricks, what to look for, etc. On average a reel took maybe four or five days, depending on the number of records and any significant events such as deaths, wars, and the like. As I said above, we were very literal in the process of creating name authorities not all of us are Adams experts. A fine example of this is T. B. Johnson. T. B. Johnson signed many of his letters as T. B. Johnson. He also signed them Thomas B. Johnson. And, there were a few that were Thomas Baker Johnson. But, he is likely not the only T. B. Johnson in the history of the world, so it is difficult to determine if they are one in the same or different people. So currently all three variants exist in the database which could make searching kind of difficult and not exhaustive. Frequently in a run of letters we were able to determine that T. B. was indeed Thomas Baker, and so in cases like this we felt comfortable changing an instance of T. B. Johnson to the fuller Thomas Baker Johnson. However, if we could not conclusively determine that it was indeed Thomas Baker Johnson, we left it alone and one of the Adams editors can make that change.
More examples...
Letters with multiple authors and/or recipients were a little complicated, as well as letters addressed generally to someone by their title/position/office.
So, after Level 1 a sample multiple author tag appeared as:
After we ran the schema and did some house cleaning, it was transformed to look like this:
Where a letter was addressed to someone by their title/office etc., after Level 1, a sample recipient tag looked this way:
After the XSLT was run, it looked like this:
You can see that that transformation skipped the word "to" and then looked at the next word and the last word. Once we reviewed the records and conducted a little research, it thus became this:
Simple beauty!
As of right now, the names database contains 19,454 names. We have yet to systematically clean up possible duplicates like Johnson example above, or instances where the names were spelled different in America than in, say, the Netherlands. Fortunately these are exceptions and not the rule, so the process should go smoothly. A lot more needs to be said about Encoding Level 2 and I'm sure I didn't touch on many of the aspects of our process. But hopefully this post gives a little flavor as to the goings-on at the Massachusetts Historical Society during the spring, summer, and fall of 2010.
The XLST looked for the text within an author or recipient tag and flipped them around. If contained a recipient tag, an additional rule was created to skip the word "to" which always appears and thus take the second and the last word within the tag. For example,
<recipient>to HA</recipient>was converted to
<recipient><ref target="adams-henry1838" type="person">to HA</ref></recipient>.And
<recipient>to Simeon Andinwooll</recipient>was converted to
<recipient><ref target="andinwooll-simeon" type="person">to the HA</ref></recipient>.It should be clear also that all attributes were automatically converted to all lower case. Ultimately, this was correctly applied to the majority of records, but there were anomalies and thus the style sheet introduced also some bad attributes. For example, Classifying the type "person" was the default, so for offices, corporations, etc. we had to manually fix the type attribute. Those individuals who went by their initials (E. W. Dodge) posed another set of issues; we were very literal in our transcription of the data on the original slip files (which in its turn is faithful to the original document), so unless E. W. Dodge was defined as, for example Eliphalet Winchester, he (assumed) is confined to the anonymity of how his (assumed) name was signed.
Sounds easy, right? After the first few weeks we got into the grove and we learned tricks, what to look for, etc. On average a reel took maybe four or five days, depending on the number of records and any significant events such as deaths, wars, and the like. As I said above, we were very literal in the process of creating name authorities not all of us are Adams experts. A fine example of this is T. B. Johnson. T. B. Johnson signed many of his letters as T. B. Johnson. He also signed them Thomas B. Johnson. And, there were a few that were Thomas Baker Johnson. But, he is likely not the only T. B. Johnson in the history of the world, so it is difficult to determine if they are one in the same or different people. So currently all three variants exist in the database which could make searching kind of difficult and not exhaustive. Frequently in a run of letters we were able to determine that T. B. was indeed Thomas Baker, and so in cases like this we felt comfortable changing an instance of T. B. Johnson to the fuller Thomas Baker Johnson. However, if we could not conclusively determine that it was indeed Thomas Baker Johnson, we left it alone and one of the Adams editors can make that change.
More examples...
Letters with multiple authors and/or recipients were a little complicated, as well as letters addressed generally to someone by their title/position/office.
So, after Level 1 a sample multiple author tag appeared as:
<author>JA, B. Franklin, J. Jay, H. Laurens, and T. Jefferson.</author>
After we ran the schema and did some house cleaning, it was transformed to look like this:
<author> <ref target="adams-john1735" type="person">JA</ref>, <ref type="person" target="franklin-benjamin">B. Franklin</ref>, <ref type="person" target="jay-john">J. Jay</ref>, <ref type="person" target="laurens-henry">H. Laurens</ref>, and <ref type="person" target="jefferson-thomas">T. Jefferson</ref>.</author>
Where a letter was addressed to someone by their title/office etc., after Level 1, a sample recipient tag looked this way:
<recipient>to the President of Congress</recipient>
After the XSLT was run, it looked like this:
<recipient><ref target="congress-the" type="person">to the President of Congress</ref></recipient>
You can see that that transformation skipped the word "to" and then looked at the next word and the last word. Once we reviewed the records and conducted a little research, it thus became this:
<recipient><ref target="huntington-samuel" type="person">to the President of Congress</ref></recipient>.
Simple beauty!
As of right now, the names database contains 19,454 names. We have yet to systematically clean up possible duplicates like Johnson example above, or instances where the names were spelled different in America than in, say, the Netherlands. Fortunately these are exceptions and not the rule, so the process should go smoothly. A lot more needs to be said about Encoding Level 2 and I'm sure I didn't touch on many of the aspects of our process. But hopefully this post gives a little flavor as to the goings-on at the Massachusetts Historical Society during the spring, summer, and fall of 2010.
Subscribe to:
Comments (Atom)