Brought to you by the Massachusetts Historical Society

"I have nothing to do here, but to take the Air, enquire for News, talk Politicks and write Letters."

John Adams to Abigail Adams, 30 June 1774

Tuesday, June 21, 2011

Names Clean-up in the People Database

The most recent task assigned to me to complete as we prepare to launch the Adams Papers Digital catalog was to review instances were names we assigned as attributes (e.g. adams-john1735 for John Adams (a.k.a. JA) appeared in the XML code, but not in the People database. This is part one of two, the second part being instances were people appear in the database but do not appear in the attributes in the XML.

This is another one of those combination's of human and computer errors, some of which were unavoidable (computer) and some of which happened because of how mundane some of the work was (human).

The attributes that needed seeing to was 1200 strong, and in printed form stretched to 110 pages. (Before you think we're total idiots, this represents approximately 5-6% of the database which means we got 94-95% right. On a ten point grading scale that is a solid A! Sheepishly we realize mistakes happen but anticipating negative criticism about the work we did we wanted to try to spin it positively.)

We did two queries to produce the data. The first query produced the faulty attributes and the slip ID number in which the attribute was contained. For the results that came about in the second run, we included within double-quotes the name associated with the faulty attribute. It gave a little insight into what we could expect before searching the database and editing the code.

Some typical instances looked like this:

a. [extract]-monroe] 160017 - "to Sec. of State [James Monroe] [extract]" | 160043 "to Sec. of State [James Monroe] [extract]"
b. bourne-sylvanus? 072781 - ""to Mr. Sylvanus? Bourne"
c. henry-laurens 031911 - "to Henry Laurens"
d. nicolay-albery-h 341282 - "to Albert H Nicolay"
e. palfrey-john 340735 - "John G. Palfrey" | 341083 - "John G. Palfrey"


There were a number of consistencies as to the errors that were made that were determined quickly when cleaning up the data. In some instances the attribute was valid, but had not been entered into the database. While frustrating, this was an easy fix and was most likely the product of human error. A second reason was due to a typographical error in the XML attribute that was not present in the attribute for that person in the people database. And the reverse, where there was a typo in the people database but the attribute was actually correct. Typographical errors includes transpositions of letters (examples c and d). A third possibility was that in the transformation process which took place in Level 1, the attribute was inaccurately reviewed (examples a and b above). And in the above it should be evident that punctuation and other marks like [ ) ? . were not allowed in the attribute; . Another kind of thing we saw was were attributes in the XML were not as complete as in the database, and the opposite (example e).

The fixes for the above examples should be relatively easy to make yourselves:

a. monroe-james (remove [], add james)
b. bourne-sylvanus (remove ?)
c. laurens-henry (flip first and last names)
d. nicolay-albert-h (typo in albert corrected)
e. palfrey-john-gorham (added middle name)

In addition to checking the slip, we also had to check the database for each name (unless it was a prominent figure). The back-end view for the Adams Papers editors (and us), allows for tabbed browsing of the digital control file. I have not really seen the public interface but imagine this might be a similar feature.

We did not keep count of the number of names added but there were quite a few. In the process we were able to clean up a number of bad attributes that were in the database and in some cases merge or separate people based on a close inspection of names, dates, etc. For example in correcting some of the slips for William Cunningham Jr. I discovered that his attribute should be "cunningham-william0" where as the majority of them were just "cunningham-william", which is the attribute for his father, William Sr. These have all been fixed now so that the users of the digital control file will get the respect they expect when searching for these people.

We will re-run the query in a few days to ensure that every instance was seen to accurately. Let's pretend this was the case, for if any were missed, I won't tell you about it!

The other side to this clean-up, whereby we'll run a query to determine in the database where attributes exist that do not appear in the XML, is more straightforward...those attributes will be deleted from the database.

Also going on has been Beta testing of the public interface. Susan and I were invited to sit in on a meeting yesterday about how that testing went, what some of the feedback was, etc. So I'll post a bit about the public interface next time. The project comes to a close on 30 June 2011 so blogging might slow down - if it doesn't stop outright - after that date. I don't want to turn this into a Brokeback Mountain "I can't quit you" kind of moment, but the reality is is that once the funds stops, so does the blog!

No comments:

Post a Comment