The most recent task assigned to me to complete as we prepare to launch the Adams Papers Digital catalog was to review instances were names we assigned as attributes (e.g. adams-john1735 for John Adams (a.k.a. JA) appeared in the XML code, but not in the People database. This is part one of two, the second part being instances were people appear in the database but do not appear in the attributes in the XML.
This is another one of those combination's of human and computer errors, some of which were unavoidable (computer) and some of which happened because of how mundane some of the work was (human).
The attributes that needed seeing to was 1200 strong, and in printed form stretched to 110 pages. (Before you think we're total idiots, this represents approximately 5-6% of the database which means we got 94-95% right. On a ten point grading scale that is a solid A! Sheepishly we realize mistakes happen but anticipating negative criticism about the work we did we wanted to try to spin it positively.)
We did two queries to produce the data. The first query produced the faulty attributes and the slip ID number in which the attribute was contained. For the results that came about in the second run, we included within double-quotes the name associated with the faulty attribute. It gave a little insight into what we could expect before searching the database and editing the code.
Some typical instances looked like this:
a. [extract]-monroe] 160017 - "to Sec. of State [James Monroe] [extract]" | 160043 "to Sec. of State [James Monroe] [extract]"
b. bourne-sylvanus? 072781 - ""to Mr. Sylvanus? Bourne"
c. henry-laurens 031911 - "to Henry Laurens"
d. nicolay-albery-h 341282 - "to Albert H Nicolay"
e. palfrey-john 340735 - "John G. Palfrey" | 341083 - "John G. Palfrey"
There were a number of consistencies as to the errors that were made that were determined quickly when cleaning up the data. In some instances the attribute was valid, but had not been entered into the database. While frustrating, this was an easy fix and was most likely the product of human error. A second reason was due to a typographical error in the XML attribute that was not present in the attribute for that person in the people database. And the reverse, where there was a typo in the people database but the attribute was actually correct. Typographical errors includes transpositions of letters (examples c and d). A third possibility was that in the transformation process which took place in Level 1, the attribute was inaccurately reviewed (examples a and b above). And in the above it should be evident that punctuation and other marks like [ ) ? . were not allowed in the attribute; . Another kind of thing we saw was were attributes in the XML were not as complete as in the database, and the opposite (example e).
The fixes for the above examples should be relatively easy to make yourselves:
a. monroe-james (remove [], add james)
b. bourne-sylvanus (remove ?)
c. laurens-henry (flip first and last names)
d. nicolay-albert-h (typo in albert corrected)
e. palfrey-john-gorham (added middle name)
In addition to checking the slip, we also had to check the database for each name (unless it was a prominent figure). The back-end view for the Adams Papers editors (and us), allows for tabbed browsing of the digital control file. I have not really seen the public interface but imagine this might be a similar feature.
We did not keep count of the number of names added but there were quite a few. In the process we were able to clean up a number of bad attributes that were in the database and in some cases merge or separate people based on a close inspection of names, dates, etc. For example in correcting some of the slips for William Cunningham Jr. I discovered that his attribute should be "cunningham-william0" where as the majority of them were just "cunningham-william", which is the attribute for his father, William Sr. These have all been fixed now so that the users of the digital control file will get the respect they expect when searching for these people.
We will re-run the query in a few days to ensure that every instance was seen to accurately. Let's pretend this was the case, for if any were missed, I won't tell you about it!
The other side to this clean-up, whereby we'll run a query to determine in the database where attributes exist that do not appear in the XML, is more straightforward...those attributes will be deleted from the database.
Also going on has been Beta testing of the public interface. Susan and I were invited to sit in on a meeting yesterday about how that testing went, what some of the feedback was, etc. So I'll post a bit about the public interface next time. The project comes to a close on 30 June 2011 so blogging might slow down - if it doesn't stop outright - after that date. I don't want to turn this into a Brokeback Mountain "I can't quit you" kind of moment, but the reality is is that once the funds stops, so does the blog!
Showing posts with label Encoding Level 1. Show all posts
Showing posts with label Encoding Level 1. Show all posts
Tuesday, June 21, 2011
Monday, January 10, 2011
Supporting Databases, Part 1
This sounds like an Oscar category...And the nominees for Best Supporting Database in a Digital Conversion Project are: Accessions. Institutions. People. Places.
The supporting databases in the project allow us to regularize and make a consistent way in which to store and retrieve information. At the present time, there are four supporting databases: Accessions, Institutions, People, Places. There are additional supporting documents that we created and used such the Microfilm Conversion Chart. Fellow Adams Slip File encoder and blogger Susan Martin worked with the Accessions and Institutions databases as well as the Microfilm Conversion Chart and MHS Collection Codes, so she will write on them.
As mentioned in the post on 15 December 2010, at that time the People database contained 19,454 names. This number will fluctuate a bit as digital control file staff and Adams Papers editors identify duplicate entries and/or clarify & identify more fully those records for which staff have more information. Occasionally also we find names skipped during encoding level 2; this generally was the result of the density or complexity of a record.
The Places database was the first to be built and populated during Level 1 Encoding. In Level 2, while not a focus, we took the opporutnity to review attributes and perform basic data clean-up if necessary. The Places database contains 3,090 records: from Abbeville to Zwolle.
The fields we populated in Level 1 in the Places database are location, city, state, country, and notes. The location field is the controlled form of the entry - the attribute. Generally the first time a city appeared it received a one word attribute: "quincy", "tallahassee", and "athol" for example. However, once the country expanded, we were left with the task of differentiating between places with the same name in different states and/or countries. A good example is Burlington. We have eight different records for Burlington: "burlington", "burlington-county", "burlington-ia", "burlington-ma", "burlington-me", "burlington-nj", "burlington-ny", and "burlington-vt". We assigned the fullest known attribute to distinguish one from the other. However, sometimes the address listed simply says Burlington. In these instances it was not always possible to determine if it was the Burlington in Massachusetts or some other state.
This is a long way of saying we did the best we could with the information we had. As with the People database, the Adams Papers editors can use their expertise to help solidly define and identify a place if needed.
The supporting databases in the project allow us to regularize and make a consistent way in which to store and retrieve information. At the present time, there are four supporting databases: Accessions, Institutions, People, Places. There are additional supporting documents that we created and used such the Microfilm Conversion Chart. Fellow Adams Slip File encoder and blogger Susan Martin worked with the Accessions and Institutions databases as well as the Microfilm Conversion Chart and MHS Collection Codes, so she will write on them.
As mentioned in the post on 15 December 2010, at that time the People database contained 19,454 names. This number will fluctuate a bit as digital control file staff and Adams Papers editors identify duplicate entries and/or clarify & identify more fully those records for which staff have more information. Occasionally also we find names skipped during encoding level 2; this generally was the result of the density or complexity of a record.
The Places database was the first to be built and populated during Level 1 Encoding. In Level 2, while not a focus, we took the opporutnity to review attributes and perform basic data clean-up if necessary. The Places database contains 3,090 records: from Abbeville to Zwolle.
The fields we populated in Level 1 in the Places database are location, city, state, country, and notes. The location field is the controlled form of the entry - the attribute. Generally the first time a city appeared it received a one word attribute: "quincy", "tallahassee", and "athol" for example. However, once the country expanded, we were left with the task of differentiating between places with the same name in different states and/or countries. A good example is Burlington. We have eight different records for Burlington: "burlington", "burlington-county", "burlington-ia", "burlington-ma", "burlington-me", "burlington-nj", "burlington-ny", and "burlington-vt". We assigned the fullest known attribute to distinguish one from the other. However, sometimes the address listed simply says Burlington. In these instances it was not always possible to determine if it was the Burlington in Massachusetts or some other state.
This is a long way of saying we did the best we could with the information we had. As with the People database, the Adams Papers editors can use their expertise to help solidly define and identify a place if needed.
Labels:
About the project,
Database,
Encoding Level 1,
Encoding Level 2,
Names,
Place
Tuesday, April 20, 2010
Control of Control File Closing In!
Despite the lack of posting, we have all been plugging away at Encoding Level 1 and have just completed the initial phase of mark-up on all 110,000 records. As it stands now, we have 42 XML files that have been proofread and input to match the massive paper file down the hall. In addition, we have control (through attributes) over all dates, codes, locations, length, and format for each record. The big pieces that remain for Encoding Level 2 are the control of names--for both authors and recipients--published references, and notes.
As we've gone through the encoding, we also been developing supplemental databases that will enhance search-ability in the final interface. These currently include locations (where letters were written) and accessioned documents (repositories other than the MHS that hold the original manuscripts, i.e. Library of Congress). We are also building a supplemental database of all persons and short titles (published versions of documents).
Much of the work in the coming months will focus on Encoding Level 2 (with an emphasis on automating as much data entry as possible through XSLTs) and the building of the database infrastructure in eXist. As we iron out the kinks in building and managing these databases, I will post what we learn and produce. Stay tuned!
As we've gone through the encoding, we also been developing supplemental databases that will enhance search-ability in the final interface. These currently include locations (where letters were written) and accessioned documents (repositories other than the MHS that hold the original manuscripts, i.e. Library of Congress). We are also building a supplemental database of all persons and short titles (published versions of documents).
Much of the work in the coming months will focus on Encoding Level 2 (with an emphasis on automating as much data entry as possible through XSLTs) and the building of the database infrastructure in eXist. As we iron out the kinks in building and managing these databases, I will post what we learn and produce. Stay tuned!
Tuesday, November 10, 2009
Checklist for Encoding Level 1
We are currently 35% through Encoding Level 1 which involves inputting proofreading corrections, verifying the basic code and creating the first of several authority look-up tables--this one for place names. The work is still broken down by reel. The following is the checklist for each record.
To open a new file for encoding level 1:
--Open new XML file through the Tortoise SVN Directory at C:\Repositories\slipfile\xml\proofread and confirm that the file name ends in “_level1”
--Confirm that FULL_schemaV2_MR.rng is associated through the Tortoise SVN Directory at C:\Repositories\slipfile\xml\schemas (reassociate if the red underlines don’t appear)
--Run the XSL transformation copyformat.xsl; overwrite the new file under same name.
--Commit these changes by right clicking on the slipfile folder on your C:\ drive and selecting "SVN Commit" from the drop down menu. Select the files to commit, click "OK" and then type in your password.
To open a working file for encoding level 1:
--Open XML file through the Tortoise SVN Directory at C:\Repositories\slipfile\xml\level1
--Enter changes and save periodically to the Tortoise SVN Directory at C:\Repositories\slipfile\xml\level1
--When finished with work, commit changes by right clicking on the slipfile folder on your C:\ drive and selecting "SVN Commit" from the drop down menu. Select the files to commit, click "OK" and then type in your password.
For each record:
Input proofreading file changes
--Confirm @color, enter if absent (if you delete the entire @color and hit the space bar, a drop down menu will appear with possible attributes and values). The choices are: 1pink, 2yellow, 3white, 4blue, or 5goldenrod
--Confirm <place>, remove unnecessary information from @location and confirm correct English spelling; confirm place name against Excel spreadsheet list and add new authority names to list i.e. “Philadelphia, 31 South Street” should have a @location value of “Philadelphia”
--Confirm <code>, use drop down prompts to fill in attributes when necessary Codes that are not @type=Accesssion, Letterbook, Miscellany or Diary should be encoded as “General” under the @type, i.e. “TS Wills and Deeds”
--Confirm <length>, enter value in @pages if absent: add multiple page numbers listed, i.e. if there is an enclosure and <length>2 p., 3 p. </length> then the total value for @pages= “5”.
--Confirm <copy>, enter value for @format. The copyformat.xsl should have populated most of these. when there are two values, one for MS and one for XPr (or the like), copy @format should have “Manuscript” as value and the subsequent XPr’s should be encoded as a note
--Confirm <date>, verify that populated dates are correct, confirm all attributes are present as necessary, enter @to for date ranges and any other appropriate @.
Most of the date should be automatically populated, except for date ranges. A date range will have the first date entered as an @when, the encoder must enter the end date in @to as year-month-day. For unknown months or days, enter “99” . For conjectural or corrected dates, encode the corrected date. For questions, check Master Encoding Guide.i.e., "1 January 1799 [i.e. 1800]" should be @when="1800-01-01".
--Add new slips found in paper file, create new ID number at end of reel
--Cross check any changes in the Corrections Binder (may be redundant, but important!)
To open a new file for encoding level 1:
--Open new XML file through the Tortoise SVN Directory at C:\Repositories\slipfile\xml\proofread and confirm that the file name ends in “_level1”
--Confirm that FULL_schemaV2_MR.rng is associated through the Tortoise SVN Directory at C:\Repositories\slipfile\xml\schemas (reassociate if the red underlines don’t appear)
--Run the XSL transformation copyformat.xsl; overwrite the new file under same name.
--Commit these changes by right clicking on the slipfile folder on your C:\ drive and selecting "SVN Commit" from the drop down menu. Select the files to commit, click "OK" and then type in your password.
To open a working file for encoding level 1:
--Open XML file through the Tortoise SVN Directory at C:\Repositories\slipfile\xml\level1
--Enter changes and save periodically to the Tortoise SVN Directory at C:\Repositories\slipfile\xml\level1
--When finished with work, commit changes by right clicking on the slipfile folder on your C:\ drive and selecting "SVN Commit" from the drop down menu. Select the files to commit, click "OK" and then type in your password.
For each record:
Input proofreading file changes
--Confirm @color, enter if absent (if you delete the entire @color and hit the space bar, a drop down menu will appear with possible attributes and values). The choices are: 1pink, 2yellow, 3white, 4blue, or 5goldenrod
--Confirm <place>, remove unnecessary information from @location and confirm correct English spelling; confirm place name against Excel spreadsheet list and add new authority names to list i.e. “Philadelphia, 31 South Street” should have a @location value of “Philadelphia”
--Confirm <code>, use drop down prompts to fill in attributes when necessary Codes that are not @type=Accesssion, Letterbook, Miscellany or Diary should be encoded as “General” under the @type, i.e. “TS Wills and Deeds”
--Confirm <length>, enter value in @pages if absent: add multiple page numbers listed, i.e. if there is an enclosure and <length>2 p., 3 p. </length> then the total value for @pages= “5”.
--Confirm <copy>, enter value for @format. The copyformat.xsl should have populated most of these. when there are two values, one for MS and one for XPr (or the like), copy @format should have “Manuscript” as value and the subsequent XPr’s should be encoded as a note
--Confirm <date>, verify that populated dates are correct, confirm all attributes are present as necessary, enter @to for date ranges and any other appropriate @.
Most of the date should be automatically populated, except for date ranges. A date range will have the first date entered as an @when, the encoder must enter the end date in @to as year-month-day. For unknown months or days, enter “99” . For conjectural or corrected dates, encode the corrected date. For questions, check Master Encoding Guide.i.e., "1 January 1799 [i.e. 1800]" should be @when="1800-01-01".
--Add new slips found in paper file, create new ID number at end of reel
--Cross check any changes in the Corrections Binder (may be redundant, but important!)
Labels:
Checklists,
Codes,
Color,
Copy,
Dates,
Encoding Level 1,
Format,
Length,
Place,
Series
Monday, October 19, 2009
Encoding excitement and other things
I feel I should apologize for not posting recently, although to whom I would be apologizing I never can tell. Anyway, encoding for level 1 actually began September 1 and has been proceeding steadily. We have Jim (Connolly) the Encoder working four days a week and Susan (Martin) the EAD Gal encoding one day per week. We have moved through seven reels and are picking up steam. I will follow up with another post about the XSL transformation we ran with the help of a very clever consultant and the checklist for encoding level 1.
As to my time, I am just finishing up the proofreading and coming off the ADE Annual Meeting in Springfield, IL. Ondine LeBlanc (Director of Publications) and I presented a workshop on getting legacy content out of MS Word and into XML and I was able to share a little bit about this project as well. Overall, we were very pleased with the participation and follow-up questions. I, for one, am very interested to see what manner of digital resources come out of the documentary editing community over the next few years.
Now I hope to get down to the busy work of encoding too... stay tuned.
As to my time, I am just finishing up the proofreading and coming off the ADE Annual Meeting in Springfield, IL. Ondine LeBlanc (Director of Publications) and I presented a workshop on getting legacy content out of MS Word and into XML and I was able to share a little bit about this project as well. Overall, we were very pleased with the participation and follow-up questions. I, for one, am very interested to see what manner of digital resources come out of the documentary editing community over the next few years.
Now I hope to get down to the busy work of encoding too... stay tuned.
Wednesday, August 12, 2009
Phase 2 Timeline
Our project to digitize the Adams Papers Control File began in January 2009. We originally planned on spending a few short months on proofreading before moving into encoding. However, proofreading 109,348 slips, one by one, has taken a little longer than we anticipated. This phase of the work is vitally important, though, and we have continued doggedly pursuing our final reels. We have found important corrections and updates and have begun entering those changes into the XML files now. The input of corrections has been folded into first phase of encoding and so far is going smoothly.
The first seven months of the project were also devoted to schema development (see Master Encoding Guide) and this summer we secured the services of an excellent XSL consultant to write an XSL transformation to convert our abbreviated vendor schema into the full schema and populate much of the consistent data automatically. The XSLT's have been very helpful and we hope to build on them to automatically generate other data as we work through the initial encoding.
Thus our schedule for 2009:
The first seven months of the project were also devoted to schema development (see Master Encoding Guide) and this summer we secured the services of an excellent XSL consultant to write an XSL transformation to convert our abbreviated vendor schema into the full schema and populate much of the consistent data automatically. The XSLT's have been very helpful and we hope to build on them to automatically generate other data as we work through the initial encoding.
Thus our schedule for 2009:
- January-August: proofreading (project manager, proofreader, EAD coordinator)
- March-June: schema development (project manager and web developer)
- July-August: XSL development and contract work (project manager, web developer, and consultant)
- August-December: encoding level 1 (project manager, encoder, EAD coordinator)
- September-December: XSL development (project manager and web developer)
Labels:
About the project,
Encoding Level 1,
Proofreading,
Schedule,
Schema
Subscribe to:
Posts (Atom)