Blog post by Noah Huffman, Duke University Libraries
As part of a recent project to migrate data from EAD (Encoded Archival Description) toArchivesSpace, I needed to clean up about 27,000 name and subject headings spread across over 2,000 EAD records in XML. Because the majority of these EAD XML files were encoded by hand using a basic text editor (don’t ask why), I knew there were likely to be variants of the same subject and name terms throughout the corpus–terms with extra white space, different punctuation and capitalization, etc. I needed a quick way to analyze all these terms, dedup them, normalize them, and update the XML before importing it into ArchivesSpace. I knew Open Refine was the tool for the job, but the process of getting the terms 1) out of the EAD, 2) into OpenRefine for munging, and 3) back into EAD wasn’t something I’d tackled before.
Below is a basic outline of the workflow I devised, combining XSLT, OpenRefine, and, yes, Excel. I’ve provided links to some source files when available. As with any major data cleanup project, I’m sure there are 100 better ways to do this, but hopefully somebody will find something useful here.