Last month I described how we grew our Excel worksheet to support gathering descriptions for AMNH expeditions and personnel in data fields mapped to EAC-CPF. Today I will explain how this basic worksheet developed, and how it became an invaluable tool for creating EAC-CPF XML as well.
I know… nothing that can be done in a spread sheet is as cool as riding to camp on the back of a camel, but allow me to tell you how we got our records from XLS to XML. Nudge-nudge, in case you aren’t aware, the AMNH Library recently launched its collection of digitized images online where I found this fabulous photograph.
As Becca’s team of interns researched and wrote entity records using the Excel file in its first major evolution, we realized that the contextual standard could capture much richer metadata if we could provide the means to record it. And as long as our studious interns were delving into the details of these expedition histories, it made sense that we spend some quality time making sure the details were encoded in a way that would be useful for future research and resource discovery (and, possibly, data visualizations). For instance, expedition itineraries could simply be recorded under Sites as a list of place names, or we could enrich that list by adding <date>, <descriptiveNote>, and <placeRole> to the location, much of that information discretely documented in the itinerary itself. But to achieve this kind of depth, the Excel form would require some major revision and restructuring.
Already the “legs” of the worksheet (as separate sheets for Timeline and Relations) had complicated our singular table environment. We had to reevaluate the data layout model: how could we maintain the ease of entering information into spread sheet rows while capturing meaningful metadata chunks in particular fields? We also needed a workform that reflected the CPF standard more completely. At the time, we were fortunate enough to be hosting Michigan University graduate student, Nick Krabbenhoeft, for his program’s practicum. Nick grew the Excel form into the template we now use. It has four sheets to reflect the four main areas of EAC-CPF: Control, Identity, Description and Relations. Each sheet holds separate tables to record information under specific elements, which allows for nested descriptions. And any table can be expanded by inserting new rows, accounting for multiple entries (though not all elements are repeatable). The Excel templates can be found at the end of this post.
In addition to this new structure, Nick also wrote a macros to convert the data into EAC-CPF XML. A winning combination to record granular data AND GENERATE XML. It was a huge step for us, given our humble beginnings. The macros basically looks at each table as a separate cpf element — if a table has content, it takes the content and wraps it in metadata tags defined by the column headers. You may notice the lower- and camelCase structure of field names in the screen shots above. The macros plucks the tag names from the header cells and creates EAC-CPF elements. Hierarchical elements have more than one row of headers; attributes are defined in the macros so that any header containing “xlink”, for instance, will be coded accordingly. The four separate sheets themselves operate as parent elements.
However, this isn’t just Excel as we know it. Two things had to be changed in our familiar Excel landscape to make this conversion to xml happen. 1) The file was changed to a macros-enabled workbook (.xlsm), and 2) the Developer tab had to be displayed in the ribbon to run the script. If you look at the tabs in the screen shots, Developer appears after View. Notice in the View tab, there is a Macros button as well. For data creation, the interns use the macros-enabled workbook without the Developer tab. It is really only necessary to create, edit, view and run the macros itself, a functionality that does not impact data gathering. And let me just say it again: WE HAVE XML FILES and no one had to hand-code a single backslash. A million thanks to Nick for this essential development!
Being able to push our data into a system-independent format, as XML, meant that we were one step closer to making our records available publicly. This is why I spent some time this winter creating a very, very basic stylesheet.
It’s not to our standard for publishing records online, but it certainly feels like we’ve passed a major milestone. If nothing else, it has greatly improved the review process — have you ever tried reading a spread sheet cell packed with four meaty paragraphs? I don’t recommend it.
Looking ahead, whether we utilize a program, such as xEAC, to publish our entity records or run the stylesheet transformation, the data now has an outlet. Excel might be a great place to start for capturing information, but we don’t want to live in it for the long term.
Ideally the records should be integrated into a larger system of content management. Our next step is to find a system to manage our entities along with finding aids. Though it is no small accomplishment to go from XLS to XML to HTML, the process has its shortcomings, also not small. XML validation is one hurdle. The macros, in its current version, creates well-formed code as long as the data is entered correctly into the Excel template and the tables have not been modified. Small oversights, such as leaving a required field blank, will render the code invalid. I use an XML editor, oXygen, to run validation and troubleshoot. Some errors are easy to fix, others can be less evident. This is the drawback of using a spread sheet versus a database where each field can be designed for specific input. But the flexibility and adaptability of Excel has made it a worthy tool. We could not have gained as much ground on this project without it.