I’ve got a part-time job in a technical services group this summer and I’m slowly rebalancing day-to-day living. Part-time is a good fit for me since I’m also working on the last course for my degree. The part-time gig is basically digital publishing in a library context. I’m setting up OJS-based site for a well-established but print-only journal. What I’m enjoying most is that there’s a good mix of things I already know how to do (project managing web publications), things I’ve recently learned and want to grow some more (manipulating data with XML), and things I haven’t done much of (garden-variety digitization).
Last week I built the shell of the journal: this was mostly tire-kicking. Building the site itself wasn’t particularly tricky — the look and feel are pretty straightforward and required fairly minor CSS tweaks. There are stack of things that I still need to test and set up. Setting up the financial backend looks like it will be more or less straightforward as long as all the plugins work and the server cooperates. And nerdily enough, I’m looking forward to sitting down and trying out XML batch importing.
Photo by bridges&balloons (CC BY-NC-ND 2.0)
This week after a little flurry of copying and backup making, I got my hands on the files that need to be migrated and discovered the fly in the electronic ointment. I knew going in that there were a range of file states we’d need to deal with (little pdfs, big pdfs, image files) but it hadn’t dawned on me that the files wouldn’t be accompanied by some sort of externalized metadata set. So. Rekeying everything isn’t practical or particularly good use of my contract time. So this week I’m building a more detailed project plan, figuring out mechanics, and coming up with a kludged way of building a metadata set.
I fully expect that once I’ve built out and tested some processes, I will discover easier methods but for now it’s kludge and learn time. Persistence and a tolerance for rework are going to come in handy. Progress since the light dawned on the fly has been okay and slightly less linear than the steps suggest:
Step 1: Finagled a license for the software I’ll need to manipulate the pdf files. The machine I’m using is not the speediest or shiniest but so far it’s grudgingly tolerating the increased workload.
Step 2: Software in hand figured out the most basic steps of splitting some of the very large PDFs (200+ pages) into individual items. It’s unlikely that I’ll be splitting up more than 10 years worth of the most recent files. Bonus round: I’ve got a rough estimate of how much time to allow for minimal file splitting. Tomorrow or next week, I’ll sort out it any other file processing falls within the project budget.
Step 3: Decided we needed an accurate list of the files we had and a better sense of the state of each group of files. Old DOS skills come back but only after they push themselves past not quite as old UNIX skills. I accomplish a list. And since it’s a list of everything (including duplicates), it’s massive (~10,000 items). But the shaggy data is now in a spreadsheet and I can bend it to my will.
Step 4: Griped about the wildly inconsistent filenames which make sorting volumes and issues into any meaningful order impossible.
Step 5: Put off dealing with changing any of the actual file names and used the powers of logical find and replace steps to come up with a sortable list. Lists are magic.
Step 6: Analyzed the magic if shaggy list and figured out exactly how many files of what sort we have to deal with. Excel filters are the bomb. Double checked the lists of issues to see if the scans which were done by multiple people and over several years are complete. And, yeah, there are a small handful missing. Updated mental to do list with task of finding out if the missing issues were actually published or if we’re dealing with a garden variety journal numbering anomaly. Must add to project plan tomorrow.
Step 7: Started tracking down which of the PDF files contain the annual cumulative index. Once all the indices are in hand, I’m going to try a kludge: OCR the PDFs, grab each index in plain text, and “move” the plain text into Excel. The plain text to Excel part will be painful unless I can figure out how to add separators between title, author, and page number. This step is going to be a pain but will be quicker than re-keying. But first things first: figure out how many of the scanned issues actually captured the indices.
At the end of this, I should have enough data to plan out the needed metadata. Before I go too far down that road, I’m going to have some fun testing out what I’ve read about moving spreadsheet data into XML and from there into the OJS platform.
I spent a lot of time in my previous work incarnation cleaning up datasets and fixing metadata and I’m hoping that I’ll be able to refine my planned kludge. I can brute force the metadata into shape but that’s both tedious and hard on the hands.