ETDs as the data-curation wedge?

Many doctoral institutions now accept and archive (or are planning to accept and archive) theses and dissertations electronically. Virginia Tech pioneered this quite some time ago, and it has caught on slowly but steadily for reasons of cost, convenience, access, and necessity.

Necessity? Afraid so. Some theses and dissertations are honest digital artifacts, unable to be faithfully represented in ink on paper or in other analog fashion. Others might be flattened into analog, but that wouldn't be their (or their author's) preference. Still others contain digital artifacts of various sorts. Source code. Multimedia. Data.

ETDs don't pose any special digital-preservation challenges over and above the usual. (I got into an exchange on Twitter yesterday about a dissertation presented with a web content-management system, raising the issue of the artifact's sustainability given the CMS dependency. But any CMS with any content involves those same issues.) What they do present, given their popularity among faculty, students, administrators, and even (some) librarians, is an opportunity.

Institutions consider dissertations to be vital institutional history. (Master's theses—well, that varies from institution to institution, and even within institutions.) There can be no question of throwing away a dissertation simply because it's digital; an institution receiving digital dissertations has no choice but to do something about them.

Now, a lot of institutions, it seems to me, aren't doing much or are doing the wrong things. (If your institution has an unaudited pile of CD-ROMs, that's the wrong thing. Perfectly understandable given the circumstances, but still wrong in today's technology environment.) This shouldn't be surprising or terrifying, nor is it excuse to excoriate the institutions. We all do our best with what we have and what we know at the time.

However… the tools now exist for us to step up our digital-preservation game, and ETDs give us an unassailable, mission-critical reason to. Remember, the problems aren't specific to ETDs, so if we solve them for ETDs, we've solved them for a wide swathe of other kinds of documents and data as well.

Perhaps instead of spinning jargon-laden webs of words such as "cyberinfrastructure," we should start with an easy-to-recognize problem that we already know we have.

Tags

More like this

Many of my readers will already have seen the Nature special issue on data, data curation, and data sharing. If you haven't, go now and read; it's impossible to overestimate the importance of this issue turning up in such a widely-read venue. I read the opening of "Data sharing: Empty archives"…
One of the problems practically every nascent data-curation effort will have to deal with is what serials librarians call the backfile, though the rest of us use the blunter word backlog. There's a lot of digital data (let's not even think about the analog for now) from old projects hanging around…
A common response, including in the comments at Book of Trogool, to raising digital-preservation issues is a chortle of "Guess print doesn't seem so bad now! Let's just print everything out, and then we'll be fine!" Leaving aside my own visceral irritation at that rather rude and dismissive…
The latest issue of the International Journal of Digital Curation is out; if you're in this space and not at least watching the RSS feed for this journal, you should be. I was scanning this article on Georgia Tech's libraries' development of a data-curation program when I ran across a real jaw-…

Virginia Tech might require an ETD for the university, but in my department they still wanted a hard copy. And my advisor wanted 2 copies, 1 for him and 1 for the lab.

That was expensive and painful :p

I'm disappointed, but not surprised. There is still a widespread belief in academia that digital files are not to be trusted, and paper and microfilm are the only appropriate archival formats.

This is also an unfortunate side-effect of faculty governance; nobody can tell your department NOT to require print.

Sorry that happened to you. I agree it was excessive and unnecessary.

At the University of Southern California, the libraries no longer archives print copies of TDs. Starting in 2006, we decided to only archive digital copies. Most of these are pdf files but everyone once in a while we get a movie file, image collection, ppt, or even a .exe.

We catalog them according to Dublic Core and ISBD standards and keep them all on locally managed servers. Considering we receive just under 1k TDs a year, this is a huge space saver. Of course, as JohnV mentioned above, some of the individual schools still require print copies for their own collections.

McGill University has now mandated that all theses be submitted electronically to the Graduate & Postdoctoral Studies Office. The GPSO then sends them on to the library to have the PDF/As (and soon - hopefully - data sets, a/v material, what have you) deposited in our institutional repository. At the same time, the library is actively digitizing earlier theses and uploading those to the IR.
The ultimate goal is to have all graduate, doctoral, and honours undergraduate (when sponsored by a supervising faculty member) available in our IR - eScholarship@McGill.

John, Amy -- any thoughts of taking the data work you're doing beyond ETDs?

Dorothea - indeed! Though we're only at the early evaluation stages, but the dream is to create a VRE linked to the IR. VRE for live data, IR for archiving. Ideally looking into one system to manage the whole thing (VRE-IR shift), tag data with researcher info, etc etc etc.

Dorothea, I'm very happy to find you again after suspending CavLec, which I always enjoyed. Your blog posts mentions data in ETDs but the other comments seem to relate mostly to ETDs themselves, not the data. I strongly agree with you that ETDs (and theses in general) are an opportunity for data. I'm particularly concerned about the envelope glued into the back cover of a printed thesis, containing disks or CDs of data. Or the links to the candidate's web site, sometimes based on a username that will shortly be deleted automatically once the candidate graduates and leaves.

Many theses come with data, often simple things like Excel spreadsheets, or videos or sound recordings, or SPSS datasets, or... Getting to grips with these, rather than just leaving them as a pile of CD-ROMs, is definitely the Right Thing, as you suggest. Just trying, and sharing the results, would be extremely helpful. Finding somewhere reasonably permanent (even in the repository itself), and giving it a URL to link from the thesis, and some independent metadata, would be great. So, thanks for this post, and I guess I'll need to find some time to have a look at your local ETD repository to see how you've done it!

By Chris Rusbridge (not verified) on 29 Sep 2009 #permalink

We haven't done it yet, Chris! We're still very much in the planning stages where I am... and also amidships of a total technology-platform shift.

I can say, though, that the possibility of archiving data alongside theses immediately captured the goodwill of the engineering professor on the ETD committee. We've had the experience of sending those CDs merrily off to ProQuest, where they appear to... drop into an oubliette. We're less than pleased about that.

The backlog of CDs is a problem in its own right. I'd like to go back and deal with it, but we'll see how it goes... it's always easier to do the right thing going forward than retrospectively.

(Which is a topic that may deserve a post in its own right!)