Mining the Medical Heritage Library

Mining the Medical Heritage Library


– All right, great. Thank you so much for the
invite to Peter and the DHLab. I’m here on behalf of Melissa
Grafe and Arthur Belanger at the Medical Heritage Library and the Cushing/Whitney
Medical Library here at Yale. And I’m gonna talk about part of a project I helped out with this past summer and that’s been kind of
humming away on our workstation through the fall and the winter. So my title is Mining the
Medical Heritage Library, Notes towards a lightweight,
open machine vision solution. And here’s a quick agenda. I want to talk about what MHL is. And then the task that
I was mainly working on, which was eliminating
false-positive or junk, non-illustrations that
MHL was trying to upload to their Flickr account in their quest to upload centuries of medical illustrations. Talk about the solution
that I helped come up with, how it uses deep rooting
techniques, some results, and then some next steps on what I’ve currently been
working with frameworks for sort of collaborative annotation and forward annotation of images. I want to talk a lot
about the importance of using training data in a
reproducible and transparent way. So a lot of the talks, I’ve talked about their sort of motivating challenges. So I thought I’d want to do that as well. The preliminary work of data collection, especially when you’re thinking
about small organizations or individual researchers, is very hard. It’s sort of thankless. It doesn’t seem to connect to scholarship or be professionally
legible in many cases. And this is especially true when the data you’re trying to get, you have
to download over a network. For instance, in our case it
existed on Internet Archive. And you don’t already
have it lying around. And then you’ll do some
training in the model. You’ll create. You might save for a
while, put on a model zoo, as they’re called. But the data you actually
use to train that model is often forgotten or
thrown out once you’ve hit an arbitrary accuracy goal, or more often when you’re just left that project behind. So I wanna think about
machine vision solutions for small organizations
and individual researchers, for instance PhD students like myself. What can you do if you need to move fast and you don’t have a bunch
of institutional funding and you maybe even hack something together with one collaborator, by yourself, over the course of a few weeks, using one of these
public digital libraries or resources like the Internet Archive? Now there’s a lot of tutorials and models, and frameworks out there, more than ever driven by
entrepreneurial forces and industry And the kind of payment
tiering for all this is still emerging. So I’ll try to share some notes on what I found worked for me. And then, okay, what’s the data? We’ve seen a lot of awesome
natural science data. Well the data I was working with, with MHL was books, which are very messy, even more so than
newspapers and periodicals. But they’re also incredibly structured and very semantically rich and
predictable in certain ways that we can take advantage of. And then a thought about book data is whether it’s better
to curate sets of books and have very rich annotations
and structure for them, or to make tools that
work with the existing major digital libraries like Internet Archive,
HathiTrust, and so forth. And then in this area of
extracting and servicing historical illustrations,
kind of find people that are also interested in this, or may become interested when they find that this archive may
pertain to their own work. Okay, so what is the
Medical Heritage Library? Well, it was founded in 2010. And it’s a consortium of
different medical libraries and special collection center, among them Yale, Harvard,
National Library of Medicine, institutions like the Wellcome
in London and in France. And it spans basically the
16th century to the present. And it’s a digital consortium, so it has an org type, but it mainly exists on Internet
Archive as a collection. So here it is and there’s
almost 300,000 volumes now. All of them have been digitized by the various partner libraries, and scanned and put on Internet Archive where anyone can download them as PDFs, or using Internet Archive’s APIs, they can download them in
bulk if they have enough time and enough space. So MHL has a few tools. The Conway Library at
Harvard has developed a nice full text searching tool that gives you a bit more capability than Internet Archive itself has. But what MHL approached me
about was their Flickr project. They are extracting these illustrations from these historical
medical treatises, pamphlets, medical journals and posting
them so that the public can view them on Flickr. Whether or not this is
the best place to do it, that’s sort of beside the point. One of the points of this presentation is if you’re given a task like this, how can you best use what’s
out there to make it work? So MHL’s Flickr accounts, one of them with all this
noisy junk that I’ll show you. So these are all images
that in the pipeline they were using were
deemed to be illustrations or to have pictorial content. These are probably from
19th century books. As you can see, a lot of tabular data, a lot of textual data was incorrectly getting picked up as an illustration. And I should just hint,
out of this workflow comes from leveraging the OCR, optical character recognition data. So when all these books are scanned and the text is attempted to be scanned, that generates some
estimates, pretty noisy of where there might be pictorial regions. So part of the use of deep
learning for this project was to pop on a few classifiers
at the end of that pipeline to eliminate these false
positive, or junk images. And MHL was actually
employing student workers to go through on Flickr
and delete these manually which was very inefficient. And this is the paradigmatic
case for sort of automation through machine learning is eliminating these highly redundant
non-enriching tasks. Here’s some more of that junk. So you see they do pick
up some of the good stuff, but it’s more false positive. They’re getting a lot of junk. And so then by the time I was able to add some filtering to it, we have much nicer results. At then end if we have time, I can scroll through a little
bit of MHL’s Flickr feeds. All their marks that have been made today about ethics and privacy
apply incredibly more so even to this archive because
you have a lot of … The archive is defined as sort of by the bodies of the deceased. And so questions of privacy
and ethics definitely come up. Okay, so existing work
on this space or in DH, there’s been a digital humanities, there’s been a lot of
probably the most work on periodicals and newspapers. So the Library of Congress
has a big project, The Europeana newspaper project, the British Library has something, to some nice papers that
have come out recently. This one by Wevers and Smits uses retrain convolutional
neural networks. And I’m glad all the
details of what those are have been explained so capably before me. And what these teams have arrived at is some basic clustering or finding of visual similarities, irregularities. So Wevers and Smits
working in a Dutch archive, found kind of all the
things in the newspaper that were recurrent,
like the chess puzzles, or the weather patterns, or these clocks. And Paul Fyfe led a team kind of finding the kind of incredible
whiteness as they called it of the 19th century newspaper archive, in the way it did it’s sort of portraits. But there are problems with this. When you’re working on
historical books or periodicals as has been already shown before, the standard object recognition
engines are pretty bad. Well, they don’t give you what you want because they are not
attuned to the materiality of the book as a form,
and the different codes and conventions of pictorial
representation within history, within a particular print format. So here’s some of the MHL
data thrown just yesterday into Google and you get
sort of all these valance and emotional scores. Yeah, I had where there
are certainly heads. But this doesn’t capture anything about like the really awesome … I have some more examples from this text which is on X-rays and stereoscopy. But clearly this sort of
off the shelf solution is not going to cut it. You need to do a bit of
retraining in the domain. Google also fails when
you have text and image, which is sort of a the
dominant word image relation with these inline images in the text, within medical history. “Grey’s Anatomy” actually 1830s and 40s is one of the places where
you start to see a shift from full page plate illustrations,
copper plate engravings, things like that, to having smaller images integrated into the text, often
made with wood engravings. So another problem with
projects in this DH or cultural analytics space, is that they often use data
that’s already been annotated, which again, doesn’t really generalize to the experience of that
individual researcher or that small group
that needs to move fast and maybe doesn’t have
that many resources, whether of computing power or money. So you’ll see at the bottom, a lot of these newspaper databases have already been had
their layout annotated with tools like DocWorks or Veridian. Doug has done some very clever things so he knows his way around this. But a lot of the work uses stuff that’s already been mapped out. In other words, okay, it says
on this part of the page, these coordinates there’s a picture there, in these parts there’s a column of text. Here’s a headline, so on and so forth. But if you don’t have that,
how’re you gonna do it? And the data on Internet Archive does not come with that information. You’re kind of going blind. So I mentioned that the way to cut down on the size of the problem
is to use the OCR data. So here’s a sample kind
of random book from MHL. It’s online on the Internet Archive. And because Internet Archive
is open, MHL is extremely open, polar tracing organization, you can download anything you want, the raw images, whatever. Here’s what we kind of
do on our machine there. So we get those initial estimates of where the OCR thought,
okay this isn’t text that I will recognize. This is a picture. And this has been done by
a guy named Kalev Leetaru. I worked this summer to try
to speed it up a little bit and make it cleaner and incorporate this sort of filtering step. Leetaru was approached
and some other people have been to say, okay
the way we figure out when we’re analyzing historical
books, where our picture is is we set all these rules. We say, well it can’t be
less than 50 pixels wide. And it can’t come at the
beginning of the book, and sort of all these
heuristics about the size, the aspect ratio of the picture, where it comes in the
sequence of page images. And we wanted something more
generalizable than that, that could find book
plates, illustrated covers, pictures of unusual size or aspect ratio. And we wanted to filter out the junk. So this sort of leads
me to my first insight is to sort of embrace the
labeling process itself. So when working with
cultural heritage data, especially in the context of
these classification tasks, and there’s really no other
task other than classification. Any technical solution is gonna require acts of historical
interpretation and comparison. So then, noisy data,
like with MHL was facing, is an opportunity I think,
to collaboratively learn about historical ways of organizing and materializing knowledge. It’s not a barrier to be overcome through a very one time
clever engineering solution, or through hard coded heuristics. So a corollary, we need
as many people involved in the data labeling and
annotating as possible. Ben Schmidt has made this kind
of a more NLP type project but when you’re working with data, and we were working mostly
with 19th century data, well when you’re working
with historical data and classifiers, often times what you’re really looking at is the history of, in many cases, library classification systems themselves, or the way that the various people and the sociology of the
text that produced the book as an object conceived of the division and representation of knowledge. So you’re learning about
historical epistemology as you do this work. Okay, a very schematic view. There’s a bunch of extracted
images from MHL in a cloud. I can thank Damon for actually. Here’s a little logo
from this Diffgram tool that’ll hopefully show at the end which provides a kind of way
to crowdsource annotation for a small teams, individual researchers. And there’s a GPU pointing to the fact that well, what do we have for MHL? We have a little bit of stuff. We had a workstation with one GPU and we had a few months to run it. So that’s kind of the basic setup. So by the numbers, we
started off by kind of doing a long 19th century project. So said that MHL had about 300K volumes, a lot of them in the 19th century for obvious reasons having to
do with intellectual property. So we ended up doing
pretty much the full set of what MHL had for the 19th
century for 1770 to 1899, 123,000 volumes. We extracted 2.2 million
in regions of interest, that’s just sort of neutral way, I don’t want to call it a picture ’cause in some cases it’s not. You know there are some errors. But they’re visual regions of interest. And I’ll talk about more
how I trained the model and what I did when I was annotating, bounding boxes around visual regions of interest
in historical books and how I want to do that better, and how I want all your input
on how to do that better. Some facts about the speed of this. What did we use for this sort of transfer learning approach sitting at the end of this pipeline? One of Google’s MobileNet,
convolutional neural networks. So it was a simple multinomial problem I’ll talk about in a little bit. There’s all these
different classes of junk. And then there’s different
things that you want. And you apply this to the whole page image then you keep the ones you want, throw away the junk,
and then you use a Mask R-CNN, localizing convolution neural net to get the actual reasons that you want and extract them from the page. And okay in training, so I
did al the labeling myself. And I think as I did about 3,000 before I got tired for the first stage, was just saying like hey does this page. have an image on it, yes or no? And about 300 for the next stage which was saying, okay
here’s a bunch of pages with pictures on them, draw a rectangle around where they are, and then let the network, which is trained on the
COCO dataset figure app, background class versus
region of interest class. Here’s a sort of schematic. So here’s a fun book by Matthew Bailey, writer in the 1790s,
one of the first works of pathological anatomy. So depictions of disease,
the etiology of it, there’s a kind of fun
literary history connection. The illustrations for
this book shown here, were in William and John
Hunter’s Museum in London, which was a kind of basis for the house in “Dr. Jekyll and Mr. Hyde”
which is kind of cool. Anyway, dig deep on any one book, you’ll find an amazing kind of thing. It’s almost frustrating
how zoomed out this is, but a necessary first step. Okay, so great. So through the Internet Archive API we can get this book. It’s 392 total pages, you know JPEGs. And then we use the OCR data to say, let’s cut that on that. There’s only 197 pages with a
likely illustration on them. And here’s the sort of taxonomy of classes that we came up with. So we only want to keep certain classes. Some are junk. We’re not really interested
in tables and charts, although there’s some cool DH projects having to do with tabular
data and diagrams. We only want to keep our inline images, plate images, and these they
were interested in book plates, and end papers, and things like that. So now we’ve cut down on the
space of the problem a lot. So we sort of have these
illustrated pages collected. Now what they wanted to do
with there Flickr project was really have a kind
of crop or bounding box of the images themselves. So that’s where the second
stage of this network, this masked R-CNN came in. So you can now see. And when I was training the data I kind of just went like if
you were to crop this image, what would be the natural human
crop of a image in a book? So you want a little bit of white space. I went back and forth, I
wasn’t super consistent on whether to include captions. That’s the kind of like
very nitty gritty question you probably want a group of
people to reach consensus on and then go for it, which is why I’m gonna show Diffgram at the end because it’s a way of better achieving that kind of consensus. So anyway, this is the pipeline. And then you end up with 155
visual regions of interest. And the nice thing about this
is that the Mask R-CNN model can kind of decompose
complicated illustrations that have maybe a couple
different parts to them into the component parts, but also able to keep in
many cases, the full diagram. Okay, we’ve talked about activation maps. I don’t need to do any material on that. Here are some examples of
what was in the training data. So the junk class contained
a lot of scanning artifacts. You see so many fingers and
hands and stuff it’s great. And then the inline image class. We had great accuracy
and precision for this. So these again are this
dominant 19th century sort of mode. The wood engraving integrated
into the text of the page. And I’ll present a sort of
numbing number of figures here. But this is just to show that
yep, even a simple problem could require a bunch
of different classes. And the key point is that you learn while doing the analysis. Like we got some things for
free doing this project. We now have a model that can predict, oh that’s a page of poetry,
or that’s a page of text. Here’s some stuff from the localization
part of the pipeline, the sort of loss or what you’re aiming at is a high IOU score which
is intersection over union. So the brown there is the
union of the prediction, and the ground truth label, and so forth. Here’s the tool I did it in. Importantly, this tool
just runs on your desktop. So I’m interested in showing Diffgram and the time I have remaining
here are some results. And it’s a more collaborative tool. Well, what can we show? You can’t really show
anything about visual themes or specific objects at this scale. We have millions of images. I know the lab has been working on putting them into PixPlot. But you can show stuff
about the format of a book at scale over a whole century
for all of medical publishing in the West almost. So you find that books
become way more illustrated. So on average a book
would have illustrations about 1% of its pages at the
beginning of the century, but by the end it’s about 6%. And what’s driving that? Well, the blue and the red lines, are very closely tide and the increase is all coming from these inline illustrations. So publishers are
figuring how to integrate illustration into the page. The expensive engraved
plates stay the same. And here’s a breakdown by city. So okay, as I finish up. What are we moving towards
next with this MHL project? Well, the age of photography at scale. And we’re gonna try to have
more focused experiments on distinguishing different
types of photographic processes, considering more thoroughly these legal and ethical questions, thinking about advertising. Advertising starts to
come into these books. And then representations of disease. Okay, so I just wanted to
show you what I’m using now, which is called Diffgram, which uses a forward annotation. Oh yeah, good. This is just a picture from Diffgram which is good. So okay, so Diffgram allows
you to collaboratively label. And you can get some nice
hierarchical labeling. So we’re just in this unit of the figure. So that includes the capture and it includes the whole thing. But a lot of books in the first
decades of the 19th century have stereoscopic views, so you have left view and the right view and you use the viewer. So that in itself is a
very interesting category. So I label the two as a
pair, as a stereoscopic pair, and then I also label
each of these as an extra. And that’s the kind of
collaborative sort of nested hierarchical labeling that you can only do with
one of these new frameworks. The cool thing about Diffgram
is that as you label, it starts to propose
better labels for you. So I’m interested in what’s
out there in the space. And I’d love to talk after this. But thank you so much for your time. – [Moderator] Thank you, very much. (audience applauds) (gentle music)

Leave a Reply

Your email address will not be published. Required fields are marked *