27/07/2011

Making a bibtex file from a folder of pdf files

The issue
As I'm going to be writing some big documents with lots of references I'd be a fool to try to manage these manually, I therefore needed to pick a reference management piece of software. After some browsing I settled on JabRef because: it's free, it's open source, it's lightweight, it's cross-platform and it handles bibtex format natively (which is what I need for it to integrate with latex). It should also link nicely into the Sciplore mind mapping software which I'm using (more about that some other time).

JabRef is basically a database management tool for references that stores its database in bibtex format. It  looks like it will work rather well, but unfortunately my first stumbling block is that I already have a folder full of my references in pdf format (~200). This means that I'm immediately faced with the big task of going through and adding the details of each pdf individually. There must be a better way...

Someone else asked the same question here. The answer seemed to be that there was no easy way in JabRef, but it could be done in some other reference management software - such as Mendeley. So I could install that as well and export from there to use JabRef, that seemed like a pain though, especially as you need log in details and all sorts for Mendeley.

The solution
Somewhere else cb2Bib was suggested. This looks like an awesome piece of software, almost to the point that I could use it instead of JabRef, although I don't think it does quite the same job. It's designed as a bibtex database manager, however it is more tailored towards reference entry than editing or final use (e.g. citations) - although it can do this. Its method of adding a new reference is based on what's currently in the clipboard - thats whatever you most recently 'cut' or 'copied' in your operating system. This can either be a piece of text or a pdf file.

Files from the system can also be queued up to be added to the clipboard for addition to the bibtex database - in this manner a folders worth of pdf files can be added. Once the file is in the clipboard the software interrogates it to try to extract the right details for the bibtex reference entry. It is also able to do some other clever things like search the web and find a web reference for it that matches only one of the pieces of data it has extracted. There is also the option to manually edit the fields or to set off a whole run of files to add automatically.

My implementation
In practice the software took a little while to get used to; the buttons aren't in quite the locations I'd expect, there seem to be about 3 different windows that are independent but interrelated and the method of specifying a bibtex file and then successively saving additions to it felt a little odd (rather than running through to create a file and then saving it all at once). But once I was used to it at that level it all worked.

When I came to actually try to add all of my pre-saved pdfs however, I hit problems. Whilst automatic extraction usually managed to pull out a few nuggets of useful data, it rarely found enough for a complete entry. Hitting the button to search the web didn't seem to give much assistance. So it was time to dig a little deeper.

Probing through the website there is quite a lot of useful information on how to configure the software to do what you want. What I needed to do was look into where was being searched on the web for my articles. This is all setup in a configuration file located at:
C:\Program Files\cb2bib\data\netqinf.txt (windows)
or
/usr/share/cb2bib/data/netqinf.txt (linux) (you'll need permissions or to be root to edit)

Wading into there you can find out where is being searched and in what order. What would have been ideal for me would have been a search of the IEEE Xplore site, as that would have turned up most of my papers. Unfortunately it was not in there. Second best was google scholar, sitting at the bottom of the list of options. The documentation in the file wasn't brilliant, but with a bit of trial and error I was able to work out what was going on.

The major change I made to the file was to add this at the top of the queries list:

# QUERY INFO FOR Google Scholar
journal=
query=http://scholar.google.com/scholar?hl=en&lr=&ie=UTF-8&q=<<title>>&btnG=Search
capture_from_query=info:(.+):scholar
referenceurl_prefix=http://scholar.google.com/scholar.bib?hl=en&lr=&ie=UTF-8&q=info:
referenceurl_sufix=:scholar.google.com/&output=citation&oe=ASCII&oi=citation
pdfurl_prefix=
pdfurl_sufix=
action=


journal=
query=http://scholar.google.com/scholar?hl=en&lr=&ie=UTF-8&q=<<excerpt>>&btnG=Search
capture_from_query=info:(.+):scholar
referenceurl_prefix=http://scholar.google.com/scholar.bib?hl=en&lr=&ie=UTF-8&q=info:
referenceurl_sufix=:scholar.google.com/&output=citation&oe=ASCII&oi=citation
pdfurl_prefix=
pdfurl_sufix=
action=

The important changes here are the <<title>> and <<excerpt>> search strings, and the change from capture_from_query=info:(\w+):scholar in the existing scholar searches to capture_from_query=info:(.+):scholar in my search. I'm not too sure what the latter change did, but its effect was that it found the details - where previously it was often missing them!

The other change I made was to untick the option "Set 'title' in double braces" box in the configuration window. After I'd made these changes it worked a lot more consistently.

Some of the time it still pulled out the wrong details if it mis-extracted the article title, however I'd named all my pdfs with the title of the paper, therefore it was simply a case of copying and pasting the filename into the title field and rerunning. It would have been really nice to be able to use the title of my pdf as part of the search but unfortunately I couldn't find a way of doing that.

The only other issue I'm having is that although cb2bib adds in the link to the pdf file, JabRef wont understand it as it uses a very slightly different bibtex format for it. The cb2bib format seems to be:
file = {location}
whereas the JabRef format seems to be:
file = {description:location:type}
There is a comment here by a Mendeley admin that suggests that there is no prescribed format for this aspect of a bibtex file, so I guess it's to be expected. I should be able to work around it with a bit of clever find/replace, but it's an annoyance.
ACTUALLY - this seems to be working under windows! It looks like a different version of JabRef has gotten around this issue.

UPDATE: After a couple of months of getting used to cb2bib and using it to produce a document I'm not really finding the need to use JabRef at all! The 'citer' facility of cb2bib is actually really good.

UPDATE: I hadn't previously gotten round to extracting from IEEE Xplore, as almost everything is on Google Scholar. However I've just tried to set it up and found that the IEEE pages use javascript buttons to produce the citation. This makes it difficult to fully automate.


If you add the following to netqinf.txt then it should search IEEE Xplore for the title, you can then manually click the "download citation" button, select BibTeX format and then copy the BibTeX citation into cb2bib:

# QUERY INFO FOR IEEEXplore
journal=
query=http://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=<<title>>&x=35&y=7
capture_from_query=arnumber=(\d+)&contentType
referenceurl_prefix=http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=
referenceurl_sufix=
pdfurl_prefix=
pdfurl_sufix=
action=browse_referenceurl

6 comments:

  1. Hi, I am also trying to figure out my way around cb2bib.I am also facing a problem regarding extracting information from IEEE explore, and Science Direct.You have given the changes you brought about regarding query from Google Scholar, can you please post the changes required to extract information from IEEE explore??

    Thanks,
    Sarthak Ghosh

    ReplyDelete
  2. Sarthak, I've added an update that may help, although it's not as complete a solution as I would like.

    ReplyDelete
  3. Hi, I like your workflow and the description of cb2bib and I may change from Mendeley. I just wonder how I can remove duplicates in the .bib file with cb2bib? Any idea?

    Regards, PHilipp

    ReplyDelete
    Replies
    1. Philipp,
      I'm not sure that there is a specific system for detecting/removing duplicates within cb2bib, however I would imagine it would be easy to spot and remove them manually. When you are viewing filtered results pressing 'e' allows you to directly edit the bib file and delete any you don't need.
      I hope the switch from Mendeley goes smoothly if you do decide to do this.

      Delete
  4. 1. Install Mendeley (It is Free).
    2. Open it.
    3. Select all PDF files.
    4. Drag and drop all the files into Mendeley Interface
    5. It is done, It will sort all your references.
    6. You can Export .bib file now and use it in Jabref, Lyx or Latex.

    ReplyDelete
    Replies
    1. Out of interest I've just tried this. It is indeed as easy as you suggest, however some comments:
      1. Mendeley is a larger program that allows you to do a whole load of other stuff including having an online profile and whatnot - this may be what you want or it might not.
      2. The automatically extracted details for references that appeared in my .bib file do not seem to be as complete as those provided by the technique above.
      3. The method of searching and citing within a LaTeX document is similar, but maybe not quite as efficient in terms of number of key presses?

      Thanks for the suggestion though.

      Delete