PhD Tools: cb2bib

Showing posts with label cb2bib. Show all posts

02/08/2012

Adding bzr files through the cb2bib

I recently described my referencing process where I simultaneously hold reference pdf files in a version control system and keep track of their details in bibtex file. I have also described a problem I've been having with my version control system of choice, Bazaar.

Based on these I decided that there might be a better (more automated) way of adding my references. After a very useful email exchange with the author of cb2bib they confirmed that this should be possible. Now, after a day of messing around and learning about various command-line tools that I hadn't used before, I think I've got it working. Here's how...

First I created a new batch file that I called "bzrAddRef.bat". Here are its contents:

@echo OFF

rem --------------------------------------------------------------------------
rem cb2Bib Tools addition by J. Welford
rem --------------------------------------------------------------------------

echo bzrAddRef:
echo cb2Bib script for adding BibTeX files to a Bazaar repository
echo.
echo Using sed and xargs utilities from:
echo http://gnuwin32.sourceforge.net
echo.
echo Path below may need changing to be within the repository
echo.

echo Processing:
echo.
cd "C:\Users\welf\Documents\literatureReview"
sed -n -e "s|.*file.*=.*{\(.*\)}.*|\"\1\"|p" %1 > bzrRefs.tmp
xargs bzr add <bzrRefs.tmp
del bzrRefs.tmp

Only the last 4 lines are really important. The first one sets the working directory, I don't think it matters what you use as long as you have write access and it is within the repository you want to add to. The next line runs through the current bibtex file and extracts the location of all the references to a temporary file. The next line adds all these files to Bazaar repository. The final line deletes the temporary file. (maybe I could have used a pipe between commands so that the temporary file was not required?)

Some special commands are used that will need to be installed, what you need are: sed.exe (and its dependencies: regex2.dll, libintl3.dll, libiconv2.dll) and xargs.exe.

Within the settings of cb2bib this batch file can now be pointed at under the "Configure BibTeX" - "External BibTeX Postprocessing" - "Command:" section. Once that is done simply hitting "Alt"+"p" in the cb2bib window should run the batch file and add all the references to version control.

I hope the helps someone! (I presume it could be altered for other version control systems or be called from other applications.)

31/07/2012

My referencing process

I suspect everyone has a subtly different way of doing this, depending on the tools they prefer and the way they like to work. I thought I would document my process as I think it's pretty efficient and has some real advantages, I'm sure there is still plenty of room for improvement though.

Overall aim

As an academic I often need to refer to work done by others, this normally done by noting their published work as a list of references or bibliography at the end of my documents. The most basic way of achieving this would be to have a stack of published books and papers in my drawer that I can refer to and then type in the reference details at the end of the each document I want to write. I'm sure that plenty of people do work this way, but from my point of view I see it as pretty inefficient (lots of printing, lots of sorting, lots of typing, not very portable, an awful pain to change reference styles, etc).

I manage my references entirely in software using a few specific tools. I've mentioned most of them in other posts but I'll go into a bit more detail here on the actual process I use.

I'm going to split referencing into two separate processes. Firstly, as I'm researching a topic, I tend to gather references to get an idea what I'm doing. Secondly, when I come to document my work I need to search and cite the references I've found.

Tools

Google Scholar is generally my primary resource for finding papers and documents. I also rely on standard Google searches a massive amount and I have a Google alert set up to email me when a few key phrases appear in new articles added to the web.

IEEExplore generally has most of the published work (in terms of papers, Journal articles, etc) that I need to refer to. The University has a subscription to this that allows me to download what I need (otherwise I'd have to pay!).

Pdf format is generally how almost all papers are delivered and the format that I keep them in. I use Foxit reader to read pdf documents. I keep all my reference documents in one big folder rather than worrying about any kind of complex filing system.

I use Bazaar to version control my folder of pdf documents. I've previously discussed how this works and how it allows me to work between different computers, even without installing any software on them.

I use a tool called cb2bib to maintain a list of references in bibtex format. I started off just using this to add references to my bibtex file, but I've found it to also be really good for browsing references and citing whilst writing a document. I changed some of the default setup to help it retrieve data from the net.

I use Latex to typeset most of my work. Within this I can simply point it at the bibtex file for all the details of the references. I have previously mentioned how to use a bibtex file that is not in the same place as the rest latex document.

Gathering references

Search for the document that I need to find using Google, Google Scholar or general web browsing.
Open and read the document to see if it looks relevant and useful. Assuming that it does...
Download the document to my big folder of "third party" references. I tend to use the full title of the work as the save name of the document - this can lead to quite long file names, but it makes it a lot easier to find things!
Add the saved file to my bazaar version control system. This only takes a couple of clicks through tortoiseBzr menus in windows explorer.
Add the document to my bibtex file using cb2bib. This is a pretty straightforward process:
1. Open cb2bib, I have a keyboard shortcut setup for this (it should also remember what bibtex file is being used)
2. Click "import from pdf file"
3. Click "select files"
4. Select pdf files saved previously (hold ctrl to select a bunch at once)
5. Click "process"
6. The software will try to extract as much info as possible from the pdf file, this probably won't be enough so...
7. Click Network query to retrieve all the info about the file from the web (this usually works fine, but it's worth checking the results, you may need to give it the right title to start it off)
8. Click save to add the reference to the bibtex file
The changed bibtex file and added references will need to be "commited" to the version control repository.

UPDATE: I have added now made a batch file that effectively takes the place of step 4 and occurs after step 5 - it can be run after a whole set of references have been input through cd2bib and adds them all to my version control repository using "Alt"+"p".

Citing references

With a Latex document that has a pointer to the bibtex file within it.
Open cb2bib citer (I use a keyboard shortcut) and select the reference(s) required:
1. [optional] Select the way I want my reference list displayed by pressing "a" (author), "j" (journal), "t" (title) or "y" (year). I find author is usually best.
2. [optional] Filter the reference list by pressing "f" and then typing what you want to search for ("d" clears the search)
3. Click on a chosen reference
4. [optional] Press "o" to open the reference and read it
5. Press "enter" to cite the reference, a small pen marker will appear next to it (in author view it will appear next to each author for the same paper). Multiple references may be cited by selecting them and pressing "enter". "delete" clears all the selected references.
6. Press "c" once all the references for citing are selected, this will close the citer window and copy the latex text for the references to the clipboard.
Paste the text into the latex file to include the references at that point in the document.

Although those might seem like a lot of steps it's really pretty straightforward once you get used to it. The only real difficulty is remembering the keyboard commands for cb2bib citer. With this setup I can take all my references with me between machines (even between Linux and Windows) and use the same process everywhere.

27/07/2011

Making a bibtex file from a folder of pdf files

The issue
As I'm going to be writing some big documents with lots of references I'd be a fool to try to manage these manually, I therefore needed to pick a reference management piece of software. After some browsing I settled on JabRef because: it's free, it's open source, it's lightweight, it's cross-platform and it handles bibtex format natively (which is what I need for it to integrate with latex). It should also link nicely into the Sciplore mind mapping software which I'm using (more about that some other time).

JabRef is basically a database management tool for references that stores its database in bibtex format. It looks like it will work rather well, but unfortunately my first stumbling block is that I already have a folder full of my references in pdf format (~200). This means that I'm immediately faced with the big task of going through and adding the details of each pdf individually. There must be a better way...

Someone else asked the same question here. The answer seemed to be that there was no easy way in JabRef, but it could be done in some other reference management software - such as Mendeley. So I could install that as well and export from there to use JabRef, that seemed like a pain though, especially as you need log in details and all sorts for Mendeley.

The solution
Somewhere else cb2Bib was suggested. This looks like an awesome piece of software, almost to the point that I could use it instead of JabRef, although I don't think it does quite the same job. It's designed as a bibtex database manager, however it is more tailored towards reference entry than editing or final use (e.g. citations) - although it can do this. Its method of adding a new reference is based on what's currently in the clipboard - thats whatever you most recently 'cut' or 'copied' in your operating system. This can either be a piece of text or a pdf file.

Files from the system can also be queued up to be added to the clipboard for addition to the bibtex database - in this manner a folders worth of pdf files can be added. Once the file is in the clipboard the software interrogates it to try to extract the right details for the bibtex reference entry. It is also able to do some other clever things like search the web and find a web reference for it that matches only one of the pieces of data it has extracted. There is also the option to manually edit the fields or to set off a whole run of files to add automatically.

My implementation
In practice the software took a little while to get used to; the buttons aren't in quite the locations I'd expect, there seem to be about 3 different windows that are independent but interrelated and the method of specifying a bibtex file and then successively saving additions to it felt a little odd (rather than running through to create a file and then saving it all at once). But once I was used to it at that level it all worked.

When I came to actually try to add all of my pre-saved pdfs however, I hit problems. Whilst automatic extraction usually managed to pull out a few nuggets of useful data, it rarely found enough for a complete entry. Hitting the button to search the web didn't seem to give much assistance. So it was time to dig a little deeper.

Probing through the website there is quite a lot of useful information on how to configure the software to do what you want. What I needed to do was look into where was being searched on the web for my articles. This is all setup in a configuration file located at:
C:\Program Files\cb2bib\data\netqinf.txt (windows)
or
/usr/share/cb2bib/data/netqinf.txt (linux) (you'll need permissions or to be root to edit)

Wading into there you can find out where is being searched and in what order. What would have been ideal for me would have been a search of the IEEE Xplore site, as that would have turned up most of my papers. Unfortunately it was not in there. Second best was google scholar, sitting at the bottom of the list of options. The documentation in the file wasn't brilliant, but with a bit of trial and error I was able to work out what was going on.

The major change I made to the file was to add this at the top of the queries list:

# QUERY INFO FOR Google Scholar
journal=
query=http://scholar.google.com/scholar?hl=en&lr=&ie=UTF-8&q=<<title>>&btnG=Search
capture_from_query=info:(.+):scholar
referenceurl_prefix=http://scholar.google.com/scholar.bib?hl=en&lr=&ie=UTF-8&q=info:
referenceurl_sufix=:scholar.google.com/&output=citation&oe=ASCII&oi=citation
pdfurl_prefix=
pdfurl_sufix=
action=

journal=
query=http://scholar.google.com/scholar?hl=en&lr=&ie=UTF-8&q=<<excerpt>>&btnG=Search
capture_from_query=info:(.+):scholar
referenceurl_prefix=http://scholar.google.com/scholar.bib?hl=en&lr=&ie=UTF-8&q=info:
referenceurl_sufix=:scholar.google.com/&output=citation&oe=ASCII&oi=citation
pdfurl_prefix=
pdfurl_sufix=
action=

The important changes here are the <<title>> and <<excerpt>> search strings, and the change from capture_from_query=info:(\w+):scholar in the existing scholar searches to capture_from_query=info:(.+):scholar in my search. I'm not too sure what the latter change did, but its effect was that it found the details - where previously it was often missing them!

The other change I made was to untick the option "Set 'title' in double braces" box in the configuration window. After I'd made these changes it worked a lot more consistently.

Some of the time it still pulled out the wrong details if it mis-extracted the article title, however I'd named all my pdfs with the title of the paper, therefore it was simply a case of copying and pasting the filename into the title field and rerunning. It would have been really nice to be able to use the title of my pdf as part of the search but unfortunately I couldn't find a way of doing that.

The only other issue I'm having is that although cb2bib adds in the link to the pdf file, JabRef wont understand it as it uses a very slightly different bibtex format for it. The cb2bib format seems to be:

file = {location}

whereas the JabRef format seems to be:

file = {description:location:type}

There is a comment here by a Mendeley admin that suggests that there is no prescribed format for this aspect of a bibtex file, so I guess it's to be expected. I should be able to work around it with a bit of clever find/replace, but it's an annoyance.
ACTUALLY - this seems to be working under windows! It looks like a different version of JabRef has gotten around this issue.

UPDATE: After a couple of months of getting used to cb2bib and using it to produce a document I'm not really finding the need to use JabRef at all! The 'citer' facility of cb2bib is actually really good.

UPDATE: I hadn't previously gotten round to extracting from IEEE Xplore, as almost everything is on Google Scholar. However I've just tried to set it up and found that the IEEE pages use javascript buttons to produce the citation. This makes it difficult to fully automate.

If you add the following to netqinf.txt then it should search IEEE Xplore for the title, you can then manually click the "download citation" button, select BibTeX format and then copy the BibTeX citation into cb2bib:

# QUERY INFO FOR IEEEXplore
journal=
query=http://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=<<title>>&x=35&y=7
capture_from_query=arnumber=(\d+)&contentType
referenceurl_prefix=http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=
referenceurl_sufix=
pdfurl_prefix=
pdfurl_sufix=
action=browse_referenceurl