PhD Tools: version control

Showing posts with label version control. Show all posts

30/05/2013

Thesis mark I

I'll be using LaTeX for the writing, in whatever document template I'm given (or have to make to fulfill my departments guidelines). I'll try to keep it split into relatively small chunks, but without going to too many levels of subheading (4 max). I plan to keep each chapter in its own subdirectory and include it using the \include command. Within chapters I will break things down again using \input commands to allow me to keep sections and subsections in their own individual tex files.

I'll try to use hyperlinks where possible using \usepackage{hyperref}, using the colorlinks option to avoid the default ugly boxes. By using a subtle colour for the links they shouldn't be too jarring in a printed report (where they will be no use) but obvious enough when read on a computer.
As previously discussed in this blog, I'll try to include figures from matlab and simulink as vector images, stored wherever they are generated. I'll also use a single bibtex reference file for all my references.
As with other work, ongoing revision of the thesis will be controlled using Bazaar. I can include the version used to produce output files within them using vc.tex, which is pretty neat. That way when I have multiple versions printed out and sent to people for comments I hopefully shouldn't lose track of which version they're commenting on!

Latex to Kindle?

Thesis mark II

Latex to html?

html to chd?

I never really finished writing this post, but it gives a flavour of what I was thinking a little while ago when I started writing in earnest!

02/08/2012

Adding bzr files through the cb2bib

I recently described my referencing process where I simultaneously hold reference pdf files in a version control system and keep track of their details in bibtex file. I have also described a problem I've been having with my version control system of choice, Bazaar.

Based on these I decided that there might be a better (more automated) way of adding my references. After a very useful email exchange with the author of cb2bib they confirmed that this should be possible. Now, after a day of messing around and learning about various command-line tools that I hadn't used before, I think I've got it working. Here's how...

First I created a new batch file that I called "bzrAddRef.bat". Here are its contents:

@echo OFF

rem --------------------------------------------------------------------------
rem cb2Bib Tools addition by J. Welford
rem --------------------------------------------------------------------------

echo bzrAddRef:
echo cb2Bib script for adding BibTeX files to a Bazaar repository
echo.
echo Using sed and xargs utilities from:
echo http://gnuwin32.sourceforge.net
echo.
echo Path below may need changing to be within the repository
echo.

echo Processing:
echo.
cd "C:\Users\welf\Documents\literatureReview"
sed -n -e "s|.*file.*=.*{\(.*\)}.*|\"\1\"|p" %1 > bzrRefs.tmp
xargs bzr add <bzrRefs.tmp
del bzrRefs.tmp

Only the last 4 lines are really important. The first one sets the working directory, I don't think it matters what you use as long as you have write access and it is within the repository you want to add to. The next line runs through the current bibtex file and extracts the location of all the references to a temporary file. The next line adds all these files to Bazaar repository. The final line deletes the temporary file. (maybe I could have used a pipe between commands so that the temporary file was not required?)

Some special commands are used that will need to be installed, what you need are: sed.exe (and its dependencies: regex2.dll, libintl3.dll, libiconv2.dll) and xargs.exe.

Within the settings of cb2bib this batch file can now be pointed at under the "Configure BibTeX" - "External BibTeX Postprocessing" - "Command:" section. Once that is done simply hitting "Alt"+"p" in the cb2bib window should run the batch file and add all the references to version control.

I hope the helps someone! (I presume it could be altered for other version control systems or be called from other applications.)

31/07/2012

My referencing process

I suspect everyone has a subtly different way of doing this, depending on the tools they prefer and the way they like to work. I thought I would document my process as I think it's pretty efficient and has some real advantages, I'm sure there is still plenty of room for improvement though.

Overall aim

As an academic I often need to refer to work done by others, this normally done by noting their published work as a list of references or bibliography at the end of my documents. The most basic way of achieving this would be to have a stack of published books and papers in my drawer that I can refer to and then type in the reference details at the end of the each document I want to write. I'm sure that plenty of people do work this way, but from my point of view I see it as pretty inefficient (lots of printing, lots of sorting, lots of typing, not very portable, an awful pain to change reference styles, etc).

I manage my references entirely in software using a few specific tools. I've mentioned most of them in other posts but I'll go into a bit more detail here on the actual process I use.

I'm going to split referencing into two separate processes. Firstly, as I'm researching a topic, I tend to gather references to get an idea what I'm doing. Secondly, when I come to document my work I need to search and cite the references I've found.

Tools

Google Scholar is generally my primary resource for finding papers and documents. I also rely on standard Google searches a massive amount and I have a Google alert set up to email me when a few key phrases appear in new articles added to the web.

IEEExplore generally has most of the published work (in terms of papers, Journal articles, etc) that I need to refer to. The University has a subscription to this that allows me to download what I need (otherwise I'd have to pay!).

Pdf format is generally how almost all papers are delivered and the format that I keep them in. I use Foxit reader to read pdf documents. I keep all my reference documents in one big folder rather than worrying about any kind of complex filing system.

I use Bazaar to version control my folder of pdf documents. I've previously discussed how this works and how it allows me to work between different computers, even without installing any software on them.

I use a tool called cb2bib to maintain a list of references in bibtex format. I started off just using this to add references to my bibtex file, but I've found it to also be really good for browsing references and citing whilst writing a document. I changed some of the default setup to help it retrieve data from the net.

I use Latex to typeset most of my work. Within this I can simply point it at the bibtex file for all the details of the references. I have previously mentioned how to use a bibtex file that is not in the same place as the rest latex document.

Gathering references

Search for the document that I need to find using Google, Google Scholar or general web browsing.
Open and read the document to see if it looks relevant and useful. Assuming that it does...
Download the document to my big folder of "third party" references. I tend to use the full title of the work as the save name of the document - this can lead to quite long file names, but it makes it a lot easier to find things!
Add the saved file to my bazaar version control system. This only takes a couple of clicks through tortoiseBzr menus in windows explorer.
Add the document to my bibtex file using cb2bib. This is a pretty straightforward process:
1. Open cb2bib, I have a keyboard shortcut setup for this (it should also remember what bibtex file is being used)
2. Click "import from pdf file"
3. Click "select files"
4. Select pdf files saved previously (hold ctrl to select a bunch at once)
5. Click "process"
6. The software will try to extract as much info as possible from the pdf file, this probably won't be enough so...
7. Click Network query to retrieve all the info about the file from the web (this usually works fine, but it's worth checking the results, you may need to give it the right title to start it off)
8. Click save to add the reference to the bibtex file
The changed bibtex file and added references will need to be "commited" to the version control repository.

UPDATE: I have added now made a batch file that effectively takes the place of step 4 and occurs after step 5 - it can be run after a whole set of references have been input through cd2bib and adds them all to my version control repository using "Alt"+"p".

Citing references

With a Latex document that has a pointer to the bibtex file within it.
Open cb2bib citer (I use a keyboard shortcut) and select the reference(s) required:
1. [optional] Select the way I want my reference list displayed by pressing "a" (author), "j" (journal), "t" (title) or "y" (year). I find author is usually best.
2. [optional] Filter the reference list by pressing "f" and then typing what you want to search for ("d" clears the search)
3. Click on a chosen reference
4. [optional] Press "o" to open the reference and read it
5. Press "enter" to cite the reference, a small pen marker will appear next to it (in author view it will appear next to each author for the same paper). Multiple references may be cited by selecting them and pressing "enter". "delete" clears all the selected references.
6. Press "c" once all the references for citing are selected, this will close the citer window and copy the latex text for the references to the clipboard.
Paste the text into the latex file to include the references at that point in the document.

Although those might seem like a lot of steps it's really pretty straightforward once you get used to it. The only real difficulty is remembering the keyboard commands for cb2bib citer. With this setup I can take all my references with me between machines (even between Linux and Windows) and use the same process everywhere.

07/06/2011

Portable Version Control

Background
I've mentioned previously that I'm version controlling my work using distributed version control software - specifically Bazaar. I explained that part of the reason I'm doing this is because I work at a number of different locations, and I need to be able to manage various different versions of my work and keep them all in sync.
One of the locations I work at is a Laptop provided by my sponsor. Unfortunately their corporate system does not allow the installation of unauthorised software and Bazaar is not on their approved list. I did briefly look into using subversion (as it was used elsewhere in the company and linking between this and Bazaar is possible) however it looked like it was going to be too much hassle to justify getting it installed. I've therefore decided to attempt to run my version control purely from my flash memory. This will not only allow me to version control my work on my sponsors laptop, but also on any other machine I decide to use.

Incidentally it's worth mentioning that I'm storing all my work on the flash memory (micro-sd) card of my phone. I upgraded to an 8 Gb card which should be plenty (at least initially). I always carry my phone anyway so all I need to retrieve my work at any time or location is a USB cable (or bluetooth if I'm desperate!). I don't treat this as a working copy, but I backup to it regularly and use it to transfer versions between machines. In version control terminology this means I have a 'branch' on each machine I work at and another 'branch' on my phone. I 'push' or 'pull' changes between the phone branch and the local copy branch to backup/transfer the work.

Bazaar is coded using the Python programming language. This is a pretty modern and, by most accounts, a pretty handy programming language; however it's not something that I'm particularly familiar with. The main benefit to me is that there is a variant of it available called Portable Python which is designed to run from a USB storage device (like my phone). This means that it is possible for me to install Portable Python on my phone, and then run Bazaar from that.

I was quite surprised not to be able to find any instructions for this setup on the net anywhere, as it seems like quite a handy proposition (certainly for someone in my situation). The only references that I could find to it were suggestions of using it here, and here. I'm far from an expert on this type of thing but I've sort of got it working - so I thought I'd publish the steps I took incase anyone has any feedback, or it is of use to someone else.

Unfortunately portable python is only windows based; however this is where I'm most likely to need it. (there's also a chance that it will work under linux through wine, but I haven't really experimented with that yet.)

Python Installation
The current version of Python is 3.2, however I don't think any of the Bazaar source is built on that version, so I went with 2.6 (I'm not sure if I could have gotten away with a newer version or not?). I downloaded Portable Python from here and copied it to a "python" folder on my phone. I then ran the installer and told it to install to the same folder on the phone.

Bazaar Installation
I tried installing Bazaar from some executables, unfortunately these complained that it needed python 2.6 in the registry, and gave me no option for manually specifying a python path. So instead I downloaded a tarball of the Bazaar source from here which I extracted to the Python "App" directory. I then opened a command prompt, navigated to the python "App" directory, and ran:
python bzr-2.4b3\setup.py install
(actually I was still using Linux at this point and instead ran SPE-Portable.exe under wine and then ran setup.py through that - I'm pretty sure it had the same effect though)
This ran for a while and terminated without any errors. Within the bazaar directory a "build" folder had now appeared.

From the command prompt I was then able to run :
python bzr-2.4b3\bzr status C:\test
where C:\test was a directory under version control. This gave the correct response, but it also gave me this warning that some extensions couldn't be loaded. I don't think this is a problem, so I ignored it (the link gives a method for turning off the warning if you're bothered about it).

Bazaar Explorer Installation
I wasn't keen on typing everything through the command line, so I looked into getting bazaar explorer running. The official instructions for this are here.
So I changed to the plugins directory of bazaar:
cd F:\python\PortablePython_1.1_py2.6.1\App\bzr-2.4b3\bzrlib\plugins

and ran bzr with the command to download from launchpad:
F:\python\PortablePython_1.1_py2.6.1\App\python F:\python\PortablePython_1.1_py2.6.1\App\bzr-2.4b3\bzr branch lp:bzr-explorer explorer
This warned me that I hadn't given a launchpad ID, but I don't think that matters as I'm not intending on writing anything back to launchpad. Then it went about downloading and building explorer. This process took a little while.

I then tried to run it, but there were a few dependencies that needed fulfilling. Firstly it told me I needed QBzr. So, still in the plugins directory, I ran:
F:\python\PortablePython_1.1_py2.6.1\App\python F:\python\PortablePython_1.1_py2.6.1\App\bzr-2.4b3\bzr branch lp:qbzr
this then went through the same process as with explorer.

On the next try it requested qt libraries (ERROR: No modeule names PyQt4). Some googling reveals this, which suggests installing it in a "non-portable" install and then copying selected files across. (I did try a proper install before this but got nowhere*). I downloaded "PyQt-Py2.6-gpl-4.5.4-1.exe" from here, and installed to C:\Python26. This created a folder in C:\Python26\Lib called "site-packages", I copied the contents of this to the Lib\site-packages folder of my portable python install.

I then tried running:
python bzr-2.4b3\bzr explorer
and up it popped!

Unfortunately a lot of the buttons bring up a deprecation warning for commands in plugins\explorer\lib\app_suite.py. This is something I haven't been able to fix yet. My guess is that it's due to a new version of explorer being used with an older version of Bazaar, but that's just a guess. If anyone can help with it then I'd love to hear from you!
It's still useful as a GUI for inspecting the changes and looking at differences; however any major commands need to be made from the command prompt.

Also this all took a while working from the phone flash card, next time I might consider doing it locally on the hard drive and then copying it across.

Please let me know if you know of any better methods for getting this working, or where I've gone wrong with getting explorer working. Hope this is of interest to someone.
I've also just come across another (easier looking) solution here.

* The copying method sounded like it might be a bit flaky, and I saw some sites indicating that it could be done in a more "conventional" manner, so I got the files for it from here. I also needed "SIP" (I tried it without but it said no). SIP can be obtained from the same place here, and extracted to the python/App directory. I then changed to the python directory and ran:

python sip-4.12.3\configure.py

after this had created a sip module Makefile it errored with unable to open siplib\siplib.sbf - not sure what this means, so I then tried:

python PyQt-win-gpl-4.8.4\configure.py

this said: "make sure you have a working qt v4 qmake on your path". I was a bit lost then so I gave up and tried the technique of copying across!

19/05/2011

Professional plotting

I've written and read plenty of technical reports in my time, and one major feature that almost all technical reports have in common is the inclusion of figures. Pick up any ten technical documents and I reckon you'll find at least 5 different ways of including figures; and of these 5 only one will actually look any good. So as I start to produce figures for my PhD I've been starting to look into the best way of managing this initially simple sounding task.

When I say "figure" I'm generally thinking of some kind of plot, or set of plots, usually in 2D; however what I'm going to discuss should apply to most other "technical" "pictures", but probably won't extend to photo-type images (for reasons that will hopefully become obvious).

The Issues
I won't bother going into all the different ways a figure and a report might come together, except to say that at the very bottom of the scale would be a 'picture'/screenshot pasted into a word document - this just looks awful. What I'll work up to will hopefully be the top of the scale.

What I will list is what I see as some major stumbling blocks in figure/report preparation:

Difficulty updating the figure.
Inability to resize the figure to suit the report.
Images (+ text) looking crappy when zoomed in.
Disconnect between the figure labels and the report text.
Difficulty regenerating the figure at a later date.

A lot of these are irrelevant when we're looking at a printed out finished article, so what we really need to understand are the details of how the information is stored and processed on the computer.

Background Details
One of the first important distinctions is the difference between types of image. Most images that one would typically come across on a computer are raster images, these are stored as a set of pixels - imagine a piece of squared paper with each square filled in a different colour. From a distance, or if all the squares are very small, then this looks great; however if we zoom in (e.g. make each square larger) then we start to see the joins between edges and everything starts looking "blocky". Most programs usually handle this by blurring the pixels together slightly, which can help up to a point, but often we just end up with a blurred mess - not what we want in a precise technical report.

The alternative is vector graphics. These are saved more as a description of what needs to be drawn, rather than point-for-point what is on the screen. This means that zooming is purely a mathematical operation, and all the lines will still appear as prefect lines. The same also works for text, which is stored within a vector graphic as actual text, rather than as a picture of it.

There are plenty of graphics explaining this along with a good description in the Wiki pages linked above. But if you're still not sure then try this simple experiment: type a word into a paint program (e.g. Microsoft Paint) and zoom in, and then do the same in a word processing program (e.g. Microsoft Word) - the difference should be pretty obvious.

In summary, unless what your are working with is an actual picture (in which case converting it to vector graphics would be impossible) then you will get best quality out of maintaining it in a vector format. There are plenty of these formats to choose from; however I find them to be surprisingly unsupported in a lot of applications. As my final target format is pdf (as mentioned elsewhere in this blog) I'm going to be working with eps and pdf formats. These both rely on postscript as an underlying process and are therefore fairly compatible.

My process (overview)
With all of the above as my aims I've worked out a basic process for generating figures. It seems to be working fairly well so far, so I'll outline it here:

1) Write a script to produce the figure and save is as an eps file. This means that I can always go back and see how each figure was produced (what the original data was, how it was scaled, etc, etc). If the data changes then I can simply rerun the script and a new figure will be produced. If I need the figure in a different ratio or with different colours (or a different label, etc, etc) then I can make some minor changes to the script and rerun it. I keep the script under version control, but not the eps file it produces (as I can always reproduce this if necessary). I use Matlab for this process as it is what I am most familiar with (although I often store the raw data in an excel or csv file and read this into Matlab). I suspect I could use Gnuplot or something similar instead.

2) Include the eps file in my LaTeX script. This means that when I regenerate the pdf output from my LaTeX it always includes the most recent version of the figure. As it remains as vector graphics throughout the process I get nice clean professional results.

This process solves all of the problems outlined above, except point 4. It is still possible to produce an eps figure from Matlab with "Courier" font and then include it in a Latex document using "Times" font. I find that this looks out of place. I get around this by using a function called matlabfrag, in combination with pstool package for LaTeX. This means that the final figure picks up the font used in the rest of the document. It also allows me to use full LaTeX expressions in my figures.

My process (in detail)
This may get more refined as time goes by, but currently this is the detailed version of how I would produce a figure:
1a) Write a Matlab script to plot the figure as normal. Using standard matlab command to plot and label axes, etc.
1b) Within the script include a call to 'saveFigure.m'. This is a function I have created which accepts a file name and optionally a figure size (otherwise some default values are used), resizes the figure and then calls matlabfrag to save it as an eps file (and an associated tex file including all the labels).
2a) In the LaTeX preamble include '\usepackage{pstool}'. This allows the use of the psfragfig command.
2b) Within my LaTeX include the figure in the normal way. However, instead of using the latex '\includegraphics' command, I replace it with the '\psfragfig' command.

Notes
I can make my 'saveFigure.m' function available to anyone interested, but it doesn't do much more than I have described above!
I have created a slightly revised process for including Simulink models in documents which is a little different that I can discuss if anyone is interested?
I spent a little time trying to get psfrag to play well with eps files produced from other packages - e.g. Google Sketchup, however I don't think I've quite got to the bottom of it yet.

05/05/2011

Document version control

The Problem
Imagine you start writing a document in Microsoft Word and save it as "notes.doc".
If Word crashes (or there is a power cut, or Windows crashes, or your friend accidentally unplugs your PC, or...) then you lose everything. For a short letter to a friend this is annoying, for a days report writing this is more than just annoying. Therefore:

You save your work regularly

I think most people do this automatically these days.
If you're making quite large changes to your document, like altering the formatting or trying out a completely different paragraph order, then maybe you:

Resave your work with a version in the filename

This is again fairly common and allows you to go back to your previous version if you don't like the changes or something goes wrong. But what to pick as a new filename? A common approach seems to be to append the title with an identifier, so you end up with "notes_2.doc". (What I would strongly advise against is ever choosing to title a version "final.doc", as inevitably you then end up with a "final2.doc", and so on)
After a few iterations of this you decide to send "notes_5.doc" to a colleague to proof read. Maybe they're savvy enough to use the built in Word 'track changes' feature, or maybe they just make the corrections in the document. If you're unlucky perhaps they just print it out and scrawl on it in unintelligible handwriting. Either way, they then send it back to you, and you need to keep it separate from your version, so either they or you:

Save it with a different filename/version

You might be able to get away with moving to "notes_6.doc", but if there have been any changes to your version (which might still be "notes_5") then there is a chance that changes could be missed. So maybe you go for "notes_5a.doc" or appending the reviewers initials as "notes_5_pcf.doc".
You might then decide to make some corrections whilst working from home, but unfortunately you've only got "notes_4.doc" with you. What do you save it as then? How do you ensure your changes are properly imported into "notes_6.doc"?
For anything more than quite a straightforward document this can all get out of hand quite quickly. Before long you have a folder of really quite large files, all with slightly different names, and you're not really sure which one is the most recent version. This is a problem that happens far too often and really infuriates me. Whilst I would agree that all the above steps are sensible and I would encourage them (in lieu of a better alternative - see below), one thing I have also started doing is:

Creating a "archive" subfolder to dump all previous versions in.

That way when I navigate to a folder with three different documents in it, I only see three documents, rather than 30 revisions of three documents. This helps a lot with finding things, but not with the underlying issue of manual naming and keeping track of things.

The Solution
This is not a new problem, in fact it's one that was identified, and solved, many years ago in computer science. In fact their problem is much more complicated as it often requires that two people be editing the same document (in their case a computer program) simultaneously.
The problem is solved by having a piece of software do the version control for you. This allows the user(s) to have a single copy of the work on their computer, with a simple title (e.g. "notes.doc") and then they can leave all the complicated stuff to the software. All they need to do instead of resaving a version with a new name is to 'check in' a version to the software.
This will work for virtually any type of file with one important caveat: the software is almost always expecting the file to be a plain text document, not a binary file (all programming source code is plain text, some file types such as Word documents are not). That's not to say it won't work for binary files, in fact it still works pretty well; however not all of the version control software's functionality will be available. This means that some of the more clever functions, such as file differencing and merging won't be available.
So what can a piece of version control software do that will be of use for our documents?

Maintain the current version - as a simple set of files with simple names
Allow access to previous versions - either specific files on their own or a whole folders worth from a particular date
View the differences between files (only for plain text) - so that all the changes since a previous version are highlighted
Merge difference versions (only for plain text) - so that two separate versions are combined into one

This is all pretty useful stuff, not that difficult to set up, and I've found that so far it has really helped keep my work tidy and avoided confusion.

My Setup
I'm using a distributed version control system called Bazaar with a single repository containing all of my documents and code fragments. As I'm trying to use plain text file formats wherever possible (Latex predominantly) I'm able to use the difference and merge functions.
I'm checking in everything except:

Autogenerated files - such as pdf's and plots (likely to be the subject of a later post)
Results files - as these are not expected to change, are only used once to produce results plots, and can be quite large files

I'm hoping that a check out of my repository will be a complete record of everything I've done in my PhD.

I tend to check in any changes to files once a day (or every few days if the changes are minor) and backup by 'pushing' a copy of the repository to my external drive every week or so. I can go into more technical detail on my setup if anyone is interested.

I'm also planning to set up a portable version of the software on my external drive so that I can plug into third party machines (on which I'm not allowed to install software - such as my sponsors laptop) and still use version control. I think this should be possible by using portable python as bazaar is coded in python, but I haven't got very far with it yet...