PhD Tools: May 2011

19/05/2011

Professional plotting

I've written and read plenty of technical reports in my time, and one major feature that almost all technical reports have in common is the inclusion of figures. Pick up any ten technical documents and I reckon you'll find at least 5 different ways of including figures; and of these 5 only one will actually look any good. So as I start to produce figures for my PhD I've been starting to look into the best way of managing this initially simple sounding task.

When I say "figure" I'm generally thinking of some kind of plot, or set of plots, usually in 2D; however what I'm going to discuss should apply to most other "technical" "pictures", but probably won't extend to photo-type images (for reasons that will hopefully become obvious).

The Issues
I won't bother going into all the different ways a figure and a report might come together, except to say that at the very bottom of the scale would be a 'picture'/screenshot pasted into a word document - this just looks awful. What I'll work up to will hopefully be the top of the scale.

What I will list is what I see as some major stumbling blocks in figure/report preparation:

Difficulty updating the figure.
Inability to resize the figure to suit the report.
Images (+ text) looking crappy when zoomed in.
Disconnect between the figure labels and the report text.
Difficulty regenerating the figure at a later date.

A lot of these are irrelevant when we're looking at a printed out finished article, so what we really need to understand are the details of how the information is stored and processed on the computer.

Background Details
One of the first important distinctions is the difference between types of image. Most images that one would typically come across on a computer are raster images, these are stored as a set of pixels - imagine a piece of squared paper with each square filled in a different colour. From a distance, or if all the squares are very small, then this looks great; however if we zoom in (e.g. make each square larger) then we start to see the joins between edges and everything starts looking "blocky". Most programs usually handle this by blurring the pixels together slightly, which can help up to a point, but often we just end up with a blurred mess - not what we want in a precise technical report.

The alternative is vector graphics. These are saved more as a description of what needs to be drawn, rather than point-for-point what is on the screen. This means that zooming is purely a mathematical operation, and all the lines will still appear as prefect lines. The same also works for text, which is stored within a vector graphic as actual text, rather than as a picture of it.

There are plenty of graphics explaining this along with a good description in the Wiki pages linked above. But if you're still not sure then try this simple experiment: type a word into a paint program (e.g. Microsoft Paint) and zoom in, and then do the same in a word processing program (e.g. Microsoft Word) - the difference should be pretty obvious.

In summary, unless what your are working with is an actual picture (in which case converting it to vector graphics would be impossible) then you will get best quality out of maintaining it in a vector format. There are plenty of these formats to choose from; however I find them to be surprisingly unsupported in a lot of applications. As my final target format is pdf (as mentioned elsewhere in this blog) I'm going to be working with eps and pdf formats. These both rely on postscript as an underlying process and are therefore fairly compatible.

My process (overview)
With all of the above as my aims I've worked out a basic process for generating figures. It seems to be working fairly well so far, so I'll outline it here:

1) Write a script to produce the figure and save is as an eps file. This means that I can always go back and see how each figure was produced (what the original data was, how it was scaled, etc, etc). If the data changes then I can simply rerun the script and a new figure will be produced. If I need the figure in a different ratio or with different colours (or a different label, etc, etc) then I can make some minor changes to the script and rerun it. I keep the script under version control, but not the eps file it produces (as I can always reproduce this if necessary). I use Matlab for this process as it is what I am most familiar with (although I often store the raw data in an excel or csv file and read this into Matlab). I suspect I could use Gnuplot or something similar instead.

2) Include the eps file in my LaTeX script. This means that when I regenerate the pdf output from my LaTeX it always includes the most recent version of the figure. As it remains as vector graphics throughout the process I get nice clean professional results.

This process solves all of the problems outlined above, except point 4. It is still possible to produce an eps figure from Matlab with "Courier" font and then include it in a Latex document using "Times" font. I find that this looks out of place. I get around this by using a function called matlabfrag, in combination with pstool package for LaTeX. This means that the final figure picks up the font used in the rest of the document. It also allows me to use full LaTeX expressions in my figures.

My process (in detail)
This may get more refined as time goes by, but currently this is the detailed version of how I would produce a figure:
1a) Write a Matlab script to plot the figure as normal. Using standard matlab command to plot and label axes, etc.
1b) Within the script include a call to 'saveFigure.m'. This is a function I have created which accepts a file name and optionally a figure size (otherwise some default values are used), resizes the figure and then calls matlabfrag to save it as an eps file (and an associated tex file including all the labels).
2a) In the LaTeX preamble include '\usepackage{pstool}'. This allows the use of the psfragfig command.
2b) Within my LaTeX include the figure in the normal way. However, instead of using the latex '\includegraphics' command, I replace it with the '\psfragfig' command.

Notes
I can make my 'saveFigure.m' function available to anyone interested, but it doesn't do much more than I have described above!
I have created a slightly revised process for including Simulink models in documents which is a little different that I can discuss if anyone is interested?
I spent a little time trying to get psfrag to play well with eps files produced from other packages - e.g. Google Sketchup, however I don't think I've quite got to the bottom of it yet.

17/05/2011

Selecting the correct format

Does format matter?

Today I received an invite to a party in ".doc" format (i.e. a Microsoft Word document) via email. Whilst I was happy to be invited to the party, and the invite very much served its purpose, I can't help thinking that it could have been presented better. Here're some comments I would make:

".doc" is a proprietary format, which although popular and therefore supported on most peoples computers, can lead to inconsistent formatting or in the worst case a user not being able to open it at all.
It is also what I would consider to be an "editing format", which means it is fine for producing a document, or passing over to someone else for meddling with, but not (in my opinion) for presenting to a recipient. This is for several reasons:

The possibility of the user accidentally editing the document. Say for example the last thing the sender changes is the date of the party "3/6/2011" - it would be all to easy for me to open the document and then accidentally nudge the zero key, only to turn up to a distinct lack of party on "30/06/11".
Access to details the sender didn't wish to share. Formats such as this provide for a great deal of version history to be saved with the document; so unless specific steps are taken to ensure that this is not included in what I receive, there is every chance that I would be able to view previous versions of it, or comments about its contents.
It does not necessarily open in an easy to view format - either on the wrong page, or at the wrong zoom level, or with certain formatting visible (for example a nice red line under all the spelling mistakes). Microsoft did try to improve on this with the introduction of their "reading view" in Word 2003; however I don't think this really helped and only served to confuse the majority of users.
Because of all the extra information the format contains, often the files are far, far larger than they need to be.

It was sent attached to an email. Email is already a perfectly good format for presenting information, with a variety of different effects available (providing html format is used), so it seems a little unnecessary to attach a file with the information in.

How about an Analogy?

To make the closest analogy possible - if this were an invitation sent by good old-fashioned snail mail, it would be:

a handwritten letter;
with all the comments and corrections scribbled in the margins;
with some spelling and grammatical mistakes highlighted but not corrected;
spread over multiple pages - but folded open somewhere in the middle of the document;
posted in a large, heavy and cumbersome box which some recipients lack the tools or knowledge to open;
that is itself housed within a larger box.

Now that is maybe a worst-case scenario, but not particularly exaggerated in my experience; and whilst you might be a bit surprised to receive that as a party invite, you would be pretty disgusted to receive it as a Masters level degree thesis submission. And even more appalled if it was presented as a final report for a multi-thousand pound project contract! Yet this is exactly the sort of thing the gets done in Word every day. That's not to say a lot of those issues don't crop up in the use of other programs; however Word seems to be the most common object of misuse.

So what's the alternative?

A lot of the fuss I've made above can be avoided through proper use of the Microsoft tools. Provided you remove any hidden data, properly spell check your work and set the display up before you finally save it then things should come out looking ok. You can even protect the document to avoid accidental editing. However, to completely negate these issues, I prefer to use a totally separate "display format" for presenting information.

For anything that is disseminated wider than myself (or my immediate team) I am very keen on the use of common, open standards. The most common of which I have found to be pdf. Most people have a pdf reader installed on their computer, no matter what their operating system. In fact most modern phones can display pdf files. Many programs are able to save to pdf as built in feature and those that can't are invariably able to print to one of the many pdf conversion programs available.

There are also some very neat features of pdf files that are not often exploited, but can be used to produce some very useful effects. e.g. opening by default in full screen mode, embedding other files within them, etc. (perhaps I'll cover this at a later date)

By far my most important reason for trying to use pdf format for dissemination though is that it is a format that is difficult to edit (granted editors do exist, but 'accidental editing' is almost impossible) . This means that if I send a file to someone, and they choose to send it to someone else, I can be fairly confident that the final recipient will see what I want them to see (and nothing else!).

05/05/2011

Pie charts

I feel like I'm often moaning about pie charts and then having to explain why I hate them, so I thought I should post here so that I can simply refer people to here for an explanation.
But when I Googled the subject it turns out that everyone else hates them too.

So there's really not much more I can say on the subject except for supplying the best link I found, which is a 2007 document by Stephen Few. It does a really good job of explaining how bad they are. It's pretty readable, but for the lazy you can get most of what you need to know from just the illustrations and their explanation.

Also here is the best quote I found on the use of pie charts:

"Piecharts are the information visualization equivalent of a roofing hammer to the frontal lobe. They have no place in the world of grownups, and occupy the same semiotic space as short pants, a runny nose, and chocolate smeared on one’s face. They are as professional as a pair of assless chaps. Anyone who suggests their use should be instinctively slapped."

Document version control

The Problem
Imagine you start writing a document in Microsoft Word and save it as "notes.doc".
If Word crashes (or there is a power cut, or Windows crashes, or your friend accidentally unplugs your PC, or...) then you lose everything. For a short letter to a friend this is annoying, for a days report writing this is more than just annoying. Therefore:

You save your work regularly

I think most people do this automatically these days.
If you're making quite large changes to your document, like altering the formatting or trying out a completely different paragraph order, then maybe you:

Resave your work with a version in the filename

This is again fairly common and allows you to go back to your previous version if you don't like the changes or something goes wrong. But what to pick as a new filename? A common approach seems to be to append the title with an identifier, so you end up with "notes_2.doc". (What I would strongly advise against is ever choosing to title a version "final.doc", as inevitably you then end up with a "final2.doc", and so on)
After a few iterations of this you decide to send "notes_5.doc" to a colleague to proof read. Maybe they're savvy enough to use the built in Word 'track changes' feature, or maybe they just make the corrections in the document. If you're unlucky perhaps they just print it out and scrawl on it in unintelligible handwriting. Either way, they then send it back to you, and you need to keep it separate from your version, so either they or you:

Save it with a different filename/version

You might be able to get away with moving to "notes_6.doc", but if there have been any changes to your version (which might still be "notes_5") then there is a chance that changes could be missed. So maybe you go for "notes_5a.doc" or appending the reviewers initials as "notes_5_pcf.doc".
You might then decide to make some corrections whilst working from home, but unfortunately you've only got "notes_4.doc" with you. What do you save it as then? How do you ensure your changes are properly imported into "notes_6.doc"?
For anything more than quite a straightforward document this can all get out of hand quite quickly. Before long you have a folder of really quite large files, all with slightly different names, and you're not really sure which one is the most recent version. This is a problem that happens far too often and really infuriates me. Whilst I would agree that all the above steps are sensible and I would encourage them (in lieu of a better alternative - see below), one thing I have also started doing is:

Creating a "archive" subfolder to dump all previous versions in.

That way when I navigate to a folder with three different documents in it, I only see three documents, rather than 30 revisions of three documents. This helps a lot with finding things, but not with the underlying issue of manual naming and keeping track of things.

The Solution
This is not a new problem, in fact it's one that was identified, and solved, many years ago in computer science. In fact their problem is much more complicated as it often requires that two people be editing the same document (in their case a computer program) simultaneously.
The problem is solved by having a piece of software do the version control for you. This allows the user(s) to have a single copy of the work on their computer, with a simple title (e.g. "notes.doc") and then they can leave all the complicated stuff to the software. All they need to do instead of resaving a version with a new name is to 'check in' a version to the software.
This will work for virtually any type of file with one important caveat: the software is almost always expecting the file to be a plain text document, not a binary file (all programming source code is plain text, some file types such as Word documents are not). That's not to say it won't work for binary files, in fact it still works pretty well; however not all of the version control software's functionality will be available. This means that some of the more clever functions, such as file differencing and merging won't be available.
So what can a piece of version control software do that will be of use for our documents?

Maintain the current version - as a simple set of files with simple names
Allow access to previous versions - either specific files on their own or a whole folders worth from a particular date
View the differences between files (only for plain text) - so that all the changes since a previous version are highlighted
Merge difference versions (only for plain text) - so that two separate versions are combined into one

This is all pretty useful stuff, not that difficult to set up, and I've found that so far it has really helped keep my work tidy and avoided confusion.

My Setup
I'm using a distributed version control system called Bazaar with a single repository containing all of my documents and code fragments. As I'm trying to use plain text file formats wherever possible (Latex predominantly) I'm able to use the difference and merge functions.
I'm checking in everything except:

Autogenerated files - such as pdf's and plots (likely to be the subject of a later post)
Results files - as these are not expected to change, are only used once to produce results plots, and can be quite large files

I'm hoping that a check out of my repository will be a complete record of everything I've done in my PhD.

I tend to check in any changes to files once a day (or every few days if the changes are minor) and backup by 'pushing' a copy of the repository to my external drive every week or so. I can go into more technical detail on my setup if anyone is interested.

I'm also planning to set up a portable version of the software on my external drive so that I can plug into third party machines (on which I'm not allowed to install software - such as my sponsors laptop) and still use version control. I think this should be possible by using portable python as bazaar is coded in python, but I haven't got very far with it yet...