Imagine you start writing a document in Microsoft Word and save it as "notes.doc".
If Word crashes (or there is a power cut, or Windows crashes, or your friend accidentally unplugs your PC, or...) then you lose everything. For a short letter to a friend this is annoying, for a days report writing this is more than just annoying. Therefore:
- You save your work regularly
If you're making quite large changes to your document, like altering the formatting or trying out a completely different paragraph order, then maybe you:
- Resave your work with a version in the filename
After a few iterations of this you decide to send "notes_5.doc" to a colleague to proof read. Maybe they're savvy enough to use the built in Word 'track changes' feature, or maybe they just make the corrections in the document. If you're unlucky perhaps they just print it out and scrawl on it in unintelligible handwriting. Either way, they then send it back to you, and you need to keep it separate from your version, so either they or you:
- Save it with a different filename/version
You might then decide to make some corrections whilst working from home, but unfortunately you've only got "notes_4.doc" with you. What do you save it as then? How do you ensure your changes are properly imported into "notes_6.doc"?
For anything more than quite a straightforward document this can all get out of hand quite quickly. Before long you have a folder of really quite large files, all with slightly different names, and you're not really sure which one is the most recent version. This is a problem that happens far too often and really infuriates me. Whilst I would agree that all the above steps are sensible and I would encourage them (in lieu of a better alternative - see below), one thing I have also started doing is:
- Creating a "archive" subfolder to dump all previous versions in.
That way when I navigate to a folder with three different documents in it, I only see three documents, rather than 30 revisions of three documents. This helps a lot with finding things, but not with the underlying issue of manual naming and keeping track of things.
This is not a new problem, in fact it's one that was identified, and solved, many years ago in computer science. In fact their problem is much more complicated as it often requires that two people be editing the same document (in their case a computer program) simultaneously.
The problem is solved by having a piece of software do the version control for you. This allows the user(s) to have a single copy of the work on their computer, with a simple title (e.g. "notes.doc") and then they can leave all the complicated stuff to the software. All they need to do instead of resaving a version with a new name is to 'check in' a version to the software.
This will work for virtually any type of file with one important caveat: the software is almost always expecting the file to be a plain text document, not a binary file (all programming source code is plain text, some file types such as Word documents are not). That's not to say it won't work for binary files, in fact it still works pretty well; however not all of the version control software's functionality will be available. This means that some of the more clever functions, such as file differencing and merging won't be available.
So what can a piece of version control software do that will be of use for our documents?
- Maintain the current version - as a simple set of files with simple names
- Allow access to previous versions - either specific files on their own or a whole folders worth from a particular date
- View the differences between files (only for plain text) - so that all the changes since a previous version are highlighted
- Merge difference versions (only for plain text) - so that two separate versions are combined into one
I'm using a distributed version control system called Bazaar with a single repository containing all of my documents and code fragments. As I'm trying to use plain text file formats wherever possible (Latex predominantly) I'm able to use the difference and merge functions.
I'm checking in everything except:
- Autogenerated files - such as pdf's and plots (likely to be the subject of a later post)
- Results files - as these are not expected to change, are only used once to produce results plots, and can be quite large files
I tend to check in any changes to files once a day (or every few days if the changes are minor) and backup by 'pushing' a copy of the repository to my external drive every week or so. I can go into more technical detail on my setup if anyone is interested.
I'm also planning to set up a portable version of the software on my external drive so that I can plug into third party machines (on which I'm not allowed to install software - such as my sponsors laptop) and still use version control. I think this should be possible by using portable python as bazaar is coded in python, but I haven't got very far with it yet...