Dispelling Subversion FUD

Ben Collins-Sussman

sussman@red-bean.com

I'm a Subversion developer who has worked on the project from the very beginning. This is a private essay, written by me. I don't pretend to be objective at all; it represents my personal opinions and feelings about Subversion. It's not official project documentation, but my hope is that people will link to this document whenever they see FUD about Subversion. My goal is to dispel some of the more common rumors and misconceptions I've heard floating around the net.

Before I begin, a word of advice to curious administrators. If you're learning about Subversion and thinking of using it in your group or company, please approach it the way you'd approach any new product: with caution. This isn't to say that Subversion is unreliable... but that doesn't mean you shouldn't use some common sense either. Don't blindly jump into the deep end without a test-drive. No user wants a new product forced upon them, and if you're going to be responsible for administering the system, you better have some familiarity with it before rolling it out to everyone. Find a smallish project, and set it up as a "pilot" for Subversion. Ask for enthusiastic volunteers to test-drive the experiment. In the end, if Subversion turns out to be a good fit, you'll have much happier developers (who have been part of the process from the start) and you'll be ready to support a larger installation as well.

That said, here are the most common bits of FUD I've heard.

Subversion is too difficult to build, with too many dependencies. I hear that it requires Apache... talk about a showstopper!

Let's address the Apache issue first: Subversion does not require Apache. It depends on the Apache Portable Runtime (APR) library, but that's not the same as the Apache webserver. APR allows Subversion clients and servers to compile anywhere Apache does, in the same way that the Netscape Portable Runtime (NSPR) library makes Mozilla compile everywhere.

Subversion has two different servers: you can use Apache2 with a custom WebDAV module, or you can run a small standalone 'svnserve' server which is similar to CVS's pserver. Neither server is "more official", and both have trade-offs. See the beginning of chapter 6 in the Subversion book for a comparison of features.

Next, regarding the "difficult to build" problem: when was the last time you compiled CVS? Never? That's because it's preinstalled on just about every system, right? If you're using a well-supported operating system, Subversion binaries should be standard packages either built-in, supplied by your distribution (rpms, debs, fink, etc.), or easily downloadable (in the case of win32.)

Building is for developers, not users. Mozilla, Evolution, KDE, and Gnome all have an insane number of dependencies too, but most normal users don't know or care, because they're not compiling. The fact is, Subversion has a lot of dependencies because it has a lot of complex features, and doesn't reinvent the wheel. Nothing unusual about that.

Subversion doesn't break new ground -- it keeps same old lame CVS model. Why imitate CVS at all?

From the start, the Subversion project has always had a "fundamental axiom":

CVS is an excellent, proven model for version control; it just wasn't implemented as well as it could be.

We're not polishing a turd, we're polishing a diamond in the rough. Subversion takes the CVS model and adds directory versioning, atomic commits, database backend, versioned metadata, efficient binary handling, flexible network abilities, and a solid C API. Most of us think that it's what CVS should have been in the first place.

If you disagree with the fundamental axiom of the project, there really isn't much more to talk about; Subversion is not for you.

Some of the newer competing version control systems are "distributed" or "decentralized" -- projects like Monotone, or Arch, or even non-free systems like Bitkeeper. These products offer a somewhat radical new way of working, where each developer has a private repository, and repositories are able to exchange changes in any sort of hierachy.

A number of Subversion developers have mixed feelings about these distributed systems. On the one hand, it sounds really neat, and we're curious to try them out. On the other hand, we've heard a lot of people complain about how difficult they are to use, perhaps something that will improve over time. And at least one Subversion developer believes that the decenttralized model isn't right for free software development. You'll have to decide for yourself.

At the moment, there are no concrete plans to evolve Subversion into a decentralized system. But an interesting project called svk is a new decentralized system based on the Subversion libraries, and is supposedly "compatible" with ordinary Subversion repositories and regular users not using svk. A lot of people really love it, so you might want to check it out. The Subversion project, at a minimum, plans to study svk some day just to see how it implements various "smart merging" behaviors. Who knows? Maybe some decentralized abilities will creep into Subversion too. It's all speculation at this point.

If Subversion is only "CVS improved", why the heck did it take four years to get to 1.0? Geez, how hard can it be to slap some features on top of CVS?

Please, don't insult the project by claiming that we just "slapped some features on CVS." Those features aren't "slappable" on CVS. The CVS codebase is a bloody mess, and very difficult to extend. (Though at least two projects attempted to do so: CVSNT and MetaCVS.) That's why we started from scratch with a completely new design. Subversion and CVS share zero code; the only things they have in common are a concurrent, centralized model and similar UI.

We started out by implementing a journaled library that manages working copy data and understands versioned directories. Then we implemented a repository on top of a transactional database, one which stores snapshots of entire trees. It took about 14 months of coding before Subversion was complete enough to start hosting itself. After that, it's been two and a half years of continuous stabilization, bug-fixing, and regression tests, with releases every few weeks. Versioning directories is a hard problem.

When Subversion hit "alpha" it was already being used by dozens of private developers and shops for real work. Any other project probably would have called the product "1.0" at that point, but we deliberately decided to delay that label as long as possible. Because we're talking managing people's irreplaceable data, the project was extremely conservative about labeling something 1.0. We were aware that many people were waiting for that label before using Subversion, and had very specific expectations about the meaning of that label. So we stuck to that standard. All it takes is one high-profile case of data loss to destroy an SCM's reputation.

I'm researching different SCM solutions for my company, and I've seen tables that compare Subversion with other systems. I notice that Subversion is lacking [feature X]. Don't you think that's a problem? Are there plans to address this? My group might be willing to contribute resources to this project, but it definitely won't happen if we don't see this feature implemented.

First of all, threatening will get you nowhere. A lot of people think they can influence a project by offering resources, but then using that offer as means of "blackmailing" the project in a certain direction. Subversion, like any other open-source project, is a meritocracy based on code contribution and lots of discussion. You're welcome to participate like everyone else, but it has to be on the same terms and rules that everyone else follows. See the HACKING document for more detail.

Second, Subversion's developers are acutely aware of the Feature Creep problem. Many projects have loose goals and no solid definition of "done", so the project scope ever drifts and expands, the community shifts, and nothing is ever released. As testament, just look at the hundreds of dead projects on Sourceforge. From day one, our developers lay down a crisp definition of exactly which CVS problems Subversion 1.0 would fix and which ones we wouldn't. If you missed that discussion, I'm sorry. It's the front page of the website, and it's been our unchanging guide for years. If you want to influence the priorities of post-1.0 features, feel free to get involved in the project discussion and be prepared to write code. Make sure to look through our issue tracker and mailing lists for previous discussions about your favorite unimplemented feature. I can almost guarantee you're not the first person to ask about it.

Finally, a little rant about the several SCM "comparison tables" that I've seen out on the net. Honestly, I give very little credibility to these tables for a couple of reasons. Many of them are written by people who are core developers for a specific SCM system, and there's just no way such a person can write an objective comparison. Consciously or unconsciously, the whole discussion is framed in terms of methodologies and features most important to the author's own system. Other times, the authors are simply information-gatherers: the table reads like a book report. You get the impression that the author went around and read each project's self-description, neatly summarized it for us, but has little or no experience using the systems for real work in a group setting. Lastly, I have a personal objection to the assumption behind these tables. Various SCM features are listed as if there's some platonic, ideal system out there somewhere: "let's see how these systems stack up when compared to the perfect system!" That's a bunch of hooey. There is no perfect system. Every system has advantages and disadvantages, and each will be a better or worse fit for different groups. No chart is going to definitively tell you if a system is good for you. You need to try it for yourself.

Why wasn't Subversion written in a good modern language like Java or C++? Why did you use crufty old C?

This is dangerous ground -- nobody wants to get into a language holy war. There are few reasons we chose C. Paraphrasing a couple of our developers:

Portability. C++ compilers are not standardized to the degree that C compilers are. What works in one C++ compiler doesn't in another, and linking to C++ libraries can be a nightmare.
C has a large pool of skilled programmers.
C library APIs are accessible from almost every other language. This is not true of Java.

Portability is the main point here. Just because Subversion is written as a collection of C libraries doesn't mean you have to use C. There are Subversion library bindings for perl, python, Java, and C++ out there, all being used by third-party projects.

A database back-end is too dangerous and unfriendly. What if I need to hack on the data directly? With CVS, at least I can open the RCS files in my text editor.

Are you suggesting that people mucking directly in RCS files is safe? Let me turn the question around: why are you loading RCS files into your editor in the first place? Why are your administrators hand-moving files around in the CVS repository? In my experience, it's almost always to overcome some shortcoming or annoyance caused by CVS itself. A well-functioning system shouldn't need its repository "hacked".

When you want to share highly organized data over a network, what's the standard practice these days? Easy: put the data in a database (like MySQL) and make it available through a web interface. It's the classic LAMP solution.

Subversion is doing the same thing: putting your data in a database, and making it available over a network. Notice that nobody panics over storing critical data in MySQL, and MySQL data isn't exactly hackable in your editor. If you want to look at the low-level data, use database utilities to dump tables. If you want to migrate your data, dump it out into a portable, transportable format.

Also note that as of Subversion 1.1, you can create a repository that doesn't use BerkeleyDB at all. An "fsfs" repository stores data in the ordinary OS filesystem. (Though the files are still binary format, and still not meant to be human editable!)

My friend said that Subversion is dog-slow.

Yes, that used to be true. We spent a long time working on correctness rather than speed. In late 2003, though, we spent a significant amount of time working on performance optimizations. By our own testing, Subversion 1.x should be pretty close to CVS in speed.

Look, Subversion can't be all butterflies and rainbows. What problems should I expect when using it?

I'm not going to lie to you. There are some annoying things about Subversion, but in the interest of actually releasing something useful to the world before the heat death of the universe (to quote Karl Fogel), we had to let some imperfections slide:

A lot of error messages could be clearer. We're working on it.
It's easy to get charset conversion failures. The repository stores all paths and commit messages in UTF8, but clients can't always convert incoming UTF8 data to native system locale. We need to be more graceful about these sorts of failures, and get better at validating UTF8.
BerkeleyDB requires care and feeding. On the one hand, it's incredibly convenient to have a transactional database in a shared library, rather than forcing people to set up a full SQL system. But on the other hand, most folks are too reckless with the database. If the process accessing the repository (apache, svnserve, svnadmin, svn, whatever) doesn't have complete read-write permission on all the db files, or if the process is interrupted, then the database locks up and requires journaled recovery to get back into a consistent state. This is not a big deal when it happens, but it's almost always a result of someone being careless who doesn't yet know any better. "With great power comes great responsibility" -- but most people are unaware of this responsibility and get burned when they treat an SVN repository just like a CVS repository. Please read this part of the book so you can become an "educated user".

Alternately, create an 'fsfs' repository instead of a BerkeleyDB one -- no wedging of the database, and works over NFS. See the Subversion 1.1 release notes