Proofing Your Own

Last updated: Sun, 04 Sep 2005 12:01:00 GMT

I confess, gentle reader, that I find it hard to read what I write.

Not, you understand, because what I write is so difficult to read -- I've grown used to that over the years -- but because I find it hard not to read what I think I've written, not what I've actually committed to paper or disc. I'm not alone in this, I know. It's easy for errors to slip through the net when proof-reading my own writing, whether that be prose or code.

I don't develop full time -- I sit on the fence between developer and sysadmin, and what I write tends to be scripts, tools databases to glue systems together, to drive things like account creation and deletion, to populate our DNS zones. You get the idea. I remember well a discussion I had with my cartoon conscience some years ago, when he suggested that I really should pick one or the other -- coder or sysadmin -- because I wouldn't do my career any good chopping from one discipline to the other, as I was doing at the time.

Well, he was right. I find myself now, after ten years in the big bad world, not quite "ten years good" at anything. Jack of all trades, the saying goes. And master of none, it ends. Still, if you need someone to tie your Unix and Windows 2000 authentication systems together without resorting to handing all of your data over to a 2K3 server, I'm your man.

It's just that problem I found myself working on 18 months ago. We had been in the process of migrating a lot of our legacy crap away from the various technolgies in relied on and into the brave new world of Oracle. I almost have an ideology when it comes to development and databases, which Oracle helps me stick to. Not just Oracle, but that's what I use here.

One of my golden rules is "no logic outside the database." Oracle helps there with stored procedures and functions. You could achieve the same ends with a middleware between the user and the data, but keeping the code close to the data is nice. Of course, there will always be code outside the database, for interfacing those pesky users. UI doesn't count, and I don't care what you do with it, but the only code outside my database is concerned with display, entry and validation.

Another golden rule is "no direct changes to the data; modification through API only." I don't like trigger-driven data manipulation as a rule. I don't like my data being changed on the fly. Triggers can be turned off. And there'll always be some logic that can't be captured or elegantly implemented by making triggers do the work in the background. Various contraints make sure that stored data conforms, and one simple rule means that all data changes have to go through the same stored procedures. I do this even if the stored procedure is little more than the INSERT statement I'd have written anyway. It seems tight, but it'll save my life one day. Of course, you can pull the data our any which way you like, I don't care, I just want my data to be internally consistent at all times. Say "no" to ambiguity.

The project was small enough that it had only a few non-trivial problems, but large enough that it was a test of the various rules I wanted to set myself. I worked for a couple of weeks, got something that was 80% finished, exhaustively commented every PL/SQL package header and then committed the first revision to our CVS repository.

And then I stuck my hand in some moving machinery, which we've discussed before.

While I was recuperating, development began on a related project that was to use my account database. It was a big project, and the guys working on it were full-time developers. I have respect for these guys, and am not too proud to take advice from them, because they are ten years good. Unfortunately, though, they've never really seemed that keen to try and adopt any more structured or closely-knit development practices. Which is a shame. They were happy to bide by the "modification through the API only" rule, but apart from that, they wanted to pretty much keep themselves to themselves. No matter.

They seemed pleased with what I'd handed them, and got on with their task while I chewed codeine tablets and had my dressings changed. When I got back, there were a few modifications to make to my code, an API change here and there, but nothing more than you'd expect. We polished corners off our turds and before we new it it was time to go live.

Everything worked a treat, barring some nastiness I had to write to tie the two systems we'd just written in with the one legacy system we'd decided to leave until Phase Two. Everyone was happy, there may even have been champaggen.

Fast-forward a year, to the end of last week. A frantic phonecall. Well, not quite frantic -- this guy's as laid back and friendly as they come. But, still.

"Have you made any changes to the account database? Like, overnight?"

"No. None. I've been away for the past two days."

"Well, I'm seeing some strange behaviour. Something's changed, users are getting old-style passwords."

"Uh. Right. I'll ask around, see what's happened while I've been away."

I asked around. The database instance had been shut down while I was away, during some hardware maintenance. Fair enough. But no-one had made any changes. In that intervening year that we've just fast-forwarded past, there had inevitably been a few minor tweaks and updates, all of which were in CVS, but I was naturally cautious of making changes to a production system. And of course, there are a couple of development instances of this database lying around, both of which were liable to be at a later revision state.

I spent fifteen minutes making sure that my understanding of the behaviour of Oracle stored procedures was as I understood it, i.e. that packages are like other database objects, and that DDL change behaviour is followed: changes are reflected immediately. After all, these machines had been up for some time -- could this guy's system have held open a transaction for that long, could he be using old code?

No. Code changes are immediate.

So, what had changed? Well, according to the guy who'd just rung off, users were getting the wrong style of password. I won't bore you with the details but some users get one style of initial password, other users get a different style. This behaviour is switched on a simple flag in the one API call that's relevant and public.

I had a quick read through and called him back, just to get a grip on what he understood the problem to be. I described how I expected the procedure to behave, given various inputs.

"No, that's now how it works," he said. "It's always seemed a bit counter-intuitive, but p_use_new_style gives an account a new-style password if I pass a zero through, and an old-style password if it's non-zero."

"That can't be right; if I'd meant that I would have called it p_use_old_style or something."

"Yeah, I thought that, but my understanding matches the package documentation. I did think it was a bit strange, though."

I took a look at the package spoec and what do you know? He was right. That's exactly what my documentation said.

"It's been working right up until last night, the way I expect it to behave, the way you've documented it."

"Hell, I'm sorry, I'll dig around and see what's up." I had to apologise, the evidence was stacking up against me rapidly. So, I followed the code through from that call, down into the bowels of the relevant packages. Turns out I pass it through unmolested to a function that returns one password style or the other, given the one flag as an input. I remembered writing that. I remembered thinking "this is overkill, but I never know how many times I'm going to have to make this decision elsewhere, so I might as well put the logic in one place." And the code was a simple as I could wish, and directly contradictory to the behaviour this guy had been relying on for the past year.

Eh?

I gave him a call to ask him if there were any chance he'd ever passed me a NULL. No, he hadn't. He called it in only one place, and there was no possibility that he'd ever be calling with a NULL. He couldnt have without failing input validation anyway, but I was grasping at straws. (Oh for an orthogonal availability of BOOLEANs in Oracle!)

Time to change tack. When had the code changed? If I could find when the code had changed, I might have a handle on why it was only becoming apparent now.

CVS to the rescue. I started plowing through annotated output to see what had changed and when. Dead end. Nothing in the path of execution had changed for at least six months, and the particular function that chose a password for me hadn't change since import, some fourteen months previously.

Panic time was approaching, The reason that guy had phoned me in an uncharacteristic flap was that he was leaving at the end of the week for a month's holiday. He wanted a solution quick, and we had to agree on something today, so that we could run it for a day before he left. Face to face was called for.

"I'd like you to change your code back to what it was yesterday."

"But I haven't changed anything. The relevant code has been static since two months before we went production."

"That's as may be, but we can see that something's changed in the data I'm getting back from you. Something's wrong, and I haven't changed anything."

"I know, I know. I don't know what's going on here. But you've got to understand that you're asking me to change something back that has never changed, that I don't remember changing, and that I've always understood to work in the opposite way that you understand. My documentation's out, but I never intended the code to work the way you think it works."

"What can I say? It's worked that way in all of the testing I've done. For over a year, since we began testing against it. You've got to change it back."

"There is no back."

We sat there. He's a nice, laid back guy, but he knows I'm not. He thinks I'm going to take this personally and things will get messy. I'm not. I'm just not backing down, because I don't understand what's going on. Stand-off.

"Show me the data that proves that this is the way it's always worked, let's look at the last user who got the right behaviour and the first that got the wrong." He queried, we looked. Nothing obvious.

Except...

"How come this user's registered in the Active Directory already? That, there, says that Windows has to know who this guy is, which isn't possible at creation time. Modification date's recent, too -- this morning." Curious! Enter changelog.

A quick word about my changelogs. It's little more than column-level trigger-driven modification tracking. All my core tables have a changelog table which records the state of the row before change, the nature of the change, flags that tell me which columns changed (because it's more expensive to deduce this in retrospect) and some other ancilliary data like the time, the order they occurred in and things that identify the session that made the change.

Sounds verbose? It is. Sounds like a waste of space? Space is free. Sounds like a waste of time? Get a faster digital watch, buddy. It makes keeping external subscribers -- like an LDAP directory -- as easy as hell. I can make point-in-time recovery on a per-row or per-table level. It also makes for great debug.

I took a look at the changelog and what did I find? Users are being created with the "wrong" password style. Always have been, since forever. They're always created as per my understanding of the code, but contrary to the package documentation. Every night for a year, this guy's system creates a batch of malformed accounts and then, the next morning, a fairy comes along and cleans up the mess.

From whence comes this code fairy? Remember that scripty nastiness I mentioned earlier, that glues the old system into this one? The guy who wrote that hideous script knew the meaning of this one flag without even looking at the docs.

That guy was me.

Laugh? We nearly cried. We decided to leave everything as it was, given that it'd obviously been working in a way for a year. We'll tidy it up when the guy gets back from his holiday.

My bad. I made a mistake documenting the package. The name of the variable proved to be better documentation than the prose accompanying it, and the fact that one didn't match the other should be considered a bug. I'm not sure how I'll ever be able to avoid that kind of thing, except by leaving the code so long before I document it that I have to reacquaint myself with it.

I have to ask myself if the problem was a mistake in the documentation or, saints save me from my own blasphemy, whether documenting it at all was the mistake. My code's pretty readable, I should have just included an example or two. That way, I could at least have cut and pasted the example to make sure it produced the right output. The docs are just chaff for eejits.

But I'm left with another question, still. I'm not sure I'll be chasing for an answer. But the question is this: just how rigourously have the other two tested their code, if they spent two months full-time developing a system against my database and never once noticed that the documentation disagreed entirely with the code's behaviour? Yet they noticed that the documentation disagreed with the name of the variable?

I guess even people who are ten years good make mistakes. They're still better at this that I am.