Getting Down From My Own Category Tree
This is for anyone who has ever wrestled with hierarchical data or even just wanting to manage finances in their own way. I’m not a programmer or a database designer, but that doesn’t stop me from trying to learn about these things. I should probably quit, since I usually get myself in over my head, just like I suppose I did with my category table in my PostgreSQL database. But I guess it’s just my nature to do things like this. And, well, sometimes, what you learn along the way may just surprise you.
I had designed a database with just 4 tables to track my personal financial transactions. I specifically did not want to use any of the plethora of available personal accounting software, and so I ‘rolled my own’. I created a table to hold some entities, be they myself or some business or other organization with whom I might engage in financial transactions. Another table holds minimal information about a given transaction, while a third table holds the details of each and every transaction. The fourth table holds the categories into which transactions can be classified.
This design allows me to make line items of all my grocery store receipts, if I want to. I don’t have to, but I can. When I go to Sam’s Club or Wal-Mart or Target, I may buy some groceries, music CDs, a new gadget for the kitchen and maybe even some oil for the car. This raises the question of how to categorize the one transaction? Does it fall under groceries? Household? Automotive? Most might simply categorize these as “miscellaneous”. But my design allows me to break down the receipt into line items, and then categorize the line items.
This approach, however, raises something of a minor dilemma. How do we design a table that allows us to categorize the transaction items, up to 3 or more levels deep? After all, Transportation can be broken down into various modes (bike, bus, car, etc.) and further into fares, maintenance, etc. for each mode. Maybe I should have just stopped at the second level and called it a day. Yes, I probably should have. But then I wouldn’t be in over my head, would I? What’s more, I had no idea that I was about to cross into water that deep.
I got the idea for this category table design, after reading several blog posts about hierarchical tables. I found an example and followed it. I thought it would be fairly easy to work with. The example seemed easy enough – and it really is. And, in terms of entering data, it is an extremely easy design. I just assign the transaction item to the appropriate category via the category_id. But – ahhh, you knew that but was coming somewhere, didn’t you? But the problem comes in requesting the data in the way I want to see it. Indeed, the example I followed served a completely different purpose, related to another realm of problem solving, albeit with similar challenges. In other words, it was analogous, but not the same.
Here is what my original category table looked like:
The screen shot shows the last few entries of the category table as I originally designed it. The first column is the category_id, simply a unique identifier for each category. The second column is the category_name, the actual name of the category. Then the parent_id allows me to link a given category to its parent category. The column lineage is meant to allow me to extract the parent and grandparent categories from the bottom-level category. This was supposed to be – so I thought – where the magic would lie. I thought I could easily use this to create a Sum-by-Category view of my finances, using only the top-level categories, like so:
This spreadsheet view shows the desired view using some test data. I set this up as an intermediate solution until I could reach the solution I wanted in the database itself. The spreadsheet solution is to connect to the database server, select a view that shows the amounts summed by bottom-level category, and then create formulas to calculate the sums by top-level category.
This is a common task. In business, the executives generally don’t care about the details of every single transaction; they really only want to know about the general operating budget. If something is out of line, then they may need to drill down deeper to see where the anomaly lies. The same goes for personal finances. You may really only want to see the general income and expenses to be sure everything is in line with your budget. But if you spend $1000 on your transportation budget in one month, then you can quickly see something is out of line with the norm. Perhaps it was an unavoidable repair, but it could also have been those fancy new hubcaps you really didn’t need.
The amount shown comes from the details table, which I called transdetails. So the query to the database has to draw from both, the category table and the details table. Folks familiar with the Structured Query Language that PostgreSQL uses will probably immediately see the challenge. The problem with my original table design, in terms of being able to see data like this, is that it takes far more effort to fulfill this seemingly simple request than is warranted. I encountered the following:
- I could show the sums grouped by parent categories, which actually omitted the categories that have no subcategories, and did not necessarily reach the top-level category.
- I could show the top-level category for a given branch of the tree, but then I would have to repeat this process for each category.
In other words, no matter what I did, I kept coming up short, either not being able to see all the categories at once, or having to effectively iterate through every category individually in order to see them all. I honestly thought that someone with more experience would be able to either suggest a solid solution or help me think through the problem to help me arrive at a solution. Alas, the suggestions I received either seemed beyond my immediate comprehension or just weren’t quite as helpful as I had hoped.
I finally decided to change that lineage column so that it only reflects the root category to which a given category belongs. While I lose that line of heritage, it makes linking the bottom-level categories with their roots much easier, and I can still use the parent_id, when I just want to see groupings by parent category. But most of the time, I will just want to see either the top categories or the bottom categories. I don’t see much need to view the mid-level categories.
Here is what the new category table looks like:
This new design requires a single intermediate step, in which I request the amount from the details table, summed by the category_id, and the lineage from the category table. Since the lineage column is a link to the root category, it is a simple matter to query this result as a view (or maybe as a temporary table), and arrive at the same result as the spreadsheet view in the second image. Having tested it, the results are the same. Whether it constitutes “good design”, from the standpoint of a textbook, I cannot say. But I can say that this design works well for this situation. That brings me to some important thoughts.
When I made this change, I felt like I was somehow cheating. I had failed, in my mind, to work through the problem completely. I had missed something, it seemed. I felt like I was taking the easy way out. The reality is that I probably had chosen a design that works well for some things, but not for what I needed. After a month of wrestling with this, it was probably time to ‘move on’. Perhaps one day I will acquire that magical power that only the true database Jedi possess. But more likely, I have acquired something at least as important – a solution that works.
When it comes to designing a database, I’ve learned that hierarchical structures are tricky to work with in general. Perhaps more importantly, I’ve learned that there is a balance in “good design” between design, ease-of-use and performance. As my friend, Jens Vöckler, pointed out “Anyone who has to actually run a db system in production will do whatever it takes to make it run.” What it really takes to make a database run is accuracy and performance. If you’ve got that, you’ve got reason to be fairly satisfied.
There is also a lesson I still haven’t learned – and hopefully won’t learn any time soon. I have not learned to quit. Not very easily, anyway.


