22
May - 2012
Tuesday
SUBSCRIBE TO NEWS
SUBSCRIBE TO COMMENTS

Ever Forward

Always discovering, learning, moving… ever forward

Getting Down From My Own Category Tree

Posted by donp On August - 30 - 2011Comments Off

This is for anyone who has ever wrestled with hierarchical data or even just wanting to manage finances in their own way.  I’m not a programmer or a database designer, but that doesn’t stop me from trying to learn about these things.  I should probably quit, since I usually get myself in over my head, just like I suppose I did with my category table in my PostgreSQL database.  But I guess it’s just my nature to do things like this.  And, well, sometimes, what you learn along the way may just surprise you.

I had designed a database with just 4 tables to track my personal financial transactions.  I specifically did not want to use any of the plethora of available personal accounting software, and so I ‘rolled my own’.  I created a table to hold some entities, be they myself or some business or other organization with whom I might engage in financial transactions.  Another table holds minimal information about a given transaction, while a third table holds the details of each and every transaction.  The fourth table holds the categories into which transactions can be classified.

This design allows me to make line items of all my grocery store receipts, if I want to.  I don’t have to, but I can.  When I go to Sam’s Club or Wal-Mart or Target, I may buy some groceries, music CDs, a new gadget for the kitchen and maybe even some oil for the car.  This raises the question of how to categorize the one transaction?  Does it fall under groceries?  Household?  Automotive?  Most might simply categorize these as “miscellaneous”.  But my design allows me to break down the receipt into line items, and then categorize the line items.

This approach, however, raises something of a minor dilemma.  How do we design a table that allows us to categorize the transaction items, up to 3 or more levels deep?  After all, Transportation can be broken down into various modes (bike, bus, car, etc.) and further into fares, maintenance, etc. for each mode.  Maybe I should have just stopped at the second level and called it a day.  Yes, I probably should have.  But then I wouldn’t be in over my head, would I?  What’s more, I had no idea that I was about to cross into water that deep.

I got the idea for this category table design, after reading several blog posts about hierarchical tables.  I found an example and followed it.  I thought it would be fairly easy to work with.  The example seemed easy enough – and it really is.  And, in terms of entering data, it is an extremely easy design.  I just assign the transaction item to the appropriate category via the category_id.  But – ahhh, you knew that but was coming somewhere, didn’t you?  But the problem comes in requesting the data in the way I want to see it.  Indeed, the example I followed served a completely different purpose, related to another realm of problem solving, albeit with similar challenges.  In other words, it was analogous, but not the same.

Here is what my original category table looked like:

Original table design image

Original table design

The screen shot shows the last few entries of the category table as I originally designed it.  The first column is the category_id, simply a unique identifier for each category.  The second column is the category_name, the actual name of the category.  Then the parent_id allows me to link a given category to its parent category.  The column lineage is meant to allow me to extract the parent and grandparent categories from the bottom-level category.  This was supposed to be – so I thought – where the magic would lie.  I thought I could easily use this to create a Sum-by-Category view of my finances, using only the top-level categories, like so:

 

Sum By Category

What I want to see: Sums by Root (Top) Category

This spreadsheet view shows the desired view using some test data.  I set this up as an intermediate solution until I could reach the solution I wanted in the database itself.  The spreadsheet solution is to connect to the database server, select a view that shows the amounts summed by bottom-level category, and then create formulas to calculate the sums by top-level category.

This is a common task.  In business, the executives generally don’t care about the details of every single transaction; they really only want to know about the general operating budget.  If something is out of line, then they may need to drill down deeper to see where the anomaly lies.  The same goes for personal finances.  You may really only want to see the general income and expenses to be sure everything is in line with your budget.  But if you spend $1000 on your transportation budget in one month, then you can quickly see something is out of line with the norm.  Perhaps it was an unavoidable repair, but it could also have been those fancy new hubcaps you really didn’t need.

The amount shown comes from the details table, which I called transdetails.  So the query to the database has to draw from both, the category table and the details table.  Folks familiar with the Structured Query Language that PostgreSQL uses will probably immediately see the challenge.  The problem with my original table design, in terms of being able to see data like this, is that it takes far more effort to fulfill this seemingly simple request than is warranted.  I encountered the following:

  • I could show the sums grouped by parent categories, which actually omitted the categories that have no subcategories, and did not necessarily reach the top-level category.
  • I could show the top-level category for a given branch of the tree, but then  I would have to repeat this process for each category.

In other words, no matter what I did, I kept coming up short, either not being able to see all the categories at once, or having to effectively iterate through every category individually in order to see them all.  I honestly thought that someone with more experience would be able to either suggest a solid solution or help me think through the problem to help me arrive at a solution.  Alas, the suggestions I received either seemed beyond my immediate comprehension or just weren’t quite as helpful as I had hoped.

I finally decided to change that lineage column so that it only reflects the root category to which a given category belongs.  While I lose that line of heritage, it makes linking the bottom-level categories with their roots much easier, and I can still use the parent_id, when I just want to see groupings by parent category.  But most of the time, I will just want to see either the top categories or the bottom categories.  I don’t see much need to view the mid-level categories.

Here is what the new category table looks like:

 

New Category Table Design

The 'lineage' column simply refers to the root category_id

This new design requires a single intermediate step, in which I request the amount from the details table, summed by the category_id, and the lineage from the category table.  Since the lineage column is a link to the root category, it is a simple matter to query this result as a view (or maybe as a temporary table), and arrive at the same result as the spreadsheet view in the second image.  Having tested it, the results are the same.  Whether it constitutes “good design”, from the standpoint of a textbook, I cannot say.  But I can say that this design works well for this situation.  That brings me to some important thoughts.

When I made this change, I felt like I was somehow cheating.  I had failed, in my mind, to work through the problem completely.  I had missed something, it seemed.  I felt like I was taking the easy way out.  The reality is that I probably had chosen a design that works well for some things, but not for what I needed.  After a month of wrestling with this, it was probably time to ‘move on’.  Perhaps one day I will acquire that magical power that only the true database Jedi possess.  But more likely, I have acquired something at least as important – a solution that works.

When it comes to designing a database, I’ve learned that hierarchical structures are tricky to work with in general.  Perhaps more importantly, I’ve learned that there is a balance in “good design” between design, ease-of-use and performance.  As my friend, Jens Vöckler, pointed out “Anyone who has to actually run a db system in production will do whatever it takes to make it run.” What it really takes to make a database run is accuracy and performance.  If you’ve got that, you’ve got reason to be fairly satisfied.

There is also a lesson I still haven’t learned – and hopefully won’t learn any time soon.  I have not learned to quit.  Not very easily, anyway.

Connecting LibreOffice to PostgreSQL

Posted by donp On July - 31 - 2011Comments Off

In my previous post, I mentioned a database I put together using PostgreSQL, and now I need to connect LibreOffice to it so I can directly access my data from within Calc.  It’s really quite simple, assuming you have all the other elements in place.

I won’t bother with all the gory details one can find at the OpenOffice.org DBA website that provides in-depth instructions on adding in the SDBC connector for PostgreSQL.  Frankly, I used LibreOffice’s Extension Manager to install the PostgreSQL SDBC driver, so that was already taken care of.  This just beats the pants off of having to follow the instructions you’ll find at the above link.

Note, I use Debian GNU/Linux and Ubuntu Linux respectively, but the same applies to Windows users (though the instructions for installing the various elements will certainly be different.  And I am assuming that, like me, you already have all that done.

Step #1: Start a new database in LibreOffice or OpenOffice.org

If your office suit is not yet open, you can just launch Base directly.  If you are already working in your office suite, you can choose Base from the New Document icon on the toolbar, or via the File  –> New menu option.

Step #2:  Select Connect to an Existing Database

In the resulting connection wizard dialog, choose the Connect to an Existing Database option.  Select PostgreSQL from the drop-down dialog.  Then click Next.

Step #3: Enter the connection URL

This is not an Internet URL, but it functions similarly.  You now need to tell LibreOffice to connect to a specific database on a specific computer:

dbname=mydb host=localhost

Use the name of your database in the dbname part.  I used localhost for the host part, since PostgreSQL is on the same computer, but I would need to change that in the event that I want to connect from a remote computer.

Step #4: Enter your username and password

Fairly simple, really.  You most like want to use the password option – the wizard failed to connect when I tried it without my password.  If the connection tests successfully, you will get a message saying so.  In this case, you can click Next.  If it fails again, there is a possibility that something went wrong in the previous steps – double check typing, etc. just to be sure.  Failing that, you may need to post a help request on one of the OpenOffice.org forums or LibreOffice mailing list.

Step# 5: Looking at your data

Once connected to your PostgreSQL database, you will initially see the following sets of tables:

  • Information schema
  • pg_catalog
  • public

You will find your data in the public section.  If you’re a novice, I’d suggest leaving the other stuff alone until you know more about that.  If you’re an advanced user, you’ll likely know more about it than I do.

One note: I find I am not able to add or edit data – at least not using LibreOffice 3.3.2 on Ubuntu 11.04 and using the SDBC driver (0.7.6b).  I understand this is a bug.  We’ll see how long it takes to fix this issue.  I have not yet tried this with my Debian system, which runs PostgreSQL 9.0  The OpenOffice.org web page above has a list of the general features and limitations of the SDBC driver – don’t know how up-to-date it is.

An Adventure in PostgreSQL

Posted by donp On July - 31 - 2011Comments Off

I have occasionally dabbled in database design over the past few years.  I once toyed with the idea of a church management database, which never really went anywhere.  Mind you, the experts that examined it really liked the overall design, but I had some work to do with making it actually useful.  Fast forward to the present, and I’ve tinkered a little more, and finally developed my own income and expense database that, so far, I really really like.  Again, I still have some work to do yet, especially showing expenses for one, two or even three levels of categories.  Still, the basic functionality is in place and seems to work very well.  Here’s a little of what I did.

I could have simply used GnuCash or KMyMoney or other personal finance software, but then I would not have learned what I have about databases in general.  Besides, I can’t stand to have to “sync” every single transaction – what I really want is to just see what I’ve earned and spent, and where all that money went.  I chose PostgreSQL because OpenOffice.org & LibreOffice are currently experiencing problems with their development of Base.  I also wanted to learn more about PostgreSQL, and Base’s development issues present a great opportunity to delve into the former. And once I connect LibreOffice to PostgreSQL, I should be able to create all the pretty charts that the personal finance software can create.

If you don’t know what PostgreSQL is, it is what is commonly called a “Relational Database Management System” or RDBMS.  It is comparable to Oracle and MS SQL Server, and used in many quarters by some rather notable “customers”.  It’s features make it one of the absolute best database systems available.  It is, of course, libre  (or what some might refer to as “open source”) software.

I bought “Beginning Databases with PostgreSQL” (APRESS) through my cell phone’s Kindle app.  Incidentally, Kindle on a cell phone is great for novels, not so great for technical books with charts and tables that you might want to read.  Worse, Amazon still has no version of Kindle for GNU/Linux distributions, and I don’t see a way to read the book via my web browser, either.  Otherwise, the book is a great buy, and has helped me understand some things that basic documentation does not answer – at least, not very well.

I created 4 simple (more or less) tables: Entity, Category, Transaction, and TransactionDetails.  That last one is thanks to the examples I’ve seen using “orders” and “order details” in tutorials.  This last one is especially important, since it allows me to draw out very specific details, if I really want to.  The Entity table, of course, stores the basic data for myself and businesses or other persons with whom I may do business.  The Category table should be self-explanatory, but it currently holds up to 3 levels of categories.  The Transaction table only holds the transaction date, and possibly a note about the general transaction.  I make no attempt whatsoever to categorize the general transaction because a single transaction may frequently be split into multiple categories.

A note on the categories: Some people will not want the most minute details on every single transaction.  That’s fine.  If you don’t want to record every item you bought at the grocery store, you can just use the “Groceries: General” category, and be done with it.  But the design lets one drill as deep as one wants to get.  There may be times when you only want to see the top-level categories, but there might also be reason to drill deeper.  This format lets the user decide at what level to stop drilling.

The TransactionDetails table is, frankly, where the majority of the “action” is.  Well, the reality is that the “action” is in the queries, not the tables.  Still, I can split a paycheck to show what Uncle Sam has taken out.  I can show every item I purchased at the grocery store, and later run a report to see how much money I spent on fruits and vegetables, or on dairy products.  I should pause here to say that my database does not show my current bank balance.  I did not make any effort to do so.  That said, one can create a category called “Accounts”, with subcategories for “Checking”, “Savings”, etc.  Then, when I make a transfer, I simply perform a bit of double-entry accounting, showing the minus transaction for one account and a plus transaction for the other.  Performing balance checks should not be overly complex either.

The following query shows how I have pulled out the Date, Entity and Amount, ordered by the date:

SELECT 
 transaction_date AS "Date", entity_name AS "Store",
 sum(transdetails_amount) AS "Amount"
FROM
 public.tblentity e, public.tbltransaction t, public.tbltransdetails d
WHERE
 e.entity_id = t.entity_id AND  t.transaction_id = d.transaction_id
GROUP BY entity_name, transaction_date
ORDER BY transaction_date;

The above SQL commands select the date, entity and the total amount for each transaction from the three main tables, groups them by the entity name and transaction date, and then makes sure the result is ordered by date (in this case, from oldest to most recent).  Incidentally, the result shows my net pay from my most recent paycheck, as well as the totals for all my purchases at each restaurant or store. Oh, and it does all this in about 0.17 milliseconds.  You’ll forgive me, I hope, for not showing the actual results, but maybe at some point I can produce some sample data that I won’t mind sharing.

This SQL shows the amounts by catgory:

SELECT category_name AS "Category", sum(transdetails_amount) AS "Amount"
FROM tblcategory c, tbltransdetails d
WHERE c.category_id = d.category_id
GROUP BY category_name;

The result shows what I spent in each specific category.  I have not yet attempted to only categorize by the first-level categories.  That should prove to be an interesting exercise, and I’ll try to return once I’ve done that.  This query shows “income” as a gross amount, but also shows the taxes in their respective categories.  Incidentally, I understand the design of my categories table (which I picked up here) will only go so far, but for most of my purposes it should work just fine.  The main thing here, though, is that I can see where my money is going.

Another thing I should note is that, while writing SQL queries at the command prompt is not necessarily the worst thing in the world, using PGAdmin III certainly improves the work environment.  I did not use the graphical query builder, except to display the tables (with their column descriptions), so I could see which columns I wanted to pull from.  I can click between th SQL tab and the graphical tab to make sure I’ve crossed my ‘”t’s” and dotted my “i’s”.  Also, you can save the queries once you’re sure they work correctly, and then you can save the resulting data into CSV files which you can then open in any spreadsheet.

Understand that this is more of a show-n-tell than a howto.  I just wanted to share a little of my adventure in exploring the PostgreSQL database system and how one can do some really cool stuff, whether for personal or business matters.  If anyone is interested, maybe I can devote a future post to the table design.  Also, if you had to think of a name for this little database, what would you call it?