Skip to content

Are GUIDs Really the Way to Go?

7
Nov
2007

I recently read a (slightly inflammatory) posting entitled “The Gospel of the GUID”.  In it, the author attempts to put forward several arguments in favor of using GUIDs for all database primary keys (as opposed to the more pedestrian sequence-generated INTEGER).  I’ve heard similar arguments in the past, and I keep coming back to the same conclusion: it’s all bunk.

To get right to the heart of my opinion, I really don’t see a compelling reason to use GUIDs for 90% of database use-cases.  Consider for a moment some of the really large databases in this world (Wikipedia, Slashdot’s comments table, Digg, etc).  I can’t think of a single web application in that league which uses GUIDs.  If you don’t believe me, look at the code for MediaWiki, the URL structure for Digg and the idiocy involving INTEGER vs BIGINT on Slashdot.  While de facto practice may not dictate optimal design, it does point to a trend that’s worthy of notice.  More importantly, it proves that INTEGER (or rather, BIGINT) primary key fields are practicable under real-world stress.  But what about the theoretical use-case?

Well to be honest, the theoretical use-case doesn’t interest me all that much.  I mean, if you tell me that your schema can support theoretically 29 trillion rows in its users table, I’ll geek out right along with you.  But if you try to feed that to a client, you’ll get either blank stares, or the equally common: “but there are only 5 billion people to chose from.”  (unless you’re talking to someone in upper management)  For an even better party, try telling your client that you can merge their data with one of their competitor’s in 45 minutes flat.  Somehow I doubt that argument will fly.

Now let’s look at the cons involved.  GUIDs are really, really hard to remember and type by hand.  When you’re in a hurry, pulling a little data-mine-while-you-wait for your boss, you’re not going to want to dig around and find that string you copy/pasted into Notepad.  Oh, and heaven forbid you actually type the GUID wrong!  Finding a single-bit deviation in a 64-bit alpha-numeric is not a trivial task let me tell you.  It’s at about this point that you start to think that maybe INTEGER keys would have been a better way to go.  At least then you can throw the userID into good ol’ short-term memory and bang it out ten seconds later in your SQL query window.  Oh, in case you’ve never seen a GUID before, here’s a couple quick examples:

  • {3F2504E0-4F89-11D3-9A0C-0305E82C3301}
  • {52335ABA-2D8B-4892-A8B7-86B817AAC607}
  • {79E9F560-FD70-4807-BEED-50A87AA911B1}

I’m not even sure how to read that, much less remember it.

Another thing worth considering is that GUIDs are VARCHARs within the database.  This means that not all ORMs will handle them appropriately.  Hibernate will, as will iBatis and ActiveObjects.  But if you’re using something like Django or ActiveRecord, you’re out of luck (disclaimer: Django might actually support VARCHAR primary keys, I’m not sure).  On top of that, some database clustering techniques don’t seem to work nicely when applied to GUID-based key schemes.  I haven’t run into this myself, but my good friend Lowell Heddings has told me tales of strange mystics and evil tides when playing with distributed indexes and GUIDs in MS SQL Server.  Technically speaking, while any sane database may work with a VARCHAR for the table’s primary key, the architects really had INTEGERs in mind when they designed the algorithms.  This means that by using GUIDs, you’re flirting with a less-well-tested use-case.  It’s certainly not unsupported, but you should consider yourself in the minority of users if you go that route.

Oh, and one little myth I’d like to debug: there really isn’t that much more overhead in grabbing the generated value of an INTEGER primary key than in sending an in-code generated GUID value.  Granted, some databases make this harder than others, but that’s their fault isn’t it?  As long as you’re using a well designed ORM, you really won’t see much of a performance increase from removing that miniscule callback.  In fact, depending on how your GUID algorithm works, you may see a significant degradation in overall performance due to extra overhead in your application.

Conclusion

So, the obligatory thirty second overview for our attention-deprived era:

Benefits of GUIDs perfectly unique across all your data and tables; way, way more head-room in terms of values
Problems with GUIDs really, really un-developer friendly; weird side-effects in databases and ORMs; unconventional approach to a “solved” problem; easy way to upset your DBA

So it really seems the only upshot for GUIDs is their massive range.  In trade-off, you lose usability from a raw SQL standpoint, your result sets are that much more cluttered and you risk odd bugs in your persistence backend.  Hmm, I wonder which one I’m going to use in my next app?

Before you make the choice, ask yourself: “how much uniqueness do I really require?”  Over-designing a solution is just as much a problem as under-designing one.  Try to find the mid-point that works best for your use-case.

Is a Separate Text Search Engine a Bad Idea?

11
Oct
2007

I was reading this blog entry a few days ago, and it started me thinking about full-text searching.  That wasn’t the main topic of the post, but I think the little side-trek into the field was interesting enough to merit some thought.  Right smack in the middle, Jamie goes on a bit of a rant about the pain of what is effectively two, separate databases (for example, MySQL and Lucene):

A fellow Rails developer asked me in all seriousness why I wasn’t abandoning the full text search functionality of TSearch2 and just using a completely separate, redundant database product designed exclusively for full text search. Seriously, that is considered the “easy” approach: one database for full text search, and another for ACID/OLTP/CRUD. Honestly if I were going to go down that road I would try hard to just abandon the SQL RDMBS and put everything in the other database, since Lucene and its imitators are capable of far more than just find-text-in-document queries. The pain of duplicating everything, using two query languages, two document representations (in addition to the object representation in Ruby) and writing application-tier query correlation makes the double-DB approach seem very unwise.

There is some validity to this thought.  After all, duplication in software usually means you’re doing something wrong - or at least, there could be an easier way.  Even ignoring this precept, it’s just common sense that keeping data synchronized concurrently between two data sources as complex as a relational database and a full-text index is not an easy task.  Granted, some ORMs can handle this task for you (actually, I can only think of Hibernate and ActiveObjects having this feature), but the principle is the same.  And even if everything is neatly and auto-magically synced, there’s always a danger of something getting out of place, and then you’re stuck with a transient stale data issue that’s difficult to track down.

The author of the post mentions that he favors the full-text search capabilities of PostgreSQL, the popular open-source database and competitor to MySQL.  This does have the advantage that you’re putting all the data in one place, handling everything with a single query language (SQL), and reducing the technologies your software depends upon.  This inarguably makes things a whole lot easier.

The main problem as I see it is this is putting a ton of unnecessary strain on the database.  In most modern server-side applications, the bottleneck is in the database (usually caused by too much badly written SQL).  There are whole mountains of documentation which offers suggestions on how to alleviate this problem.  Indexes, database clustering and a carefully chosen ORM can go a long way.  Unfortunately, tacking on full-text indexing seems like a step in the wrong direction.

Lucene is very good at what it does.  It’s indexing and storage performance is second to none.  In fact, it’s so fast that a lot of companies use it as a quick-and-dirty storage dumping ground for raw data, knowing that it will be much faster and more scalable than a relational database.  Why not take advantage of this incredible power and take one more item off of your database’s back?  This is all not to mention the fact that a Lucene index query is probably a lot faster than an SQL query grabbing data from a PostgreSQL full-text index.

So what about the flip side of things?  Why not just put all the data into Lucene (or clone) and eschew relational databases altogether?  Well as I mentioned above, a lot of companies do this for simple things.  Lucene is fantastic at both scalability, and very fast indexing and querying of large blocks of text.  Where it begins to trip up is when you turn it loose on other data types.  Don’t get me wrong, Lucene is an amazing piece of technology.  But just like PostgreSQL isn’t a full-text search engine, Lucene isn’t an RDBMS.  Each component of the infrastructure needs to handle what it’s best at.  In fact, this is really a large aspect of scalability.  Ensuring that every technology is utilized to its fullest potential and no more is crucial to a high-volume application.

Final verdict?  I think I’m sticking with MySQL and Lucene working in tandem, each doing what they do best.  ActiveObjects makes the synchronization almost completely transparent, so it’s not like I’m loading myself down with unnecessary work from a code standpoint.  Seems like a good solution to me; and since most of the industry agrees, it’s probably a safe bet for you too.

ActiveObjects: Indexing vs Searching

1
Oct
2007

So in the intervening time since I last updated you on ActiveObjects, I’ve been busy refactoring and repurposing some of the core.  I’ve added a full JUnit test suite, which definitely helps my own confidence about the stability of the source.  Also, there’s a whole bunch of new features that have come down the pipe which hopefully I’ll get to address in the next few posts.  So, without further ado…

The change which is probably going to cause you the most grief is the switch from @Index to @Searchable, and the addition of the @Indexed annotation.  Yes, you really did read that correctly; and no @Indexed isn’t even related to the functionality provided by the old @Index annotation.

The old @Index annotation used to handle tagging of entity methods to mark them as to be added to the Lucene full-text search index (see this post for more details).  This was a little confusing for a number of reasons, not the least of which my failure to remember the tense of the annotation name (see the comments on the indexing post).  By convention, most Java annotations are declarative in name.  Thus, the name should not be the present tense “index” but the past tense “indexed”.  So my first thought was to just refactor the annotation, but then I came into a slightly hairier name-clash.

Database Field Indexing

One of the most common techniques for optimizing your database’s read (SELECT) performance is to create indexes on certain fields.  When a field is indexed, the database will maintain some separate hash tables to enable very fast selection of rows based on the field in question.  This is a really good thing for almost all foreign keys, for example:

SELECT ID FROM people WHERE companyID = ?

Here we’re SELECTing the id field from the people table where companyID matches a certain value.  The database can execute this query fairly quickly.  In fact, the only bottle-neck is finding all of the rows which match the specified companyID value.  In a table containing hundreds of thousands of rows, one can see how this could be a problem.

The problem goes away (sort of) with the use of field indexing.  Instead of having to linearly search through the table for rows matching the companyID, the database can perform a quick hash lookup in an index and get a set of rowid(s) based on the companyID.  Simple, efficient, and incredibly scalable.  Practically, the DBMS wouldn’t take any longer to execute such a query against a table of 100,000,000 rows than it does to execute the same query against a table of 100 rows.  So, this is a really good thing right?

Well, indexes have their drawbacks.  I won’t go into all of the reasons not to use indexes, the following two points will likely suffice:

  • Indexing slows down UPDATEs and INSERTs
  • Indexing adds to your table’s storage requirements (all those hashes have to be put somewhere)

So perfect database performance isn’t attainable just by indexing every field, one has to be quite judicious about it.  In fact, choosing the fields to be indexed is really as much of an art as it is a science.

This long and tangled introduction actually did have a point…  ActiveObjects didn’t have any support for field indexing.  I hadn’t really considered the possibility, so I didn’t factor it into my design.  In retrospect, this was probably a bad idea.  So making up for lost time, I’ve now introduced field indexing into the library!

public interface Person extends Entity {
 
    @Indexed
    public String getName();
    @Indexed
    public void setName(String name);
 
    public int getAge();
    public void setAge(int age);
 
    public Company getCompany();
    public void setCompany(Company company);
}

When ActiveObjects generates the migration DDL for this table (running against MySQL), it will look something like this:

CREATE TABLE people (
    ID INTEGER NOT NULL AUTO_INCREMENT,
    NAME VARCHAR(255),
    age INTEGER,
    companyID INTEGER,
    CONSTRAINT fk_people_companyID FOREIGN KEY (companyID) REFERENCES companies(ID),
    PRIMARY KEY(ID)
);
CREATE INDEX NAME ON people(NAME);
CREATE INDEX companyID ON people(companyID);

This is where the aforementioned @Indexed annotation comes in.  As you can see from the resultant DDL, adding a field index is as simple as tagging the corresponding method.  Also, foreign keys are automatically indexed, to ensure maximum performance in SELECTion of relations.

So that’s the good news, the bad news is that index creation doesn’t play too nicely with migrations.  Everything works of course, and if you’re actually adding a table or field, the corresponding index(es) will also be created.  Likewise if you drop a table or field, the corresponding index(es) will also be dropped.  However, that’s about the limit of the migrations support for indexes at the moment.  JDBC has an unfortunately limitation which prevents developers from getting a list of indexes within a database.  Since I have no way (within JDBC) of finding pre-existing indexes, they must be excluded from the schema diff and thus are CREATEd or DROPped irregardless of the existing schema.  I do have a plan to fix this (along with migrations on Oracle), but I seriously doubt that it’ll be included in the 1.0 release, owing to the changes required.

Refactoring End Result

The final result of the refactoring and the adding of field indexing is that the annotation formerly named @Index is now the @Searchable annotation.  Likewise (and keeping with convention), the formerly IndexingEntityManager is now named SearchableEntityManager.  Both of these types remain in the net.java.ao package.  To allow for field indexing, the @Indexed annotation was added to the net.java.ao.schema package, owing to the fact that it only effects schema generation and doesn’t change runtime behavior in the framework at all.

Hopefully these changes won’t be too confusing (now that you’re aware of them) and will be a welcome addition to the ActiveObjects functionality.  As always, I welcome comments, suggestions and criticisms!

Performance is Good: ActiveObjects vs ActiveRecord

14
Aug
2007

So ActiveObjects is a fairly cool ORM.  However, coolness alone does not an enterprise ORM make.  In fact, the real qualifications for an enterprise-ready framework are as follows:

  • Stability
  • Performance

I’m sure there are other questions which factor into design decisions on whether or not to use a library, but those are the two which I look at most closely.  Stability is usually a hard metric to find, since it usually depends on a lot of adopters hammering the library until it breaks, is fixed and then hammered again.  However, performance numbers are almost always easy to come by, since all that is required are a few simple benchmark tests to just get a ballpark-number.

Since benchmarks are so fun, I’ve decided to do a few for ActiveObjects.  Or rather, I’ve decided to run a simple (read, very simple) benchmark test with ActiveObjects as well as a number of other ORMs.  At the moment, I’ve only been able to run the test with ActiveRecord (sorry guys, Hibernate’s a really complex framework), but I think the numbers are still worth looking at.

ActiveRecord claims only a 50% overhead compared to manual database access (that number is actually listed as a feature).  There has been some dispute over whether the test used to obtain that particular figure was valid or not, but that’s besides the point.  ActiveObjects should be able to do at least that well, right?

Well, as it turns out, it can.  Here are the numbers from my reasonably simple benchmark:

ActiveObjects
==============
Queries test: 55 ms
Retrieval test: 68 ms
Persistence test: 55 ms
Relations test: 154 ms

ActiveRecord
=============
Queries test: 154 ms
Retrieval test: 6 ms
Persistence test: 76 ms
Relations test: 75 ms

Surprisingly close numbers actually.  I had assumed that there would be some significant disparity, one way or another.  However, as you can see ActiveObjects is fairly comparable to ActiveRecord on a set of extremely trivial tests.  There are some jumps and obvious areas of strength/weakness in both frameworks, but on average they’re pretty similar in performance.

As my friend Lowell Heddings pointed out, ORM benchmarks are far more useful if you actually examine the SQL generated to see how efficient it really is from a theoretical standpoint.  So, to make things easier I sed/grepped the logs and arrived at the following SQL outputs for each respective ORM.

Details

Now, I will be the first to admit that this is hardly at even test to begin with.  Obviously there are different strengths and weaknesses in every library, and though I tried to be impartial in the designing of the benchmarks, I probably accidentally favored one ORM over the other.  Also, there are inherent performance advantages to Java over Ruby, especially in the area of database access.  In short, ActiveObjects probably had a sizeable advantage coming right out of the gate, so take my numbers with a grain of salt.

The test itself consisted of four phases, each involving three entities: Person, Profession and Workplace.  Person has a many-to-many relation with Profession through a fourth entity, Professional.  Workplace has a one-to-many relation with Person.  These relations were exploited directly in the relations benchmark (e.g. Person#getProfessions(), Workplace#getPeople(), etc).  Each entity had a number of fields, including one CLOB (or TEXT, as MySQL refers to them) in the Person entity.  The tables for each respective schema were pre-populated with the same data, which involved several rows with different values (except for the CLOB, which was a roughly 4000 character paragraph and the same for every row).  In the ActiveObjects Person entity, I used the @Preload annotation to eagerly load firstName and lastName.

For the retrieval test, the benchmark iterates through every Person row and grabs firstName, lastName, age, alive, and bio.  Since ActiveObjects only preloaded firstName and lastName, it suffered a bit here. 

The persistence test iterates through every person row and changes the first and last name to one selected from a pool of names I populated with random names which came to mind.  It then goes through the same iteration again and sets the age, alive flag and the bio to our 4000 word Pulitzer-winning essay.  Each row is saved through each iteration, thus each row is saved exactly twice throughout the test.  ActiveObjects came out ahead here probably because of its use of PreparedStatements, as well as the more efficient UPDATE statement generation.

The relations test involved first finding all of the Professions associated with each individual Person and retrieving the Profession name.  Next, the Workplace for the Person is retrieved, then all of the Person(s) associated with that Workplace and their firstName and lastName values accessed.

The queries test was little more than getting all of the Person(s), all of the Workplace(s), all of the Professional mappings, along with all of the Profession(s).  ActiveObjects far outperformed ActiveRecord in this area since ActiveRecord uses SELECT * for everything and eagerly loads the row values.  This means (especially with a CLOB thrown into the mix) that ActiveRecord’s initial query time will be very long, while it’s field access time will be very quick.  Most ORMs function in this way, and it can be a very good thing at times (our benchmark is one of those times).

Lessons Learned

  • Eager loading can be a good thing
  • ActiveObjects generates some weird SQL for relations access

Obviously I can only do so much about the eager loading issue.  I believe pretty strongly that ActiveObject’s approach (in lazy loading most things) is the right one for most use-cases.  However, the second lesson to be learned here is one which I think I need to take a bit more to heart: keep it simple SQL.

Normally, ActiveObjects will generate a query something like the following for accessing a one-to-many relation:

SELECT DISTINCT a.outMap AS outMap FROM (
    SELECT ID AS outMap,workplaceID AS inMap FROM people 
       WHERE workplaceID = ?) a

Yuck!  For obvious reasons, this is an incredibly inefficient bit of querying.  Actually, not only is it inefficient, but needlessly so.  You and I of course know that we could replace the above query with the much simpler:

SELECT ID FROM people WHERE workplaceID = ?

So why doesn’t ActiveObjects do that?  Frankly, I was lazy in my coding of the EntityProxy#retrieveRelations method, so a lot of ugly SQL slipped through the cracks in cases where it really wasn’t necessary.  I’ve spent a bit of time on this, and I think I’ve got the issue resolved.  The problem is that ActiveObjects was assuming that any relation (one-to-many or many-to-many) can have multiple mapping fields, thus requiring a wrapping DISTINCT outer query around a subquery SELECT which is UNIONed with an arbitrary number of other SELECTs, corresponding to the other mapping fields.  Obviously, it is almost never the case that we have to deal with multiple mapping paths, so I added a short-circuit to the logic which creates far simpler queries if at all possible.  As a result, the benchmark numbers for the relations test in ActiveObjects are between 80 and 100 ms.  Still slower than ActiveRecord, but much improved.

It’s worth noting that if we ran each benchmark twice, we would see a marked improvement in the ActiveObjects performance the second time through.  Not just because a lot of the values would be cached, but also because the prepared statements in question would have been compiled and stored.  This is a fairly major area in which ActiveRecord falls short since it doesn’t utilize prepared statements, thus having a constant runtime for its queries and remaining unable to take advantage of cached, compiled queries.

So in short, ActiveObjects may be really neat, but it’s performance numbers don’t seem all that superior to those of ActiveRecord, a Ruby ORM with numerous known shortcomings in this area.  I guess I need to work on things a bit more.  :-)  Next up, either manual JDBC code or Hibernate running the same benchmark, depending on how soon I’m able to figure out Hibernate’s crazy XML mapping schema.

Note: I forgot to mention this… You can get the source for my benchmark from the ActiveObjects SVN repository: svn co https://activeobjects.dev.java.net/svn/activeobjects/trunk/Benchmarks