Skip to content

Wide World of Pool Providers: Side-by-Side Comparison

13
Nov
2007

It seems that for any conceivable functionality in Java, there exist a myriad of frameworks which accomplish the task in more-or-less the same way.  ORMs for example; I can count five different Java ORMs without even trying, and I’m sure that number would expand exponentially if I actually sat down and used Google to get a more precise estimate. 

Just like any other function, there seems to be a glut of frameworks which provide JDBC connection pooling.  Choosing between these frameworks can sometimes be a daunting task.  After all, what qualifications do you look at?  Performance?  Licensing?  Documentation?  Connection pooling is such an (apparently) small part of an application’s infrastructure that many development teams devote very little time to this vital selection process.

Due to this management shortsightedness, many projects will simply use whatever pooling library is default for the ORM their using, or (more likely) the first Google hit when searching for “java connection pool”.  Of course, this will get you something which is usable, but rarely will you arrive at a pool provider which is optimal.

To help you to make a more informed decision in this vital aspect of project design, I hereby present some of the lessons I have learned on the subject while working on ActiveObjects.  All benchmarks were run against MySQL 5.0 running on Windows Vista Premium, 2 Ghz Intel Core2Duo, 2 GB DDR2, 7200 RPM SATA drive.  Each test was run five times (executing the DDL each time) with the average runtime taken as the result.  Any obviously poor results (several seconds above the mean) were dropped and re-tested.  All tests were run using the Eclipse JUnit 4 test runner.

commons-dbcp

This is quite possibly the most commonly used pool provider (by default backed by commons-pool), mainly because it used to come up first on Google.  Though, perhaps more important is the fact that commons-dbcp is an Apache sub-project.  This gives it both a less-restrictive license than some other projects, as well as a credibility that comes with being hosted at Apache.  Honestly, if I see a project’s URL contains “apache.org”, I immediately give it the benefit of the doubt, assuming that it will be of reasonable to high quality.  There’s certainly something to be said for reputation…

commons-dbcp is probably the easiest pool provider I’ve seen in terms of API and “just work”-ness.  It has two mechanisms for setting up connections: alternative JDBC URI and a JNDI DataSource implementation.  It’s interesting to note here that the JDBC javadocs state that DataSource is the preferred way to retrieve connections, while the commons-pool javadoc asserts that the alternative URI method passed directly to DriverManager is preferable.  In practice, I find myself using the DataSource method for connection pools, mainly because it reminds me that I’m not dealing with a normal connection creation, but something which is potentially pooled:

BasicDataSource ds = new BasicDataSource();
ds.setDriverClassName(jdbcDriver);
 
ds.setUsername(getUsername());
ds.setPassword(getPassword());
ds.setUrl(getURI());
 
ds.setMaxActive(20);
 
// get connection here
Connection conn = ds.getConnection();
 
// dispose of pool
ds.close();

It’s important to note here that the pool is explicitly disposed. This is always a good idea, even for pools which don’t state the requirement in their documentation.  An undisposed pool can hold database connection open, tying up resources and dragging your database performance through the dirt.  Always, always, always dispose of your connection pools when you’re done with them.

So the API seems pretty intuitive here.  All of the methods do exactly what one would expect.  What’s more, the entire library is extremely well documented.  There’s quite a bit of material on the commons-pool project page discussing how to get started, what best practices to follow, etc.  The public API is javadoc’d, and there are a number of examples available.  I was up-and-running with the framework a few short minutes after I punched in the URL to my address bar.

One thing I haven’t addressed yet is performance.  It’s vitally important that the connection pool chosen run as efficiently as possible.  After all, its whole purpose is to optimize access and reduce the strain on the database in the form of connection create as well as statement compilation.  Obviously all of the really interesting stuff is in this segment of the library.  If this code performs poorly, it would be a very bad idea to try and use the framework for any sort of serious project.

I just happen to have a reasonably comprehensive database benchmark handy in the form of the ActiveObjects JUnit test suite.  ActiveObjects uses a reasonable number of JDBC features (it doesn’t use many conventional Statement(s) or stored procedures).  Since neither the suite nor the library itself changes between benchmarks, we can test arbitrary connection pools easily and receive reasonably accurate results.

Continuing my recent obsession with HTML tables and their use in product reviews, here is the obligatory “five second rundown”:

Documentation Excellent
API Easy and intuitive
License Apache License 2.0
AO Test Suite Run Time 20.6302 seconds

C3P0

C3P0 is another very common connection pool framework, partially because it’s the default pool used with the ever-popular Hibernate ORM.  Unlike commons-pool, C3P0 is actually hosted at SourceForge, that ever popular source of dead open-source projects and over-ambitious specs.  With that said, C3P0 is actually quite respectable as a framework and seems to have avoided the premature fate which befalls most open-source frameworks: developer boredom.

Unfortunately, like so many projects on SourceForge, C3P0 does not have a separate website.  The maintainers opted to stick with the SourceForge project interface as the sole source of “official” information.  Add to that the fact that they decided not to add anything to the “Documentation” section of the page and you arrive at one very frustrating first impression for a new user.  Fortunately there’s a lot of material on using C3P0 (both with and without Hibernate) available around the internet.  Always remember: Google is your friend.

ComboPooledDataSource cpds = new ComboPooledDataSource();
cpds.setDriverClass(jdbcDriver);
 
cpds.setJdbcUrl(getURI());
cpds.setUser(getUsername());
cpds.setPassword(getPassword());
 
cpds.setMaxPoolSize(20);
cpds.setMaxStatements(180);
 
// get connection here
Connection conn = cpds.getConnection();
 
// dispose of pool
try {
    DataSources.destroy(cpds);
} catch (SQLException e) {
}

The API is somewhat similar to that of commons-dbcp.  Both use the DataSource API as a foundation (which is the “right” approach according to the JDBC docs), and both allow roughly the same configuration options on a pool.  At face value, the APIs seem similar to the point that comparison between the two on such a level would be pointless.

One very important “feature” of the C3P0 library is its license: LGPL.  For those of you who don’t know, LGPL is basically identical to the famed GPL 2.0 without the so-called “viral clause”.  GPL pre-3.0 has some legal ambiguity relating to “derivative works” and what qualifies as such.  For this reason, many projects (especially commercial applications written using object-oriented languages) tend to shy away from libraries licensed as such.  LGPL of course doesn’t have this problem, so it has seen moderately better acceptance from the corporate gods.  Unfortunately, it is still a fairly restrictive license relating to other matters such as redistribution.  It is fairly common practice to perform what can be described as “static linking” of JAR files for an application (un-JARring dependencies and then re-JARring them into the main application JAR).  This is something which is prohibited under LGPL, thus restricting deployment options somewhat.  Also, if I remember correctly any non-LGPL application or framework using a LGPL licensed dependency must include a copy of LGPL somewhere in the application (the About or Help section springs to mind).  It was due to this licensing that a company I recently worked for decided against using C3P0 for its application.  Of course, everyone’s requirements are different, but you should still be aware of the possible consequences of using a restrictively licensed framework.

Documentation Lousy
API Easy and intuitive
License LGPL
AO Test Suite Run Time 17.277 seconds

Proxool

Like C3P0, Proxool is an open-source pooling library hosted on SourceForge.  Thankfully, unlike C3P0, Proxool’s maintainers actually took the time to build a full site for the project, containing documentation and examples.  Unfortunately for us, the examples don’t do much good.

Proxool’s documentation is obfuscated and hidden away, making it somewhat difficult to get started with the framework.  Unintuitively enough, the “Quick start” section is of very little help when trying to actually use the library.  Oh it does contain samples, but in my tests I couldn’t get the samples to run successfully.  Add to this the fact that Proxool is a less well-known framework and you lead to some very frustrating experiences trying to get up and running.

To their credit, the Proxool maintainers have written quite a bit of documentation which covers a great deal of the framework functionality.  Organizing a project page intuitively is very hard, it’s just a shame that would-be adopters of the framework have to pay the penalty. 

So to make things easier for others like myself looking to try the framework, here’s the basic setup code for a Proxool pool:

Class.forName(jdbcDriver);
 
Properties props = new Properties();
props.setProperty("proxool.maximum-connection-count", "20");
props.setProperty("user", getUsername());
props.setProperty("password", getPassword());
 
String driverUrl = getURI();
String url = "proxool.mypool:" + jdbcDriver + ":" + getURI();
 
ProxoolFacade.registerConnectionPool(url, props);
 
// get connection here
Connection conn = DriverManager.getConnection("proxool.mypool");
 
// dispose of pool
try {
    ProxoolFacade.removeConnectionPool("mypool");
} catch (ProxoolException e) {
}

Hardly intuitive I’d say.  Nevertheless, the above code seems to get the job done.

One of the debatable advantages to the Proxool library is that it allows developers to take advantage of connection pooling simply by using a special JDBC URI prefix (commons-dbcp allows this too).  I’m not using that syntax in the above example mainly because I think that developers are better served remembering when they are or are not using a pool.  Also, I never could quite get the syntax working (again, poorly structured documentation).

One interesting feature of Proxool that’s worth mentioning is that it allows developers access to things like pool stats, event listeners and so on.  I believe these features are completely unique to Proxool, and while they’re not very interesting in a small test application, imagine the power which can be unleashed in a real-world application.  Exposing this information through something like JMX could make tracing and debugging of database bottlenecks on a production server significantly easier.

Documentation Frustrating
API Poor
License Apache License
AO Test Suite Run Time 18.6406 seconds

Benchmark Comparison

In terms of raw performance, C3P0 comes out ahead by almost a second and a half.  For a short-running test suite like that of ActiveObjects, that’s a fairly impressive difference.  That translates into hours of clock time saved on a database-intensive application of the course of a few weeks.  In my book, that’s something seriously worth considering.

Proxool came in a solid second place, at eighteen and a half seconds.  It’s definitely slower than C3P0, but it’s a full two seconds faster than commons-dbcp.  Considering that Proxool is licensed under the far less restrictive Apache License, it may be worth sacrificing the odd millisecond per query, depending on the opinion of your legal department.

commons-dbcp was the slowest of the three benchmarked at a disappointing twenty and one half seconds.  I’m not entirely sure why DBCP is so much slower in its default, commons-pool backed implementation.  However, the fact remains that performance-wise, it isn’t even worth comparing with C3P0.  Seems I need to make some changes in the classpath of some of my projects…

Throughout the whole benchmarking process, I was constantly reminding why Vista is so notoriously difficult as a host OS for application benchmarks.  The results were constantly fluctuating dramatically up and down, based on how much Vista had superloaded, indexing state, open apps, etc.  In short, Vista was so frustratingly difficult to deal with in the testing process that the test results should be treated with some skepticism.  After all, it’s hard to say that this is empirical, hard evidence when I’m throwing away three quarters of the test results due to vast deviation from the mean.

Conclusion

I must (grudgingly) admit that C3P0 is probably the best choice for most projects.  I say grudgingly because the extreme lack of documentation really bothers me.  Granted, Proxool, the next closest in performance, only has an advantage in licensing; its documentation is no better than C3P0’s.  Proxool of course has the added disadvantage of having a difficult API, as well as less popularity, therefore fewer articles and samples available around the web.

So if you’re a license purist, and you want an intuitive API at the expense of performance, commons-dbcp is the way to go.  However, if you’re willing to work within the restrictions of the LGPL license and you know how to use Google effectively, C3P0 would be the preferred choice, given its higher performance and excellent configurablility.

Update: I didn’t have time to run the benchmarks in any sort of rigorous way (see aforementioned whining about Vista’s benchmark flakiness), but preliminary runtimes indicate that DBPool is an even better framework, performance-wise.  It has a less restrictive license than C3P0, and seems to have a second to a second and a half edge in runtime. Again, these are just quick numbers I grabbed as I was adding support for the provider to ActiveObjects, but I thought it was worth mentioning.

Are GUIDs Really the Way to Go?

7
Nov
2007

I recently read a (slightly inflammatory) posting entitled “The Gospel of the GUID”.  In it, the author attempts to put forward several arguments in favor of using GUIDs for all database primary keys (as opposed to the more pedestrian sequence-generated INTEGER).  I’ve heard similar arguments in the past, and I keep coming back to the same conclusion: it’s all bunk.

To get right to the heart of my opinion, I really don’t see a compelling reason to use GUIDs for 90% of database use-cases.  Consider for a moment some of the really large databases in this world (Wikipedia, Slashdot’s comments table, Digg, etc).  I can’t think of a single web application in that league which uses GUIDs.  If you don’t believe me, look at the code for MediaWiki, the URL structure for Digg and the idiocy involving INTEGER vs BIGINT on Slashdot.  While de facto practice may not dictate optimal design, it does point to a trend that’s worthy of notice.  More importantly, it proves that INTEGER (or rather, BIGINT) primary key fields are practicable under real-world stress.  But what about the theoretical use-case?

Well to be honest, the theoretical use-case doesn’t interest me all that much.  I mean, if you tell me that your schema can support theoretically 29 trillion rows in its users table, I’ll geek out right along with you.  But if you try to feed that to a client, you’ll get either blank stares, or the equally common: “but there are only 5 billion people to chose from.”  (unless you’re talking to someone in upper management)  For an even better party, try telling your client that you can merge their data with one of their competitor’s in 45 minutes flat.  Somehow I doubt that argument will fly.

Now let’s look at the cons involved.  GUIDs are really, really hard to remember and type by hand.  When you’re in a hurry, pulling a little data-mine-while-you-wait for your boss, you’re not going to want to dig around and find that string you copy/pasted into Notepad.  Oh, and heaven forbid you actually type the GUID wrong!  Finding a single-bit deviation in a 64-bit alpha-numeric is not a trivial task let me tell you.  It’s at about this point that you start to think that maybe INTEGER keys would have been a better way to go.  At least then you can throw the userID into good ol’ short-term memory and bang it out ten seconds later in your SQL query window.  Oh, in case you’ve never seen a GUID before, here’s a couple quick examples:

  • {3F2504E0-4F89-11D3-9A0C-0305E82C3301}
  • {52335ABA-2D8B-4892-A8B7-86B817AAC607}
  • {79E9F560-FD70-4807-BEED-50A87AA911B1}

I’m not even sure how to read that, much less remember it.

Another thing worth considering is that GUIDs are VARCHARs within the database.  This means that not all ORMs will handle them appropriately.  Hibernate will, as will iBatis and ActiveObjects.  But if you’re using something like Django or ActiveRecord, you’re out of luck (disclaimer: Django might actually support VARCHAR primary keys, I’m not sure).  On top of that, some database clustering techniques don’t seem to work nicely when applied to GUID-based key schemes.  I haven’t run into this myself, but my good friend Lowell Heddings has told me tales of strange mystics and evil tides when playing with distributed indexes and GUIDs in MS SQL Server.  Technically speaking, while any sane database may work with a VARCHAR for the table’s primary key, the architects really had INTEGERs in mind when they designed the algorithms.  This means that by using GUIDs, you’re flirting with a less-well-tested use-case.  It’s certainly not unsupported, but you should consider yourself in the minority of users if you go that route.

Oh, and one little myth I’d like to debug: there really isn’t that much more overhead in grabbing the generated value of an INTEGER primary key than in sending an in-code generated GUID value.  Granted, some databases make this harder than others, but that’s their fault isn’t it?  As long as you’re using a well designed ORM, you really won’t see much of a performance increase from removing that miniscule callback.  In fact, depending on how your GUID algorithm works, you may see a significant degradation in overall performance due to extra overhead in your application.

Conclusion

So, the obligatory thirty second overview for our attention-deprived era:

Benefits of GUIDs perfectly unique across all your data and tables; way, way more head-room in terms of values
Problems with GUIDs really, really un-developer friendly; weird side-effects in databases and ORMs; unconventional approach to a “solved” problem; easy way to upset your DBA

So it really seems the only upshot for GUIDs is their massive range.  In trade-off, you lose usability from a raw SQL standpoint, your result sets are that much more cluttered and you risk odd bugs in your persistence backend.  Hmm, I wonder which one I’m going to use in my next app?

Before you make the choice, ask yourself: “how much uniqueness do I really require?”  Over-designing a solution is just as much a problem as under-designing one.  Try to find the mid-point that works best for your use-case.

Custom Data Types in ActiveObjects

17
Oct
2007

ORMs really interest me, so naturally I read a lot of material regarding ORMs of all kinds, especially Hibernate and ActiveRecord.  One of the more interesting reviews I read recently complained about the rigidity of the type system in the Rails ORM.  According to the author’s examination of the code, ActiveRecord just uses a monolithic switch/case statement to determine the appropriate Ruby type from the SQL type in the result set.  This may make sense from a simplicity standpoint, but it may not be the best approach when it comes to flexibility.

The problem with this approach is that it’s impossible to easily add new types to the ORM.  Granted, the framework authors could do it by modifying the switch/case statement(s) - and the approach does usually require more than one statement - and releasing a whole new version of the framework.  This is not a significant issue as the framework authors already have access to the full library sources.  The real trial is with third-party developers who require custom data types.

An alternative approach (suggested in the article) is to implement a series of type delegates inheriting from a common superclass, or possibly using a mixin as allowed by Ruby.  These type classes would each be responsible for a single type, handling the mapping both to and from the language-native type to the database type.  This would allow for both easy addition and modification of core types by the framework authors, but also trivial support for arbitrary types as implemented by third-party developers.

Not one to shirk good advise when I hear it, I’ve decided to go with this approach to types in ActiveObjects.  Formerly, I must admit I had gone with the multiple, giant switch/case statements.  This seemed to make sense when I first implemented the framework, but it developed, it became apparent that this was inadequate, especially if third-party types are desired.  This decision led to the refactoring of the type system and subsequent creation of the TypeManager class.

TypeManager is basically the singleton manager for the entire type system.  It maintains the list of available DatabaseType(s) and can resolve both Java classes and SQL types to the appropriate delegate.  A number of core types (VarcharType, IntegerType, etc) are added to the singleton instance of TypeManager, ensuring that basic functionality works without any extra effort on the part of the developer.  If a type other than the core types is needed, all that is necessary is to add the type delegate instance to the TypeManager prior to the type’s usage in either migrations or data access.  Thusly:

public interface Company extends Entity {
    public String getName();
    public void setName(String name);
 
    public Class<?> getJavaType();
    public void setJavaType(Class<?> type);
}
 
public class ClassType extends DatabaseType<Class<?>> {
 
    public ClassType() {
        super(Types.VARCHAR, 255, Class.class);
    }
 
    @Override
    public Class<?> convert(EntityManager manager, ResultSet res, 
                Class<? extends Class<?>&gt; type, String field) throws SQLException {
        try {
            return Class.forName(res.getString(field));
        } catch (Throwable t) {
            return null;
        }
    }
 
    @Override
    public void putToDatabase(int index, PreparedStatement stmt, 
                Class<?> value) throws SQLException {
        stmt.setString(index, value.getName());
    }
 
    @Override
    public Object defaultParseValue(String value) {
        try {
            return Class.forName(value);
        } catch (Throwable t) {
            return null;
        }
    }
 
    @Override
    public String valueToString(Object value) {
        if (value instanceof Class) {
            return ((Class<?>) value).getName();
        }
 
        return super.valueToString(value);
    }
 
    @Override
    public String getDefaultName() {
        return "VARCHAR";
    }
}
 
// ...
TypeManager.getInstance().addType(new ClassType());
 
Company[] stringCompanies = manager.find(Company.class, "javaType = ?", String.class);
for (Company c : stringCompanies) {
    System.out.println(c.getName() + " former held type " + c.getJavaType().getName());
 
    c.setJavaType(Exception.class);
    c.save();
}

The most complicated bit of the example above is the database type itself.  Yet even this delegate isn’t too horrible.  The ClassType class first specifies in its constructor which types it corresponds to, both database and Java.  Multiple Java class types can be specified, allowing for cases like IntegerType which maps to both Integer.class and int.class.

The rest of the database type is fairly self-explanatory.  There are methods to read the Java value out of a JDBC ResultSet, put the Java value back into a JDBC PreparedStatement, as well as three methods to handle some of the non-database type-sensitive operations, such as parsing a String value into a type-specific value and visa-versa.  These database non-specific conversions are required for things like parsing the value of a @Default or an @OnUpdate annotation.  Finally, getDefaultName() allows the default DDL rendering of the type to be specified.  This can be overridden in the DatabaseProvider implementation for that particular database, but the use of getDefaultName() allows for third party types that the database provider developers may not have foreseen.  Thus, it effectively opens the door to third-party types in migrations.

Of course, no example would be complete without another one to complement it!  Here’s how we could create a type delegate for the java.awt.Point class:

public class PointType extends DatabaseType<Point> {
    private static final Pattern PATTERN = Pattern.compile("x=(\\d+),y=(\\d+)");
 
    protected PointType() {
        super(Types.VARCHAR, 45, Point.class);
    }
 
    @Override
    public Point convert(EntityManager manager, ResultSet res, 
            Class<? extends Point> type, String field) throws SQLException {
        return (Point) defaultParseValue(res.getString(field));
    }
 
    @Override
    public void putToDatabase(int index, PreparedStatement stmt, Point value) 
                throws SQLException {
        stmt.setString(index, valueToString(value));
    }
 
    @Override
    public Object defaultParseValue(String value) {
        Point back = null;
        Matcher matcher = PATTERN.matcher(value);
 
        if (matcher.find()) {
            back = new Point();
            back.x = Integer.parseInt(matcher.group(1));
            back.y = Integer.parseInt(matcher.group(2));
        }
 
        return back;
    }
 
    @Override
    public String getDefaultName() {
        return "VARCHAR";
    }
}

One thing of note here which has changed from the previous example of ClassType is that the second parameter to the super constructor is now 45, instead of 255.  This parameter is actually the default precision of the SQL type when rendered into the database.  If the SQL type doesn’t have a precision or should just take the database default, a negative value should be specified for this parameter.  Another item of note is that we’re delegating work between methods in a way that I simply didn’t do for ClassType.  Because the rendering of the type in the database is in VARCHAR (String) form, we can rely upon our default String conversion methods to render into the database.  As an aside, the superclass implementation of valueToString(Object) uses the toString() method for that particular value.

As you can see, the type system in ActiveObjects is incredibly powerful and capable of satisfying many use-cases that were impossible in previous versions or other ORMs.  Hopefully this brief glimpse into advanced uses of the type system will aid you in databasing efforts.

Custom Primary Keys with ActiveObjects

15
Oct
2007

One of the main complaints I’ve heard leveled against ActiveObjects is that it’s just not suitable for mapping to legacy schemas.  More generically, concerns have been mooted that it enforces naming conventions and field conventions which aren’t suitable/preferable for some projects.  I suppose at first both of these were true.  After all, ActiveObjects’s entire premise was convention over configuration, and this requires some restrictions by default.  However, I don’t think it’s entirely accurate any longer.

Over the last few months, I’ve added several features which satisfy three primary goals:

  • Customize the table name convention
  • Customize the field name convention
  • Allow for primary key fields (and types) other than id INTEGER

The first two goals were easily met through the addition of TableNameConverter and FieldNameConverter.  These two classes are used by every feature within ActiveObjects, from migrations to simple data access, to determine the database table and field names from the class and method names respectively.  The canonical example of this is table name pluralization, which can be accomplished in the following way:

EntityManager manager = new EntityManager(
    "jdbc:mysql://localhost/test", "username", "secret");
manager.setTableNameConverter(new PluralizedNameConverter());

Not too horrible.  The second use-case is assigning a different field name convention than the default camelCase.  For example, some people really like the ActiveRecord (Rails) field naming convention.  (e.g. “first_name” as opposed to “firstName”)  This can easily be accomplished by specifying a field name converter:

EntityManager manager = new EntityManager(
    "jdbc:mysql://localhost/test", "username", "secret");
 
// lower_case convention
manager.setFieldNameConverter(new UnderscoreFieldNameConverter(false));

Custom table and field name converters are also possible, allowing for a great deal of flexibility in name conventions.  Additionally, it’s always possible to specify field and table names directly in the entities, using the @Accessor, @Mutator and @Table annotations respectively.

Custom Primary Keys

The most challenging goal (from a library standpoint) is to allow for primary key fields other than “id”.  This is partially such a challenge because it had been hard coded literally everywhere in ActiveObjects that the “id” field is the field to use in any sort of SELECT, JOIN, INSERT, UPDATE, etc.  In short, changing this required finding all of these instances and converting the code to query a centralized source for the data.  A few days of fiddling with Eclipse’s text search accomplished this without inordinate pain, but the hard part was coming.

The question remained: how to specify the primary key within the entity itself?  After all, it’s been hard coded and sort of magically “worked” based on the method definition in the Entity superinterface.  There had been a syntax to specify a second PRIMARY KEY for the schema migration, but ActiveObjects didn’t treat these fields any differently, and this sort of syntax wouldn’t really cut it if we were trying to completely override the existing getID() method in the superinterface.

The solution is to refactor all of the interesting functionality in Entity up into a super-superinterface, RawEntity.  Thus the only method defined within Entity would be getID(), annotated appropriately to be recognized as a PRIMARY KEY field.  This would do away with all the magic tricks under the surface which assumed the existence of the getID() method.  ActiveObjects can easily parse the class to find the PRIMARY KEY field amongst the methods, both defined and inherited.  The only compromise which must be made is only one PRIMARY KEY can now be allowed per table.  This isn’t such an issue, since 99% of the time, that’s all you need anyway.  Usually that remaining 1% can be more properly accomplished using UNIQUE and some sort of auto-generation of values.

Since we’ve refactored interesting functionality up into RawEntity and kept getID() within Entity, no legacy code needs to be changed.  Any entities previously written against ActiveObjects will run without modification or any behavior changes.  We are merely allowed the flexibility of specifying our own primary keys.  So, without further ado, the obligatory example:

public interface Person extends Entity {
    public String getFirstName();
    public void setFirstName(String firstName);
 
    public String getLastName();
    public void setLastName(String lastName);
 
    public Company getCompany();
    public void setCompany(Company company);
 
    public House getHome();
    public void setHome(House home);
}
 
public interface Company extends RawEntity<String> {
 
    @PrimaryKey
    @NotNull
    @Generator(UUIDValueGenerator.class)
    public String getCompanyKey();
 
    public String getName();
    public void setName(String name);
 
    @OneToMany
    public Person[] getEmployees();
}
 
public interface House extends RawEntity<Integer> {
 
    @PrimaryKey
    @NotNull
    @AutoIncrement
    public int getHouseID();
 
    // ...
 
    @OneToMany
    public Person[] getOccupants();
}
 
public class UUIDValueGenerator implements ValueGenerator<String> {
    public String generateValue(EntityManager em) {
        // generate uuid
        return uuid;
    }
}
 
// ...
Person p = manager.get(Person.class, 1);
Company c = manager.get(Company.class, "abff999dd99ddf0a225f");

Maybe a bit longer of an example than you were expecting, but it does cover the material well.  What’s happening here is the Person entity has a standard, “id” primary key.  This follows the same convention that ActiveObjects has been enforcing since the beginning of time (or at least since I started the project).  Company and House are the interesting entities here.

House defines a getHouseID() method of type int which is marked as a PRIMARY KEY as well as being auto-incremented by the database (SERIAL on PostgreSQL, AUTO_INCREMENT on MySQL, etc).  This is the same sort of declaration that you would find if you looked in the source for Entity.  The difference is that House will not contain the “id” field and its PRIMARY KEY will be “houseID”.  The really interesting entity here Company.

Company defines a primary key that is not only a different field, but also an entirely different type.  Also, its value is generated automatically not by the database, but by the application itself.  This is a fairly common use-case in those crazy databases which use UUIDs as primary keys.  Not only does this field define “companyKey” as a different type than INTEGER, but it also ensures that the “companyID” FORIEGN KEY field in the “person” table is also of type VARCHAR.

Another item of note in this example is that the RawEntity interface is parameterized.  This is to allow the get(...) method in EntityManager to stay type-checked, ensuring that the values passed are actually valid primary key values for the entity in question.  Of course, there’s nothing that can be done to ensure that the actual method definition of the primary key is of the proper type.  However, at some point the developer must be trusted to make sure their entity model doesn’t violate the dictates of logic.

Conclusion

With this latest addition to the ActiveObjects feature set, it should be possible to use the ORM with any schema whatsoever.  While AO may still be an implementation of the active record pattern, and thus less powerful than solutions such as Hibernate, there should be no problems applying AO to just about any sane use-case.

Is a Separate Text Search Engine a Bad Idea?

11
Oct
2007

I was reading this blog entry a few days ago, and it started me thinking about full-text searching.  That wasn’t the main topic of the post, but I think the little side-trek into the field was interesting enough to merit some thought.  Right smack in the middle, Jamie goes on a bit of a rant about the pain of what is effectively two, separate databases (for example, MySQL and Lucene):

A fellow Rails developer asked me in all seriousness why I wasn’t abandoning the full text search functionality of TSearch2 and just using a completely separate, redundant database product designed exclusively for full text search. Seriously, that is considered the “easy” approach: one database for full text search, and another for ACID/OLTP/CRUD. Honestly if I were going to go down that road I would try hard to just abandon the SQL RDMBS and put everything in the other database, since Lucene and its imitators are capable of far more than just find-text-in-document queries. The pain of duplicating everything, using two query languages, two document representations (in addition to the object representation in Ruby) and writing application-tier query correlation makes the double-DB approach seem very unwise.

There is some validity to this thought.  After all, duplication in software usually means you’re doing something wrong - or at least, there could be an easier way.  Even ignoring this precept, it’s just common sense that keeping data synchronized concurrently between two data sources as complex as a relational database and a full-text index is not an easy task.  Granted, some ORMs can handle this task for you (actually, I can only think of Hibernate and ActiveObjects having this feature), but the principle is the same.  And even if everything is neatly and auto-magically synced, there’s always a danger of something getting out of place, and then you’re stuck with a transient stale data issue that’s difficult to track down.

The author of the post mentions that he favors the full-text search capabilities of PostgreSQL, the popular open-source database and competitor to MySQL.  This does have the advantage that you’re putting all the data in one place, handling everything with a single query language (SQL), and reducing the technologies your software depends upon.  This inarguably makes things a whole lot easier.

The main problem as I see it is this is putting a ton of unnecessary strain on the database.  In most modern server-side applications, the bottleneck is in the database (usually caused by too much badly written SQL).  There are whole mountains of documentation which offers suggestions on how to alleviate this problem.  Indexes, database clustering and a carefully chosen ORM can go a long way.  Unfortunately, tacking on full-text indexing seems like a step in the wrong direction.

Lucene is very good at what it does.  It’s indexing and storage performance is second to none.  In fact, it’s so fast that a lot of companies use it as a quick-and-dirty storage dumping ground for raw data, knowing that it will be much faster and more scalable than a relational database.  Why not take advantage of this incredible power and take one more item off of your database’s back?  This is all not to mention the fact that a Lucene index query is probably a lot faster than an SQL query grabbing data from a PostgreSQL full-text index.

So what about the flip side of things?  Why not just put all the data into Lucene (or clone) and eschew relational databases altogether?  Well as I mentioned above, a lot of companies do this for simple things.  Lucene is fantastic at both scalability, and very fast indexing and querying of large blocks of text.  Where it begins to trip up is when you turn it loose on other data types.  Don’t get me wrong, Lucene is an amazing piece of technology.  But just like PostgreSQL isn’t a full-text search engine, Lucene isn’t an RDBMS.  Each component of the infrastructure needs to handle what it’s best at.  In fact, this is really a large aspect of scalability.  Ensuring that every technology is utilized to its fullest potential and no more is crucial to a high-volume application.

Final verdict?  I think I’m sticking with MySQL and Lucene working in tandem, each doing what they do best.  ActiveObjects makes the synchronization almost completely transparent, so it’s not like I’m loading myself down with unnecessary work from a code standpoint.  Seems like a good solution to me; and since most of the industry agrees, it’s probably a safe bet for you too.