Skip to content

Thoughts on jPersist

15
Nov
2007

Disclaimer: For those of you who don’t know, my pet project is an ORM based on the active record pattern (ActiveObjects).  As such, my opinions are probably colored somewhat.  Take everything I say here with a grain of salt.

There seems to have been a recent resurgence of interest in Java implementations of the active record pattern in recent months.  Most traditional Java ORMs (like Hibernate/JPA) implement the data mapper pattern, as it is far easier to apply to a static language like Java.  Yet I can count a number of new (at least to me) ORM projects which attempt to fill this void.  One such project is jPersist.  From the project page:

jPersist is an extremely powerful object-relational persistence API that is based on the Active-Record pattern. jPersist is mapless and has no need for XML or annotation based mapping (all mapping is automatic and dynamic).

Sounds promising!  I’m always in favor of any framework which can reduce the boiler-plate configuration required, especially when the configuration would have been in a language as verbose as XML.  But irregardless of the method for configuration, I’m always interested in how alternative frameworks implement the active record pattern.  As they say, inspiration is one part perspiration and three parts plagiarism.

So I started to dig through the project page a bit, trying to find more info.  Unfortunately, the jPersist project page is a bit weird in how the whole thing is structured.  The page layout implies that jPersist is in fact a sub-project of some kind, which lead me to initially discount the “Documentation” link in the main nav bar.  Even after I found the documentation though, I was somewhat disappointed in its limited scope.  Effectively, the only documentation available is some basic javadoc and a “getting started” tutorial.  That’s alright though, I can work with that.  Code samples are more informative anyway, right?  I downloaded the full distribution (including javadoc and examples) and began to poke around.

The first thing I noticed upon opening the “DatabaseExample.java” file was the syntax: it’s very verbose.  Granted, most persistence APIs suffer from a less-than-DSL design, but I was still a bit taken aback at the volume of code required to accomplish a simple task.  For example, this is how you INSERT a new entity “Contact” with jPersist:

Database db = dbm.getDatabase();
db.saveObject(new Contact("alincoln", "mypasswd1", "Abraham", "Lincoln", 
        null, "alincoln@unitedstates.gov"));

This is assuming that you have defined the POJO Contact with a constructor taking all of the above arguments.  Obviously this could have been just as easily simplified into something more like this:

Contact c = new Contact();
c.setContactId("alincoln");
c.setPassword("mypasswd1");
c.setFirstName("Abraham");
c.setLastName("Lincoln");
c.setEmail("alincoln@unitedstates.gov");
 
Database db = dbm.getDatabase();
db.saveObject(c);

Still a bit verbose, but understandably so. I don’t know of any persistence API in Java that can accomplish the same thing more compactly, so I won’t hold it against jPersist. 

What really stands out at me in this example is that Contact isn’t actually a database peer, it’s a bean.  It’s just a regular, run-of-the-mill POJO created using a constructor upon which mutators are invoked.  This doesn’t really seem significant until you consider the claim from the project page (emphasis mine):

jPersist is an extremely powerful object-relational persistence API that is based on the Active-Record pattern.

Hmm, that’s odd, because it sure looks like data mapping to me.  By definition, the active record pattern requires the entities to directly peer to database rows or (more correctly) result-set rows.  You can’t just instantiate an active record entity, it must be created for you since it has to peer to the database somehow.  More importantly, to satisfy the active record pattern, entities must have the ability to persist themselves, at least from an API standpoint.  In this example, we have to haul around an instance of Database which we invoke separately to create, update, delete and query our entities.  This property is the hallmark of the data mapper pattern, which (as the name implies) involves a separate mapper which translates data from one form (entity beans) to another (database records).

Upon digging a bit further, I did manage to find “ProxyExample.java”.  The title itself seemed promising as any implementation of the active-record pattern in Java must utilize proxies at some level (JDK interface proxies or otherwise).  Digging into the source for the example, I discovered an interface Contacts extending DatabaseObject which was used further down in the following way:

Database db = dbm.getDatabase();
Contacts c = (Contacts) db.castDatabase(Contacts.class);
 
c.setResultSetConcurrency(ResultSet.CONCUR_UPDATABLE);
c.executeQuery("select * from contacts where 1 = 0");
 
c.moveToInsertRow();
c.setContactId("contactId" + System.currentTimeMillis());
c.setPassword("password");
c.setFirstName("First Name");
c.setLastName("Last Name");
c.setCompanyName("Company Name");
c.setEmail("email");
c.insertRow();

So it does seem that the active-record pattern is supported as a secondary mechanism for persistence (the documentation heavily emphasizes the data mapper style).  I do find it a bit odd though that we’re executing queries against an entity directly, cursoring around in result sets and still performing operations with a non-peered entity instance.  It’s also a bad sign that even having the documentation in front of me, as well as a pretty solid knowledge of database concepts, I’m still not entirely sure what the above code does on a semantic level.

Incidentally, if you look at the javadoc for DatabaseObject, it is basically functioning as a super-interface for any persistence class.  Database is the most notable implementation.  Thus, technically speaking we’re still not getting the active-record pattern in the above sample, we’re just being handed a data mapper which maps data from itself to the database.  You could make an argument that this satisfies the basic properties of active record, but I’m skeptical.

Along this line, I’d like to point out that the API of DatabaseObject (and by extension, Database) is heavily dependant on modification of its internal state.  For example, notice anything odd about the following snippet?

Database db = dbm.getDatabase();
db.queryObject(new Contact(), 
        "where :companyName = 'United States' order by :lastName");
 
Contact contact = null;
while (db.hasNext() && db.next() != null) {
    db.loadObject(contact = new Contact());
    System.out.println(contact);
}

Passing over the introduction of yet another framework-specific query language, notice the way the calls are structured?  Database itself is being used as an iterator.  So in a nutshell, you query the database using the queryObject(Object, String) method, which opens a result-set and saves it internally.  You then iterate over the database, loading the current object as you go.  If you want to be academic about this, what we have here is called an “internal iterator”.  As any compsci 201 student will tell you, internal iterators are bad news, both for concurrency and scalability.  Also, they tend to somewhat unintuitive APIs (iterating over a database?) and non-standard idioms.  In addition to all, this approach also imposes a hard limit on the number of queries which can be inspected at once against a given Database instance (one).  This leads to more odd idioms (like maintaining n Database instances for flexibility) and can introduce some bugs which would be very difficult to track down.

Moving on through the examples, a few more things jumped out at more, both positive and negative.  As I mentioned above, jPersist introduces its own SQL-derivative query language which allows querying for entities directly using field names (something supported by Hibernate using HQL but not by ActiveRecord or ActiveObjects).  Consider the following jPersist example:

Database db = dbm.getDatabase();
db.queryObject(new Contact(), 
        "where :companyName = 'United States' order by :lastName");

Compare to the corresponding ActiveObjects code:

EntityManager em = ...
Contact[] contacts = em.query(Contact.class, 
        Query.select().where("companyName = 'United States'").order("lastName"));

Hmm, which do you think is more readable? (hint: it ain’t ActiveObjects)

There are varying schools of thought on whether framework-specific query languages are a good idea or not.  Personally, I’m very opposed to them due to the added complexity and overhead they impose.  Why would I want to learn n different query languages rather than sticking with good old SQL?  On the flip side, I can see why such “spin-off” languages are nice when I look at examples like the above.  It’s really a matter of personal preference.

Another thing that’s really worth mentioning is the design decision made by jPersist not to avail itself of Java 5 syntax.  For example, if you want to use prepared statement parameters with a jPersist query, you must pass them in the following way:

db.queryObject(customer, "where :state = ? order by :lastName", 
        new Object[] {"arizona"});

A basic application of varargs would have simplified this API tremendously. Another example of the lack of Java 5 usage is in the active-record sample:

Contacts c = (Contacts) db.castDatabase(Contacts.class);

Notice the cast? Adding a type parameter to the castDatabase method would eliminate the cast entirely without bulking up the method call in any way.

The major advantage to not using Java 5 features is compatibility.  You can use jPersist on pre-Java 5 systems, possibly even back as far as Java 1.2 (not sure if it uses assertions or not).  Though, this advantage is of questionable value given the scarcity of such neolithic JVMs.

There’s more that’s worth mentioning, like the rather odd syntax for defining relations or the all-important schema generation, but this post is already pushing record sizes.  In brief: jPersist exhibits some interesting properties, and it certain provides a unique perspective on the field of database persistence, but I can’t say I’m particularly fond of the API.  Hopefully, the weaknesses I’ve enumerated will eventually be rectified, resulting in a really excellent library.

Are GUIDs Really the Way to Go?

7
Nov
2007

I recently read a (slightly inflammatory) posting entitled “The Gospel of the GUID”.  In it, the author attempts to put forward several arguments in favor of using GUIDs for all database primary keys (as opposed to the more pedestrian sequence-generated INTEGER).  I’ve heard similar arguments in the past, and I keep coming back to the same conclusion: it’s all bunk.

To get right to the heart of my opinion, I really don’t see a compelling reason to use GUIDs for 90% of database use-cases.  Consider for a moment some of the really large databases in this world (Wikipedia, Slashdot’s comments table, Digg, etc).  I can’t think of a single web application in that league which uses GUIDs.  If you don’t believe me, look at the code for MediaWiki, the URL structure for Digg and the idiocy involving INTEGER vs BIGINT on Slashdot.  While de facto practice may not dictate optimal design, it does point to a trend that’s worthy of notice.  More importantly, it proves that INTEGER (or rather, BIGINT) primary key fields are practicable under real-world stress.  But what about the theoretical use-case?

Well to be honest, the theoretical use-case doesn’t interest me all that much.  I mean, if you tell me that your schema can support theoretically 29 trillion rows in its users table, I’ll geek out right along with you.  But if you try to feed that to a client, you’ll get either blank stares, or the equally common: “but there are only 5 billion people to chose from.”  (unless you’re talking to someone in upper management)  For an even better party, try telling your client that you can merge their data with one of their competitor’s in 45 minutes flat.  Somehow I doubt that argument will fly.

Now let’s look at the cons involved.  GUIDs are really, really hard to remember and type by hand.  When you’re in a hurry, pulling a little data-mine-while-you-wait for your boss, you’re not going to want to dig around and find that string you copy/pasted into Notepad.  Oh, and heaven forbid you actually type the GUID wrong!  Finding a single-bit deviation in a 64-bit alpha-numeric is not a trivial task let me tell you.  It’s at about this point that you start to think that maybe INTEGER keys would have been a better way to go.  At least then you can throw the userID into good ol’ short-term memory and bang it out ten seconds later in your SQL query window.  Oh, in case you’ve never seen a GUID before, here’s a couple quick examples:

  • {3F2504E0-4F89-11D3-9A0C-0305E82C3301}
  • {52335ABA-2D8B-4892-A8B7-86B817AAC607}
  • {79E9F560-FD70-4807-BEED-50A87AA911B1}

I’m not even sure how to read that, much less remember it.

Another thing worth considering is that GUIDs are VARCHARs within the database.  This means that not all ORMs will handle them appropriately.  Hibernate will, as will iBatis and ActiveObjects.  But if you’re using something like Django or ActiveRecord, you’re out of luck (disclaimer: Django might actually support VARCHAR primary keys, I’m not sure).  On top of that, some database clustering techniques don’t seem to work nicely when applied to GUID-based key schemes.  I haven’t run into this myself, but my good friend Lowell Heddings has told me tales of strange mystics and evil tides when playing with distributed indexes and GUIDs in MS SQL Server.  Technically speaking, while any sane database may work with a VARCHAR for the table’s primary key, the architects really had INTEGERs in mind when they designed the algorithms.  This means that by using GUIDs, you’re flirting with a less-well-tested use-case.  It’s certainly not unsupported, but you should consider yourself in the minority of users if you go that route.

Oh, and one little myth I’d like to debug: there really isn’t that much more overhead in grabbing the generated value of an INTEGER primary key than in sending an in-code generated GUID value.  Granted, some databases make this harder than others, but that’s their fault isn’t it?  As long as you’re using a well designed ORM, you really won’t see much of a performance increase from removing that miniscule callback.  In fact, depending on how your GUID algorithm works, you may see a significant degradation in overall performance due to extra overhead in your application.

Conclusion

So, the obligatory thirty second overview for our attention-deprived era:

Benefits of GUIDs perfectly unique across all your data and tables; way, way more head-room in terms of values
Problems with GUIDs really, really un-developer friendly; weird side-effects in databases and ORMs; unconventional approach to a “solved” problem; easy way to upset your DBA

So it really seems the only upshot for GUIDs is their massive range.  In trade-off, you lose usability from a raw SQL standpoint, your result sets are that much more cluttered and you risk odd bugs in your persistence backend.  Hmm, I wonder which one I’m going to use in my next app?

Before you make the choice, ask yourself: “how much uniqueness do I really require?”  Over-designing a solution is just as much a problem as under-designing one.  Try to find the mid-point that works best for your use-case.

Custom Data Types in ActiveObjects

17
Oct
2007

ORMs really interest me, so naturally I read a lot of material regarding ORMs of all kinds, especially Hibernate and ActiveRecord.  One of the more interesting reviews I read recently complained about the rigidity of the type system in the Rails ORM.  According to the author’s examination of the code, ActiveRecord just uses a monolithic switch/case statement to determine the appropriate Ruby type from the SQL type in the result set.  This may make sense from a simplicity standpoint, but it may not be the best approach when it comes to flexibility.

The problem with this approach is that it’s impossible to easily add new types to the ORM.  Granted, the framework authors could do it by modifying the switch/case statement(s) - and the approach does usually require more than one statement - and releasing a whole new version of the framework.  This is not a significant issue as the framework authors already have access to the full library sources.  The real trial is with third-party developers who require custom data types.

An alternative approach (suggested in the article) is to implement a series of type delegates inheriting from a common superclass, or possibly using a mixin as allowed by Ruby.  These type classes would each be responsible for a single type, handling the mapping both to and from the language-native type to the database type.  This would allow for both easy addition and modification of core types by the framework authors, but also trivial support for arbitrary types as implemented by third-party developers.

Not one to shirk good advise when I hear it, I’ve decided to go with this approach to types in ActiveObjects.  Formerly, I must admit I had gone with the multiple, giant switch/case statements.  This seemed to make sense when I first implemented the framework, but it developed, it became apparent that this was inadequate, especially if third-party types are desired.  This decision led to the refactoring of the type system and subsequent creation of the TypeManager class.

TypeManager is basically the singleton manager for the entire type system.  It maintains the list of available DatabaseType(s) and can resolve both Java classes and SQL types to the appropriate delegate.  A number of core types (VarcharType, IntegerType, etc) are added to the singleton instance of TypeManager, ensuring that basic functionality works without any extra effort on the part of the developer.  If a type other than the core types is needed, all that is necessary is to add the type delegate instance to the TypeManager prior to the type’s usage in either migrations or data access.  Thusly:

public interface Company extends Entity {
    public String getName();
    public void setName(String name);
 
    public Class<?> getJavaType();
    public void setJavaType(Class<?> type);
}
 
public class ClassType extends DatabaseType<Class<?>> {
 
    public ClassType() {
        super(Types.VARCHAR, 255, Class.class);
    }
 
    @Override
    public Class<?> convert(EntityManager manager, ResultSet res, 
                Class<? extends Class<?>&gt; type, String field) throws SQLException {
        try {
            return Class.forName(res.getString(field));
        } catch (Throwable t) {
            return null;
        }
    }
 
    @Override
    public void putToDatabase(int index, PreparedStatement stmt, 
                Class<?> value) throws SQLException {
        stmt.setString(index, value.getName());
    }
 
    @Override
    public Object defaultParseValue(String value) {
        try {
            return Class.forName(value);
        } catch (Throwable t) {
            return null;
        }
    }
 
    @Override
    public String valueToString(Object value) {
        if (value instanceof Class) {
            return ((Class<?>) value).getName();
        }
 
        return super.valueToString(value);
    }
 
    @Override
    public String getDefaultName() {
        return "VARCHAR";
    }
}
 
// ...
TypeManager.getInstance().addType(new ClassType());
 
Company[] stringCompanies = manager.find(Company.class, "javaType = ?", String.class);
for (Company c : stringCompanies) {
    System.out.println(c.getName() + " former held type " + c.getJavaType().getName());
 
    c.setJavaType(Exception.class);
    c.save();
}

The most complicated bit of the example above is the database type itself.  Yet even this delegate isn’t too horrible.  The ClassType class first specifies in its constructor which types it corresponds to, both database and Java.  Multiple Java class types can be specified, allowing for cases like IntegerType which maps to both Integer.class and int.class.

The rest of the database type is fairly self-explanatory.  There are methods to read the Java value out of a JDBC ResultSet, put the Java value back into a JDBC PreparedStatement, as well as three methods to handle some of the non-database type-sensitive operations, such as parsing a String value into a type-specific value and visa-versa.  These database non-specific conversions are required for things like parsing the value of a @Default or an @OnUpdate annotation.  Finally, getDefaultName() allows the default DDL rendering of the type to be specified.  This can be overridden in the DatabaseProvider implementation for that particular database, but the use of getDefaultName() allows for third party types that the database provider developers may not have foreseen.  Thus, it effectively opens the door to third-party types in migrations.

Of course, no example would be complete without another one to complement it!  Here’s how we could create a type delegate for the java.awt.Point class:

public class PointType extends DatabaseType<Point> {
    private static final Pattern PATTERN = Pattern.compile("x=(\\d+),y=(\\d+)");
 
    protected PointType() {
        super(Types.VARCHAR, 45, Point.class);
    }
 
    @Override
    public Point convert(EntityManager manager, ResultSet res, 
            Class<? extends Point> type, String field) throws SQLException {
        return (Point) defaultParseValue(res.getString(field));
    }
 
    @Override
    public void putToDatabase(int index, PreparedStatement stmt, Point value) 
                throws SQLException {
        stmt.setString(index, valueToString(value));
    }
 
    @Override
    public Object defaultParseValue(String value) {
        Point back = null;
        Matcher matcher = PATTERN.matcher(value);
 
        if (matcher.find()) {
            back = new Point();
            back.x = Integer.parseInt(matcher.group(1));
            back.y = Integer.parseInt(matcher.group(2));
        }
 
        return back;
    }
 
    @Override
    public String getDefaultName() {
        return "VARCHAR";
    }
}

One thing of note here which has changed from the previous example of ClassType is that the second parameter to the super constructor is now 45, instead of 255.  This parameter is actually the default precision of the SQL type when rendered into the database.  If the SQL type doesn’t have a precision or should just take the database default, a negative value should be specified for this parameter.  Another item of note is that we’re delegating work between methods in a way that I simply didn’t do for ClassType.  Because the rendering of the type in the database is in VARCHAR (String) form, we can rely upon our default String conversion methods to render into the database.  As an aside, the superclass implementation of valueToString(Object) uses the toString() method for that particular value.

As you can see, the type system in ActiveObjects is incredibly powerful and capable of satisfying many use-cases that were impossible in previous versions or other ORMs.  Hopefully this brief glimpse into advanced uses of the type system will aid you in databasing efforts.

Custom Primary Keys with ActiveObjects

15
Oct
2007

One of the main complaints I’ve heard leveled against ActiveObjects is that it’s just not suitable for mapping to legacy schemas.  More generically, concerns have been mooted that it enforces naming conventions and field conventions which aren’t suitable/preferable for some projects.  I suppose at first both of these were true.  After all, ActiveObjects’s entire premise was convention over configuration, and this requires some restrictions by default.  However, I don’t think it’s entirely accurate any longer.

Over the last few months, I’ve added several features which satisfy three primary goals:

  • Customize the table name convention
  • Customize the field name convention
  • Allow for primary key fields (and types) other than id INTEGER

The first two goals were easily met through the addition of TableNameConverter and FieldNameConverter.  These two classes are used by every feature within ActiveObjects, from migrations to simple data access, to determine the database table and field names from the class and method names respectively.  The canonical example of this is table name pluralization, which can be accomplished in the following way:

EntityManager manager = new EntityManager(
    "jdbc:mysql://localhost/test", "username", "secret");
manager.setTableNameConverter(new PluralizedNameConverter());

Not too horrible.  The second use-case is assigning a different field name convention than the default camelCase.  For example, some people really like the ActiveRecord (Rails) field naming convention.  (e.g. “first_name” as opposed to “firstName”)  This can easily be accomplished by specifying a field name converter:

EntityManager manager = new EntityManager(
    "jdbc:mysql://localhost/test", "username", "secret");
 
// lower_case convention
manager.setFieldNameConverter(new UnderscoreFieldNameConverter(false));

Custom table and field name converters are also possible, allowing for a great deal of flexibility in name conventions.  Additionally, it’s always possible to specify field and table names directly in the entities, using the @Accessor, @Mutator and @Table annotations respectively.

Custom Primary Keys

The most challenging goal (from a library standpoint) is to allow for primary key fields other than “id”.  This is partially such a challenge because it had been hard coded literally everywhere in ActiveObjects that the “id” field is the field to use in any sort of SELECT, JOIN, INSERT, UPDATE, etc.  In short, changing this required finding all of these instances and converting the code to query a centralized source for the data.  A few days of fiddling with Eclipse’s text search accomplished this without inordinate pain, but the hard part was coming.

The question remained: how to specify the primary key within the entity itself?  After all, it’s been hard coded and sort of magically “worked” based on the method definition in the Entity superinterface.  There had been a syntax to specify a second PRIMARY KEY for the schema migration, but ActiveObjects didn’t treat these fields any differently, and this sort of syntax wouldn’t really cut it if we were trying to completely override the existing getID() method in the superinterface.

The solution is to refactor all of the interesting functionality in Entity up into a super-superinterface, RawEntity.  Thus the only method defined within Entity would be getID(), annotated appropriately to be recognized as a PRIMARY KEY field.  This would do away with all the magic tricks under the surface which assumed the existence of the getID() method.  ActiveObjects can easily parse the class to find the PRIMARY KEY field amongst the methods, both defined and inherited.  The only compromise which must be made is only one PRIMARY KEY can now be allowed per table.  This isn’t such an issue, since 99% of the time, that’s all you need anyway.  Usually that remaining 1% can be more properly accomplished using UNIQUE and some sort of auto-generation of values.

Since we’ve refactored interesting functionality up into RawEntity and kept getID() within Entity, no legacy code needs to be changed.  Any entities previously written against ActiveObjects will run without modification or any behavior changes.  We are merely allowed the flexibility of specifying our own primary keys.  So, without further ado, the obligatory example:

public interface Person extends Entity {
    public String getFirstName();
    public void setFirstName(String firstName);
 
    public String getLastName();
    public void setLastName(String lastName);
 
    public Company getCompany();
    public void setCompany(Company company);
 
    public House getHome();
    public void setHome(House home);
}
 
public interface Company extends RawEntity<String> {
 
    @PrimaryKey
    @NotNull
    @Generator(UUIDValueGenerator.class)
    public String getCompanyKey();
 
    public String getName();
    public void setName(String name);
 
    @OneToMany
    public Person[] getEmployees();
}
 
public interface House extends RawEntity<Integer> {
 
    @PrimaryKey
    @NotNull
    @AutoIncrement
    public int getHouseID();
 
    // ...
 
    @OneToMany
    public Person[] getOccupants();
}
 
public class UUIDValueGenerator implements ValueGenerator<String> {
    public String generateValue(EntityManager em) {
        // generate uuid
        return uuid;
    }
}
 
// ...
Person p = manager.get(Person.class, 1);
Company c = manager.get(Company.class, "abff999dd99ddf0a225f");

Maybe a bit longer of an example than you were expecting, but it does cover the material well.  What’s happening here is the Person entity has a standard, “id” primary key.  This follows the same convention that ActiveObjects has been enforcing since the beginning of time (or at least since I started the project).  Company and House are the interesting entities here.

House defines a getHouseID() method of type int which is marked as a PRIMARY KEY as well as being auto-incremented by the database (SERIAL on PostgreSQL, AUTO_INCREMENT on MySQL, etc).  This is the same sort of declaration that you would find if you looked in the source for Entity.  The difference is that House will not contain the “id” field and its PRIMARY KEY will be “houseID”.  The really interesting entity here Company.

Company defines a primary key that is not only a different field, but also an entirely different type.  Also, its value is generated automatically not by the database, but by the application itself.  This is a fairly common use-case in those crazy databases which use UUIDs as primary keys.  Not only does this field define “companyKey” as a different type than INTEGER, but it also ensures that the “companyID” FORIEGN KEY field in the “person” table is also of type VARCHAR.

Another item of note in this example is that the RawEntity interface is parameterized.  This is to allow the get(...) method in EntityManager to stay type-checked, ensuring that the values passed are actually valid primary key values for the entity in question.  Of course, there’s nothing that can be done to ensure that the actual method definition of the primary key is of the proper type.  However, at some point the developer must be trusted to make sure their entity model doesn’t violate the dictates of logic.

Conclusion

With this latest addition to the ActiveObjects feature set, it should be possible to use the ORM with any schema whatsoever.  While AO may still be an implementation of the active record pattern, and thus less powerful than solutions such as Hibernate, there should be no problems applying AO to just about any sane use-case.

ActiveObjects: Indexing vs Searching

1
Oct
2007

So in the intervening time since I last updated you on ActiveObjects, I’ve been busy refactoring and repurposing some of the core.  I’ve added a full JUnit test suite, which definitely helps my own confidence about the stability of the source.  Also, there’s a whole bunch of new features that have come down the pipe which hopefully I’ll get to address in the next few posts.  So, without further ado…

The change which is probably going to cause you the most grief is the switch from @Index to @Searchable, and the addition of the @Indexed annotation.  Yes, you really did read that correctly; and no @Indexed isn’t even related to the functionality provided by the old @Index annotation.

The old @Index annotation used to handle tagging of entity methods to mark them as to be added to the Lucene full-text search index (see this post for more details).  This was a little confusing for a number of reasons, not the least of which my failure to remember the tense of the annotation name (see the comments on the indexing post).  By convention, most Java annotations are declarative in name.  Thus, the name should not be the present tense “index” but the past tense “indexed”.  So my first thought was to just refactor the annotation, but then I came into a slightly hairier name-clash.

Database Field Indexing

One of the most common techniques for optimizing your database’s read (SELECT) performance is to create indexes on certain fields.  When a field is indexed, the database will maintain some separate hash tables to enable very fast selection of rows based on the field in question.  This is a really good thing for almost all foreign keys, for example:

SELECT ID FROM people WHERE companyID = ?

Here we’re SELECTing the id field from the people table where companyID matches a certain value.  The database can execute this query fairly quickly.  In fact, the only bottle-neck is finding all of the rows which match the specified companyID value.  In a table containing hundreds of thousands of rows, one can see how this could be a problem.

The problem goes away (sort of) with the use of field indexing.  Instead of having to linearly search through the table for rows matching the companyID, the database can perform a quick hash lookup in an index and get a set of rowid(s) based on the companyID.  Simple, efficient, and incredibly scalable.  Practically, the DBMS wouldn’t take any longer to execute such a query against a table of 100,000,000 rows than it does to execute the same query against a table of 100 rows.  So, this is a really good thing right?

Well, indexes have their drawbacks.  I won’t go into all of the reasons not to use indexes, the following two points will likely suffice:

  • Indexing slows down UPDATEs and INSERTs
  • Indexing adds to your table’s storage requirements (all those hashes have to be put somewhere)

So perfect database performance isn’t attainable just by indexing every field, one has to be quite judicious about it.  In fact, choosing the fields to be indexed is really as much of an art as it is a science.

This long and tangled introduction actually did have a point…  ActiveObjects didn’t have any support for field indexing.  I hadn’t really considered the possibility, so I didn’t factor it into my design.  In retrospect, this was probably a bad idea.  So making up for lost time, I’ve now introduced field indexing into the library!

public interface Person extends Entity {
 
    @Indexed
    public String getName();
    @Indexed
    public void setName(String name);
 
    public int getAge();
    public void setAge(int age);
 
    public Company getCompany();
    public void setCompany(Company company);
}

When ActiveObjects generates the migration DDL for this table (running against MySQL), it will look something like this:

CREATE TABLE people (
    ID INTEGER NOT NULL AUTO_INCREMENT,
    NAME VARCHAR(255),
    age INTEGER,
    companyID INTEGER,
    CONSTRAINT fk_people_companyID FOREIGN KEY (companyID) REFERENCES companies(ID),
    PRIMARY KEY(ID)
);
CREATE INDEX NAME ON people(NAME);
CREATE INDEX companyID ON people(companyID);

This is where the aforementioned @Indexed annotation comes in.  As you can see from the resultant DDL, adding a field index is as simple as tagging the corresponding method.  Also, foreign keys are automatically indexed, to ensure maximum performance in SELECTion of relations.

So that’s the good news, the bad news is that index creation doesn’t play too nicely with migrations.  Everything works of course, and if you’re actually adding a table or field, the corresponding index(es) will also be created.  Likewise if you drop a table or field, the corresponding index(es) will also be dropped.  However, that’s about the limit of the migrations support for indexes at the moment.  JDBC has an unfortunately limitation which prevents developers from getting a list of indexes within a database.  Since I have no way (within JDBC) of finding pre-existing indexes, they must be excluded from the schema diff and thus are CREATEd or DROPped irregardless of the existing schema.  I do have a plan to fix this (along with migrations on Oracle), but I seriously doubt that it’ll be included in the 1.0 release, owing to the changes required.

Refactoring End Result

The final result of the refactoring and the adding of field indexing is that the annotation formerly named @Index is now the @Searchable annotation.  Likewise (and keeping with convention), the formerly IndexingEntityManager is now named SearchableEntityManager.  Both of these types remain in the net.java.ao package.  To allow for field indexing, the @Indexed annotation was added to the net.java.ao.schema package, owing to the fact that it only effects schema generation and doesn’t change runtime behavior in the framework at all.

Hopefully these changes won’t be too confusing (now that you’re aware of them) and will be a welcome addition to the ActiveObjects functionality.  As always, I welcome comments, suggestions and criticisms!