Skip to content

Custom Primary Keys with ActiveObjects

15
Oct
2007

One of the main complaints I’ve heard leveled against ActiveObjects is that it’s just not suitable for mapping to legacy schemas.  More generically, concerns have been mooted that it enforces naming conventions and field conventions which aren’t suitable/preferable for some projects.  I suppose at first both of these were true.  After all, ActiveObjects’s entire premise was convention over configuration, and this requires some restrictions by default.  However, I don’t think it’s entirely accurate any longer.

Over the last few months, I’ve added several features which satisfy three primary goals:

  • Customize the table name convention
  • Customize the field name convention
  • Allow for primary key fields (and types) other than id INTEGER

The first two goals were easily met through the addition of TableNameConverter and FieldNameConverter.  These two classes are used by every feature within ActiveObjects, from migrations to simple data access, to determine the database table and field names from the class and method names respectively.  The canonical example of this is table name pluralization, which can be accomplished in the following way:

EntityManager manager = new EntityManager(
    "jdbc:mysql://localhost/test", "username", "secret");
manager.setTableNameConverter(new PluralizedNameConverter());

Not too horrible.  The second use-case is assigning a different field name convention than the default camelCase.  For example, some people really like the ActiveRecord (Rails) field naming convention.  (e.g. “first_name” as opposed to “firstName”)  This can easily be accomplished by specifying a field name converter:

EntityManager manager = new EntityManager(
    "jdbc:mysql://localhost/test", "username", "secret");
 
// lower_case convention
manager.setFieldNameConverter(new UnderscoreFieldNameConverter(false));

Custom table and field name converters are also possible, allowing for a great deal of flexibility in name conventions.  Additionally, it’s always possible to specify field and table names directly in the entities, using the @Accessor, @Mutator and @Table annotations respectively.

Custom Primary Keys

The most challenging goal (from a library standpoint) is to allow for primary key fields other than “id”.  This is partially such a challenge because it had been hard coded literally everywhere in ActiveObjects that the “id” field is the field to use in any sort of SELECT, JOIN, INSERT, UPDATE, etc.  In short, changing this required finding all of these instances and converting the code to query a centralized source for the data.  A few days of fiddling with Eclipse’s text search accomplished this without inordinate pain, but the hard part was coming.

The question remained: how to specify the primary key within the entity itself?  After all, it’s been hard coded and sort of magically “worked” based on the method definition in the Entity superinterface.  There had been a syntax to specify a second PRIMARY KEY for the schema migration, but ActiveObjects didn’t treat these fields any differently, and this sort of syntax wouldn’t really cut it if we were trying to completely override the existing getID() method in the superinterface.

The solution is to refactor all of the interesting functionality in Entity up into a super-superinterface, RawEntity.  Thus the only method defined within Entity would be getID(), annotated appropriately to be recognized as a PRIMARY KEY field.  This would do away with all the magic tricks under the surface which assumed the existence of the getID() method.  ActiveObjects can easily parse the class to find the PRIMARY KEY field amongst the methods, both defined and inherited.  The only compromise which must be made is only one PRIMARY KEY can now be allowed per table.  This isn’t such an issue, since 99% of the time, that’s all you need anyway.  Usually that remaining 1% can be more properly accomplished using UNIQUE and some sort of auto-generation of values.

Since we’ve refactored interesting functionality up into RawEntity and kept getID() within Entity, no legacy code needs to be changed.  Any entities previously written against ActiveObjects will run without modification or any behavior changes.  We are merely allowed the flexibility of specifying our own primary keys.  So, without further ado, the obligatory example:

public interface Person extends Entity {
    public String getFirstName();
    public void setFirstName(String firstName);
 
    public String getLastName();
    public void setLastName(String lastName);
 
    public Company getCompany();
    public void setCompany(Company company);
 
    public House getHome();
    public void setHome(House home);
}
 
public interface Company extends RawEntity<String> {
 
    @PrimaryKey
    @NotNull
    @Generator(UUIDValueGenerator.class)
    public String getCompanyKey();
 
    public String getName();
    public void setName(String name);
 
    @OneToMany
    public Person[] getEmployees();
}
 
public interface House extends RawEntity<Integer> {
 
    @PrimaryKey
    @NotNull
    @AutoIncrement
    public int getHouseID();
 
    // ...
 
    @OneToMany
    public Person[] getOccupants();
}
 
public class UUIDValueGenerator implements ValueGenerator<String> {
    public String generateValue(EntityManager em) {
        // generate uuid
        return uuid;
    }
}
 
// ...
Person p = manager.get(Person.class, 1);
Company c = manager.get(Company.class, "abff999dd99ddf0a225f");

Maybe a bit longer of an example than you were expecting, but it does cover the material well.  What’s happening here is the Person entity has a standard, “id” primary key.  This follows the same convention that ActiveObjects has been enforcing since the beginning of time (or at least since I started the project).  Company and House are the interesting entities here.

House defines a getHouseID() method of type int which is marked as a PRIMARY KEY as well as being auto-incremented by the database (SERIAL on PostgreSQL, AUTO_INCREMENT on MySQL, etc).  This is the same sort of declaration that you would find if you looked in the source for Entity.  The difference is that House will not contain the “id” field and its PRIMARY KEY will be “houseID”.  The really interesting entity here Company.

Company defines a primary key that is not only a different field, but also an entirely different type.  Also, its value is generated automatically not by the database, but by the application itself.  This is a fairly common use-case in those crazy databases which use UUIDs as primary keys.  Not only does this field define “companyKey” as a different type than INTEGER, but it also ensures that the “companyID” FORIEGN KEY field in the “person” table is also of type VARCHAR.

Another item of note in this example is that the RawEntity interface is parameterized.  This is to allow the get(...) method in EntityManager to stay type-checked, ensuring that the values passed are actually valid primary key values for the entity in question.  Of course, there’s nothing that can be done to ensure that the actual method definition of the primary key is of the proper type.  However, at some point the developer must be trusted to make sure their entity model doesn’t violate the dictates of logic.

Conclusion

With this latest addition to the ActiveObjects feature set, it should be possible to use the ORM with any schema whatsoever.  While AO may still be an implementation of the active record pattern, and thus less powerful than solutions such as Hibernate, there should be no problems applying AO to just about any sane use-case.

Is a Separate Text Search Engine a Bad Idea?

11
Oct
2007

I was reading this blog entry a few days ago, and it started me thinking about full-text searching.  That wasn’t the main topic of the post, but I think the little side-trek into the field was interesting enough to merit some thought.  Right smack in the middle, Jamie goes on a bit of a rant about the pain of what is effectively two, separate databases (for example, MySQL and Lucene):

A fellow Rails developer asked me in all seriousness why I wasn’t abandoning the full text search functionality of TSearch2 and just using a completely separate, redundant database product designed exclusively for full text search. Seriously, that is considered the “easy” approach: one database for full text search, and another for ACID/OLTP/CRUD. Honestly if I were going to go down that road I would try hard to just abandon the SQL RDMBS and put everything in the other database, since Lucene and its imitators are capable of far more than just find-text-in-document queries. The pain of duplicating everything, using two query languages, two document representations (in addition to the object representation in Ruby) and writing application-tier query correlation makes the double-DB approach seem very unwise.

There is some validity to this thought.  After all, duplication in software usually means you’re doing something wrong - or at least, there could be an easier way.  Even ignoring this precept, it’s just common sense that keeping data synchronized concurrently between two data sources as complex as a relational database and a full-text index is not an easy task.  Granted, some ORMs can handle this task for you (actually, I can only think of Hibernate and ActiveObjects having this feature), but the principle is the same.  And even if everything is neatly and auto-magically synced, there’s always a danger of something getting out of place, and then you’re stuck with a transient stale data issue that’s difficult to track down.

The author of the post mentions that he favors the full-text search capabilities of PostgreSQL, the popular open-source database and competitor to MySQL.  This does have the advantage that you’re putting all the data in one place, handling everything with a single query language (SQL), and reducing the technologies your software depends upon.  This inarguably makes things a whole lot easier.

The main problem as I see it is this is putting a ton of unnecessary strain on the database.  In most modern server-side applications, the bottleneck is in the database (usually caused by too much badly written SQL).  There are whole mountains of documentation which offers suggestions on how to alleviate this problem.  Indexes, database clustering and a carefully chosen ORM can go a long way.  Unfortunately, tacking on full-text indexing seems like a step in the wrong direction.

Lucene is very good at what it does.  It’s indexing and storage performance is second to none.  In fact, it’s so fast that a lot of companies use it as a quick-and-dirty storage dumping ground for raw data, knowing that it will be much faster and more scalable than a relational database.  Why not take advantage of this incredible power and take one more item off of your database’s back?  This is all not to mention the fact that a Lucene index query is probably a lot faster than an SQL query grabbing data from a PostgreSQL full-text index.

So what about the flip side of things?  Why not just put all the data into Lucene (or clone) and eschew relational databases altogether?  Well as I mentioned above, a lot of companies do this for simple things.  Lucene is fantastic at both scalability, and very fast indexing and querying of large blocks of text.  Where it begins to trip up is when you turn it loose on other data types.  Don’t get me wrong, Lucene is an amazing piece of technology.  But just like PostgreSQL isn’t a full-text search engine, Lucene isn’t an RDBMS.  Each component of the infrastructure needs to handle what it’s best at.  In fact, this is really a large aspect of scalability.  Ensuring that every technology is utilized to its fullest potential and no more is crucial to a high-volume application.

Final verdict?  I think I’m sticking with MySQL and Lucene working in tandem, each doing what they do best.  ActiveObjects makes the synchronization almost completely transparent, so it’s not like I’m loading myself down with unnecessary work from a code standpoint.  Seems like a good solution to me; and since most of the industry agrees, it’s probably a safe bet for you too.

Performance is Good: ActiveObjects vs ActiveRecord

14
Aug
2007

So ActiveObjects is a fairly cool ORM.  However, coolness alone does not an enterprise ORM make.  In fact, the real qualifications for an enterprise-ready framework are as follows:

  • Stability
  • Performance

I’m sure there are other questions which factor into design decisions on whether or not to use a library, but those are the two which I look at most closely.  Stability is usually a hard metric to find, since it usually depends on a lot of adopters hammering the library until it breaks, is fixed and then hammered again.  However, performance numbers are almost always easy to come by, since all that is required are a few simple benchmark tests to just get a ballpark-number.

Since benchmarks are so fun, I’ve decided to do a few for ActiveObjects.  Or rather, I’ve decided to run a simple (read, very simple) benchmark test with ActiveObjects as well as a number of other ORMs.  At the moment, I’ve only been able to run the test with ActiveRecord (sorry guys, Hibernate’s a really complex framework), but I think the numbers are still worth looking at.

ActiveRecord claims only a 50% overhead compared to manual database access (that number is actually listed as a feature).  There has been some dispute over whether the test used to obtain that particular figure was valid or not, but that’s besides the point.  ActiveObjects should be able to do at least that well, right?

Well, as it turns out, it can.  Here are the numbers from my reasonably simple benchmark:

ActiveObjects
==============
Queries test: 55 ms
Retrieval test: 68 ms
Persistence test: 55 ms
Relations test: 154 ms

ActiveRecord
=============
Queries test: 154 ms
Retrieval test: 6 ms
Persistence test: 76 ms
Relations test: 75 ms

Surprisingly close numbers actually.  I had assumed that there would be some significant disparity, one way or another.  However, as you can see ActiveObjects is fairly comparable to ActiveRecord on a set of extremely trivial tests.  There are some jumps and obvious areas of strength/weakness in both frameworks, but on average they’re pretty similar in performance.

As my friend Lowell Heddings pointed out, ORM benchmarks are far more useful if you actually examine the SQL generated to see how efficient it really is from a theoretical standpoint.  So, to make things easier I sed/grepped the logs and arrived at the following SQL outputs for each respective ORM.

Details

Now, I will be the first to admit that this is hardly at even test to begin with.  Obviously there are different strengths and weaknesses in every library, and though I tried to be impartial in the designing of the benchmarks, I probably accidentally favored one ORM over the other.  Also, there are inherent performance advantages to Java over Ruby, especially in the area of database access.  In short, ActiveObjects probably had a sizeable advantage coming right out of the gate, so take my numbers with a grain of salt.

The test itself consisted of four phases, each involving three entities: Person, Profession and Workplace.  Person has a many-to-many relation with Profession through a fourth entity, Professional.  Workplace has a one-to-many relation with Person.  These relations were exploited directly in the relations benchmark (e.g. Person#getProfessions(), Workplace#getPeople(), etc).  Each entity had a number of fields, including one CLOB (or TEXT, as MySQL refers to them) in the Person entity.  The tables for each respective schema were pre-populated with the same data, which involved several rows with different values (except for the CLOB, which was a roughly 4000 character paragraph and the same for every row).  In the ActiveObjects Person entity, I used the @Preload annotation to eagerly load firstName and lastName.

For the retrieval test, the benchmark iterates through every Person row and grabs firstName, lastName, age, alive, and bio.  Since ActiveObjects only preloaded firstName and lastName, it suffered a bit here. 

The persistence test iterates through every person row and changes the first and last name to one selected from a pool of names I populated with random names which came to mind.  It then goes through the same iteration again and sets the age, alive flag and the bio to our 4000 word Pulitzer-winning essay.  Each row is saved through each iteration, thus each row is saved exactly twice throughout the test.  ActiveObjects came out ahead here probably because of its use of PreparedStatements, as well as the more efficient UPDATE statement generation.

The relations test involved first finding all of the Professions associated with each individual Person and retrieving the Profession name.  Next, the Workplace for the Person is retrieved, then all of the Person(s) associated with that Workplace and their firstName and lastName values accessed.

The queries test was little more than getting all of the Person(s), all of the Workplace(s), all of the Professional mappings, along with all of the Profession(s).  ActiveObjects far outperformed ActiveRecord in this area since ActiveRecord uses SELECT * for everything and eagerly loads the row values.  This means (especially with a CLOB thrown into the mix) that ActiveRecord’s initial query time will be very long, while it’s field access time will be very quick.  Most ORMs function in this way, and it can be a very good thing at times (our benchmark is one of those times).

Lessons Learned

  • Eager loading can be a good thing
  • ActiveObjects generates some weird SQL for relations access

Obviously I can only do so much about the eager loading issue.  I believe pretty strongly that ActiveObject’s approach (in lazy loading most things) is the right one for most use-cases.  However, the second lesson to be learned here is one which I think I need to take a bit more to heart: keep it simple SQL.

Normally, ActiveObjects will generate a query something like the following for accessing a one-to-many relation:

SELECT DISTINCT a.outMap AS outMap FROM (
    SELECT ID AS outMap,workplaceID AS inMap FROM people 
       WHERE workplaceID = ?) a

Yuck!  For obvious reasons, this is an incredibly inefficient bit of querying.  Actually, not only is it inefficient, but needlessly so.  You and I of course know that we could replace the above query with the much simpler:

SELECT ID FROM people WHERE workplaceID = ?

So why doesn’t ActiveObjects do that?  Frankly, I was lazy in my coding of the EntityProxy#retrieveRelations method, so a lot of ugly SQL slipped through the cracks in cases where it really wasn’t necessary.  I’ve spent a bit of time on this, and I think I’ve got the issue resolved.  The problem is that ActiveObjects was assuming that any relation (one-to-many or many-to-many) can have multiple mapping fields, thus requiring a wrapping DISTINCT outer query around a subquery SELECT which is UNIONed with an arbitrary number of other SELECTs, corresponding to the other mapping fields.  Obviously, it is almost never the case that we have to deal with multiple mapping paths, so I added a short-circuit to the logic which creates far simpler queries if at all possible.  As a result, the benchmark numbers for the relations test in ActiveObjects are between 80 and 100 ms.  Still slower than ActiveRecord, but much improved.

It’s worth noting that if we ran each benchmark twice, we would see a marked improvement in the ActiveObjects performance the second time through.  Not just because a lot of the values would be cached, but also because the prepared statements in question would have been compiled and stored.  This is a fairly major area in which ActiveRecord falls short since it doesn’t utilize prepared statements, thus having a constant runtime for its queries and remaining unable to take advantage of cached, compiled queries.

So in short, ActiveObjects may be really neat, but it’s performance numbers don’t seem all that superior to those of ActiveRecord, a Ruby ORM with numerous known shortcomings in this area.  I guess I need to work on things a bit more.  :-)  Next up, either manual JDBC code or Hibernate running the same benchmark, depending on how soon I’m able to figure out Hibernate’s crazy XML mapping schema.

Note: I forgot to mention this… You can get the source for my benchmark from the ActiveObjects SVN repository: svn co https://activeobjects.dev.java.net/svn/activeobjects/trunk/Benchmarks

Even More ActiveObjects: Preloading

13
Aug
2007

There has been some talk recently regarding the ActiveObjects lazy-loading mechanism.  It’s starting to seem that what I thought was a great idea and terribly innovative when I designed the framework might not have been such a great idea after all.  :-)  That’s a good thing though, finding my mistakes that is, it just forces me to think a little harder about how to solve the problem.

One of the guiding ideas behind ActiveObjects is that nothing should be loaded until it’s needed.  Once it’s loaded, it should be cached and then up-chucked on command, obviating the need for multiple loads.  This technique, commonly known as “lazy-loading”, works really well if you’re in a memory-crunch situation.  This is because even for tables with extremely large numbers of columns (think 50-100), none of the data in a row is loaded if you don’t need it.  Thus, you could work with a database-peered object without having to load the entire row into memory, a potentially long and expensive operation.

The problem with this is it tends to create large numbers of queries.  Also, it can be very inefficient for certain types of operations.  For example:

for (Person p : manager.find(Person.class)) {
    System.out.println(p.getName());
}

This will generate the following SQL (assuming 6 rows in the people table):

SELECT ID FROM people
SELECT NAME FROM people WHERE ID = ?
SELECT NAME FROM people WHERE ID = ?
SELECT NAME FROM people WHERE ID = ?
SELECT NAME FROM people WHERE ID = ?
SELECT NAME FROM people WHERE ID = ?
SELECT NAME FROM people WHERE ID = ?

Granted, it’s a prepared statement, so it will be compiled and run very quickly 5 out of 6 times.  However, this is still pretty inefficient.  Imagine if there were 100,000 people in the database, instead of 6 (not an unreasonable assumption).  This code could take hours to run.

Now, if you were writing the JDBC code by hand, you’d probably do something like this (exception handling omitted):

Connection conn = getConnection();
PreparedStatement ps = conn.prepareStatement("SELECT name FROM people");
ResultSet res = ps.executeQuery();
while (res.next()) {
    System.out.println(res.getString("name"));
}
res.close();
ps.close();
conn.close();

One statement, that’s all that’s really required.  Paging through a result set is a pretty quick operation, so even with 100,000 rows this shouldn’t be an insanely slow piece of code.  In fact, the slow-down here is probably how fast the console can print the text in question (not very fast actually).

So, obviously we have very disparate performance between JDBC by hand and using ActiveObjects, and we really can’t have that.  The solution is to force ActiveObjects to somehow load all of the names for the people in the first query, like we did when we ran the SQL by hand.  For a while now, ActiveObjects has had this capability:

for (Person p : manager.find(Person.class, Query.select("id,name"))) {
    System.out.println(p.getName());
}

Now we just execute a single line of SQL:

SELECT ID,NAME FROM people

Much more efficient.  However, the code is now much uglier and a little unintuitive. (I mean, who’s going to think of Query.select(”…”) when looking to override lazy-loading?)  Also, we would have to use this cryptic syntax in every single query in which we want to override the lazy-loading.  This could be a bit of a pain, especially if you know at design time that every time you get a Person, you’ll probably need a “name” shortly thereafter.  So, for situations just like this one, I’ve now added the @Preload annotation (not in the 0.4 release, available in trunk/)

@Preload("name")
public interface Person extends Entity {
    public String getName();
    public void setName(String name);
 
    public int getAge();
    public void setAge(int age);
}
 
// ...
for (Person p : manager.find(Person.class)) {
    System.out.println(p.getName());
}

Just as we would expect, this now runs the following single-query SQL statement:

SELECT NAME,ID FROM people

If we were to add a call to p.getAge(), it would of course lazy-load that value, leading to another SQL statement.  However, we can just as easily add it to the @Preload clause like this:

@Preload({"name", "age"})
public interface Person extends Entity {
    // ...
}

Or, since this is really all of the properties in Person, we can use the following, shorter syntax:

@Preload
public interface Person extends Entity {
    // ...
}

So effectively, you can disable lazy-loading in ActiveObjects by adding the @Preload annotation without any parameters to every entity you use.  However, this is a little inefficient since it will pretty much turn any non-joining SELECT statement into a SELECT *.  For this reason, I suggest you only use @Preload for situations like our name-printing loop.  In other words: only for values you know will be queried every time you grab a bunch of entities of a given type.

One more thing worthy of note: this is a hint only.  It doesn’t mean that every Person instance will have a preloaded name value.  Any Query(s) with JOIN clauses will ignore the @Preload annotation to avoid accidentally running JOINs with SELECT *.  Also, quite a few Person instances won’t have any values at all by default.  For example, if you use EntityManager#create(), a new row will be INSERTed into the people table, but the resulting Person instance won’t have any value cached for name.  Likewise, if you make a simple call to EntityManager#get(Class<? extends Entity>, int), this will return the Entity instance which corresponds to that id value, but it may or may not have a cached name.  Thus, the get() method still does not run any queries, it merely creates the object peers.

An Easier Java ORM: Indexing

6
Aug
2007

In continuing with my series on ActiveObjects, this post delves into the eternal mysteries of search indexing and Lucene integration. Most modern web applications not only store data in a database, but also in an index of some kind to allow fast and efficient searching. Java’s Lucene framework provides an excellent mechanism for this functionality, however it can be somewhat cryptic and hard to use. To ease this pain, ActiveObjects provides auto-magical Lucene integration for specified fields, making it trivial to index and search for entities.

Unless there is great public outcry, I intend this to be the last of my “Easier Java ORM” series (with the exception a roundup post for linking purposes). As fun as it is being self-promoting and pushing my favorite open source project, I feel a slight twinge of guilt every time I flood your feed agregator with more information on a library in which you may or may not have interest. I’ll probably still post about ActiveObjects from time to time, but only on occasions when there is something of special note.

Indexing

Of course, we can’t even begin to talk about searching for entities unless there is some data from the entity added to the index. The actual creation and maintenance of the index is usually considered the hardest part of working with Lucene. In ActiveObjects, it requires two separate steps.

Firstly, you must decide which fields and which entities you wish to index. Let’s say that we have a simple blog schema as follows:

public interface UserModifiedEntity extends SaveableEntity {
    @Default("CURRENT_TIMESTAMP")
    public Calendar getDate();
    @Default("CURRENT_TIMESTAMP")
    public void setDate(Calendar calendar);
 
    @Default("false")
    public boolean isDeleted();
    @Default("false")
    public void setDeleted(boolean deleted);
}
 
public interface Post extends UserModifiedEntity {
    public String getTitle();
    public void setTitle(String title);
 
    @SQLType(Types.CLOB)
    public String getText();
    @SQLType(Types.CLOB)
    public void setText(String text);
 
    @OneToMany
    public Comment[] getComments();
}
 
public interface Comment extends UserModifiedEntity {
    public Post getPost();
    public void setPost(Post post);
 
    public String getCommenter();
    public void setCommenter(String name);
 
    @SQLType(Types.CLOB)
    public String getText();
    @SQLType(Types.CLOB)
    public void setText(String text);
}

In this schema, we have both Post and Comment entities. Both entity types extend UserModifiedEntity, which contains some fields which will be common to both resulting tables. Both Comment and Post also have “text” fields, containing the actual meat of each entity’s value.

Now, for our blog’s search engine, we’re going to want to do something a bit more precise than search for all values contained in any entities. Actually, at this point, ActiveObjects wouldn’t index any values whatsoever. We need to tag the fields we want to add to the index with the @Indexed annotation. Let’s assume that we don’t need to search on comments at all, just posts. The modified Post entity might look something like this:

public interface Post extends UserModifiedEntity {
    @Index
    public String getTitle();
    @Index
    public void setTitle(String title);
 
    @Index
    @SQLType(Types.CLOB)
    public String getText();
 
    @Index
    @SQLType(Types.CLOB)
    public void setText(String text);
 
    @OneToMany
    public Comment[] getComments();
}

That takes care of step one in the indexing procedure. ActiveObjects now has everything it needs to know relating to what it should index. Now we need to inform it to actually perform the indexing, and where to store the result. This is all handled using a special EntityManager subclass: IndexingEntityManager.

// ...
IndexingEntityManager manager = new IndexingEntityManager(jdbcURI, username, password, 
        FSDirectory.getDirectory("~/lucene_index"));
 
Post post = manager.create(Post.class);
post.setTitle("My Cool Post");
post.setText("Here's some test text that I'll use to test the search indexing.  "
        + "It's really amazing what you can do with so little code...");
post.save();

As you can see, we’re using an instance of IndexingEntityManager to access and create all of our entity instances (all one of them). This is all that is necessary to cause ActiveObjects to handle the indexing for these entities.

Oh, FSDirectory is actually a Lucene class (sub-classing Directory) which is used to tell the Lucene backend where to store the index. Since we’re actually using the Lucene Directory abstraction classes, the index could just as easily be stored in memory, or even in another database.

Searching

Obviously, an index isn’t all that useful if you can’t do anything with it. Since our goal from the start was to provide search capabilities to our rather limited blog, we need to have a way of accessing the Lucene indexing and performing a search. Again, ActiveObjects makes this incredibly easy:

// ...code from above
Post[] results = manager.search(Post.class, "test search terms");
 
System.out.println("Search results:");
for (Post post : results) {
    System.out.println("   " + post.getTitle());
}

The search method delegates its call down to the Lucene engine, which parses the search terms and runs through the index searching for any key-value sets (or Document(s), as Lucene refers to them) which match in the “title” or “text” fields. By default, ActiveObjects runs the search against all index fields in the specified entity type. Since this is usually the behavior people want when using Lucene, it is a sane default.

If the mindless defaults aren’t good enough for your application, you are quite free to use the Lucene index directly. IndexingEntityManager provides accessors for the Directory containing the index, as well as the Analyzer in use. (getIndexDir() and getAnalyzer()) Of course, you can also extend IndexingEntityManager and provide your own search() implementation.

Removing from the Index

Almost as important as adding entities to an index is removing them. We don’t want our searches to pull back deleted posts. IndexingEntityManager can handle this task for us automatically, to a point. The problem is that in our case, we’re not actually deleting the posts as such. We’re simply setting a flag in the row which indicates the post is deleted. We’re supplying all of the logic (theoretically) to ignore deleted posts and comments.

If we were using the EntityManager#delete(Entity…) method, we would be DELETEing the rows properly and then IndexingEntityManager could automatically remove the relevant Document(s) from the index. However, since we’re not doing this, we need a bit more logic. For simplicity’s sake, we’re going to put this logic into a defined implementation for the UserEditableEntity interface:

@Implementation(UserEditableEntityImpl.class)
public interface UserEditableEntity extends SaveableEntity {
    // ...
}
 
public class UserEditableEntityImpl {
    private UserEditableEntity entity;
 
    public UserEditableEntityImpl(UserEditableEntity entity) {
        this.entity = entity;
    }
 
    public void setDeleted(boolean deleted) {
        if (deleted &amp;&amp; !entity.isDeleted()) {
            // deleting the entity, remove it from index
            ((IndexingEntityManager) entity.getEntityManager()).removeFromIndex(entity);
        } else if (!deleted &amp;&amp; entity.isDeleted()) {
            // we're un-deleting the entity here
            ((IndexingEntityManager) entity.getEntityManager()).addToIndex(entity);
        }
 
         entity.setDeleted(deleted);
    }
}

Now, whenever we call setDeleted(boolean) on a Post or Comment instance, it will be removed from the index (if we’re deleting the entity), or re-added to the index (if we’re un-deleting it). In the case of Comment, it has no @Indexed methods, so IndexingEntityManager will more or less ignore the call to addToIndex(Entity) (it actually will iterate through all of the methods to find any @Indexed).

Related Content

Many sites have need of a “related content” algorithm. This is most often seen in blogs which show a list of “related posts”. Since ActiveObjects auto-magically handles indexing and searching, it only makes sense that it provide some mechanism for accessing related entities based on their indexed values. This is handled using the RelatedEntity super-interface.

Let’s assume that we want to be able to find related posts to a given Post instance. The only thing we need to do is make sure that the Post interface also extends RelatedEntity:

public interface Post extends UserEditableEntity, RelatedEntity&lt;Post&gt; {
    // ...
}

Now we can call:

Post post = // ...
Post[] related = post.getRelated();
 
System.out.println("Posts related to " + post.getTitle() + ":");
for (Post relate : related) {
    System.out.println("   " + relate.getTitle());
}

Alright, caveat time… First off, this does depend on the Lucene Queries contrib library, specifically the MoreLikeThis class. Secondly, I’m not entirely sure that this is working right. :-) I’ve yet to actually get it to return any related values whatsoever in my test bed. This could be due to the way I’m indexing, or possibly the way I’m using MoreLikeThis; I’m not sure. If it works for you, let me know! Also, if you have any experience with the MoreLikeThis functionality, I’d appreciate any pointers you may have.

Well, that about sums it up for indexing in ActiveObjects. Hopefully, this simplifies your data backend code still some more and eases your pain in dealing with Lucene.