Skip to content

Better Utilizing JRuby in Project Scripting

21
Aug
2007

Often times in a large-scale Java project, I find myself in need of performing small, discrete tasks with the existing project infrastructure.   The standard way of doing this has always been to add a main(String[]) method to a class or to add a separate driver class to perform the one operation.  This is a bit messy though, since it requires polluting your nice clean project with random utility classes.  A better way to do this would be if we could have a set of separate scripts which just perform the actions we need.  After all, that’s what scripting languages are best at, right?  Enter JRuby.

JRuby is really designed for this sort of thing.  Interoperating with Java is second nature to it, and easily performing small, discrete tasks is a perfect use for Ruby’s compact and intuitive syntax.  The one problem is JRuby makes it pretty hard to get at a project’s class library.

For example, I have a medium-sized web application project based on Wicket and ActiveObjects.  This project has not only its own sources (and hence compiled binaries) as part of its classpath, but it also relies heavily upon almost two dozen other JARs (contained within the WebContent/WEB-INF/lib directory).  Any sort of utility scripting would have to somehow gain access to all of these classes, otherwise its capabilities would be extremely limited.

Now the task I need to perform with this scripting (at least, the task which precipitated this effort) is to optimize the Lucene search index.  This index is managed through ActiveObject’s automagical indexing of entity values.  The problem is that the index will grow rather organically, eventually becoming extremely slow.  I can’t just cause the application to stop at some arbitrary point and optimize the index since that would mean blocking a page load somewhere along the line.  As such, this is a perfect candidate for an external utility script.

Examining the Problem

So it really would be nice if we could just fire a JRuby script within the project root and expect it to be able to grab all of the project’s required classes.  However, there’s no way JRuby can do this for us.  What we need to do is actually manage a classloader of our own which searches through three separate locations relevant to the project:

  • Project compiled binaries (build/classes/)
  • ActiveObjects project compiled binaries (../ActiveObjects/bin/)
  • Project JAR dependencies (WebContent/WEB-INF/lib/*.jar)

While JRuby doesn’t provide any handy mechanism to do this (nor should they), Java does.  We can use a URLClassLoader instance from within JRuby and associate it with all of the given paths.  Using this URLClassLoader, we can load whatever class we need by String name and then – through the Java reflection API – access whatever methods/constructors are required.

If this sounds complicated, then you’re probably paying attention.  Classloaders and Java reflection both have notoriously verbose and complicated APIs.  In short, this isn’t something with which we want to pollute every single utility script we may need to write.  We need to develop some sort of Ruby API to wrap around all of this complexity.  Hopefully one which will enable us to access project classes in an intuitive (and memorable) manner.

Designing the Interface

Ruby lends itself extremely well to the development of so-called DSLs (Domain Specific Languages).  While we don’t need a full-blown DSL here, it would be really nice if we could have an intuitive syntax which could handle all of the complexity for us.  With that in mind, let’s imagine exactly what syntax we really want for our wrapper API:

#!/usr/bin/env jruby
 
require 'java'
require 'java_classes'
 
Utilities = get_class 'com.myproject.app.Utilities'
Utilities.createManager.optimize

In my project, the com.myproject.app.Utilties class contains a static method createManager() which returns an instance of IndexingEntityManager.  This ActiveObjects class in turn contains an optimize() instance method which calls the appropriate Lucene functions to optimize the index.  As you can see, our goal is to use the Utilities class within our script exactly as if it were a standard Ruby class (with the exception of the “dot” syntax as opposed to the double semi-colon for the class method).  Thankfully, this is an achievable goal.

If we write the get_class method to return a Ruby wrapper class corresponding to our custom-loaded project class, we could conceivably result in a syntax like the above.  Obviously, in the above example we’re assigning our wrapper class to the “Utilities” variable, and then treating that variable as if it were a proper class symbol (actually, as far as Ruby’s concerned, it may as well be).  So conceptually, that part of the problem is taken care of.

The second aspect to the syntax is actually treating project class static methods (and constructors, since we would potentially want to call “Utilities.new” and get a Utilities instance) as member methods of our wrapper class.  This is actually where much of the complicated will have to go.  Ruby does provide a mechanism for accomplishing this called “method_missing” (this is how ActiveRecord works).  However, we still need to write the logic needed to actually convert parameters, reflectively invoke methods, etc…

Implementation

So of course the first thing we need to take care of is the initialization of the classloader.  This is just hard work with the URLClassLoader API, so were going to assume we took care of it already.  :-)   For the sake of brevity, we’re going to pretend that the CLASSLOADER variable in the java_classes.rb file is an instance of URLClassLoader, properly initialized to hit the project classpath. 

The real interest of this mini API is in the JClassWrapper implementation.  It will be an instance of this Ruby class which is returned from the get_class method.  For the sake of simplicity, we’ll make the method_missing method within JClassWrapper to all the work.  Thus, the only things for which the get_class method is responsible are loading the class in question via CLASSLOADER and creating an instance of JClassWrapper to return.  The implementation is shown below:

def get_class(name)
    JClassWrapper.new java.lang.Class.forName(name, true, CLASSLOADER)
end

The really interesting code is in method_missing.  This is where we will handle both static methods and the special new method, which will be passed on to the wrapped Class’s constructor.  This is also where we need to worry about auto-converting the method parameters into values which will make sense to the wrapped Java method.  For the sake of simplicity, we’re not going to worry about wrapping anything like complex classes.  Instead, we’ll just assume that the parameters passed will either be already Java objects, or simple primitives like String or Fixnum.

The auto-conversion logic should look something like this (we can add to it as necessary):

def method_missing(sym, *args)
    jarg_types = java.lang.Class[args.size].new
    jargs = java.lang.Object[args.size].new
 
    for i in 0..(args.length - 1)
        if args[i].kind_of? String
            args[i] = java.lang.String.new args[i]
        elsif args[i].kind_of? Fixnum
            args[i] = java.lang.Integer.new args[i]
        elsif args[i].kind_of? JClassWrapper
            args[i] = args[i].java_class
        end
 
        jarg_types[i] = args[i].java_class
        jargs[i] = args[i]
    end
    # ...
end

As you can see, all this does is create and populate a types and a values array.  There are some basic, hard-coded conversions, and that’s about it.  This gives us all we need to pass values to the reflectively discovered static methods or constructor.  In fact, the only really interesting logic left to us is the actual reflective invocations:

def method_missing(sym, *args)
    # ...
    if sym == :new
        begin
            constructor = @clazz.getConstructor jarg_types
        rescue
            return super
        end
 
        return constructor.newInstance(jargs)
    elsif sym == :java_class
        return @clazz
    end
 
    begin
        method = @clazz.getMethod(sym.to_s, jarg_types)
    rescue
        return super
    end
 
    return method.invoke(nil, jargs) if defined? method
 
    super
end

Here we have some special logic for the new and java_class methods, since we don’t want to pass these directly to the wrapped class, but the rest of the logic is surprisingly simple.  Really all we need to do is find the corresponding static method by name and using the types array we populated earlier.  Then, using the java.lang.reflect.Method instance, we invoke the method passing the values array and nil for the instance (since it’s a static method).

One of the nice things about this is any Java values returned from these methods will be automatically handled and wrapped by JRuby.  Thus, we can do something like this if we really want to:

Person = get_class 'com.myproject.db.Person'
EntityManager = get_class 'net.java.ao.EntityManager'
 
manager = EntityManager.new(config[:uri], config[:user], config[:pass])
people = manager.find(Person.java_class)
 
people.each do |p|
    puts "Person: #{p.first_name} #{p.last_name} is #{p.age} years old"
end

You’ll notice we don’t have any special logic at work dealing with the EntityManager instance or the Person array and instances returned from the find method.  Regardless of our fancy ClassLoader tricks and class wrapping, we can still rely upon JRuby’s built-in Java integration facilities to take care of most of the heavy lifting.  The full source for the “java_classes.rb” file is available here.  Note, you will have to customize the values a bit depending on your project’s classpath.  Enjoy!

OpenID Prefiller Greasemonkey Script

17
Aug
2007

OpenID is really taking off.  We’re seeing more and more sites which offer OpenID login in addition to the standard create/sign-in login system.  Still more sites are offering to be an OpenID provider (such as AOL and LiveJournal).  Distributed single sign-on for the web is really a compelling concept, and I can see why it’s becoming so popular.  However, OpenID does have it’s issues.

Obviously, there are questions about login security, since OpenID just authenticates you, it doesn’t ensure you aren’t a spam-bot or similar.  But the big issue for me is (ironically) how much more typing it costs.  Think about which is easier: to type a username and a password (a pair which you probably type several times a day), or a long, unwieldy URL?  Personally, I find it takes much longer to type the URL than the username/password, and this of course cuts into my workflow and interrupts my train of thought.

The ultimate solution would be if my browser could prefill my OpenID into any OpenID login form elements on the page.  This doesn’t have the issues that the Password Manager in Firefox has, because I don’t really care if everyone knows what my OpenID is; they can’t use it anyway.  (one of the many wonderful things about OpenID)  Now as I understand it, this feature is coming in Firefox 3.0, but I want something that will solve my problems now.

Enter Greasemonkey.  If you haven’t already installed this extension, you really should do so.  You can do tons of stuff with the right script, like moving or removing elements around the page, or adding functionality to Gmail, or even reverting recent interface changes made at DZone.  This extension really is a must-have for any web power user.  In fact, this extension is exactly what we need to solve our little OpenID prefilling problem.

Since a Greasemonkey script by definition allows us to locally modify elements of pages as they load, we can create a script which will look for OpenID text fields in the loading page, and if found, pre-populate them with a given value (in this case, our OpenID).  This of course depends on all OpenID text fields having certain attributes in common, but thankfully this is the case.  It just so happens that every OpenID text field I’ve ever found has the word “openid” somewhere in either the element id, or somewhere in the name.  Using this bit of information, we can construct a script to search for form elements matching these criterion:

// ==UserScript==
// @name           OpenID Prefiller
// @namespace      codecommit.com
// @description    Prefills my openid into any openid text field on the page
// @include        *
// ==/UserScript==
 
var OPENID = 'openid.danielspiewak.org';    // change this to set your openid
 
var all = document.getElementsByTagName('input');
 
for (var i = 0; i < all.length; i++) {
    var current = all[i];
 
    if (current.type == 'text'
        && (current.id.indexOf('openid') >= 0
            || current.name.indexOf('openid') >= 0)
            || current.id.indexOf('openId') >= 0
            || current.name.indexOf('openId') >= 0) {
        current.value = OPENID;
    }
}

As you can see, all this does is search for input type=”text” elements with “openid” or “openId” in the id or the name attributes.  If it finds such a field, it sets its value to the OpenID we’ve hard-coded into the script and moves on.  Simple, yet effective.

To use this script yourself, simply install it into Greasemonkey and set it to run on all sites.  Open up the script, and change the OPENID String value to your own OpenID.  Save the script, and you’re on your way!

image

So far, I’ve tried this script on several dozen sites and it’s worked perfectly so far.  If you like, there’s an (incomplete) OpenID site directory available, listing sites upon which you can try this.  Actually, the only site I’ve found with which this script doesn’t work is DZone.  This is because DZone does some weird, lightbox pre-population of the login div, and thus isn’t modifiable by the Greasemonkey script.  (interestingly enough, other sites which do use lightbox logins do work with this script.  DZone is the only one which doesn’t)

SaveableEntity Bids a Fond Farewell

15
Aug
2007

Well, to make a small, side entry out of something which probably should be in bold print on the ActiveObjects website…  It’s worth announcing that I’ve merged SaveableEntity into the Entity super-interface.  The only reason to keep these two separate was so that some entities could be configured to receive calls to setters and immediately execute UPDATE statements.  This is a really inefficient way to code your database model and I think the only real use of it was in my sample code.  :-)   Since it really was an API misstep, I’ve decided to do away with it.  The save() method is now obligatory for any data modification.  Thus, any legacy code you may have which extended Entity may not function in the way you would expect (e.g. the DogfoodBlog example no longer persists data properly).  If you have any code which extended SaveableEntity, just change this to extend the Entity interface and everything should work as before.  Just thought I’d make a general announcement.

Performance is Good: ActiveObjects vs ActiveRecord

14
Aug
2007

So ActiveObjects is a fairly cool ORM.  However, coolness alone does not an enterprise ORM make.  In fact, the real qualifications for an enterprise-ready framework are as follows:

  • Stability
  • Performance

I’m sure there are other questions which factor into design decisions on whether or not to use a library, but those are the two which I look at most closely.  Stability is usually a hard metric to find, since it usually depends on a lot of adopters hammering the library until it breaks, is fixed and then hammered again.  However, performance numbers are almost always easy to come by, since all that is required are a few simple benchmark tests to just get a ballpark-number.

Since benchmarks are so fun, I’ve decided to do a few for ActiveObjects.  Or rather, I’ve decided to run a simple (read, very simple) benchmark test with ActiveObjects as well as a number of other ORMs.  At the moment, I’ve only been able to run the test with ActiveRecord (sorry guys, Hibernate’s a really complex framework), but I think the numbers are still worth looking at.

ActiveRecord claims only a 50% overhead compared to manual database access (that number is actually listed as a feature).  There has been some dispute over whether the test used to obtain that particular figure was valid or not, but that’s besides the point.  ActiveObjects should be able to do at least that well, right?

Well, as it turns out, it can.  Here are the numbers from my reasonably simple benchmark:

ActiveObjects
==============
Queries test: 55 ms
Retrieval test: 68 ms
Persistence test: 55 ms
Relations test: 154 ms

ActiveRecord
=============
Queries test: 154 ms
Retrieval test: 6 ms
Persistence test: 76 ms
Relations test: 75 ms

Surprisingly close numbers actually.  I had assumed that there would be some significant disparity, one way or another.  However, as you can see ActiveObjects is fairly comparable to ActiveRecord on a set of extremely trivial tests.  There are some jumps and obvious areas of strength/weakness in both frameworks, but on average they’re pretty similar in performance.

As my friend Lowell Heddings pointed out, ORM benchmarks are far more useful if you actually examine the SQL generated to see how efficient it really is from a theoretical standpoint.  So, to make things easier I sed/grepped the logs and arrived at the following SQL outputs for each respective ORM.

Details

Now, I will be the first to admit that this is hardly at even test to begin with.  Obviously there are different strengths and weaknesses in every library, and though I tried to be impartial in the designing of the benchmarks, I probably accidentally favored one ORM over the other.  Also, there are inherent performance advantages to Java over Ruby, especially in the area of database access.  In short, ActiveObjects probably had a sizeable advantage coming right out of the gate, so take my numbers with a grain of salt.

The test itself consisted of four phases, each involving three entities: Person, Profession and Workplace.  Person has a many-to-many relation with Profession through a fourth entity, Professional.  Workplace has a one-to-many relation with Person.  These relations were exploited directly in the relations benchmark (e.g. Person#getProfessions(), Workplace#getPeople(), etc).  Each entity had a number of fields, including one CLOB (or TEXT, as MySQL refers to them) in the Person entity.  The tables for each respective schema were pre-populated with the same data, which involved several rows with different values (except for the CLOB, which was a roughly 4000 character paragraph and the same for every row).  In the ActiveObjects Person entity, I used the @Preload annotation to eagerly load firstName and lastName.

For the retrieval test, the benchmark iterates through every Person row and grabs firstName, lastName, age, alive, and bio.  Since ActiveObjects only preloaded firstName and lastName, it suffered a bit here. 

The persistence test iterates through every person row and changes the first and last name to one selected from a pool of names I populated with random names which came to mind.  It then goes through the same iteration again and sets the age, alive flag and the bio to our 4000 word Pulitzer-winning essay.  Each row is saved through each iteration, thus each row is saved exactly twice throughout the test.  ActiveObjects came out ahead here probably because of its use of PreparedStatements, as well as the more efficient UPDATE statement generation.

The relations test involved first finding all of the Professions associated with each individual Person and retrieving the Profession name.  Next, the Workplace for the Person is retrieved, then all of the Person(s) associated with that Workplace and their firstName and lastName values accessed.

The queries test was little more than getting all of the Person(s), all of the Workplace(s), all of the Professional mappings, along with all of the Profession(s).  ActiveObjects far outperformed ActiveRecord in this area since ActiveRecord uses SELECT * for everything and eagerly loads the row values.  This means (especially with a CLOB thrown into the mix) that ActiveRecord’s initial query time will be very long, while it’s field access time will be very quick.  Most ORMs function in this way, and it can be a very good thing at times (our benchmark is one of those times).

Lessons Learned

  • Eager loading can be a good thing
  • ActiveObjects generates some weird SQL for relations access

Obviously I can only do so much about the eager loading issue.  I believe pretty strongly that ActiveObject’s approach (in lazy loading most things) is the right one for most use-cases.  However, the second lesson to be learned here is one which I think I need to take a bit more to heart: keep it simple SQL.

Normally, ActiveObjects will generate a query something like the following for accessing a one-to-many relation:

SELECT DISTINCT a.outMap AS outMap FROM (
    SELECT ID AS outMap,workplaceID AS inMap FROM people 
       WHERE workplaceID = ?) a

Yuck!  For obvious reasons, this is an incredibly inefficient bit of querying.  Actually, not only is it inefficient, but needlessly so.  You and I of course know that we could replace the above query with the much simpler:

SELECT ID FROM people WHERE workplaceID = ?

So why doesn’t ActiveObjects do that?  Frankly, I was lazy in my coding of the EntityProxy#retrieveRelations method, so a lot of ugly SQL slipped through the cracks in cases where it really wasn’t necessary.  I’ve spent a bit of time on this, and I think I’ve got the issue resolved.  The problem is that ActiveObjects was assuming that any relation (one-to-many or many-to-many) can have multiple mapping fields, thus requiring a wrapping DISTINCT outer query around a subquery SELECT which is UNIONed with an arbitrary number of other SELECTs, corresponding to the other mapping fields.  Obviously, it is almost never the case that we have to deal with multiple mapping paths, so I added a short-circuit to the logic which creates far simpler queries if at all possible.  As a result, the benchmark numbers for the relations test in ActiveObjects are between 80 and 100 ms.  Still slower than ActiveRecord, but much improved.

It’s worth noting that if we ran each benchmark twice, we would see a marked improvement in the ActiveObjects performance the second time through.  Not just because a lot of the values would be cached, but also because the prepared statements in question would have been compiled and stored.  This is a fairly major area in which ActiveRecord falls short since it doesn’t utilize prepared statements, thus having a constant runtime for its queries and remaining unable to take advantage of cached, compiled queries.

So in short, ActiveObjects may be really neat, but it’s performance numbers don’t seem all that superior to those of ActiveRecord, a Ruby ORM with numerous known shortcomings in this area.  I guess I need to work on things a bit more.  :-)   Next up, either manual JDBC code or Hibernate running the same benchmark, depending on how soon I’m able to figure out Hibernate’s crazy XML mapping schema.

Note: I forgot to mention this… You can get the source for my benchmark from the ActiveObjects SVN repository: svn co https://activeobjects.dev.java.net/svn/activeobjects/trunk/Benchmarks

Even More ActiveObjects: Preloading

13
Aug
2007

There has been some talk recently regarding the ActiveObjects lazy-loading mechanism.  It’s starting to seem that what I thought was a great idea and terribly innovative when I designed the framework might not have been such a great idea after all.  :-)   That’s a good thing though, finding my mistakes that is, it just forces me to think a little harder about how to solve the problem.

One of the guiding ideas behind ActiveObjects is that nothing should be loaded until it’s needed.  Once it’s loaded, it should be cached and then up-chucked on command, obviating the need for multiple loads.  This technique, commonly known as “lazy-loading”, works really well if you’re in a memory-crunch situation.  This is because even for tables with extremely large numbers of columns (think 50-100), none of the data in a row is loaded if you don’t need it.  Thus, you could work with a database-peered object without having to load the entire row into memory, a potentially long and expensive operation.

The problem with this is it tends to create large numbers of queries.  Also, it can be very inefficient for certain types of operations.  For example:

for (Person p : manager.find(Person.class)) {
    System.out.println(p.getName());
}

This will generate the following SQL (assuming 6 rows in the people table):

SELECT ID FROM people
SELECT NAME FROM people WHERE ID = ?
SELECT NAME FROM people WHERE ID = ?
SELECT NAME FROM people WHERE ID = ?
SELECT NAME FROM people WHERE ID = ?
SELECT NAME FROM people WHERE ID = ?
SELECT NAME FROM people WHERE ID = ?

Granted, it’s a prepared statement, so it will be compiled and run very quickly 5 out of 6 times.  However, this is still pretty inefficient.  Imagine if there were 100,000 people in the database, instead of 6 (not an unreasonable assumption).  This code could take hours to run.

Now, if you were writing the JDBC code by hand, you’d probably do something like this (exception handling omitted):

Connection conn = getConnection();
PreparedStatement ps = conn.prepareStatement("SELECT name FROM people");
ResultSet res = ps.executeQuery();
while (res.next()) {
    System.out.println(res.getString("name"));
}
res.close();
ps.close();
conn.close();

One statement, that’s all that’s really required.  Paging through a result set is a pretty quick operation, so even with 100,000 rows this shouldn’t be an insanely slow piece of code.  In fact, the slow-down here is probably how fast the console can print the text in question (not very fast actually).

So, obviously we have very disparate performance between JDBC by hand and using ActiveObjects, and we really can’t have that.  The solution is to force ActiveObjects to somehow load all of the names for the people in the first query, like we did when we ran the SQL by hand.  For a while now, ActiveObjects has had this capability:

for (Person p : manager.find(Person.class, Query.select("id,name"))) {
    System.out.println(p.getName());
}

Now we just execute a single line of SQL:

SELECT ID,NAME FROM people

Much more efficient.  However, the code is now much uglier and a little unintuitive. (I mean, who’s going to think of Query.select(“…”) when looking to override lazy-loading?)  Also, we would have to use this cryptic syntax in every single query in which we want to override the lazy-loading.  This could be a bit of a pain, especially if you know at design time that every time you get a Person, you’ll probably need a “name” shortly thereafter.  So, for situations just like this one, I’ve now added the @Preload annotation (not in the 0.4 release, available in trunk/)

@Preload("name")
public interface Person extends Entity {
    public String getName();
    public void setName(String name);
 
    public int getAge();
    public void setAge(int age);
}
 
// ...
for (Person p : manager.find(Person.class)) {
    System.out.println(p.getName());
}

Just as we would expect, this now runs the following single-query SQL statement:

SELECT NAME,ID FROM people

If we were to add a call to p.getAge(), it would of course lazy-load that value, leading to another SQL statement.  However, we can just as easily add it to the @Preload clause like this:

@Preload({"name", "age"})
public interface Person extends Entity {
    // ...
}

Or, since this is really all of the properties in Person, we can use the following, shorter syntax:

@Preload
public interface Person extends Entity {
    // ...
}

So effectively, you can disable lazy-loading in ActiveObjects by adding the @Preload annotation without any parameters to every entity you use.  However, this is a little inefficient since it will pretty much turn any non-joining SELECT statement into a SELECT *.  For this reason, I suggest you only use @Preload for situations like our name-printing loop.  In other words: only for values you know will be queried every time you grab a bunch of entities of a given type.

One more thing worthy of note: this is a hint only.  It doesn’t mean that every Person instance will have a preloaded name value.  Any Query(s) with JOIN clauses will ignore the @Preload annotation to avoid accidentally running JOINs with SELECT *.  Also, quite a few Person instances won’t have any values at all by default.  For example, if you use EntityManager#create(), a new row will be INSERTed into the people table, but the resulting Person instance won’t have any value cached for name.  Likewise, if you make a simple call to EntityManager#get(Class<? extends Entity>, int), this will return the Entity instance which corresponds to that id value, but it may or may not have a cached name.  Thus, the get() method still does not run any queries, it merely creates the object peers.