Skip to content
Print

Scala as a Scripting Language?

3
Nov
2008

I know, the title seems a bit…bizarre.  I don’t know about you, but when I think of Scala, I think of many of the same uses to which I apply Java.   Scala is firmly entrenched in my mind as a static, mid-level language highly applicable to things like large-scale applications and non-trivial architectures, but less so for tasks like file processing and system maintenance.   However, as I have been discovering, Scala is also extremely well suited to basic scripting tasks, things that I would normally solve in a dynamic language like Ruby.

One particular task which I came across quite recently was the parsing of language files into bloom filters, which were then stored on disk.  To me, this sounds like a perfect application for a scripting language.  It’s fairly simple, self-contained, involves a moderate degree of file processing, and should be designed, coded and then discarded as quickly as possible.   Dynamic languages have a tendency to produce working designs much faster than static ones, and given the fact that the use-case required access to a library written in Scala, JRuby seemed like the obvious choice (Groovy would have been a fine choice as well, but I’m more familiar with Ruby).  The result looked something like this:

require 'scala'
 
import com.codecommit.collection.BloomSet
 
import java.io.BufferedOutputStream
import java.io.FileOutputStream
 
WIDTH = 2000000
 
def compute_k(lines, width)
  # ...
end
 
def compute_m(lines)
  #...
end
 
Dir.foreach 'wordlists' do |fname|
  unless File.directory? fname
    count = 0
    File.new "wordlists/#{fname}" do |file|
      file.each { |line| count += 1 }
    end
 
    optimal_m = compute_m(count)
    optimal_k = compute_k(count, WIDTH)
 
    set = BloomSet.new(optimal_m, optimal_k)
 
    File.new "wordlists/#{fname}" do |fname|
      file.each do |line|
        set += line.strip
      end
    end
 
    os = BufferedOutputStream.new FileOutputStream.new("gen/#{fname}")
    set.store os
    os.close
  end
end

As far as scripts go, this one isn’t too bad.  I’ve written some real whoppers for things like video encoding and incremental backups.  The main trick here is the fact that we need to make two separate passes over the same file in order to get the number of lines before constructing the set.  We could load the file into an array buffer in a single pass, count its length and then iterate over the array, placing each element in the bloom filter.   However, this really wouldn’t be too much faster than just hitting the file twice (we still need two separate passes) and it has the additional drawback of requiring a fair amount of memory.

All in all, this script is a fairly natural representation of my requirements.   I needed to loop over a number of word lists, push the results into separate bloom filters and then freeze-dry the state.  However, look at what we’ve actually done here.  Remember earlier where we were considering which language to use?  We wanted a language which could concisely and quickly express our intent.  For that decision making process, we just assumed that a dynamic language would suffice better than one hampered by a static type system.  However, at no point in the above script do we actually do anything truely dynamic.  By that I mean: open classes, unfixed parameter types, method_missing, that sort of thing.   In fact, we haven’t really done anything that we couldn’t do in Scala:

import com.codecommit.collection.BloomSet
import java.io.{BufferedOutputStream, File, FileOutputStream}
import scala.io.Source
 
val WIDTH = 2000000
 
def computeK(lines: Int, width: Int) = // ...
 
def computeM(lines: Double) = // ...
 
for (file <- new File("wordlists").listFiles) {
  if (!file.isDirectory) {
    val src = Source.fromFile(file)
    val count = src.getLines.foldLeft(0) { (i, line) => i + 1 }
 
    val optimalM = computeM(count)
    val optimalK = computeK(count, optimalM)
 
    val init = new BloomSet[String](optimalM, optimalK)
 
    val set = src.reset.getLines.foldLeft(init) { _ + _.trim }
 
    val os = new BufferedOutputStream(new FileOutputStream("gen/" + file.getName))
    set.store(os)
    os.close()
  }
}

This is actually runnable Scala.  I’m not omitting boiler-plate or cheating in any similar respect.  If you copy this code into a .scala file and make sure that BloomSet is on your CLASSPATH (which you would have needed anyway for JRuby), you would be able to run the script uncompiled using the scala command.  Unlike Java, Scala actually includes an “interpreter” which can parse raw Scala sources and execute the representative program just as if it had been pre-compiled using scalac.   One of the perquisites of this approach is the ability to simply omit any main method or Application class.  In nearly every sense of the word, Scala is a scripting language…as well as an enterprise-ready Java-killer (let the flames begin).

Now that we’re fairly convinced that the above is valid Scala, let’s compare it with the original version of the script written using JRuby.  If we just go off LoC (Lines of Code), Scala actually wins here.  This was a more-than-slightly surprising discovery for me, given how often dynamic languages (and Ruby in particular) are touted as being more concise and expressive than static languages.  But of course, sheer LoC-brevity isn’t everything: we also should consider things like readability.  A few characters of Befunge can accomplish more than I can do in several lines of Scala, but that doesn’t mean I’ll be able to figure out what it means tomorrow morning.

On the readability score, I think Scala wins here too.  The file processing and set creation is all done in a highly functional style (using foldLeft).   At least to my eyes, this is a lot easier to follow than the imperative form in Ruby.  More importantly, I think it’s a bit harder to make silly mistakes.  When I wrote the Ruby version of the script, it took several tries before I solidly pinned down the exact incantation I was seeking.  The Scala version literally required only one revision after the initial prototype.   Granted, I had the Ruby version to go off of, but I think we would all agree that the scripts use some fairly different libraries and methodologies for accomplishing identical tasks.

So what is it that makes Scala so surprisingly well suited to the task of quick-and-dirty file processing and scripting?  After all, isn’t is just a fancy syntax wrapping around the plain-old-Java standard library?  While it is true that Scala has first-class access to Java libraries (as demonstrated in the script), that isn’t all that it offers.  I believe that Scala has two important features which make it so suitable for these tasks:

  • Type inference
  • Powerful core libraries

The first feature is of course evident wherever you look in the script.  With the exception of the two methods and the BloomSet constructor, we never actually declare a type anywhere in the script.  This gives the whole thing a very “dynamic feel” without actually sacrificing static type safety.   The first time you try this sort of language feature it is an almost euphoric experience (especially coming from highly-verbose languages like Java).

The second feature is a bit harder to see.  It is most evident in the way in which we handle file IO.  The directory listing is of course yet another application of the venerable java.io.File class, but the process of opening and reading the file line-by-line seems to be a lot easier than anything Java can muster.  This is made possible by Scala’s Source API.   Rather than fiddling with BufferedReader and the whole menagerie that goes along with it, we just get a new Source from a File instance and then use conventional Scala methods to iterate over its contents.  In fact, we’re actually applying a functional idiom (fold) rather than a standard imperative iteration.  Finally, when we’re done with our first pass, we don’t need to re-open the file from scratch (inviting initialization mistakes in our coding), we just reset the Source and start from the beginning once more.

Using Scala as a scripting language comes with some pretty hefty benefits.   For one thing, you get immediate and idiomatic access to the mighty wealth of libraries which exist in Java.  Even for scripting, this sort of interoperability is invaluable.  JRuby does provide some excellent Java interop, but it simply can’t compare to what you get with Scala.  Further, Scala has a static type system to check you (at runtime with a script) to ensure that you haven’t done anything obviously bone-headed.  This too is nothing to sniff at.

Given the fact that Scala’s “scripting syntax” is just as concise as Ruby’s (sometimes more), it’s hard to see a reason not to employ it for around-the-server tasks.  Amusingly, the most compelling reason not to use Scala for scripting just might be its comment syntax.  Not having direct support for the magic “hash bang” (#!) incantation to define a file interpreter just means that Scala scripts have to go through some extra steps to be directly executable.  However, if immediately-executable scripts aren’t an issue, you may want to consider Scala as your scripting language of choice for your next non-trivial outing.  You may reap the rewards in ways you weren’t even expecting.

Comments

  1. I’m not a linux geek, but it appears that you actually can write sort of “!# scripts” with scala: http://www.scala-lang.org/node/166#Scriptit

    Pawel Badenski Monday, November 3, 2008 at 1:34 am
  2. Apparently if you want a directly executable script you can do this:

    #!/bin/sh
    exec scala $0 $@
    !#
    // Say hello to the first argument
    println(“Hello, ” + args(0) + “!”)

    Asd Monday, November 3, 2008 at 3:21 am
  3. Hi,

    Nice writeup. The Scala version does look nice, but I found myself wondering why you used an imperative algorithm for the Ruby version. You could have used inject, just like you used foldLeft… =)

    Cheers

    Ola Bini Monday, November 3, 2008 at 3:27 am
  4. scala ftw :D

    Scala doesn’t have libraries which would simplify scripting. As you know, you are using directly java File class. I think that after some engineering a library for file operations could be prepared which would make writing scripts even faster – with very readable results. Ruby already has such a library, but creating one for Scala may result in even nicer abstractions as imo closures syntax is better and Scala is also very well suited for creating dsl-s.

    jau Monday, November 3, 2008 at 6:43 am
  5. To make the JRuby and Scala examples more similar (not debating whether inject/reduce/foldLeft is more readable):

    count = File.open(“wordlists/#{fname}”) { |f| f.inject(0) { |sum,line| sum+1 } }

    init = BloomSet.new(optimal_m, optimal_k)
    set = File.open(“wordlists/#{fname}”) { |f| f.inject(init) { |set, line| set += line.strip } }

    > (let the flames begin)
    You mean about your comments that JRuby/Groovy(/Jython) aren’t enterprise-ready Java killers? :)

    orip Monday, November 3, 2008 at 7:00 am
  6. foldLeft(init) { _ + _.trim }

    Huh? Its unreadable, nonsense lines like this that will be Scala’s undoing.

    Stephen Colebourne Monday, November 3, 2008 at 7:17 am
  7. It’s been about a year since I played with Scala, but shouldn’t you be able to avoid the those two folds? (I just don’t like folds. They are too low-level. :)

    Something like:

    val count = src.getLines.length
    :
    val set = init.addAll(src.reset.getLines.map { _.trim })

    Of course I don’t know if BloomSet has an addAll.

    Nathan Sanders Monday, November 3, 2008 at 8:29 am
  8. @Stephen Colebourne:

    Would you prefer:

    foldLeft(init) { (acc,cur) => acc + cur.trim }

    It means the same thing. Some people prefer the { _ + _.trim } form, since for a fold as simple as this, introducing names for the parameters doesn’t buy you much, and may distract readers from what the function is actually doing. Or do you just object to using folds at all? Folds are very commonly used when programming in a functional style, and once you are used to them, they can be much easier to read than explicit loops.

    Paul Chiusano Monday, November 3, 2008 at 9:48 am
  9. Your Ruby might be a little artificially lengthy. With the caveat that I may be misunderstanding your code, I think this

    count = 0
    File.new “wordlists/#{fname}” do |file|
    file.each { |line| count += 1 }
    end

    would be better expressed like this

    File.open(“wordlists/#{fname}”).readlines.length

    , and this

    File.new “wordlists/#{fname}” do |fname|
    file.each do |line|
    set += line.strip
    end
    end

    more like this

    set += File.new “wordlists/#{fname}”.readlines.inject { |i, k| i.strip.concat k.strip }

    though I’m not a Rubyist myself.

    James Cunningham Monday, November 3, 2008 at 10:06 am
  10. Er, make that last one

    File.new “wordlists/#{fname}”.readlines.each { |i| set += i.strip }

    James Cunningham Monday, November 3, 2008 at 10:09 am
  11. @Pawel and @Asd

    Yes, there is a slight hack to setup #! executable Scala scripts. I think you’ll agree though that this is no-where near as easy as “#!/usr/bin/ruby” or even “#!/usr/bin/env jruby”. True, it’s boilerplate, but it’s not very nice either. What’s more, it breaks every Scala editor on the books (though, it probably wouldn’t be too hard to update the jEdit mode to work properly with it).

    @Stephen

    I actually find foldLeft(init) { _ + _.trim } to be *extremely* readable. Paul pointed out in a later comment that you can replace the underscores with named parameters, but I think that the underscore syntax does a nice job of removing the unnecessary distraction of parameter names in trivial cases like this. You’ll note that I didn’t use the far-more-cryptic (init /: src.getLines) { _ + _.trim }. I suppose that this too is just a matter of taste, but I do draw the line somewhere in my own coding style. :-) I usually only use the /: and :\ operators when brevity is essential and readability is not.

    @Ola

    I used an imperative version in the Ruby code for two reasons. First (and probably most significant), I’m not familiar with the functional side of Ruby’s libraries, at least not enough to apply them in this fashion. At least to my mind, Ruby’s libraries are very strange and extremely poorly designed. I realize this is a subjective issue, but it does play into the choice to use Scala rather than JRuby in this particular scripting instance: I can leverage the platform more because I *remember* how to do so. :-)

    And just to nip this argument in the bud: if you add it up, I’ve probably written a lot more Ruby than I have Scala. I’ve certainly been using that language longer. So if I still can’t remember basic collections methods like #inject after *5 years* of experience with the language, I’m either a little slower than most or the libraries just aren’t consistently organized. I’ve actually seen places where Scala’s libraries are like this (#length vs #size), but not even close to the degree that Ruby demonstrates.

    @orip

    :-) Actually, my comment about the flames was in reference to my implication that Scala is a Java-killer. It seems these days that whenever you list *any* language as “the next Java”, your family receives death threats and the police have to start scanning your mail for letter bombs.

    @Nathan

    It is true that I probably overuse fold. As far as functional programming goes, it is a fairly low-level operation. BloomSet does have #addAll, but unfortunately your version requires significantly more memory than mine. By adding the lines to the BloomSet as I read them, I’m able to do everything in-place and keep the footprint to a minimum. If I were to use #map as you suggest, then Scala would have to read the entire file into memory (or rather, the mapped file). This pure in-memory Iterable would then be passed to BloomSet#addAll.

    One approach I *could* have taken which wouldn’t have required any extra memory would be to leverage lazy collections:

    set ++ src.getLines.projection.map { _.trim }

    This doesn’t require any extra memory. I think it’s probably a shade slower, but I don’t use Scala’s lazy collections enough to be sure.

    Daniel Spiewak Monday, November 3, 2008 at 10:15 am
  12. Here’s a better way to do it in Ruby:

    require ‘bloomset’

    WIDTH = 2e6

    def compute_k(lines, width);end # …
    def compute_m(lines);end # …

    for fname in Dir["wordlists/*.txt"]
    count = `wc -l #{fname}`.to_i
    optimal_m = compute_m(count)
    optimal_k = compute_k(count, WIDTH)
    set = BloomSet.new(optimal_m, optimal_k)
    IO.foreach(fname){|line| set += line.strip}
    File.open(“gen/#{fname}”, ‘w’){|file| set.store file}
    end

    Jules Monday, November 3, 2008 at 10:27 am
  13. Oh no, my indentation :(

    Jules Monday, November 3, 2008 at 10:27 am
  14. @Jules

    Shelling out just to count the number of lines in a file? That’s not very sporting! Besides, it’s not cross-platform. I originally wrote these scripts on Windows, so `wc -l` wasn’t exactly an option.

    With respect to the last line (File.open… {|file| set.store file}), that’s just not compatible with BloomSet. Remember, BloomSet is a Scala class designed to work with Scala libraries. The only way to persist it is to use streams directly. The invocation “set.store file” is going to fail with a ClassCastException in the JRuby runtime due to the fact that Ruby’s File class is not the same as Java’s OutputStream.

    > Oh no, my indentation :(

    Yeah, that’s been on my TODO list forever. I hate the default WordPress comment system, I just haven’t gotten a chance to fix up something better.

    Daniel Spiewak Monday, November 3, 2008 at 10:36 am
  15. I’m not a java person so please correct me if I’m wrong. Doesn’t the jvm require a fair amount of memory just to start up? My projects (rails) have a lot of little ruby scripts that run from cron and many of them use something like 5MB of memory while running. I’d be concerned if I had to have a bunch of 30MB jvm’s running all over the place because of scala scripts started from cron. Am I wrong about this aspect of using scala as a scripting language?

    reck Monday, November 3, 2008 at 12:40 pm
  16. @reck

    Not as much as you would think. I haven’t looked in a while, but the above script probably takes no more than 3-4 MB of memory on startup. Scala doesn’t really impose any extra overhead (it’s just another Java library), so you can probably trust the numerous references around the web which discuss the VM’s memory usage and startup performance.

    One thing which *will* increase the memory and startup time for the JVM is the -server option. It improves overall performance (as well as consistency) of the application in question, but it does chew through a bit of extra RAM.

    Daniel Spiewak Monday, November 3, 2008 at 12:50 pm
  17. I tried to run it like this and it didn’t seem to have any effect (it always used about 18MB)

    $ scala -DXms8m HelloWorld
    and
    $ scala -DXms128m HelloWorld

    Is this the right way to set this in scala?

    reck Monday, November 3, 2008 at 3:02 pm
  18. > Is this the right way to set this in scala?

    Good question. :-) I don’t think so. To be safe, try it like this:

    java -Xms8m -cp ${SCALA_HOME}/lib/scala-library.jar HelloWorld

    You may have to use HelloWorld$ as the main class if your main object has a companion class.

    Daniel Spiewak Monday, November 3, 2008 at 3:07 pm
  19. Daniel: Do not speak about what you have not tried. Passing a Ruby File to a method that takes OutputStream does not ClassCastException, it raises a Ruby error saying it does not know how to automatically convert a Ruby File to an OutputStream. And then you can turn around and do this

    set.store file.to_outputstream

    This will take the “out” channel of the File object and present it as an output stream suitable for any Java API. There’s also to_inputstream and to_channel.

    Charles Oliver Nutter Monday, November 3, 2008 at 3:17 pm
  20. @Charles

    Ah, very clever! I was unaware of those handy conversion methods. I seem to remember a ClassCastException arising from some dynamic JVM language (not necessarily JRuby) passing objects of the wrong type into a Java method, thus I assumed that the result would be the same here.

    In any case, the code given was still not going to work. There certainly were more concise ways that I could have written the Ruby code…mostly because I wasn’t aware of them. However, that doesn’t really affect my point. I was trying to illustrate how Ruby and Scala aren’t really too far separated in concision, contrary to popular perception of statically typed languages. In fact, even if Scala *were* significantly more concise than Ruby for this script, it wouldn’t mean anything. This script was interacting with a Java (actually, Scala) API for most of its functionality. *Naturally* Scala is going to be a little more in its element than Ruby. It’s a credit to JRuby’s Java integration that the script was as concise as it was.

    Daniel Spiewak Monday, November 3, 2008 at 3:26 pm
  21. Here’s how you can count lines in a file with Ruby:

    n = 0
    IO.foreach(fname){|line| n += 1}

    > If we just go off LoC (Lines of Code), Scala actually wins here. This was a more-than-slightly surprising discovery for me [...]

    Could it be that your Scala program is shorter because you know Scala better?

    > On the readability score, I think Scala wins here too. The file processing and set creation is all done in a highly functional style (using foldLeft). At least to my eyes, this is a lot easier to follow than the imperative form in Ruby.

    I don’t agree with this.

    val init = new BloomSet[String](optimalM, optimalK)
    val set = src.reset.getLines.foldLeft(init) { _ + _.trim }

    I find this less readable than the Ruby version:

    set = BloomSet.new(optimal_m, optimal_k)
    IO.foreach(fname){|line| set += line.strip}

    And less writable too (mostly because I can’t remember the order of arguments to fold).

    So while I don’t agree with everything you said, this is a very good article. Keep it up :)

    It’s strange that there isn’t a built-in way to iterate over all files in a directory in either Ruby or Scala (I could be wrong). You almost always want to get the files, not directories.

    Jules Saturday, November 8, 2008 at 12:58 pm
  22. > It’s strange that there isn’t a built-in way to iterate over all files
    > in a directory in either Ruby or Scala (I could be wrong). You
    > almost always want to get the files, not directories.

    I suspect this is partially because most operating systems treat files and directories identically under the surface. Ever wonder why directories need to have the executable bit set in order to allow cd? That’s it. :-)

    As to a built-in mechanism, I think the Java 7 NIO2 will have something like this. But until then, you can do the following in Scala:

    for (f <- myDir.listFiles; if f.isDrectory) {
    // do your thing…
    }

    Daniel Spiewak Saturday, November 8, 2008 at 1:07 pm
  23. Oh, I’m backwards:

    for (f <- myDir.listFiles; if !f.isDrectory) {
    // do your thing…
    }

    Daniel Spiewak Saturday, November 8, 2008 at 1:07 pm
  24. > I’ve probably written a lot more Ruby than I have Scala. I’ve certainly been using that language longer.
    > So if I still can’t remember basic collections methods like #inject after *5 years* of experience with
    > the language, I’m either a little slower than most or the libraries just aren’t consistently organized.

    I think this has totally another reason: Ruby started to be the “better Perl” and was heavily pushed by showing that it really is. The backtick solution of Jules (shell exec) is a typical example of that style of influence. Beside that, Ruby was introduced as “truly object oriented”, targeting at Java, leaving behind Python and especially the Perl OO. Functional style was never really discussed in Ruby communities, unless in niche threads.

    OTOH Scala was introduced as being “functional” (beside OO) and all focus was on documenting what that means, how this is applied a.s.o. (among that your article, covering “functional scripting” ;-) ).

    No wonder, that people grown up with such documentation background end thinking imperatively or object-imperatively in Ruby, whilst they are driven to think functional in Scala.

    You have to explicitly think about “functional Ruby” to approach the API from this perspective.

    Det Tuesday, November 18, 2008 at 12:35 am
  25. Your article would be wonderful, if you just remove the Scala vs JRuby comparison. Five years of experience with Ruby and can’t remember inject? Oh, c’mon…

    Fabio Kung Friday, November 28, 2008 at 6:48 pm
  26. @Fabio

    Strange, but true! It would help if inject were given a better name (like one that was, maybe, descriptive of what it does). :-)

    Daniel Spiewak Friday, November 28, 2008 at 6:54 pm

Post a Comment

Comments are automatically formatted. Markup are either stripped or will cause large blocks of text to be eaten, depending on the phase of the moon. Code snippets should be wrapped in <pre>...</pre> tags. Indentation within pre tags will be preserved, and most instances of "<" and ">" will work without a problem.

Please note that first-time commenters are moderated, so don't panic if your comment doesn't appear immediately.

*
*