Wednesday, March 17, 2010

Expressing joint behavior of modules

Sometimes, special code is needed to make two modules interact together. This seems to come up all the time in Factor. For example, there are two language features, locals and prettyprinting, implemented in the Factor library. If they are both loaded, then there is special code to make locals prettyprint. But neither one depends on the other.

The old way to handle this situation was to have code that looked like this, as a top-level form in the locals vocab:
"prettyprint" vocab [
"locals.prettyprint" require
] when
But this is problematic. What if locals is loaded first, and then prettyprint? Well, this actually happened, due to a change in some other code. Then locals don't prettyprint correctly! You could fix this by adding top-level statements in both vocabs, as mirrors and specialized arrays did, but then this single joint dependency is duplicated in two places. Maintenance is more difficult than it should be.

To fix this problem, I added a new word, require-when. The code above, as a top-level form in the locals vocab, would be replaced with
"prettyprint" "locals.prettyprint" require-when
The logic is: if the prettyprint vocab is already loaded, then locals.prettyprint will be loaded. Otherwise, the dependency is registered with the module system, so if prettyprint is loaded later by something else, then locals.prettyprint will be as well.

I'm pretty satisfied with this solution to a somewhat long-standing problem in Factor. I wonder how other module systems solve the same problem. I have this implemented in a branch in my repository and it should be incorporated into mainline Factor soon.

Monday, March 8, 2010

Paper about call( -- ) inlining

Since I think my method of inlining closures in Factor is stronger than previous similar optimizations, and since I wouldn't mind a potentially free trip to Toronto, I wrote an abstract on what I did for the PLDI student research contest. It's called Closure Elimination as Constant Propagation (not sure if that's the best title). It's supposed to be under 800 words, and it's about double that now. Readers, I'd really appreciate your help in fixing errors, figuring out what to cut, and finding what is unclear and what I left out. Send me any suggestions as comments here or by email at ehrenbed@carleton.edu. Thanks!

Update: I edited the paper, and changed the link above.

Update 2: There's also a severely abridged version, which is more like what I'll submit, here.

Update 3: I got accepted to PLDI! If you're coming, I'll see you there!

Thursday, March 4, 2010

A protocol for sets

It's bothered me for some time that Factor has a protocol (an informal abstract class) for sequences and associative mappings, but nothing for sets. In core and basis, there are three set implementations: you can use a hashtable as a set, a bit array, or a plain old sequence. Different words are used to manipulate these, and each set type has a rather incomplete set of operations.

In a branch in my repository, I've fixed this omission. Now all three are under a common protocol, and more can easily be added. If you're interested in reading about the details of the protocol, you can pull from this repository, bootstrap and read the documentation included.

There are a number of benefits to this.
  1. All sets now support all operations using names that correspond to the intuitive meanings of the operations
  2. It is much easier to change set representations within a piece of code: just change the code that initializes the set
  3. It's easier to implement new types of sets because you get many operations 'for free' and a nice common API.

The design I went with had a few conflicting goals:
  • The protocol should be clean and simple, and therefore easy to implement
  • The new protocol should be mostly compatible with the existing sets vocabulary.
  • There should be no performance overhead (besides method dispatch) for using this protocol

Each of these had to be sacrificed a little. For performance, I had to make the protocol include many different operations, since different set implementations might have a way to override them for efficiency. To keep things simple and easy to implement, I did this only for operations that were currently in use in the Factor code base, and I included default methods for as many operations as I could. For compatibility, I made unordered sequences act as sets in a way that's very similar to the old sets vocabulary, and the major operations are generalizations of the operations on sequences.

There is not total backwards compatibility, though. If that had been a design requirement (say, if this were a change made on Factor 1.0), the design would be much cruftier. One change is that the word prune is gone, subsumed by members, which generates a sequence of the members of a set. Given a sequence, this gives a sequence with the same elements but only one copy of each.

A bigger change is that hashtables are no longer meant to be used as sets. In their place is a new data structure, the hash set. Hash sets have literal syntax like HS{ 1 2 3 }, similar to other collections. In their current implementation, they use a hashtable underneath, but in a future implementation a more memory-efficient construction may be used. Hash sets implement set operations and hashtables do not. Previously, words like conjoin, conjoin-at and key? were used to query hashtables as sets, but now these are subsumed by the set words adjoin, adjoin-at and in?. There is a lot of code that uses hashtables as sets, and it's not easy to sort out the set uses of hashtables from the non-set uses for someone who hasn't written the code. So for now, words to manipulate hashtables as sets are still present. conjoin and conjoin-at will be eliminated when all code in the Factor repository is updated to use hash sets instead.

It's somewhat to have a language evolution process where the language is not guaranteed to be compatible from one version to the next, as Factor is right now. There will continue to be incompatibilities until version 1.0 is released for the sake of clean organization of the language. There is a tradeoff here: incompatibilities make Factor harder to use now, and prevent adoption today, but the resulting system will be better-organized and easier to use as a result. Factor would probably be in much worse shape today if a policy of backwards compatibility had been adopted a few years ago, and it's a little too soon to start now in freezing the language.

Update: By the way, I forgot to mention in the original post: many other high-level programming languages don't have such a nice generic collection of data structures. Factor's isn't complete, but in many ways it's more advanced than many other popular programming languages. Java, C++ and C# are languages with generic data structures libraries, but using the data structures is more verbose due to certain missing language features, like the lack of syntax for literal hashtables. Popular scripting languages like Python, Ruby and Perl tend to privilege hashtables and resizable arrays over other data structures. It's possible to create data types that look just like arrays or hashtables using Python or Ruby in terms of their interface. But the higher methods for these data structures will only work for the builtin types and there's no way to make them work for your data type but to reimplement them. Functional languages like Scheme and Haskell have libraries for lists and arrays, but the interfaces are different for different data types. Even though Haskell has type classes, both of these languages' standard libraries are written with lists in mind for the most common operations. Factor used to resemble scripting languages in its support of data structures, but experience in writing large programs with these data structures has led to a better thought-out, more object-oriented model.