Critiquing Facebook’s new PHP spec

Yesterday, Facebook released an initial draft specification for PHP. Written by a team of Facebook employees including a veteran of many specification committees, it looks like a serious effort to provide a needed specification to a language that has gone without it for a long time.

So I thought I’d take a look and see if it was any good.

Why care what I think?

To start with it might be worth mentioning some background. I worked on a PHP compiler for about 4 years from 2005-2009, in which I implemented lots of code generation, static analysis and optimization for PHP, and wrote a PhD on it. If you’ve heard of me before, it’s probably for my rant about HipHop.

As one of the few people outside Facebook and the PHP community that has built a compiler for PHP, and as the author of the most advanced static analyzer for PHP (again, excluding Facebook’s one), my thoughts on Facebook’s PHP spec might be interesting to a larger audience.

Quick summary

Let me summarize this post by saying I love this specification. Both the fact that they wrote it, and the spec itself are awesome.

I don’t just love it because it is a specification, and because PHP sorely needed one, but because I truly think this is well done. It describes the current implementation exceptionally well, and it’s clear they have a ton of expertise in understanding how PHP works. Not only that, but it’s exceptionally clear and well written, and I haven’t been able to find a single flaw in the semantics at all (“semantics” in the programming language world just means “how does the language work”).

A quick aside here. In my PhD, I discussed how PHP, along with Ruby, Python and some others, are languages defined by their implementations. There’s really only one very good discussion of how the PHP implementation works, which is a book from 2006 by Sara Golemon called Embedding and Extending PHP.

The book serves as a reference for the entire PHP implementation. I had a dog-eared copy that I read cover to cover about four times, and it basically taught me the PHP memory model. There are few people as qualified to discuss the implications of copy-on-write and change-on-write sets, and the other weirdness of the PHP implementation and its effect on PHP’s semantics than its author Sara Golemon. So it’s no surprise really that one of the folks leading the charge on the HHVM team is Sara, and it feels like this is indicative of quality I saw in the spec.

Terminology

Since “PHP” is such an overloaded term in this post, I’m going to refer to the PHP interpreter as the “Zend engine”, refer to PHP the language as “the language”, and refer to Facebook’s PHP specification as “the spec”.

Deviation

One of the most interesting things is the places they’ve chosen to deviate from how the Zend engine does things. This is an interesting political move. In a couple of places, the spec says that an implementation can choose how to implement the language, and that other implementations can go a different way. Each case is pretty well chosen, and although the reasoning isn’t presented, there’s a pretty clear goal of allowing HHVM to deviate from the language’s weird corner cases.

This is interesting because they’re changing the definition of the language through a sort of back-channel. They’re allowing breaking changes by effectively deciding that other implementation choices are equally valid.

I’ll give you an example, which I’ll get into more below. There’s a little known bug in the Zend engine around copying arrays that contain references. IBM wrote a paper about this bug in 2009. Basically, this bug was necessary in Zend to make copying arrays fast, and IBM figured out a way to do it in a way that was actually correct, for only a 10% performance penalty.

The spec describes this use case in great detail, and then says you don’t need to implement it this way. They say:

If $source‘s VStore has a refcount that is greater than 1, the Engine uses an implementation-defined algorithm to decide whether to copy the element using value assignment ($destination = $source) or byRef assignment ($destination =& $source).

This is one of eighteen times in the spec that they use the phrase “implementation-defined”. So they actually let an implementation choose between different ways of doing it – the Zend way and what I assume is the HHVM way. This isn’t just a sane option, it’s actually the only sane option.

I presume this came directly from the requirements of HHVM, but by effectively saying “look, we don’t want to be bound by this decade old mistake” they not only free HHVM from having to implement this, but they actually change the rules for what Zend itself has to implement going forward. A very nice move, in my opinion.

PHP memory model

One of the best things in the spec is the description of the language’s memory model. A “memory model” is a technical term for “how variables and values and assignment and stuff works”.

They made a very interesting choice of how to specify this. Rather than trying to specify the exact algorithm for everything (which is what the JS spec typically does, for example), they chose to describe the Zend model (or close enough) and say “it has to appear to work like this”.

So the spec provides a high-level model based on the idea that there are variables (called VSlots), and separately there are values (called VStores) which hold integers and other scalars or refer to objects, strings or arrays, and then there is extra thing called HStore, which is where objects and arrays are kept.

This gets to the crux of the Zend memory model, and its a really strong abstraction of it. It isn’t perfectly representative of how Zend actually works under the hood, which they own up to, but they also say that you can’t tell the difference between this and Zend. I think that’s correct, or at least can’t think of any counter-examples, and I’ve spent a long time thinking of weird counter examples.

This is also one of those areas where the writing is really good. It’s not perfect, but for the most part they describe very complicated stuff in very clear language, systematically going through all the edge cases: whether the assignment is to or from an array, whether the values involved are references, and whether the assignments are by reference. Very high quality stuff.

Reference counting

One thing they abstract over is reference counting. Zend is implemented using reference counting, and the reference counting allows Zend to implement an optimization called “copy-on-write”. This means when the Zend engine copies a value, it doesn’t actually copy it, it just increments the value’s reference count and says that it’s copied. If the value gets changed later (the “write” in “copy-on-write”), then the copy happens lazily when the change happens.

Since the semantics of this sometimes leak out to the world, you would imagine the spec would need to include the reference-counting semantics. Instead, they tease out the exact details of the copy-on-write semantics, and describe that in high-level terms, with examples. Which is a really powerful way to avoid hamstringing future implementors by specifying the Zend engine’s behaviour.

PHP’s nitty gritty details

Here’s a good test of your PHP knowledge: describe the difference between

$a = new Point(1, 3)

and

$a =& new Point (1, 3)

Answer: I forget! I think it’s that the next assignment to “$a` will do something odd, but honestly I don’t remember the subtleties.

However, clearly the spec’s authors haven’t forgotten the subtleties, and the spec describes them in great detail. If you want to understand the PHP language’s idiosyncrasies (ignore the inconsistent parameter ordering, reference semantics are the real mess), this makes a good read.

One of the more interesting discussions in the spec is around “deferred array copying“, which is that thing in the IBM paper I mentioned earlier. You have an array which contains a value which references another value. When you go to copy that array, you have a choice: does the referenced value exist in both arrays? Or do you make a copy of it like you did the array? Well, the latter probably makes more sense but Zend implements the former.

Here’s an example from the spec of code that triggers this behaviour:

$x = 0;
$a = array(&$x);
$b = $a;
$x = 2;
unset($x);
$b[1]++;
$b[0]++;
echo $a[0], ' ', $b[0];

The spec says that you can have an “implementation-defined algorithm” to choose between different ways to implement it. I presume this is because HHVM chose a different algorithm than Zend did. At least, they should because the Zend choice doesn’t make sense (“but at least it’s fast!”), and anyone relying on this behaviour deserves what they get.

Imperfection

Now, as much as I like the spec, I think there a few mistakes in here. One of them is the description of when to do garbage collection.

The spec declares that you have to GC the memory that holds variables at the end of a scope:

A variable having automatic storage duration comes into being and is initialized at its declaration or on its first use, if it has no declaration. Its lifetime is delimited by an enclosing scope. The automatic variable’s lifetime ends at the end of that scope. Automatic variables lend themselves to being stored on a stack where they can help support argument passing and recursion. Local variables, which include function parameters, have automatic storage duration.

To my reading, this says that you must keep the variables alive until they fall out of scope (until the function ends, for example). Similarly:

The Engine must reclaim each VSlot when the storage duration of its corresponding variable ends, when the variable is explicitly unset by the programmer, or when the script exits, whichever comes first. In the case where a VSlot is contained within an HStore (i.e. an array element or an object instance property), the engine must immediate reclaim the VSlot when it is explicitly unset by the programmer, when the containing HStore is reclaimed, or when the script exits, whichever comes first.

I read this as saying that when a variable dies, you must immediately clear it up. I suspect that this will make the GC a little less flexible than it has to be. For example, some advanced garbage collectors might want to keep the variables alive a little bit longer. Alternatively, an implementation might want to share storage for a pair of variables which it knows aren’t alive at the same time. My read here is that they won’t allow that.

[Edit: thanks to HN user wvenable for pointing this out!]

One side effect of the new GC behaviour is that the HStore (representing an object) does not have to be reclaimed immediately. That’s great if you’re writing an implementation with a GC – cleaning up objects later is ideal for advanced garbage collectors. But it’s not so great if you rely on RAII, which the Zend engine currently supports. It looks like this is going to be a pretty big language change if it goes through as is. An implementation supporting RAII would still be allowed, just not required.

[RAII means “Resource Acquisition is Initialization”. It means you grab everything you need in your constructor, and free it in your destructor. This is possible in languages like C++ which have explicit rules that you must call your destructor when something goes out of scope, but not in say Java where finalizers run when the GC gets around to it. Zend effectively has C++’s rules, which the spec is closer to Java’s rules.]

Array cursors and threading

One other thing they specified is that array cursors are internal. (To be honest, I didn’t much like the description of arrays itself, but that might be the computer scientist in me: A PHP is an ordered map, really). This would manifest if a new PHP implementation wanted to use a different threading model: would two threads looping through the same array use the same cursor? Sounds pretty racy. Per-thread cursors or optional external cursors might be good options here, and leaving the door open to this might be useful.

Aliases

I’m also not crazy about some of the wording they chose. They use words like “pointer” and “memory location” and “aliases” in ways that are not strictly correct, or which are at least confusing since those words hold a different meaning to how they use them in the spec. For example, they say two VSlots (variables) are “aliased” if they hold values which are references to each other (aka “point to the same VStore”). While I agree that those are aliases in the technical sense, that definition excludes other aliased values, such as two variables pointing to the same object (which would be two VSlots pointing to two VStores pointing to one HStore, in the spec’s terminology).

By the way, this might be a good time to reflect on how specific this nitpicking is. Imagine how good a job they did on the whole thing if these are the depths to which I have to go to find something they did wrong!

Overflow

That said, I think their choice of how overflow works is unfortunate. They allow the implementation to define both the type and value of overflows, and I think this is a really bad idea. Call me a purist, but I think we should have a very specific understanding of exactly what happens when integers overflow.

I suspect the reason here is that Zend’s choice is poor, and Facebook wants to go in a different direction with HHVM. I loved that approach before with the deferred array copies, so it seems harsh to condemn here what I was happy with before, but integer overflow is important. Still, they’re likely to be hamstrung by Zend’s implementation, which has to remain valid since Zend is by definition a correct implementation.

A welcome specification

Overall, I’d say this is a pretty great spec, and a welcome addition to the PHP implementation writer’s repertoire. It’s clear that some work remains (I didn’t highlight explanations that were poor, etc, and there are a few), but I feel confident that this is very close to being an excellent doc. The HHVM team did a great job!