AST in JSON format

# Kevin Curtis (15 years ago)

In May there was some discussion on the possibility of a standardized format for the ECMAScript AST in JSON/JsonML.

The V8 engine has an option (in bleeding_edge) where the internal AST tree can be output to JsonML for debugging purposes: ./shell_g --print_json_ast <file.js>

This is V8's internal AST type, which necessarily includes some implementation-specific artifacts. That said, the V8 AST is very nearly straight out of the ECMA 262 spec, so it's pretty generic. (Note: it's an initial version e.g doesn't recur into switch statement cases). It could be useful as an input as to what a standard JSON AST should look like. (Which, i guess, ECMAScript engines could support as an new, additional format to any existing AST serialization formats).

Here's an example - with some V8 artefact's removed for clarity. Note: the script gets wrapped in a FunctionLiteral and VariableProxy == Identifier.

--- source ---

x = 1; if (x > 0) { y = x + 2; print(y); }

--- AST JsonML ---

["FunctionLiteral", {"name":""}, ["ExpressionStatement", ["Assignment", {"op":"ASSIGN"}, ["VariableProxy", {"name":"x"} ], ["Literal", {"handle":1} ] ] ], ["IfStatement", ["CompareOperation", {"op":"GT"}, ["VariableProxy", {"name":"x"} ], ["Literal", {"handle":0} ] ], ["Block", ["ExpressionStatement", ["Assignment", {"op":"ASSIGN"}, ["VariableProxy", {"name":"y"} ], ["BinaryOperation", {"op":"ADD"}, ["VariableProxy", {"name":"x"} ], ["Literal", {"handle":2} ] ] ] ], ["ExpressionStatement", ["Call", ["VariableProxy", {"name":"print"} ], ["VariableProxy", {"name":"y"} ] ] ] ], ["EmptyStatement"] ] ]

3

# David-Sarah Hopwood (15 years ago)

Kevin Curtis wrote:

In May there was some discussion on the possibility of a standardized format for the ECMAScript AST in JSON/JsonML.

The V8 engine has an option (in bleeding_edge) where the internal AST tree can be output to JsonML for debugging purposes: ./shell_g --print_json_ast <file.js>

This is V8's internal AST type, which necessarily includes some implementation-specific artifacts. That said, the V8 AST is very nearly straight out of the ECMA 262 spec, so it's pretty generic. (Note: it's an initial version e.g doesn't recur into switch statement cases). It could be useful as an input as to what a standard JSON AST should look like. (Which, i guess, ECMAScript engines could support as an new, additional format to any existing AST serialization formats).

The Jacaranda parser (not released yet) also produces a JsonML AST. Below is the same example for comparison, also with Jacaranda-specific artefacts removed.

Here's an example - with some V8 artefact's removed for clarity. Note: the script gets wrapped in a FunctionLiteral and VariableProxy == Identifier.

--- source ---

x = 1; if (x > 0) { y = x + 2; print(y); }

["SEQ", {}, ["EXPRSTMT", {}, ["=", {}, ["REF", {"name":"x"}], ["NUMBER", {"MV":1}]]], ["if", {}, [">", {}, ["REF", {"name":"x"}], ["NUMBER", {"MV":0}]], ["{", {}, ["SEQ", {}, ["EXPRSTMT", {}, ["=", {}, ["REF", {"name":"y"}], ["+", {}, ["REF", {"name":"x"}], ["NUMBER", {"MV":2}]]], ["EXPRSTMT", {}, ["(", {}, ["REF", {"name":"print"}], ["ARGS", {}, ["REF", {"name":"y"}]]]]]]]]

# Kevin Curtis (15 years ago)

DSH - very interesting.

Is the idea eventually (for say ES6) to have something like like the Python ast module, where the AST (in JsonML format) can be executed? Or just js source to source roundtripping. e.g js -> JsonML AST -> js

I'm working on an experiment utilizing the V8 engine - using the JSON object as a (temporary) namespace:

var source = "var x = 3; if (x > 2) print('hello world');";

var astStr = JSON.AST.parse(source); // returns the AST in a JsonML string

JSON.AST.execute(astStr);

# Kevin Curtis (15 years ago)

A patch is available for the V8 engine which enables:

JSON.AST.parse(|js source code|) -> |AST JsonML String|

At: code.google.com/p/ecmascript-ast

# Kevin Curtis (14 years ago)

I have tweaked the JsonML AST format (shorter element names and operator tokens not names eg ">=" instead of "GTE") and added an

evaluate() function which can execute the JsonML string. Example:

more example.js

var source = "x = 2; if (x > 1) print(x + 3);";

print("--- js source ---"); print(source); print("");

print("--- ast ---"); var ast = JSON.AST.parse(source); print(ast); print("");

print("--- evaluate(ast) ---"); JSON.AST.evaluate(ast);

./shell_g example.js --- js source --- x = 2; if (x > 1) print(x + 3);

--- ast --- ["Program", ["ExprStmt", ["Assign", {"op":"="}, ["Id", {"name":"x"} ], ["Literal", {"handle":2} ] ] ], ["If", ["CompareOp", {"op":">"}, ["Id", {"name":"x"} ], ["Literal", {"handle":1} ] ], ["ExprStmt", ["Call", ["Id", {"name":"print"} ], ["BinOp", {"op":"+"}, ["Id", {"name":"x"} ], ["Literal", {"handle":3} ] ] ] ], ["EmptyStmt"] ] ]

--- evaluate(ast) --- 5

# Kevin Curtis (14 years ago)

The previous example was pretty printed. A production version would offer a minified alternative without newlines and indentation. e.g:

["Program",["ExprStmt",["Assign",{"op":"="},["Id",{"name":"x"}], ...

Perhaps there could be in addition an alternative compact format where the element strings are replaced with integers:

[44,[20,[21,{"op":"="},[25,{"name":"x"}], ...

And the attribute name string with an integer or maybe the first letter:

[44,[20,[21,{1:"="},[25,{3:"x"}], ... [44,[20,[21,{"o":"="},[25,{n:"x"}], ...

It's compact and as the integers match the underlying enum no hash lookup for the element/attribute string names is required. Kind of a half way house between the AST and bytecode. e.g:

var ast = JSON.AST.parse(source, true); // true for compact mode var result = JSON.AST.evaluate(ast); // No special mode required

The evaluate() function would simply test if the element value was a string or integer and hash lookup or atoi.

# Kevin Curtis (14 years ago)

Re the idea of standardizing a serialization format in JsonML of the ECMAScript AST based on the ECMAScript grammar.

The ecmascript-ast project's JSON.AST.parse() function now serializes all of the grammar/AST to JsonML (I think). JSON.AST.evaluate() is rudimentary (example.js and ast-test/if.js and while.js work!) The code to build JSON.AST on V8 is at: code.google.com/p/ecmascript-ast

JSON.AST - like the JSON object - could initially be implemented in pure javascript. (evaluate() could parse JsonML and generate JS which could then be eval'ed). Then engines could determine whether to do native implementations. A native JSON.AST.parse() outputting JsonML should be relatively straighforward. JSON.AST.evaluate() is more tricky - 2 alternatives. Parse and execute the JsonML directly. Or parse the JsonML to JS - and then the JS can be eval'ed. (A JsonML to JS function is probably desirable in any case).

Here's an example of parsing JS source to JsonML AST:

./ast.sh ast-test/while.js --- js source --- x = 1; while (x < 5) { print(x); x = x + 1; };

--- ast --- ["Program", ["ExprStmt", ["Assign", {"op":"="}, ["Id", {"name":"x"} ], ["Literal", {"handle":1} ] ] ], ["While", ["CompareOp", {"op":"<"}, ["Id", {"name":"x"} ], ["Literal", {"handle":5} ] ], ["Block", ["ExprStmt", ["Call", ["Id", {"name":"print"} ], ["Id", {"name":"x"} ] ] ], ["ExprStmt", ["Assign", {"op":"="}, ["Id", {"name":"x"} ], ["BinOp", {"op":"+"}, ["Id", {"name":"x"} ], ["Literal", {"handle":1} ] ] ] ] ] ] ]

--- evaluate(ast) --- 1 2 3 4

# Mark S. Miller (14 years ago)

On Sun, Dec 6, 2009 at 4:19 AM, Kevin Curtis <kevinc1846 at googlemail.com> wrote:

Re the idea of standardizing a serialization format in JsonML of the ECMAScript AST based on the ECMAScript grammar.

The ecmascript-ast project's JSON.AST.parse() function now serializes all of the grammar/AST to JsonML (I think). JSON.AST.evaluate() is rudimentary (example.js and ast-test/if.js and while.js work!) The code to build JSON.AST on V8 is at: code.google.com/p/ecmascript-ast

I am eager to see such a proposal. However, on your pages, I could not find any documentation, other than small examples, showing what your proposed encoding of JS ASTs in JSON is. Are you writing a draft spec?

# Kevin Curtis (14 years ago)

No spec as yet - more a speculative prototype.

The patch creates a subdir 'ast-test' in the V8 dir which contain a range of JS source examples covering the ECMAScript grammar. These examples (and any other JS scripts) can be parsed to JsonML using the ast-parse.sh shell script (which calls JSON.AST.Parse()): ./ast-parse.sh ast-test/forin.js

The project is an attempt to use the (already pretty generic) JsonML output feature offered by the V8 deug shell as a starting point for a generic standarized JSON AST format. Thus, output from from an unpatched V8 debug shell can also give a good feel of what the JsonML looks like. (Note: this output is incomplete - it prints only the node name for some AST constructs): ./shell_g file.js --print-json-ast

It would be useful to get a feel whether the general idea is a starter

  • or non-starter. Maybe it merits a wiki straw man entry.

more ast-test/forin.js

x = [4,5,6]; for (i in x) print(x[i]);

./ast-parse.sh ast-test/forin.js ["Program", ["ExprStmt", ["Assign", {"op":"="}, ["Id", {"name":"x"} ], ["ArrayLiteral", ["Literal", {"handle":4} ], ["Literal", {"handle":5} ], ["Literal", {"handle":6} ] ] ] ], ["ForIn", ["Id", {"name":"i"} ], ["Id", {"name":"x"} ], ["ExprStmt", ["Call", ["Id", {"name":"print"} ], ["Property", {"type":"NORMAL"}, ["Id", {"name":"x"} ], ["Id", {"name":"i"} ] ] ] ] ] ]

# Oliver Hunt (14 years ago)

On Dec 7, 2009, at 3:52 AM, Kevin Curtis wrote:

No spec as yet - more a speculative prototype.

The patch creates a subdir 'ast-test' in the V8 dir which contain a range of JS source examples covering the ECMAScript grammar. These examples (and any other JS scripts) can be parsed to JsonML using the ast-parse.sh shell script (which calls JSON.AST.Parse()): ./ast-parse.sh ast-test/forin.js

The project is an attempt to use the (already pretty generic) JsonML output feature offered by the V8 deug shell as a starting point for a generic standarized JSON AST format. Thus, output from from an unpatched V8 debug shell can also give a good feel of what the JsonML looks like. (Note: this output is incomplete - it prints only the node name for some AST constructs): ./shell_g file.js --print-json-ast

It would be useful to get a feel whether the general idea is a starter

  • or non-starter. Maybe it merits a wiki straw man entry.

Maybe i've missed an email or something, but what is the purpose of this spec? What is it trying to let developers do?

# Kevin Curtis (14 years ago)

This covers the origin of the idea and some of it's uses: esdiscuss/2009-May/009234

I'm interested in JsonML AST as a DSL target.

Hacking the YACC file in jsc to parse the ES5 grammar as expressed in JsonML could yield an (executable) spec of sorts.

# Maciej Stachowiak (14 years ago)

On Dec 7, 2009, at 7:22 AM, Kevin Curtis wrote:

This covers the origin of the idea and some of it's uses: esdiscuss/2009-May/009234

I'm interested in JsonML AST as a DSL target.

Hacking the YACC file in jsc to parse the ES5 grammar as expressed in JsonML could yield an (executable) spec of sorts.

I can see how modifying the AST client-side prior to execution could
be useful, to implement macro-like processing. But I don't see the use
case for serializing an AST in JSON-like format, or sending it over
the wire that way. It seems like it would be larger (and therefore
slower to transmit), and likely no faster to parse, as compared to
JavaScript source code. So shouldn't the serialization format just be
JS source code?

, Maciej

# Kevin Curtis (14 years ago)

True. JS as the ultimate delivery mechanism is most likely. With JSON serialization comes for free i guess.

Though i propose an 'compact format' - which is definitely 'speculative':

Perhaps there could an alternative compact format where the element strings are replaced with integers:

[44,[20,[21,{"op":"="},[25,{"name":"x"}], ...

And the attribute name string with an integer or maybe the first letter: [44,[20,[21,{1:"="},[25,{3:"x"}], ... [44,[20,[21,{"o":"="},[25,{n:"x"}], ...

It's compact and as the integers match the underlying enum no hash lookup for the element/attribute string names is required. Kind of a half way house between the AST and bytecode.

var ast = JSON.AST.parse(source, true); // true for compact mode var result = JSON.AST.evaluate(ast); // No special mode required The evaluate function would simply test if the element value was a string or integer and hash lookup or atoi.

# Maciej Stachowiak (14 years ago)

I could see a potential use case for a format that was more compact
and/or faster to parse down to an executable form than JS source.
However, for size the fair comparison should probably be to JS source
that has been run through a source-to-source minimizer/compressor, and
with both run through gzip. I seriously doubt any text-based AST
format will do better under these conditions.

I also note that your compact form no longer seems very useful for
direct manipulation of the AST, so presumably the use case would be
different.

If we really want to minimize size for delivery and parsing speed,
then probably some sort binary format with built-in compression would
be most useful, though there would be risk presented in properly
validating such a format.

Actually, this is potentially a factor for any natively supported AST
format. If execution is direct rather than via transoformation to JS
source, the implementation would have to verify that the AST is one
that could be created by parsing JS source.

, Maciej

# Kevin Curtis (14 years ago)

On Mon, Dec 7, 2009 at 4:56 PM, Maciej Stachowiak <mjs at apple.com> wrote:

I could see a potential use case for a format that was more compact and/or faster to parse down to an executable form than JS source. However, for size the fair comparison should probably be to JS source that has been run through a source-to-source minimizer/compressor, and with both run through gzip. I seriously doubt any text-based AST format will do better under these conditions.

V8 has the implementation idea of symbols - frequently used strings - such as 'valueOf, String, Number' etc. In ES5 'Object.defineProperty' etc is going to be used quite frequently. It would be possible to to turn these common Call strings into integers in JsonML. If the Call Id is an integer than it is considered a direct call to one of ES5's non-monkeypatched 'special forms'. Given the amount of object hardening factories that i imagine will be written this could be a win. ["Call", ["Property", ["Id", {name:"Object"}, ["Id", "defineProperty"]] [33,[55,{4:12}...

Maybe that would be more compact. Though it's stretching!

If we really want to minimize size for delivery and parsing speed, then probably some sort binary format with built-in compression would be most useful, though there would be risk presented in properly validating such a format.

Maybe the ByteVector proposal offers possibilities in this area! (With a module system could this be namespaced within a native builtin module? Then Data could be a name possibility. Though having ByteVector as top-level and immediately accessible is sweet).

Overall, there seems to be a tension between those who wish to treat the ECMAScript as a semantic runtime with serverside compilers generating compact, optimized JS and those who want to add sugar to the language so that the it could compete with - say Python - for user friendliness as a general purpose PL. I prefer to reduce the language and then add any necessary core semantics for safety, speed and simplicity. Then let DSL's compete for the best sugar - C syntax or otherwise.

Actually, this is potentially a factor for any natively supported AST format. If execution is direct rather than via transoformation to JS source, the implementation would have to verify that the AST is one that could be created by parsing JS source.

Though in some ways a JsonML format - or whatever - offers freedom from JS backward compatibility issues. Want a new keyword: ["Import", ...

.

# Brendan Eich (14 years ago)

On Dec 7, 2009, at 8:56 AM, Maciej Stachowiak wrote:

Actually, this is potentially a factor for any natively supported
AST format. If execution is direct rather than via transoformation
to JS source, the implementation would have to verify that the AST
is one that could be created by parsing JS source.

This reminds me of SafeTSA:

portal.acm.org/citation.cfm?id=378825, portal.acm.org/citation.cfm?doid=1377492.1377496

and more specifically of work by Christian Stork and Michael Franz, see:

www.ics.uci.edu/~cstork

The idea as I first heard it from Chris and Michael was to
arithmetically code ASTs such that no ill-formed tree could be
encoded. You could take a JPEG of the Mona Lisa, run it through the
decoder, and if it succeeded, get a (almost-certainly) nonsensical yet
syntactically well-formed AST. The encoding is fairly efficient, not
as good as optimized Huffman coding but close.

This work was motivated by the sometimes bad (O(n^4)) complexity in
the Java bytecode verifier (or at least in early versions of it).

My view is that there will never be a standardized bytecode (politics
look insuperable to me), and more: that there should not be. Besides
the conflicts among target VM technical details, and ignoring latent
IPR issues, I believe view-source capability is essential. Even
minification lets one pretty-print (jsbeautifier.org) and
learn or diagnose.

JS is still used in edit-shift-reload, crawl-walk-run development
style and part of this culture involves sharing. Of course no one
could mandate binary syntax to the exclusion of source, but a binary
syntax that did not allow pretty-printing would shove us all down the
slippery slope toward the opaque, closed-box world of Java applets,
Flash SWFs (modulo Flash+Flex's server-fetched view-source
capabilities), etc.

Compression at the transport (session, whatever, the model is climbing
the traditional layering) is a separate issue.

# Maciej Stachowiak (14 years ago)

On Dec 7, 2009, at 10:11 AM, Brendan Eich wrote:

On Dec 7, 2009, at 8:56 AM, Maciej Stachowiak wrote:

Actually, this is potentially a factor for any natively supported
AST format. If execution is direct rather than via transoformation
to JS source, the implementation would have to verify that the AST
is one that could be created by parsing JS source.

This reminds me of SafeTSA:

portal.acm.org/citation.cfm?id=378825, portal.acm.org/citation.cfm?doid=1377492.1377496

and more specifically of work by Christian Stork and Michael Franz,
see:

www.ics.uci.edu/~cstork

The idea as I first heard it from Chris and Michael was to
arithmetically code ASTs such that no ill-formed tree could be
encoded. You could take a JPEG of the Mona Lisa, run it through the
decoder, and if it succeeded, get a (almost-certainly) nonsensical
yet syntactically well-formed AST. The encoding is fairly efficient,
not as good as optimized Huffman coding but close.

This work was motivated by the sometimes bad (O(n^4)) complexity in
the Java bytecode verifier (or at least in early versions of it).

My view is that there will never be a standardized bytecode
(politics look insuperable to me), and more: that there should not
be. Besides the conflicts among target VM technical details, and
ignoring latent IPR issues, I believe view-source capability is
essential. Even minification lets one pretty-print (jsbeautifier.org ) and learn or diagnose.

JS is still used in edit-shift-reload, crawl-walk-run development
style and part of this culture involves sharing. Of course no one
could mandate binary syntax to the exclusion of source, but a binary
syntax that did not allow pretty-printing would shove us all down
the slippery slope toward the opaque, closed-box world of Java
applets, Flash SWFs (modulo Flash+Flex's server-fetched view-source
capabilities), etc.

Compression at the transport (session, whatever, the model is
climbing the traditional layering) is a separate issue.

Given the above, do you think there is a valid case to be made for a
serialization format other than JavaScript source itself? It seems
like anything binary is likely to have the same downsides as bytecode,
and anything text-based enough to be truly readable and view-source
compatible would be rather inefficient as a wire format (I would
consider a JSON encoding with mysterious integers all over to be not
truly view-source compatible). Thus I would propose that we should not
define an alternate serialization at all.

(This is as considered separately from the possibility of
programatically manipulating a parsed AST - the use cases for that are
clear. Though there may still be verification issues depending on the
nature of the manipulation API. It seems like the possibilities are
either specialized objects that enforce validity on every individual
manipulation, or something that accepts JSON-like objects and verifies
validity after the fact, or something that accepts JSON-like objects
and verifies validity by converting to JavaScript source code and then
parsing it).

, Maciej

# Mark Miller (14 years ago)

On Mon, Dec 7, 2009 at 7:45 AM, Maciej Stachowiak <mjs at apple.com> wrote:

On Dec 7, 2009, at 7:22 AM, Kevin Curtis wrote:

This covers the origin of the idea and some of it's uses: esdiscuss/2009-May/009234

I'm interested in JsonML AST as a DSL target.

Hacking the YACC file in jsc to parse the ES5 grammar as expressed in JsonML could yield an (executable) spec of sorts.

I can see how modifying the AST client-side prior to execution could be useful, to implement macro-like processing. But I don't see the use case for serializing an AST in JSON-like format, or sending it over the wire that way. It seems like it would be larger (and therefore slower to transmit), and likely no faster to parse, as compared to JavaScript source code. So shouldn't the serialization format just be JS source code?

+1.

While potentially useful, I have no interest in these ASTs as a serialization format nor in a compact AST encoding. I am interested in having a standard JsonML AST encoding of parsed ES5, and eventually an efficient and standard browser-side parser that emits these ASTs. Many forms of JS meta-programming that currently occur only on the server (e.g., Caja, FBJS, MSWebSandbox, Jacaranda) or have to download a full JS parser to the client per frame (ADsafe, JSLint, Narcissus, Narrative JS) could instead become lighter weight client side programs.

# David-Sarah Hopwood (14 years ago)

Mark Miller wrote:

On Mon, Dec 7, 2009 at 7:45 AM, Maciej Stachowiak <mjs at apple.com> wrote:

I can see how modifying the AST client-side prior to execution could be useful, to implement macro-like processing. But I don't see the use case for serializing an AST in JSON-like format, or sending it over the wire that way. It seems like it would be larger (and therefore slower to transmit), and likely no faster to parse, as compared to JavaScript source code. So shouldn't the serialization format just be JS source code?

+1.

While potentially useful, I have no interest in these ASTs as a serialization format nor in a compact AST encoding. I am interested in having a standard JsonML AST encoding of parsed ES5, and eventually an efficient and standard browser-side parser that emits these ASTs. Many forms of JS meta-programming that currently occur only on the server (e.g., Caja, FBJS, MSWebSandbox, Jacaranda) or have to download a full JS parser to the client per frame (ADsafe, JSLint, Narcissus, Narrative JS) could instead become lighter weight client side programs.

+1.

Note that:

  • although the size of the JSON serialization of the AST is not critical for this kind of usage, the size of the in-memory representation definitely is.

  • encoding node type strings as integers, as suggested earlier in the thread, does not help with this memory usage.

  • any Lempel-Ziv-based compression algorithm will do much better than replacing type strings with integers, in the few situations where it is useful to serialize the AST and to minimize the size of the serialization.

  • JsonML is a reasonable basis for an AST format even when "serialization for free" is of fairly low importance. In particular, it is useful that it only uses structures that are common across programming languages (for instance, the prototype Jacaranda verifier uses it even though it is written in Java). Also, programmers of AST-processing applications will see this serialization when debugging, and it is likely to appear in test cases for such applications and for parsers/emitters.

# ihab.awad at gmail.com (14 years ago)

On Mon, Dec 7, 2009 at 3:10 PM, David-Sarah Hopwood <david-sarah at jacaranda.org> wrote:

... programmers of AST-processing applications will see this serialization when debugging, and it is likely to appear in test cases for such applications and for parsers/emitters.

Also: would a JsonML representation be quicker to execute than the original human readable JS source? If so, it could be useful as a wire format for the code of mobile objects.

Furthermore, this format could be a good target for generated code.

On the other hand, this discussion slowly creeps into asking whether (say) collaborative IDEs and could benefit from such a common representation. For that use case, it would be necessary to support comments, which would grow the problem space considerably....

Ihab

# Oliver Hunt (14 years ago)

On Dec 7, 2009, at 3:21 PM, ihab.awad at gmail.com wrote:

On Mon, Dec 7, 2009 at 3:10 PM, David-Sarah Hopwood <david-sarah at jacaranda.org> wrote:

... programmers of AST-processing applications will see this serialization when debugging, and it is likely to appear in test cases for such applications and for parsers/emitters.

Also: would a JsonML representation be quicker to execute than the original human readable JS source? If so, it could be useful as a wire format for the code of mobile objects.

I doubt it -- the format would be larger and therefore have more ota time and more content to parse than plain JS content. The AST would be untrusted content so would still need to be validated (as the normal parser does implicitly).

Furthermore, this format could be a good target for generated code.

I'm not sure what you mean here...

# ihab.awad at gmail.com (14 years ago)

On Mon, Dec 7, 2009 at 3:29 PM, Oliver Hunt <oliver at apple.com> wrote:

On Dec 7, 2009, at 3:21 PM, ihab.awad at gmail.com wrote:

Furthermore, this format could be a good target for generated code. I'm not sure what you mean here...

Oh, I'm just claiming that if you need to write a code generator, it may be easier to output an AST in JsonML then use a standard renderer to serialize it, rather than worry about how to render as real JS source code and print out opening and closing braces, parentheses, semicolons, etc (or hunt around finding a library that will do that for you).

Ihab

# David-Sarah Hopwood (14 years ago)

ihab.awad at gmail.com wrote:

On Mon, Dec 7, 2009 at 3:10 PM, David-Sarah Hopwood <david-sarah at jacaranda.org> wrote:

... programmers of AST-processing applications will see this serialization when debugging, and it is likely to appear in test cases for such applications and for parsers/emitters.

Also: would a JsonML representation be quicker to execute than the original human readable JS source? If so, it could be useful as a wire format for the code of mobile objects.

Parsing ECMAScript correctly is hairy, complicated, and likely to be inefficient unless you have a lot of time to spare to optimize the parser (as I'm sure you know from Caja :-). Parsing a JsonML serialized AST is trivial: use JSON.parse.

OTOH, if we standardize an AST format, then presumably we'll be adding a source->AST API function that uses the implementation's existing parser.

If and when such an API is available, there would be little reason to use the JsonML serialization as a wire format.

Furthermore, this format could be a good target for generated code.

The in-memory AST, yes. Also possibly the serialized form where the code generator is written in a language other than ECMAScript (since writing a correct JSON emitter, even if you can't reuse one, is much easier than writing a correct ECMAScript emitter).

On the other hand, this discussion slowly creeps into asking whether (say) collaborative IDEs and could benefit from such a common representation. For that use case, it would be necessary to support comments, which would grow the problem space considerably....

It depends whether you just want to preserve comments, which is pretty easy and would be quite a useful option (especially for "doc-comments"), or whether you want to preserve whitespace, which I think should be considered out-of-scope for an AST format.

# Oliver Hunt (14 years ago)

<snip>

OTOH, if we standardize an AST format, then presumably we'll be adding a source->AST API function that uses the implementation's existing parser.

I'd be worried about assuming that this is an obvious/trivial thing for implementations to do, you're effectively requiring that the internal AST representation of an implementation be entirely standardised. For example it is not possible for JSC's parser to produce an AST that exactly matches the input code -- I would expect similar problems with other implementations.

# Kevin Curtis (14 years ago)

(This is as considered separately from the possibility of programatically manipulating a parsed AST - the use cases for that are clear. Though there may still be verification issues depending on the nature of the manipulation API. It seems like the possibilities are either specialized objects that enforce validity on every individual manipulation, or something that accepts JSON-like objects and verifies validity after the fact, or something that accepts JSON-like objects and verifies validity by converting to JavaScript source code and then parsing it).

, Maciej

I asked something similar here: esdiscuss/2009-May/009243

Are there 2 approaches to adding AST functionality in ecmascript:

  1. The AST as JSON Either Brendan's original example or the S-expression-ish jsonml. This covers both the in memory representation of the AST and its serialization format.

  2. API Similar to the ast module in python where multiple api calls build the AST nodes.


Given that there are no takers for 'scheme by the back door', then, with the object hardening features of ES5 - 2) should be considered. e.g: // assume an ast module factory var x = ast.BinOp("+", 4, 5); x.left = 6; x.right = 7 x.whatever = 8 // error

The Python ast module has a dump method which dumps the AST tree of Python objects to a string of nested functions calls - which can be eval'ed to recreate in memory.

# Brendan Eich (14 years ago)

On Dec 7, 2009, at 11:16 AM, Maciej Stachowiak wrote:

On Dec 7, 2009, at 10:11 AM, Brendan Eich wrote:

[snip...] Compression at the transport (session, whatever, the
model is climbing the traditional layering) is a separate issue.

Given the above, do you think there is a valid case to be made for a
serialization format other than JavaScript source itself?

No, I do not. I've cooled on the idea since last talking about it on
this list.

But as Mark pointed out in reply (sorry, I've been off-net all day and
am only now catching up) a JsonML AST encoding standard would help
certain programs. Narcissus could be a lot faster if it could call a
standard method to parse its source into such a JSON object tree and
then execute that, for example (almost all profiled time running
Narcissus is in its lexer and parser).

# Brendan Eich (14 years ago)

On Dec 7, 2009, at 4:07 PM, Oliver Hunt wrote:

<snip>

OTOH, if we standardize an AST format, then presumably we'll be
adding a source->AST API function that uses the implementation's existing
parser.

I'd be worried about assuming that this is an obvious/trivial thing
for implementations to do, you're effectively requiring that the
internal AST representation of an implementation be entirely
standardised. For example it is not possible for JSC's parser to
produce an AST that exactly matches the input code -- I would expect
similar problems with other implementations.

This is a good point, we've talked here before about bottom-up vs. top- down parser trade-offs, left- vs. right-associativity for && and ||,
etc.

Also, some (at least Waldemar, IIRC) on TC39 have objected to
intermediating concrete syntax to semantics via an AST, since it
increases the size and complexity of the standard.

# Brendan Eich (14 years ago)

On Dec 7, 2009, at 4:28 PM, Kevin Curtis wrote:

(This is as considered separately from the possibility of
programatically manipulating a parsed AST - the use cases for that are clear.
Though there may still be verification issues depending on the nature of the
manipulation API. It seems like the possibilities are either specialized objects
that enforce validity on every individual manipulation, or something
that accepts JSON-like objects and verifies validity after the fact, or
something that accepts JSON-like objects and verifies validity by converting to
JavaScript source code and then parsing it).

, Maciej

I asked something similar here: esdiscuss/2009-May/009243

Are there 2 approaches to adding AST functionality in ecmascript:

  1. The AST as JSON Either Brendan's original example or the S-expression-ish jsonml. This covers both the in memory representation of the AST and its serialization format.

I withdrew my sketch in favor of the JsonML direction back in the
original thread.

  1. API Similar to the ast module in python where multiple api calls build the AST nodes.

Given that there are no takers for 'scheme by the back door', then, with the object hardening features of ES5 - 2) should be considered. e.g: // assume an ast module factory var x = ast.BinOp("+", 4, 5); x.left = 6; x.right = 7 x.whatever = 8 // error

The Python ast module has a dump method which dumps the AST tree of Python objects to a string of nested functions calls - which can be eval'ed to recreate in memory.

As I think David-Sarah just pointed out, using too much memory to
express something that can be encoded in a much smaller string can be
a bit of a lose.

Standardized ASTs are not a done deal, they constitute a prior work
item, call it 0. We should check whether it is something TC39 members
could agree on in more detail before going too far, although some
popular implementations and a bit of luck not over-constraining parser
implementation choices could lead to adoption and de-facto
standardization.

# David-Sarah Hopwood (14 years ago)

Kevin Curtis wrote:

David-Sarah,

Jacaranda is - I think - a DSL which is not a subset of JS.

No, it's a subset of JS (like most of the other languages mentioned -- Caja/Cajita, FBJS, ADsafe etc.)

How do you anticipate jacaranda programs being delivered to the browser.

Either:

  • verify on the server/proxy, and deliver the verified program as-is (with the Jacaranda runtime library), or
  • tag the program as Jacaranda and have a browser add-on verify it before execution.

The value of the AST format here would be in potentially greatly simplifying some future version of the verifier, by allowing an existing parser to be reused.

That would however depend on an assessment of whether browser implementors had succeeded in implementing secure and correct ES5->AST parsers (with a mode that accepts exactly ES5 as specified,

not ES5 plus undocumented cruft and short-cuts for edge cases).

# Breton Slivka (14 years ago)

On Tue, Dec 8, 2009 at 3:57 PM, David-Sarah Hopwood <david-sarah at jacaranda.org> wrote: <snip>

That would however depend on an assessment of whether browser implementors had succeeded in implementing secure and correct ES5->AST parsers (with a mode that accepts exactly ES5 as specified, not ES5 plus undocumented cruft and short-cuts for edge cases).

-- David-Sarah Hopwood  ⚥  davidsarah.livejournal.com

would it make sense to abandon our attachment to using the browser native parser, and just implement an ES5 parser/serializer as a seperate standard unit, without ties to the js engine itself? Would there be significant disadvantage to having two parsers in one ES interpreter?

# David-Sarah Hopwood (14 years ago)

Oliver Hunt wrote:

<snip>

OTOH, if we standardize an AST format, then presumably we'll be adding a source->AST API function that uses the implementation's existing parser.

I'd be worried about assuming that this is an obvious/trivial thing for implementations to do, you're effectively requiring that the internal AST representation of an implementation be entirely standardised.

Not at all. An implementation could, for example, parse to its internal AST format and then convert from that to the standard format (which is a trivial tree walk). This only requires that the internal format not lose information relative to the standard one. If it does currently lose information, then changing it not to is relatively straightforward.

In any case, without a source->AST API, what use is a standard AST format?

The existance of that API (and the corresponding AST->source pretty-printing

API) is the main motivation for standardizing the format, AFAICS.

# David-Sarah Hopwood (14 years ago)

Breton Slivka wrote:

On Tue, Dec 8, 2009 at 3:57 PM, David-Sarah Hopwood <david-sarah at jacaranda.org> wrote: <snip>

That would however depend on an assessment of whether browser implementors had succeeded in implementing secure and correct ES5->AST parsers (with a mode that accepts exactly ES5 as specified, not ES5 plus undocumented cruft and short-cuts for edge cases).

would it make sense to abandon our attachment to using the browser native parser, and just implement an ES5 parser/serializer as a seperate standard unit, without ties to the js engine itself? Would there be significant disadvantage to having two parsers in one ES interpreter?

What "attachment to using the browser native parser"? It's an implementation detail how the ES5->AST parser is constructed.

However, I wouldn't expect many implementors to want to duplicate code and effort.

Note that with an event-driven parser, for example, it's trivially easy to plug in different event consumers to the same parser and generate different AST formats.

# Kevin Curtis (14 years ago)

A comparative analysis of how the opensource engines layout their AST nodes could be useful. If there is commonality there could be progress.

Also, it could be left to engine vendors - how or if - they parse JS -> JsonML AST string:-

  • pure JS - a fallback. No work by vendor. Reuse a JS parser written in JS.
  • native. Reuse native AST tree walking functionality.

Defacto standarization could emerge with this approach. As with json2.js and native JSON.

Also, is there a case for a standard function to parse JsonML to JS. Maybe a native implementation - depending on performance compared to pure javascript. N.B: not relevant for narcissus style meta-circular processors.

# Allen Wirfs-Brock (14 years ago)

-----Original Message----- From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Breton Slivka Sent: Monday, December 07, 2009 9:36 PM

would it make sense to abandon our attachment to using the browser

native parser, and just implement an ES5 parser/serializer as a seperate standard unit, without ties to the js engine itself? Would there be significant disadvantage to having two parsers in one ES interpreter?

# Oliver Hunt (14 years ago)

On Dec 8, 2009, at 2:18 AM, David-Sarah Hopwood wrote:

Oliver Hunt wrote:

<snip>

OTOH, if we standardize an AST format, then presumably we'll be adding a source->AST API function that uses the implementation's existing parser.

I'd be worried about assuming that this is an obvious/trivial thing for implementations to do, you're effectively requiring that the internal AST representation of an implementation be entirely standardised.

Not at all. An implementation could, for example, parse to its internal AST format and then convert from that to the standard format (which is a trivial tree walk). This only requires that the internal format not lose information relative to the standard one. If it does currently lose information, then changing it not to is relatively straightforward.

And here you are assuming that ES implementations don't lose information, which is an incorrect assumption. I believe spider monkey at minimum does constant folding while parsing, JSC does constant folding, variable and function hoisting, it also does not actually generate an AST at all for function bodies.

In any case, without a source->AST API, what use is a standard AST format? The existance of that API (and the corresponding AST->source pretty-printing API) is the main motivation for standardizing the format, AFAICS.

That's kind of my point -- adding an ES API to get the parse tree requires one of two things, it's either specifying the internal implementation of an engine's parser, or it's requiring a second parser. Both of these alternatives seem bad.

I have also yet to see an actual developer centric use case -- the most frequently repeated use cases seem to be analysis tools used to ensure code can be run on a standard ES implementation while essentially being a stricter subset of ES. The developer scenario for that is "developer wishes to use 'safe' subset of ES", not "developer wants to perform their own analysis"

# Breton Slivka (14 years ago)

On Wed, Dec 9, 2009 at 6:28 AM, Oliver Hunt <oliver at apple.com> wrote:

On Dec 8, 2009, at 2:18 AM, David-Sarah Hopwood wrote:

Oliver Hunt wrote:

I have also yet to see an actual developer centric use case -- the most frequently repeated use cases seem to be analysis tools used to ensure code can be run on a standard ES implementation while essentially being a stricter subset of ES.  The developer scenario for that is "developer wishes to use 'safe' subset of ES", not "developer wants to perform their own analysis"

--Oliver

Run time mashups? As a developer, I now have access in some browsers to a cross domain enabled getJSON method. Suppose I wanted to use that to retrieve a javascript program from another server (perhaps some kind of module repository), but since I don't trust external code, I want to validate it before I compile it to a JS function, or run it. This is, as you say, a code analysis task. However, in this case, the developer wants to ensure that someone else is using a safe subset of js, rather than as you put it, the developer simply wanting to ensure their own code is in the safe subset.

Right now there are projects to do this (caja, adsafe), but to do a runtime check requires that the user download a full JS parser, and validator. If part of the parsing task was built into the browser, there would be less code to download, and the verification would run much faster. This has real implications for users and developers, and would enable new and novel uses for JS in a browser, and distributed code modules.

Perhaps most of this is possible with just a hermetic eval, or some kind of worker inspired api as crockford suggested once, however consider this:

There is a nice function in js called map which takes a function. If that function could be guaranteed to have certain properties (such as being referentially transparent), then map could be safely broken up into parts and run in parallel, with perhaps a worker api implementation. You could do that now, but it requires some degree of faith that the provided function doesn't do anything naughty. You would not be able to accept such a function from a user, or from an external server.

Anyway, I'm sure there's other uses. Just a thought.

# Oliver Hunt (14 years ago)

On Dec 8, 2009, at 7:30 PM, Breton Slivka wrote:

On Wed, Dec 9, 2009 at 6:28 AM, Oliver Hunt <oliver at apple.com> wrote:

On Dec 8, 2009, at 2:18 AM, David-Sarah Hopwood wrote:

Oliver Hunt wrote:

I have also yet to see an actual developer centric use case -- the most frequently repeated use cases seem to be analysis tools used to ensure code can be run on a standard ES implementation while essentially being a stricter subset of ES. The developer scenario for that is "developer wishes to use 'safe' subset of ES", not "developer wants to perform their own analysis"

--Oliver

Run time mashups? As a developer, I now have access in some browsers to a cross domain enabled getJSON method. Suppose I wanted to use that to retrieve a javascript program from another server (perhaps some kind of module repository), but since I don't trust external code, I want to validate it before I compile it to a JS function, or run it. This is, as you say, a code analysis task. However, in this case, the developer wants to ensure that someone else is using a safe subset of js, rather than as you put it, the developer simply wanting to ensure their own code is in the safe subset.

Right now there are projects to do this (caja, adsafe), but to do a runtime check requires that the user download a full JS parser, and validator. If part of the parsing task was built into the browser, there would be less code to download, and the verification would run much faster. This has real implications for users and developers, and would enable new and novel uses for JS in a browser, and distributed code modules.

Providing an AST doesn't get you anything substantial here as the hard part of all this is validation, not parsing. Realistically you would want the browser to be responsible for validation because it is able to do much more interesting forms of validation, there are in fact already multiple concepts being investigated by the whatwg to solve just this problem, without requiring js subsetting. Especially given in the mashup scenario you don't just have JS, you have the DOM and html -- assuming you can completely separate html from the js, you're still fairly limited as your validation either prohibits any access to the dom or the validation can be circumvented.

Perhaps most of this is possible with just a hermetic eval, or some kind of worker inspired api as crockford suggested once, however consider this:

There is a nice function in js called map which takes a function. If that function could be guaranteed to have certain properties (such as being referentially transparent), then map could be safely broken up into parts and run in parallel, with perhaps a worker api implementation. You could do that now, but it requires some degree of faith that the provided function doesn't do anything naughty. You would not be able to accept such a function from a user, or from an external server.

Serialisation doesn't get you anything here -- you would have to do a large amount of validation, and then in the end you would need to manually spawn workers. Realistically there's nothing stopping a browser from doing so itself in the first place, and it would be able to do so more efficiently than an end developer could (and end developer is limited to workers, which are only a dom level tech, messaging between workers requires copying, etc, etc otoh an engine can relatively trivially track referential transparency of a function, both as a standalone function, and with the constraints of the input elements, and is able to ignore threading restrictions that apply to actual ES code).

# David-Sarah Hopwood (14 years ago)

Oliver Hunt wrote:

On Dec 8, 2009, at 7:30 PM, Breton Slivka wrote:

Right now there are projects to do this (caja, adsafe), but to do a runtime check requires that the user download a full JS parser, and validator. If part of the parsing task was built into the browser, there would be less code to download, and the verification would run much faster. This has real implications for users and developers, and would enable new and novel uses for JS in a browser, and distributed code modules.

Providing an AST doesn't get you anything substantial here as the hard part of all this is validation, not parsing.

That's not entirely accurate. In implementing Jacaranda, I estimate the split of effort between validation/parsing has been about 60/40. ECMAScript is really quite difficult to lex+parse if you absolutely need to do so correctly.

# Mark S. Miller (14 years ago)

On Tue, Dec 8, 2009 at 7:59 PM, Oliver Hunt <oliver at apple.com> wrote:

Providing an AST doesn't get you anything substantial here as the hard part of all this is validation, not parsing.

Given ES5 as a starting point,

  1. validation for many interesting purposes, especially security, is no longer hard,
  2. the subset restrictions need no longer be severe, and
  3. the issue isn't what's hard but what's slow and large. Lexing and parsing JS accurately is slow. Accurate JS lexers and parsers are large. Even if JS is now fast enough to write a parser competitive with the one built into the browsers, this parser would itself need to be downloaded per frame. Even if all downloads of the parser code hit on the browser's cache, the parser would still need to be parsed per frame that needed it (unless browsers cache a frame-independent parsed representation of JS scripts).

I am currently working on just such a validator and safe execution environment -- assuming ES5 and a built in parser->AST. Going out on a

limb, I expect it to have a small download, a simple translation, no appreciable code expansion, and no appreciable runtime overhead. Once I've posted it, we can reexamine my claims above against it.

Realistically you would want the browser to be responsible for validation because it is able to do much more interesting forms of validation,

What are these more interesting forms of validation?

there are in fact already multiple concepts being investigated by the whatwg to solve just this problem, without requiring js subsetting.

What are these other concepts? I am aware of one -- the sandboxed iframe. Compared to JS subsetters, this is flawed in many ways. But the more important contrast is that whatwg is investigating security frameworks to be centrally designed, by them/us, and then implemented and deployed by the browser makers. When they screw up, the rest of us downstream have no recourse. By contrast, multiple competing projects are trying various approaches to JS subsetting -- Caja, FBJS, MS WebSandbox, ADsafe, Jacaranda. By one measure www.eros-os.org/pipermail/cap-talk/2009-October/013567.html,

these already dominate the Same Origin Policy as the primary isolation mechanism on the web.

This victory happened despite the insane difficulty of doing this on an ES3 base. Again, starting from ES5, this becomes vastly easier.

Especially given in the mashup scenario you don't just have JS, you have the DOM and html -- assuming you can completely separate html from the js, you're still fairly limited as your validation either prohibits any access to the dom or the validation can be circumvented.

All the JS subsetters mentioned above mediate access to the dom but do not prevent it. The virtualized dom provided by Caja is a sufficient emulation of the browser DOM that the YUI library from Yahoo! now operates fully cajoled (translated by Caja and accessing the dom only via Caja's mediation). Please show how the protections provided by Caja can be circumvented.

# Oliver Hunt (14 years ago)

On Dec 8, 2009, at 8:51 PM, Mark S. Miller wrote:

On Tue, Dec 8, 2009 at 7:59 PM, Oliver Hunt <oliver at apple.com> wrote:

Providing an AST doesn't get you anything substantial here as the hard part of all this is validation, not parsing.

Given ES5 as a starting point,

  1. validation for many interesting purposes, especially security, is no longer hard,
  2. the subset restrictions need no longer be severe, and
  3. the issue isn't what's hard but what's slow and large. Lexing and parsing JS accurately is slow. Accurate JS lexers and parsers are large. Even if JS is now fast enough to write a parser competitive with the one built into the browsers, this parser would itself need to be downloaded per frame. Even if all downloads of the parser code hit on the browser's cache, the parser would still need to be parsed per frame that needed it (unless browsers cache a frame-independent parsed representation of JS scripts).

I am currently working on just such a validator and safe execution environment -- assuming ES5 and a built in parser->AST. Going out on a limb, I expect it to have a small download, a simple translation, no appreciable code expansion, and no appreciable runtime overhead. Once I've posted it, we can reexamine my claims above against it.

Realistically you would want the browser to be responsible for validation because it is able to do much more interesting forms of validation,

What are these more interesting forms of validation?

I think basically i took Breton's concept to be (effectively) whitelisting language constructs (to me a reasonable interpretation of his statements) -- the logical step for a end engine to take would be an object capability model of some kind.

there are in fact already multiple concepts being investigated by

the whatwg to solve just this problem, without requiring js subsetting.

What are these other concepts? I am aware of one -- the sandboxed iframe. Compared to JS subsetters, this is flawed in many ways. But the more important contrast is that whatwg is investigating security frameworks to be centrally designed, by them/us, and then implemented and deployed by the browser makers. When they screw up, the rest of us downstream have no recourse. By contrast, multiple competing projects are trying various approaches to JS subsetting -- Caja, FBJS, MS WebSandbox, ADsafe, Jacaranda. By one measure www.eros-os.org/pipermail/cap-talk/2009-October/013567.html, these already dominate the Same Origin Policy as the primary isolation mechanism on the web.

This victory happened despite the insane difficulty of doing this on an ES3 base. Again, starting from ES5, this becomes vastly easier.

Same origin policy is intended to prevent content from one domain from accessing another, it's not meant to prevent the embedded site from doing lamentable things inside its own context -- same origin is not a concept that it makes sense to involve in an ES spec, at a very basic level you could treat it as an object capabilities model that has the rather simple rule function actionAllowed(executionContext) { return executionContext.origin == ownOrigin; }

The issue I have with the various object capability models that are layered on top of ES is that (as far as i can tell -- correct me if i'm wrong) they attempt to restrict the language to make validation possible, whereas the many and varied sandbox-themed concepts the whatwg is considering/has considered work on the assumption that ES should not be restricted at all, and failure should only occur when you attempt to something illegal (cross-origin access being the prime example). The sandboxing rules are simply an extension to this concept, allowing content from the same origin to be restricted as if it were from a separate origin.

Especially given in the mashup scenario you don't just have JS, you have the DOM and html -- assuming you can completely separate html from the js, you're still fairly limited as your validation either prohibits any access to the dom or the validation can be circumvented.

All the JS subsetters mentioned above mediate access to the dom but do not prevent it. The virtualized dom provided by Caja is a sufficient emulation of the browser DOM that the YUI library from Yahoo! now operates fully cajoled (translated by Caja and accessing the dom only via Caja's mediation). Please show how the protections provided by Caja can be circumvented.

Do you have a site set up with Caja that I can try out? eg. something where the sole purpose is to allow someone to through random content at it and see what sticks?

The advantage that in-engine validation have over models like Caja and Jacaranda is that there does not need to be any language restrictions (the security constraints can be enforced by the engine at access time, etc);

That said if in future a system such as Caja or Jacaranda (or some other yet to be developed system) turns out to be effective and popular I'm sure some effort will be made to standardise it.

# Mark S. Miller (14 years ago)

On Tue, Dec 8, 2009 at 10:02 PM, Oliver Hunt <oliver at apple.com> wrote:

Do you have a site set up with Caja that I can try out? eg. something where the sole purpose is to allow someone to through random content at it and see what sticks?

The Caja Playground is at caja.appspot.com.

Please let us know if you run into any issues. Let's resume the rest of this discussion after you've had a chance to play with Caja. Have fun!

# Mark S. Miller (14 years ago)

On Tue, Dec 8, 2009 at 8:51 PM, Mark S. Miller <erights at google.com> wrote:

On Tue, Dec 8, 2009 at 7:59 PM, Oliver Hunt <oliver at apple.com> wrote:

Providing an AST doesn't get you anything substantial here as the hard part of all this is validation, not parsing.

Given ES5 as a starting point,

  1. validation for many interesting purposes, especially security, is no longer hard,
  2. the subset restrictions need no longer be severe, and
  3. the issue isn't what's hard but what's slow and large. Lexing and parsing JS accurately is slow. Accurate JS lexers and parsers are large. Even if JS is now fast enough to write a parser competitive with the one built into the browsers, this parser would itself need to be downloaded per frame. Even if all downloads of the parser code hit on the browser's cache, the parser would still need to be parsed per frame that needed it (unless browsers cache a frame-independent parsed representation of JS scripts).

I am currently working on just such a validator and safe execution environment -- assuming ES5 and a built in parser->AST. Going out on a limb, I expect it to have a small download, a simple translation, no appreciable code expansion, and no appreciable runtime overhead. Once I've posted it, we can reexamine my claims above against it.

Work in progress is at < code.google.com/p/es-lab/source/browse/trunk/src/ses>.

This SES implementation is not actually quite complete yet. Even once it seems complete, we can't test it until there is an available ES5 implementation we can try running it on. However, it is complete enough that we're more confident about what it will be, once bugs are fixed, that we can try assessing the limb I climbed out on above. Though, until it is tested, all this should still be taken with some salt.

  • SES can all be implemented in any ES5 implementation satisfying a few additional constraints that we're trying to accumulate at < code.google.com/p/es-lab/wiki/SecureableES5>.

  • The implementation sketch shown there depends on two elements not currently provided by ES5 or the browser:

    • A Parser->AST API, for which Tom wrote an OMeta/JS parser at <

code.google.com/p/es-lab/source/browse/trunk/src/parser/es5parser.ojs>

that does run in current JavaScript, producing the ASTs described at < code.google.com/p/es-lab/wiki/JsonMLASTFormat> and available at <

es-lab.googlecode.com/svn/trunk/site/esparser/index.html>.

  • An object-identity-based key/value table, such as the EphemeronTables from the weak-pointer strawman. Assuming tables made the SES runtime initialization a touch easier to write. But I can (and probably should) refactor the SES runtime initialization so that it does not use such a table, just to see how adequate ES5 itself already is at supporting SES.

  • Given a parser, the rest of the SES startup is indeed small and fast. For an SES in which the parser needs to be provided in JS, the parser will dominate the size of the SES runtime library. The SES verifier is really trivial by comparison with any accurate JS parser.

  • We are enumerating the subset restrictions imposed by SES at < code.google.com/p/es-lab/wiki/SecureEcmaScript>. For JS code that is

already written to widely accepted best practice, such as no monkey patching of primordials, I would guess these restrictions to be quite reasonable. This is the area where we need the most feedback -- is there any less restrictive or more pleasant object-capability subset of ES5 than the one described here?

  • Because this SES implementation is verification-based, not translation-based, there is no code expansion.

  • SES does no blacklisting and no runtime whitelisting. It does all its whitelisting only at initialization and verification time.

  • Aside from "eval", "Function", "RegExp.prototype.exec", "RegExp.prototype.test", and initialization of the SES runtime (which, given a built-in parser, should be fast), SES has no runtime overhead. This also applies to "eval" and "Function" themselves. All their overhead is in starting up. Given a fast parser, this startup overhead should be small. After startup, code run by either eval or Function has no remaining runtime overhead.