Writing an open source Apex syntax highlighter for the Monaco editor

惭颈肠谤辞蝉辞蹿迟鈥檚 has come along leaps and bounds in the last couple of years, both in functionality and popularity. One of my favourite things about it is that it鈥檚 open source, which means that not only can you , you can contribute, too!

At its core, VSCode is powered by the editor, which handles all of the code editing functionality. Since it鈥檚 all built on top of browser technology, it鈥檚 actually possible to use the Monaco editor by itself inside a browser - and in fact, that鈥檚 exactly what we do to render most of our code and XML diffs within 91导航.

Monaco handles things like diff rendering, quick navigation and syntax highlighting for you, but when it came to displaying Apex diffs, there was no Apex syntax highlighter available. Since Java and Apex are syntactically pretty similar, we got away with using the inbuilt Java syntax highlighting for a little while, but we wanted to both improve our product and give back to the open source community that helps build this fantastic editor. So, I decided to take a stab at writing an Apex syntax highlighter and contributing it back to the project.

What does a syntax highlighter do, anyway?

Syntax highlighting requires understanding the code and 鈥渢okenizing鈥� it. This entails marking which bits of text correspond to different 鈥渢oken types鈥�, where a token type might be something like 鈥渋dentifier鈥� or 鈥渟tring鈥� or 鈥渃omment鈥�. Each of those token types can then be mapped to a certain colour to provide syntax highlighting. In a lot of editors, those colours can be customised by themes - and the tokenization can also help the editor provide other functionality such as highlighting matching brackets and code folding.

The code/configuration that carries out this tokenization process is sometimes called a tokenizer or a grammar, or even a lexical specification.

How does Monaco do syntax highlighting?

If you鈥檙e familiar with Salesforce鈥檚 for VSCode, you might be wondering why I need to create a syntax highlighter for Monaco, given that the plugin provides Apex syntax highlighting.

It turns out that there are actually several ways to implement syntax highlighting in Monaco:

TextMate grammars

TextMate grammars were created for the TextMate editor, but are now used by many editors and are something of a defacto standard when it comes to syntax highlighting. Since so many TextMate grammars exist for so many different languages, when you make a new editor it makes sense to support TextMate grammars for your syntax highlighting. So, in VSCode, Monaco has a module which support TextMate grammars.

Unfortunately this support relies on a native regex library for performance reasons, and therefore the support only works in VSCode, and not when you run Monaco in a browser.

Language servers

Microsoft also has a specification called the (LSP), which can be used to let an editor communicate with a separate server which will provide language-specific features like autocomplete and syntax highlighting.

The Apex Code Editor plugin uses this functionality - it provides a separate Apex Language Server which communicates with VSCode over LSP. Unfortunately, this separate server is written in Java and designed to run locally, so it doesn鈥檛 make sense in the browser either.

Monarch grammars

Monaco also has its own way of creating syntax highlighters by specifying rules in a JSON format, using a library called Monarch. This library is designed to be efficient (so that it can run fast in a browser environment) - and is the only syntax highlighting method that runs in Monaco in the browser.

Since no Monarch Apex grammar existed, we were stuck when it came to highlighting Apex in the browser.

Creating a Monarch grammar for Apex

The Monaco editor website has a on how to write a Monarch grammar. Not only does it explain the Monarch specification pretty clearly, it also has a playground where you can modify a language definition and see the resulting syntax highlighting applied to some code in real time. The real time editor was invaluable when trying out different things during development.

As mentioned above, in Monarch you provide a series of attributes for your language in a JSON format in order to create a language definition. The main configuration happens within the tokenizer attribute which contains a series of states, and rules. The rules match on the input and tell the tokenizer to perform a certain action - usually to transition into a different state or mark the matched text with a certain token.

The states are needed to keep track of context - the string 4.5 will probably match some sort of number token normally, but it 蝉丑辞耻濒诲苍鈥檛 do so when it appears within a comment.

A starting point

As a novice when it comes to writing language grammars, I would have been a bit lost trying to start from scratch!

Luckily, I had a good jumping off point in the form of the Java Monarch grammar which already exists for Monaco (and which we were already using to highlight our Apex code in 91导航). Java and Apex are very similar in many ways, so it made sense to copy the existing grammar and modify it to suit some of the differences that Apex has.

One of the ways I went about finding these differences was to grab a fairly diverse set of Apex code from open source repositories on GitHub which I could run the current highlighter on for testing. The Monarch website鈥檚 live playground editor was really useful for this.

One of the things that immediately jumps out when you use a Java highlighter on Apex code is that some of the keywords aren鈥檛 highlighted - keywords like global, bulk and future don鈥檛 exist in Java, and so they usually get interpreted as identifiers instead. Merging in the list of Apex keywords immediately improved the results.

Another difference is that Apex is case insensitive, and Java isn鈥檛. You can mark Monarch grammars as case insensitive, which sounds like a great solution - unfortunately, it clashed with another feature which I wanted to include. When you set your grammar to be case insensitive, it stops any of your rules discriminating based on the casing of the text (which makes sense).

The problem was, I also wanted to highlight identifiers that started with an uppercase letter differently to those that didn鈥檛, because it鈥檚 a good clue that the identifier is a type rather than a variable name. This rule wasn鈥檛 possible to write without case sensitivity turned on.

As a compromise, I created a small function which takes all of the keywords and generates some common casing variations (specifically, all uppercase and with the first letter upper cased). This means that the highlighter will correctly match things like SELECT, if and Decimal as keywords, but it won鈥檛 match uSiNg or tomORROW. This seemed like a decent compromise to me, but it鈥檚 not perfect. In particular, PascalCase keywords like TestMethod seem perfectly sensible but won鈥檛 be highlighted correctly.

Some other small changes to the Java highlighter which I made included:

Removing binary, hex and octal numbers (Apex doesn鈥檛 support them)
Changing the javadoc tokens to apexdoc tokens

Making sure it works

Every Salesforce developer knows the value of a good test suite - so it was time to make sure the highlighter had a good set of tests to verify the functionality (and help out the next people who come to make improvements).

Tests in the repository basically consist of a set of test inputs and the expected token output. Again, the existing test suite for the Java highlighter was a big help.

Submitting a pull request

Finally, after running through the checklist of things to do when adding a new language, I submitted a with my work.

It was accepted, and is now released in version 0.14.0 of Monaco editor - and it鈥檚 also now live in the 91导航 app for your Apex code diffs! 馃帀

PRs welcome

Of course, although it鈥檚 a good starting point, it鈥檚 not perfect by any means. Here鈥檚 a list of things that could do with some improvement:

SOQL - although lots of SOQL keywords like SELECT and FROM are highlighted correctly the tokenizer won鈥檛 recognise SOQL as a different context and therefore the highlighting within SOQL queries isn鈥檛 perfect
I didn鈥檛 manage to find a fully comprehensive list of Apex keywords. There鈥檚 one in the , but some keywords such as switch are marked as 鈥渇or future use鈥� despite being in use now, and some keywords such as get, and void are missing. It also doesn鈥檛 include any of the built in types. Because of this, I actually merged this list with the list of keywords from the Java highlighter to get my final list.
As mentioned above, the highlighter isn鈥檛 totally case insensitive.
Some keywords are context dependent - for example, with and sharing are keywords when used defining a class but aren鈥檛 reserved words and can be used as identifiers in different contexts. I didn鈥檛 tackle this issue in my first pass - the highlighter simply won鈥檛 detect these as keywords.
Although apexdoc is recognised as a whole block, individual usages of things like @param within the apexdoc block are not parsed separately.

Like I said, it鈥檚 all open source, so if one of these issues is bugging you and you think you can take a crack at making it better, you can!

Join us!

Come and work in a team where trying new things is the norm! Take a look at our latest engineering jobs if you think 91导航 could be the next step for you.

91导航

Life at 91导航