SL file format

An SL file holds a library of Swanson code that has already been translated into S₀. The SL file uses a binary format that should be very easy to parse, making it useful for “bootstrap” code.

The bulk of the Swanson standard library is implemented in S₁, and the translators for other languages will typically assume S₁ as the “simplest” language that they can be implemented in terms of. But we don’t want every host to have to implement an S₁ parser and translator directly. Instead, we use a single “bootstrap host” to translate that S₁ code into S₀, and write it out into SL files. Since the standard library includes a S₁ translator implemented in S₁ itself, that means that all other hosts can get by with loading SL files. By loading the standard library via their bootstrapped SL files, other hosts then get an S₁ translator for free.

This document assumes a familiarity with the Swanson execution model and S₀ language.

Overview

An SL file represents a Swanson library, and contains one or more units. Each unit is implemented as an S₀ module.

An SL file consists of three sections:

  • a header section

  • a binaries section

  • a modules section

Varint encoding

All integer values in an SL file are encoded using a variable-length encoding.

The first byte includes a prefix indicating the encoded length of this particular integer. The length prefix is a sequence of 0 bits followed by a 1 bit. The number of 0 bits indicates the number of additional bytes (not including the byte containing the length prefix). All remaining bits in the first byte, along with all bits in any additional bytes, provide the value of the integer, encoded in big-endian order.

Some examples:

0x80 => 1_0000000 => 0

Length prefix of 1 means no additional bytes. Remaining bits 0000000 encode the number 0.

0xff => 1_1111111 => 127

Length prefix of 1 means no additional bytes. Remaining bits 1111111 encode the number 127.

0x40 0x80 => 01_000000 10000000 => 128

Length prefix of 01 means one additional byte. Remaining bits 000000 10000000 encode the number 128.

0x20 0xc3 0x50 => 001_00000 11000011 01010000 => 50,000

Length prefix of 001 means two additional bytes. Remaining bits 00000 11000011 01010000 encode the number 50,000.

When decoding, this scheme has the nice property that you can determine the number of bytes needed for an integer using a single “count leading zeroes” operation, which is available as a single instruction on most modern CPUs, and exposed an instrinsic in most host languages.

Header section

The header section identifies this file as an SL file, and contains pointers to the other sections in the file.

header section {
  magic number [uint32_be]
  version [uint32_be]
}

The magic number is the four-byte big-endian constant 0x534C4942, which is the same as the ASCII string “SLIB”.

The version is the four-byte big-endian constant 0x00000003, indicating that this is version 3 of the SL file format.

Binaries section

The binaries section contains all of the binary constants used throughout the rest of the file. Binary constants are used for S₀ names, and for the value of any Swanson literals created by an S₀ “create literal” statement.

binaries section {
  binary count [varint]
  binary constants [array of constant]
}

constant {
  length [varint]
  content [bytes]
}

The binary count field specifies how many binary constants there are in the section. Each constant then appears consecutively. Each constant starts with a length field indicating how long the constant’s content is. The content then follows. The constant is not encoded in any way; its binary content is included in the file verbatim.

Names

“Names” are used throughout the modules section. Each name is annotated with the source file location where the name appears. Names all have the same structure:

name {
  content [varint]
  location [source location]
}

source location {
  source file [varint]
  start line [varint]
  start column [varint]
  end line [varint]
  end column [varint]
}

The content and source file fields are each the index of one of the binary constants in the binaries section. The content field’s constant gives the content of the name. The source file field’s constant gives the name of the source file where the name appears. The start line, start column, end line, and end column fields give the location of the name within source file. Each of these fields are 0-indexed. The end fields should point at the character immediately following the name in its source file. (That means that if the name appears on a single line — with start line and end line being equal — then subtracting start column from end column will give you the length of the source file syntax that the name comes from.)

Globbed lists

Several parts of a module include a globbed list, which consists of a list of names, along with an optional glob. Like names, globs are annotated with the source file location where the glob appears.

glob {
  present [u8 = "*"]
  location [source location]
}

missing glob {
  missing [u8 = " "]
}

optional glob = glob | missing glob

globbed list {
  name count [varint]
  names [array of name]
  glob [optional glob]
}

The name count field specifies how many elements there are in the names field.

Modules section

The modules section provides the definition of each S₀ module in the file.

modules section {
  module count [varint]
  modules [array of module]
}

The module count field specifies how many modules there are in the file. Each module then appears consecutively.

module {
  module name [name]
  block count [varint]
  blocks [array of block]
}

The block count field specifies how many blocks there are in the module. Each block then appears consecutively.

block {
  block name [name]
  containing [globbed list]
  branch count [varint]
  branches [array of branch]
}

Each block starts with its name and its containing clause, which is a globbed list. After the containing clause is the block’s list of branches. The branch count field specifies how many branches there are in the block. Each branch then appears consecutively.

branch {
  branch name [name]
  receiving [globbed list]
  statement [array of statement]
  invocation [invocation]
}

Each branch starts with its name and its receiving clause, which is a globbed list. After the receiving clause is the list of statements in the branch, followed by the branch’s invocation. Each kind of statement, and the invocation, have different formats.

statement = create closure | create literal | rename

create closure {
  code [u8 = "C"]
  dest [name]
  block [varint]
  close over [globbed list]
}

The one-byte code field has the value 0x43 (ASCII “C”) for a create literal statement. The block field is the index of one of the blocks in this module. The close-over clause is a globbed list.

create literal {
  code [u8 = "L"]
  dest [name]
  content [varint]
  location [source location]
}

The one-byte code field has the value 0x4C (ASCII “L”) for a create literal statement. The content field specifies the content of the new literal. It is an index of one of the binary constants in the binaries section. (Note that like names, the literal content is annotated with information about its location within a source file.)

rename {
  code [u8 = "R"]
  dest [name]
  source [name]
}

The one-byte code field has the value 0x52 (ASCII “R”) for a create literal statement.

invocation {
  code [u8 = "I"]
  target [name]
  branch [name]
  inputs [globbed list]
}

The one-byte code field has the value 0x49 (ASCII “I”) for an invocation. The invocation’s inputs are a globbed list.

The invocation is the last portion of a branch. The next branch in the block immediately follows the invocation. (If there are no more branches in the block, the next block in the module immediately follows. If there are no more blocks in the module, the next module in the file immediately follows. If there are no more modules in the file, no more content appears in the file.)

Version history

Version 1 was introduced in February 2021. Up until then, S₀ code was always encoded in its human-readable text format. The SL format was created to be easier for hosts to parse.

Version 2 was introduced in January 2022, as part of the work to add explicit inputs and input globs to S₀ invocations.

Version 3 was introduced in March 2022, as part of the work to add globs to the containing and receiving clauses in S₀ blocks and branches.

Version 4 was introduced in June 2022, and adds location information to literals.