Improved Transaction Serialization

Motivation

One of the major bottlenecks in all blockchain systems is storage: transactions occupy storage space, and need to be kept around forever. Transaction size also translates directly to cost, as transaction fees are generally proportional to the size of a transaction in bytes.

For a variety of reasons, the current binary encoding for A-Block transactions is both inefficient and cannot be upgraded well:

  • Byte arrays are often stored as hexadecimal strings, which occupy two bytes per byte of actual data.
    • Among other things, this includes all binary data on the stack, TxOut destination addresses and OutPoint transaction hashes
  • Enums are encoded with a 32-bit type ID, even when far fewer bytes would suffice.
    • Most significantly, this includes script opcodes, so every opcode in a script ends up occupying 8 bytes: 4 to indicate that the stack entry is a StackEntry::Op, and another 4 for the actual opcode number.
  • All arrays/vectors are encoded with a 64-bit length prefix, even though this is unnecessarily long for all variable-length sequences (such as the number of transaction inputs/outputs) and completely unnecessary in many cases where the array length is fixed (such as the number of bytes in a public key)
  • The version number stored in a transaction is stored near the end of the transaction, making it impossible to determine what version the transaction was serialized with without first deserializing all the inputs and outputs (whose format is dependent on the transaction version!!!)
  • A full script is encoded in every TxIn, and is then matched against known script patterns to determine the actual transaction type. As most transaction inputs and outputs follow a consistent format (e.g. P2PKH), it would be nice to split TxIn and TxOut into enums which store only the minimal required information for each type of transaction, rather than having to serialize a full script (which is space-inefficient) and then pattern-match against the script contents (which is error-prone and adds code complexity).
  • Scripts currently operate on integers of type usize, which is non-portable as its size may vary across systems. This should be changed to an integer type with known size, likely u64 or u128, so as to ensure consistent behavior across all platforms.
  • Script entries have separate item types for public keys, signatures, integers and binary data. Not only does this currently require two enum discriminators per script entry in serialized scripts, but it also has numerous disadvantages:
    • Numbers and byte strings are not interoperable. For instance, there is no way for a script to hash a number, or to concatenate a number with a byte string. Even if we choose to treat them separately at the opcode level (to allow more compact serialization using VarInts), it would be nice if they were treated as byte strings by the interpreter, ideally by simply treating them as byte strings with a fixed length.
    • Public keys and signatures could be trivially treated as byte strings. To ensure compact representation, we could dedicate some opcodes for representing constant-length byte data, so that public keys and signatures could be stored in the script without a length prefix.

Potential difficulties:

  • Currently, transaction hashes consist of the letter g followed by 31 (!!!) hexadecimal digits. This means that in their current state, they cannot be decoded into an array of bytes, as they only contain 15.5 bytes of information. I’ve chosen to treat them as 16-byte arrays where the 4 trailing bits are zero, however we could take this opportunity to extend them.
    • The same goes for P2SH addresses, which currently consist of the letter H followed by 63 hexadecimal digits. However, as there are no existing transactions on-chain which use P2SH, this can be changed to use 64 digits without any issues.
  • While the hash signed by P2PKH transactions is fortunately based on a textual representation of a transaction (see construct_tx_in_out_signable_hash()), the actual transaction hash is not. Instead, it’s based on the bincode serialization of the transaction object. This means that in order to handle legacy transactions, it will still be necessary to have code to serialize old transactions in the old format so that their hash can be determined.
  • get_stack_entry_signable_string, used when computing the signable message for druid transactions, relies on the exact types of stack elements. Changing this would also break existing transactions, however it seems that there are no existing druid transactions on-chain, so this should be a safe change.

Specification

Transaction

Field name Field type Notes
version varint(u64)
inputs_len varint(u32) Number of elements in the following array
inputs Array(TxIn)
outputs_len varint(u32) Number of elements in the following array
outputs Array(TxOut)
fees_len varint(u32) Number of elements in the following array
fees Array(TxOut)
druid_info DruidInfo

Address

Field name Field type Notes
type u8 enum See below
type Field name Field type Notes
0: P2PKH pubkey_hash Array(u8) (32) SHA3-256 hash of the receiver’s public key
1: P2SH lock_script_hash Array(u8) (32) SHA3-256 hash of the stringified lock script (see below)
2: Burn no fields

Asset

Field name Field type Notes
type u8 enum See below
type Field name Field type Notes
0: Token amount varint(u64) The value in AIBCOIN
1: Item amount varint(u64) The value in item asset tokens
genesis_hash AssetGenesisHash The value in item asset tokens

AssetGenesisHash

Field name Field type Notes
type u8 enum See below
type Field name Field type Notes
0: Create no fields Only allowed in a TxOut in an item creation transaction. This indicates that the genesis hash is equal to the enclosing transaction’s hash.
1: Hash hash TxHash The hash of the transaction which created the item.
2: Default no fields Dummy value, seems to be used as a placeholder for a response from a DRUID transaction?

DruidExpectation

Field name Field type Notes
from Array(u8) (32) SHA3-256 hash of the transaction inputs which need to be spent (TODO: this is currently completely broken?)
to Address The address to which the value must be sent
asset Asset The asset which must be transferred to the to address

DruidInfo

Field name Field type Notes
participants varint(u32) If 0, the DruidInfo is considered absent and the subsequent fields are omitted
druid_len varint(u32) The length of the following array
druid Array(u8) UTF-8 string
expectations_len varint(u32) Number of elements in the following array
expectations Array(DruidExpectation)

OutPoint

Field name Field type Notes
t_hash TxHash A transaction hash
n varint(u32) The index of an output in the transaction with the specified hash

TxHash

Field name Field type Notes
hash Array(u8) (16) SHA3-256 hash of a transaction (see below). The hash is truncated to 16 bytes, and the 16th byte is masked with 0xF0.

TxIn

Field name Field type Notes
type u8 enum See below
type Field name Field type Notes
0: P2PKH previous_out OutPoint The P2PKH transaction output being spent
public_key PublicKey
signature Signature
1: P2SH previous_out OutPoint The P2SH transaction output being spent
unlock_script Script
lock_script Script
2: CreateItem block_number varint(u64) The current block number, used as a placeholder to prevent replay attacks
public_key PublicKey
signature Signature
3: Coinbase block_number varint(u64) The mined block’s block number

TxOut

Field name Field type Notes
value Asset
locktime varint(u64)
script_public_key Address In spite of the name (which is preserved for legacy reasons), this is actually the receiver’s address.

PublicKey

Field name Field type Notes
public_key Array(u8) (32) An Ed25519 public key

Signature

Field name Field type Notes
signature Array(u8) (64) An Ed25519 signature

Script

Field name Field type Notes
len varint(u32) Number of elements in the following array
ops Array(TBD) Undecided as to how scripts should be stored. Do we break existing scripts and switch to a new format which unifies numbers, public keys, signatures and bytes?

Example

TBD

Considerations

TBD