Support

Documentation

Appendix A. The JPA archive format, v.1.2

Design goals

The JPA format strives to be a compressed archive format designed specifically for efficiency of creation by a PHP script. It is similar in design to the PKZIP format, with a few notable differences:

  • CRC32 is not used; calculation of file checksums is time consuming and can lead to errors when attempted on large files from a script running under PHP4, or a script running on PHP5 without the hash extension.

  • Only allowed compression methods are store and deflate.

  • There is no Central Directory (simplifies management of the file).

  • File permissions (UNIX style) are stored within the file.

Even though JPA is designed for use by PHP scripts, creating a command-line utility, a programming library or even a GUI program in any other language is still possible. JPA is not supposed to have high compression rations, or be secure and error-tolerant as other archive formats. It merely an attempt to provide the best compromise for creating archives of very large directory trees using nothing but PHP code to do it.

This is an open format. You may use it in any commercial or non-commercial application royalty-free. Even though the PHP implementation is GPL-licensed, we can provide it under commercial-friendly licenses, e.g. LGPL v3. Please ask us if you want to use it on your own software.

Structure of an archive

An archive consists of exactly one Standard Header and one or more Entity Blocks . Each Entity Block consists of exactly one Entity Description Block and at most one File Data Block . All values are stored in little-endian byte order, unless otherwise specified.

All textual data, e.g. file names and symlink targets, must be written as little-endian UTF-8, non null terminated strings, for the widest compatibility possible.

Standard Header

The function of the Standard Header is to allow identification of the archive format and supply the client with general information regarding the archive at hand. It is a binary block appearing at the beginning of the archive file and there alone. It consists of the following data (in order of appearance):

Signature, 3 bytes

The bytes 0x4A 0x50 0x41 (uppercase ASCII string “JPA”) used for identification purposes.

Header length, 2 bytes

Unsigned short integer represented as two bytes, holding the size of the header in bytes. This is now fixed to 19 bytes, but this variable is here to allow for forward compatibility. When extra header fields are present, this value will be 19 + the length of all extra fields.

Major version, 1 byte

Unsigned integer represented as single byte, holding the archive format major version, e.g. 0X01 for version 1.2.

Minor version, 1 byte

Unsigned integer represented as single byte, holding the archive format minor version, e.g. 0X02 for version 1.2.

File count, 4 bytes

Unsigned long integer represented as four bytes, holding the number of files present in the archive.

Uncompressed size, 4 bytes

Unsigned long integer represented as four bytes, holding the total size of the archive's files when uncompressed.

Compressed size, 4 bytes

Unsigned long integer represented as four bytes, holding the total size of the archive's files in their stored (compressed) form

Extra Header Field - Spanned Archive Marker

This is an optional field, written after the Standard Header but before the first Entity Block, denoting that the current archive spans multiple files. Its structure is:

Signature, 4 bytes

The bytes 0x4A, 0x50, 0x01, 0x01

Extra Field Length, 2 bytes

The length of the extra field, without counting the signature length. It's value is fixed and equals 4.

Number of parts, 2 bytes

The total number of parts this archive consists of.

When creating spanned archives, the first file (part) of the archive set has an extension of .j01, the next part has an extension of .j02 and so on. The last file of the archive set has the extension .jpa.

When creating spanned archives you must ensure that the Entity Description Block is within the limits of a single part, i.e. the contents of the Entity Description Block must not cross part boundaries. The File Data Block data can cross one or multiple part blocks.

Entity Block

An Entity Block is merely the aggregation of an Entity Description Block and at most one File Data Block. An Entity can be at present either a File or a Directory. If the entity is a File of zero length or if it is a Directory the File Data Block is omitted. In any other case, the File Data Block must exist.

Entity Description Block

The function of the Entity Description Block is to provide the client information about an Entity included in the archive. The client can then use this information in order to reconstruct a copy of the Entity on the client's file system. It is a binary block consisting of the following data (in order of appearance):

Signature, 3 bytes

The bytes 0x4A, 0x50, 0x46 (uppercase ASCII string “JPF”) used for identification purposes.

Block length, 2 bytes

Unsigned short integer, represented as 2 bytes, holding the total size of this Entity Description Block.

Length of entity path, 2 bytes.

Unsigned short integer, represented as 2 bytes, holding the size of the entity path data below.

Entity path data, variable length.

Holds the complete (relative) path of the Entity as a UTF16 encoded string, without trailing null. The path separator must be a forward slash (“/”), even on systems which use a different path separator, e.g. Windows.

Entity type, 1 byte.
  • 0x00 for directories (instructs the client to recursively create the directory specified in Entity path data).

  • 0x01 for files (instructs the client to reconstruct the file specified in Entity path data)

  • 0x02 for symbolic links (instructs the client to create a symbolic link whose target is stored, uncompressed, as the entity's File Data Block). When the type is 0x02 the Compression Type MUST be 0x00 as well.

Compression type, 1 byte.
  • 0x00 for no compression; the data contained in File Data Block should be written as-is to the file. Also used for directories, symbolic links and zero-sized files.

  • 0x01 for deflate (Gzip) compression; the data contained in File Data Block must be deflated using Gzip before written to the file.

  • 0x02 for Bzip2 compression; the data contained in File Data Block must be uncompressed using BZip2 before written to the file. This is generally discouraged, as both the archiving and unarchiving scripts must be ran in a PHP environment which supports the bzip2 library.

Compressed size, 4 bytes

An unsigned long integer representing the size of the File Data Block in bytes. For directories, symlinks and zero-sized files it is zero (0x00000000).

Uncompressed size, 4 bytes

An unsigned long integer representing the size of the resulting file in bytes. For directories, symlinks and zero-sized files it is zero (0x00000000).

Entity permissions, 4 bytes

UNIX-style permissions of the stored entity.

Extra fields data, variable length

The extra fields for each file are stored here. The total length of extra fields is included in the Block Length above

Each Extra Fields consists of:

Extra Field Identifier, 2 bytes

A signature denoting the data stored in the extra field

Extra Field Length, 2 bytes

The length (in bytes) of the Extra Field Data

Extra Field Data, variable length

The internal structure varies by the type of the Extra Field, as noted in the Extra Field Identifier

Timestamp Extra Field

Its purpose is to store the date and time the file was modified. This extra field should be ignored for directories and symlinks, or - if present - the Timestamp should be set to 0x00000000. Its format is:

Extra Field Identifier, 2 bytes

The bytes 0x00 0x01

Extra Field Length, 2 bytes

The value 0x08 stored in little-endian format

Timestamp, 4 bytes

A 4-byte UNIX timestamp of the file's modification time, as returned by filemtime().

File Date Block

The File Data Block is only present if the Entity is a file with a non-zero file size. It can consist of one and only one of the following, depending on the Compression Type:

  • Binary dump of file contents or textual representation of the symlink's target, for CT=0x00

  • Gzip compression output, without the trailing Adler32 checksum, for CT=0x01

  • Bzip2 compression output, for CT=0x02

Change Log

Revision History
June 2009NKD,
Updated to format version 1.1, fixed incorrect descriptions of header signatures