Thursday 5 July 2012

Native C/C++ Like Performance For Java Object Serialisation

Do you ever wish you could turn a Java object into a stream of bytes as fast as it can be done in a native language like C++?  If you use standard Java Serialization you could be disappointed with the performance.  Java Serialization was designed for a very different purpose than serialising objects as quickly and compactly as possible.

Why do we need fast and compact serialisation?  Many of our systems are distributed and we need to communicate by passing state between processes efficiently.  This state lives inside our objects.  I've profiled many systems and often a large part of the cost is the serialisation of this state to-and-from byte buffers.  I've seen a significant range of protocols and mechanisms used to achieve this.  At one end of the spectrum are the easy to use but inefficient protocols likes Java Serialisation, XML and JSON.  At the other end of this spectrum are the binary protocols that can be very fast and efficient but they require a deeper understanding and skill.

In this article I will illustrate the performance gains that are possible when using simple binary protocols and introduce a little known technique available in Java to achieve similar performance to what is possible with native languages like C or C++.

The three approaches to be compared are:
  1. Java Serialization: The standard method in Java of having an object implement Serializable.
  2. Binary via ByteBuffer: A simple protocol using the ByteBuffer API to write the fields of an object in binary format.  This is our baseline for what is considered a good binary encoding approach.
  3. Binary via Unsafe: Introduction to Unsafe and its collection of methods that allow direct memory manipulation.  Here I will show how to get similar performance to C/C++.
The Code
import sun.misc.Unsafe;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;
import java.lang.reflect.Field;
import java.nio.ByteBuffer;
import java.util.Arrays;

public final class TestSerialisationPerf
{
    public static final int REPETITIONS = 1 * 1000 * 1000;

    private static ObjectToBeSerialised ITEM =
        new ObjectToBeSerialised(
            1010L, true, 777, 99,
            new double[]{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
            new long[]{1, 2, 3, 4, 5, 6, 7, 8, 9, 10});


    public static void main(final String[] arg) throws Exception
    {
        for (final PerformanceTestCase testCase : testCases)
        {
            for (int i = 0; i < 5; i++)
            {
                testCase.performTest();

                System.out.format("%d %s\twrite=%,dns read=%,dns total=%,dns\n",
                                  i,
                                  testCase.getName(),
                                  testCase.getWriteTimeNanos(),
                                  testCase.getReadTimeNanos(),
                                  testCase.getWriteTimeNanos() + 
                                  testCase.getReadTimeNanos());

                if (!ITEM.equals(testCase.getTestOutput()))
                {
                    throw new IllegalStateException("Objects do not match");
                }

                System.gc();
                Thread.sleep(3000);
            }
        }
    }

    private static final PerformanceTestCase[] testCases =
    {
        new PerformanceTestCase("Serialisation", REPETITIONS, ITEM)
        {
            ByteArrayOutputStream baos = new ByteArrayOutputStream();

            public void testWrite(ObjectToBeSerialised item) throws Exception
            {
                for (int i = 0; i < REPETITIONS; i++)
                {
                    baos.reset();

                    ObjectOutputStream oos = new ObjectOutputStream(baos);
                    oos.writeObject(item);
                    oos.close();
                }
            }

            public ObjectToBeSerialised testRead() throws Exception
            {
                ObjectToBeSerialised object = null;
                for (int i = 0; i < REPETITIONS; i++)
                {
                    ByteArrayInputStream bais = 
                        new ByteArrayInputStream(baos.toByteArray());
                    ObjectInputStream ois = new ObjectInputStream(bais);
                    object = (ObjectToBeSerialised)ois.readObject();
                }

                return object;
            }
        },

        new PerformanceTestCase("ByteBuffer", REPETITIONS, ITEM)
        {
            ByteBuffer byteBuffer = ByteBuffer.allocate(1024);

            public void testWrite(ObjectToBeSerialised item) throws Exception
            {
                for (int i = 0; i < REPETITIONS; i++)
                {
                    byteBuffer.clear();
                    item.write(byteBuffer);
                }
            }

            public ObjectToBeSerialised testRead() throws Exception
            {
                ObjectToBeSerialised object = null;
                for (int i = 0; i < REPETITIONS; i++)
                {
                    byteBuffer.flip();
                    object = ObjectToBeSerialised.read(byteBuffer);
                }

                return object;
            }
        },

        new PerformanceTestCase("UnsafeMemory", REPETITIONS, ITEM)
        {
            UnsafeMemory buffer = new UnsafeMemory(new byte[1024]);

            public void testWrite(ObjectToBeSerialised item) throws Exception
            {
                for (int i = 0; i < REPETITIONS; i++)
                {
                    buffer.reset();
                    item.write(buffer);
                }
            }

            public ObjectToBeSerialised testRead() throws Exception
            {
                ObjectToBeSerialised object = null;
                for (int i = 0; i < REPETITIONS; i++)
                {
                    buffer.reset();
                    object = ObjectToBeSerialised.read(buffer);
                }

                return object;
            }
        },
    };
}

abstract class PerformanceTestCase
{
    private final String name;
    private final int repetitions;
    private final ObjectToBeSerialised testInput;
    private ObjectToBeSerialised testOutput;
    private long writeTimeNanos;
    private long readTimeNanos;

    public PerformanceTestCase(final String name, final int repetitions,
                               final ObjectToBeSerialised testInput)
    {
        this.name = name;
        this.repetitions = repetitions;
        this.testInput = testInput;
    }

    public String getName()
    {
        return name;
    }

    public ObjectToBeSerialised getTestOutput()
    {
        return testOutput;
    }

    public long getWriteTimeNanos()
    {
        return writeTimeNanos;
    }

    public long getReadTimeNanos()
    {
        return readTimeNanos;
    }

    public void performTest() throws Exception
    {
        final long startWriteNanos = System.nanoTime();
        testWrite(testInput);
        writeTimeNanos = (System.nanoTime() - startWriteNanos) / repetitions;

        final long startReadNanos = System.nanoTime();
        testOutput = testRead();
        readTimeNanos = (System.nanoTime() - startReadNanos) / repetitions;
    }

    public abstract void testWrite(ObjectToBeSerialised item) throws Exception;
    public abstract ObjectToBeSerialised testRead() throws Exception;
}

class ObjectToBeSerialised implements Serializable
{
    private static final long serialVersionUID = 10275539472837495L;

    private final long sourceId;
    private final boolean special;
    private final int orderCode;
    private final int priority;
    private final double[] prices;
    private final long[] quantities;

    public ObjectToBeSerialised(final long sourceId, final boolean special,
                                final int orderCode, final int priority,
                                final double[] prices, final long[] quantities)
    {
        this.sourceId = sourceId;
        this.special = special;
        this.orderCode = orderCode;
        this.priority = priority;
        this.prices = prices;
        this.quantities = quantities;
    }

    public void write(final ByteBuffer byteBuffer)
    {
        byteBuffer.putLong(sourceId);
        byteBuffer.put((byte)(special ? 1 : 0));
        byteBuffer.putInt(orderCode);
        byteBuffer.putInt(priority);

        byteBuffer.putInt(prices.length);
        for (final double price : prices)
        {
            byteBuffer.putDouble(price);
        }

        byteBuffer.putInt(quantities.length);
        for (final long quantity : quantities)
        {
            byteBuffer.putLong(quantity);
        }
    }

    public static ObjectToBeSerialised read(final ByteBuffer byteBuffer)
    {
        final long sourceId = byteBuffer.getLong();
        final boolean special = 0 != byteBuffer.get();
        final int orderCode = byteBuffer.getInt();
        final int priority = byteBuffer.getInt();

        final int pricesSize = byteBuffer.getInt();
        final double[] prices = new double[pricesSize];
        for (int i = 0; i < pricesSize; i++)
        {
            prices[i] = byteBuffer.getDouble();
        }

        final int quantitiesSize = byteBuffer.getInt();
        final long[] quantities = new long[quantitiesSize];
        for (int i = 0; i < quantitiesSize; i++)
        {
            quantities[i] = byteBuffer.getLong();
        }

        return new ObjectToBeSerialised(sourceId, special, orderCode, 
                                        priority, prices, quantities);
    }

    public void write(final UnsafeMemory buffer)
    {
        buffer.putLong(sourceId);
        buffer.putBoolean(special);
        buffer.putInt(orderCode);
        buffer.putInt(priority);
        buffer.putDoubleArray(prices);
        buffer.putLongArray(quantities);
    }

    public static ObjectToBeSerialised read(final UnsafeMemory buffer)
    {
        final long sourceId = buffer.getLong();
        final boolean special = buffer.getBoolean();
        final int orderCode = buffer.getInt();
        final int priority = buffer.getInt();
        final double[] prices = buffer.getDoubleArray();
        final long[] quantities = buffer.getLongArray();

        return new ObjectToBeSerialised(sourceId, special, orderCode, 
                                        priority, prices, quantities);
    }

    public boolean equals(final Object o)
    {
        if (this == o)
        {
            return true;
        }
        if (o == null || getClass() != o.getClass())
        {
            return false;
        }

        final ObjectToBeSerialised that = (ObjectToBeSerialised)o;

        if (orderCode != that.orderCode)
        {
            return false;
        }
        if (priority != that.priority)
        {
            return false;
        }
        if (sourceId != that.sourceId)
        {
            return false;
        }
        if (special != that.special)
        {
            return false;
        }
        if (!Arrays.equals(prices, that.prices))
        {
            return false;
        }
        if (!Arrays.equals(quantities, that.quantities))
        {
            return false;
        }

        return true;
    }
}

class UnsafeMemory
{
    private static final Unsafe unsafe;
    static
    {
        try
        {
            Field field = Unsafe.class.getDeclaredField("theUnsafe");
            field.setAccessible(true);
            unsafe = (Unsafe)field.get(null);
        }
        catch (Exception e)
        {
            throw new RuntimeException(e);
        }
    }

    private static final long byteArrayOffset = unsafe.arrayBaseOffset(byte[].class);
    private static final long longArrayOffset = unsafe.arrayBaseOffset(long[].class);
    private static final long doubleArrayOffset = unsafe.arrayBaseOffset(double[].class);

    private static final int SIZE_OF_BOOLEAN = 1;
    private static final int SIZE_OF_INT = 4;
    private static final int SIZE_OF_LONG = 8;

    private int pos = 0;
    private final byte[] buffer;

    public UnsafeMemory(final byte[] buffer)
    {
        if (null == buffer)
        {
            throw new NullPointerException("buffer cannot be null");
        }

        this.buffer = buffer;
    }

    public void reset()
    {
        this.pos = 0;
    }

    public void putBoolean(final boolean value)
    {
        unsafe.putBoolean(buffer, byteArrayOffset + pos, value);
        pos += SIZE_OF_BOOLEAN;
    }

    public boolean getBoolean()
    {
        boolean value = unsafe.getBoolean(buffer, byteArrayOffset + pos);
        pos += SIZE_OF_BOOLEAN;

        return value;
    }

    public void putInt(final int value)
    {
        unsafe.putInt(buffer, byteArrayOffset + pos, value);
        pos += SIZE_OF_INT;
    }

    public int getInt()
    {
        int value = unsafe.getInt(buffer, byteArrayOffset + pos);
        pos += SIZE_OF_INT;

        return value;
    }

    public void putLong(final long value)
    {
        unsafe.putLong(buffer, byteArrayOffset + pos, value);
        pos += SIZE_OF_LONG;
    }

    public long getLong()
    {
        long value = unsafe.getLong(buffer, byteArrayOffset + pos);
        pos += SIZE_OF_LONG;

        return value;
    }

    public void putLongArray(final long[] values)
    {
        putInt(values.length);

        long bytesToCopy = values.length << 3;
        unsafe.copyMemory(values, longArrayOffset,
                          buffer, byteArrayOffset + pos,
                          bytesToCopy);
        pos += bytesToCopy;
    }

    public long[] getLongArray()
    {
        int arraySize = getInt();
        long[] values = new long[arraySize];

        long bytesToCopy = values.length << 3;
        unsafe.copyMemory(buffer, byteArrayOffset + pos,
                          values, longArrayOffset,
                          bytesToCopy);
        pos += bytesToCopy;

        return values;
    }

    public void putDoubleArray(final double[] values)
    {
        putInt(values.length);

        long bytesToCopy = values.length << 3;
        unsafe.copyMemory(values, doubleArrayOffset,
                          buffer, byteArrayOffset + pos,
                          bytesToCopy);
        pos += bytesToCopy;
    }

    public double[] getDoubleArray()
    {
        int arraySize = getInt();
        double[] values = new double[arraySize];

        long bytesToCopy = values.length << 3;
        unsafe.copyMemory(buffer, byteArrayOffset + pos,
                          values, doubleArrayOffset,
                          bytesToCopy);
        pos += bytesToCopy;

        return values;
    }
}

Results
2.8GHz Nehalem - Java 1.7.0_04
==============================
0 Serialisation  write=2,517ns read=11,570ns total=14,087ns
1 Serialisation  write=2,198ns read=11,122ns total=13,320ns
2 Serialisation  write=2,190ns read=11,011ns total=13,201ns
3 Serialisation  write=2,221ns read=10,972ns total=13,193ns
4 Serialisation  write=2,187ns read=10,817ns total=13,004ns
0 ByteBuffer     write=264ns   read=273ns    total=537ns
1 ByteBuffer     write=248ns   read=243ns    total=491ns
2 ByteBuffer     write=262ns   read=243ns    total=505ns
3 ByteBuffer     write=300ns   read=240ns    total=540ns
4 ByteBuffer     write=247ns   read=243ns    total=490ns
0 UnsafeMemory   write=99ns    read=84ns     total=183ns
1 UnsafeMemory   write=53ns    read=82ns     total=135ns
2 UnsafeMemory   write=63ns    read=66ns     total=129ns
3 UnsafeMemory   write=46ns    read=63ns     total=109ns
4 UnsafeMemory   write=48ns    read=58ns     total=106ns

2.4GHz Sandy Bridge - Java 1.7.0_04
===================================
0 Serialisation  write=1,940ns read=9,006ns total=10,946ns
1 Serialisation  write=1,674ns read=8,567ns total=10,241ns
2 Serialisation  write=1,666ns read=8,680ns total=10,346ns
3 Serialisation  write=1,666ns read=8,623ns total=10,289ns
4 Serialisation  write=1,715ns read=8,586ns total=10,301ns
0 ByteBuffer     write=199ns   read=198ns   total=397ns
1 ByteBuffer     write=176ns   read=178ns   total=354ns
2 ByteBuffer     write=174ns   read=174ns   total=348ns
3 ByteBuffer     write=172ns   read=183ns   total=355ns
4 ByteBuffer     write=174ns   read=180ns   total=354ns
0 UnsafeMemory   write=38ns    read=75ns    total=113ns
1 UnsafeMemory   write=26ns    read=52ns    total=78ns
2 UnsafeMemory   write=26ns    read=51ns    total=77ns
3 UnsafeMemory   write=25ns    read=51ns    total=76ns
4 UnsafeMemory   write=27ns    read=50ns    total=77ns

Analysis

To write and read back a single relatively small object on my fast 2.4 GHz Sandy Bridge laptop can take ~10,000ns using Java Serialization, whereas when using Unsafe this can come down to well less than 100ns even accounting for the test code itself.  To put this in context, when using Java Serialization the costs are on par with a network hop!  Now that would be very costly if your transport is a fast IPC mechanism on the same system.

There are numerous reasons why Java Serialisation is so costly.  For example it writes out the fully qualified class and field names for each object plus version information.  Also ObjectOutputStream keeps a collection of all written objects so they can be conflated when close() is called.   Java Serialisation requires 340 bytes for this example object, yet we only require 185 bytes for the binary versions.  Details for the Java Serialization format can be found here.  If I had not used arrays for the majority of data, then the serialised object would have been significantly larger with Java Serialization because of the field names.  In my experience text based protocols like XML and JSON can be even less efficient than Java Serialization.  Also be aware that Java Serialization is the standard mechanism employed for RMI.

The real issue is the number of instructions to be executed.  The Unsafe method wins by a significant margin because in Hotspot, and many other JVMs, the optimiser treats these operations as intrinsics and replaces the call with assembly instructions to perform the memory manipulation.  For primitive types this results in a single x86 MOV instruction which can often happen in a single cycle.  The details can be seen by having Hotspot output the optimised code as I described in a previous article.

Now it has to be said that "with great power comes great responsibility" and if you use Unsafe it is effectively the same as programming in C, and with that can come memory access violations when you get offsets wrong.

Adding Some Context

"What about the likes of Google Protocol Buffers?", I hear you cry out.  These are very useful libraries and can often offer better performance and more flexibility than Java Serialisation.  However they are not remotely close to the performance of using Unsafe like I have shown here.  Protocol Buffers solve a different problem and provide nice self-describing messages which work well across languages.  Please test with different protocols and serialisation techniques to compare results.

Also the astute among you will be asking, "What about Endianness (byte-ordering) of the integers written?"  With Unsafe the bytes are written in native order.  This is great for IPC and between systems of the same type.  When systems use differing formats then conversion will be necessary.

How do we deal with multiple versions of a class or determining what class an object belongs to?  I want to keep this article focused but let's say a simple integer to indicate the implementation class is all that is required for a header.  This integer can be used to look up the appropriately implementation for the de-serialisation operation.

An argument I often hear against binary protocols, and for text protocols, is what about being human readable and debugging?  There is an easy solution to this.  Develop a tool for reading the binary format!

Conclusion

In conclusion it is possible to achieve the same native C/C++ like levels of performance in Java for serialising an object to-and-from a byte stream by effectively using the same techniques.  The UnsafeMemory class, for which I've provided a skeleton implementation, could easily be expanded to encapsulate this behaviour and thus protect oneself from many of the potential issues when dealing with such a sharp tool.

Now for the burning question.  Would it not be so much better if Java offered an alternative Marshallable interface to Serializable by offering natively what I've effectively done with Unsafe???

42 comments:

  1. Does AKKA use the unsafe technique?

    ReplyDelete
    Replies
    1. No. Akka by default uses Google Protocol Buffers for its own internal messages, and standard Java serialization for Serializable user objects.

      Delete
    2. Hi Anthony!

      No, Akka doesn't use this technique, and I'm not sure where that'd be beneficial right now as in-memory message passing is done via references.

      Akka Serialization is completely pluggable & configurable so you can write your own serializers, use other people's serializers and mix and match at will. Akka Remote Protocol uses Protocol Buffers to allow for platform independence but the payloads uses Akka Serialization.

      Hope that answers the question!

      Cheers,

      Delete
  2. You can get much better performance (~3x) out of the ByteBuffer implementation with a couple simple changes:

    - Use native ordered direct ByteBuffers. (use ByteBuffer.allocateDirect(1024).order(ByteOrder.nativeOrder()))

    - Change the array loops to Unsafe's copyMemory equivalent, that is:

    replace this:

    for ( final double price : prices ) {
    byteBuffer.putDouble(price);
    }

    with this:

    byteBuffer.asDoubleBuffer().put(prices);
    byteBuffer.position(byteBuffer.position() + prices.length * 8);

    Tested on 1.7.0_06, Sandy Bridge 3.1GHz.

    ReplyDelete
    Replies
    1. Thanks for the follow up. I get a ~2.5X improvement with DirectByteBuffer and native byte ordering as you suggest.

      However it is still 2-2.5X slower than using Unsafe.

      Delete
    2. I tried few modified tests,tried on i7-6820HQ CPU @ 2.70GHz/java 1.8.0_121 below are results:

      0 Serialisation write=2,281ns read=7,995ns total=10,276ns
      1 Serialisation write=1,423ns read=7,626ns total=9,049ns
      2 Serialisation write=1,403ns read=7,651ns total=9,054ns
      3 Serialisation write=1,417ns read=7,595ns total=9,012ns
      4 Serialisation write=1,405ns read=7,642ns total=9,047ns
      0 DirectByteBuffer NativeOrder write=63ns read=80ns total=143ns
      1 DirectByteBuffer NativeOrder write=37ns read=66ns total=103ns
      2 DirectByteBuffer NativeOrder write=36ns read=65ns total=101ns
      3 DirectByteBuffer NativeOrder write=37ns read=69ns total=106ns
      4 DirectByteBuffer NativeOrder write=36ns read=73ns total=109ns
      0 DirectByteBuffer write=63ns read=102ns total=165ns
      1 DirectByteBuffer write=48ns read=79ns total=127ns
      2 DirectByteBuffer write=47ns read=75ns total=122ns
      3 DirectByteBuffer write=48ns read=84ns total=132ns
      4 DirectByteBuffer write=48ns read=78ns total=126ns
      0 ByteBuffer write=193ns read=219ns total=412ns
      1 ByteBuffer write=168ns read=191ns total=359ns
      2 ByteBuffer write=182ns read=213ns total=395ns
      3 ByteBuffer write=177ns read=202ns total=379ns
      4 ByteBuffer write=167ns read=190ns total=357ns
      0 UnsafeMemory write=24ns read=55ns total=79ns
      1 UnsafeMemory write=26ns read=50ns total=76ns
      2 UnsafeMemory write=18ns read=56ns total=74ns
      3 UnsafeMemory write=19ns read=49ns total=68ns
      4 UnsafeMemory write=19ns read=52ns total=71ns


      Not sure thought why setting native order have improves performance

      Delete
  3. It would be interesting for completeness sake to time the use of standard java serialization but with an overridded readObject and writeObject implementation that just wrote the field values.

    ReplyDelete
    Replies
    1. Please go for it and post the results.

      Delete
    2. Out of curiosity, I added an Externalizable test case. Extending ObjectToBeSerialised (as ObjectToBeExternalized) required some minor surgery here and there. I am running Win7/64 and java 1.7.0_02. The original (unmodified code) results were:

      0 Serialisation write=2,203ns read=11,066ns total=13,269ns
      1 Serialisation write=1,992ns read=10,924ns total=12,916ns
      2 Serialisation write=2,050ns read=10,892ns total=12,942ns
      3 Serialisation write=2,054ns read=11,092ns total=13,146ns
      4 Serialisation write=2,075ns read=10,958ns total=13,033ns
      0 ByteBuffer write=200ns read=229ns total=429ns
      1 ByteBuffer write=172ns read=215ns total=387ns
      2 ByteBuffer write=160ns read=206ns total=366ns
      3 ByteBuffer write=174ns read=205ns total=379ns
      4 ByteBuffer write=163ns read=213ns total=376ns
      0 UnsafeMemory write=35ns read=67ns total=102ns
      1 UnsafeMemory write=20ns read=51ns total=71ns
      2 UnsafeMemory write=20ns read=50ns total=70ns
      3 UnsafeMemory write=20ns read=52ns total=72ns
      4 UnsafeMemory write=21ns read=51ns total=72ns

      The modified code was:

      0 Serialisation write=2,316ns read=10,930ns total=13,246ns
      1 Serialisation write=2,124ns read=10,653ns total=12,777ns
      2 Serialisation write=2,047ns read=10,549ns total=12,596ns
      3 Serialisation write=2,044ns read=10,534ns total=12,578ns
      4 Serialisation write=2,022ns read=10,359ns total=12,381ns
      0 Externalisation write=1,693ns read=8,523ns total=10,216ns
      1 Externalisation write=1,650ns read=8,413ns total=10,063ns
      2 Externalisation write=1,705ns read=8,526ns total=10,231ns
      3 Externalisation write=1,648ns read=8,416ns total=10,064ns
      4 Externalisation write=1,658ns read=8,528ns total=10,186ns
      0 ByteBuffer write=216ns read=244ns total=460ns
      1 ByteBuffer write=182ns read=216ns total=398ns
      2 ByteBuffer write=182ns read=211ns total=393ns
      3 ByteBuffer write=192ns read=210ns total=402ns
      4 ByteBuffer write=190ns read=209ns total=399ns
      0 UnsafeMemory write=37ns read=70ns total=107ns
      1 UnsafeMemory write=28ns read=56ns total=84ns
      2 UnsafeMemory write=23ns read=53ns total=76ns
      3 UnsafeMemory write=23ns read=51ns total=74ns
      4 UnsafeMemory write=23ns read=52ns total=75ns

      So it's a bit better than standard serialization, and extern can be improved upon by overriding the class descriptor writer, but there's not much point. It's not going to beat ByteBuffer, let alone UnsafeMemory.

      Great project. Thanks for that.

      Delete
    3. did you write the arrays manually ie, write length field and elements? as opposed to writeObject on the array?

      Delete
    4. Yes. Exactly the same.

      class ObjectToBeExternalized extends ObjectToBeSerialised implements Externalizable {
      public ObjectToBeExternalized() {
      super();
      }

      public ObjectToBeExternalized(long sourceId, boolean special,
      int orderCode, int priority, double[] prices, long[] quantities) {
      super(sourceId, special, orderCode, priority, prices, quantities);
      }

      public void writeExternal(ObjectOutput out) throws IOException {
      out.writeLong(sourceId);
      out.writeByte((special ? 1 : 0));
      out.writeInt(orderCode);
      out.writeInt(priority);
      out.writeInt(prices.length);
      for (final double price : prices)
      {
      out.writeDouble(price);
      }
      out.writeInt(quantities.length);
      for (final long quantity : quantities)
      {
      out.writeLong(quantity);
      }

      }

      public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
      sourceId = in.readLong();
      special = 0 != in.readByte();
      orderCode = in.readInt();
      priority = in.readInt();

      int pricesSize = in.readInt();
      prices = new double[pricesSize];
      for (int i = 0; i < pricesSize; i++)
      {
      prices[i] = in.readDouble();
      }

      int quantitiesSize = in.readInt();
      quantities = new long[quantitiesSize];
      for (int i = 0; i < quantitiesSize; i++)
      {
      quantities[i] = in.readLong();
      }
      }
      }

      Delete
    5. nickman: I am surprised that you found such poor improvement from Externalizable.

      Check out the Thrift benchmarks webpage:
      http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

      They found that Externalizable gives the best results of all, beating even kryo!

      I wonder what caused this huge discrepancy.

      Possibly it was something with the environment (hardware, OS, JVM, etc).

      Another possibility is the seemingly minor choice of what streams to use. The source code that you posted uses ObjectInput/ObjectOutput interfaces. What concrete classes did you use? The only ones built into the JDK are ObjectInputStream/ObjectOutputStream.

      In contrast, download the Thrift benchmarks as described here
      http://code.google.com/p/thrift-protobuf-compare/source/checkout

      Then open up the class
      ...\thrift-protobuf-compare-read-only\tpc\src\serializers\JavaExtSerializer.java

      There you will find that it defines custom ObjectInput/ObjectOutput implementations that essentially are just DataInputStream/DataOutputStream.

      Could it be that DataXxxStream has vastly lower overhead than ObjectXxxStream, even when all you do is write/read primitives like you do? Skimming the ObjectOutputStream source code, I note that primitives are written using an underlying BlockDataOutputStream. (This is also mentioned near the end of that class's javadocs.) In contrast, if you look at the source code of DataOutputStream, you will see that methods like writeInt write directly to the underlying OutputStream.

      So, I suspect that your results could be improved a lot with a tiny code adjustment. Would you mind re-running?

      On another note, I have not played with yet, but here is a project that claims to be a drop-in replacement for JDK serialization, but as fast or faster than Kryo:
      http://code.google.com/p/fast-serialization/

      Delete
    6. Followup to my own reply: looking close at the thrift benchmark JavaExtSerializer code, I think that they somewhat cheated.

      If you look closer, you will see that they only write then read an object of a single type (MediaContent). Consequently, their deserialize method does not determine what type of object is next on the input stream--it assumes that a single MediaContent instance is there.

      So, the solution that they give is NOT a general solution for writing a series of objects of potentially different types to a stream. You need to be more complicated (e.g. when writing, first write what type of object is coming).

      That said, their approach of only effectively using a DataXxxStream is intriguing. Nickman, I would still like to see what you get if you try that.

      Delete
    7. Please feel free to take the above code and add the test case for Externalizable and using DataXxxStream as you suggest. If you post the code via a link then we can all try it for comparison.

      Delete
  4. Thanks for posting these measurements. Very helpful to know what the tradeoff is. I will probably continue to use ByteBuffer's for the bounds checking. So much parallelism is available that I don't see it as the right tradeoff for serialization to disk or network.

    I think the right way to view these results are in terms of the # of megabytes of serialization you can do per core per second. My napkin math says 974 megabytes per core. By way of comparison when I tested Snappy I got the advertised 250 megabytes/second and LZ4 compression claims to do 300 megabytes/sec.

    As you point out these numbers matter if you are doing IPC on a local system, and if you have 10-gig E there is some fat to trim, but you can scale your way out of this particular bottleneck. I'll bet there is more fat to trim thinking about the cache misses being incurred by local IPC and associated coordination then the serialization scheme itself. That too is on the order of 100s of nanoseconds.

    ReplyDelete
  5. It would be nice if you could collaborate with team behind [https://github.com/eishay/jvm-serializers] (there is a mailing list). Much easier to get decent perspective; especially since while sometimes (de)serialization is a significant cost, quite often it really is not (which was implied in the article too).

    ReplyDelete
    Replies
    1. I added a quick and dirty Unsafe serializer to the jvm-serializer project locally and the results on my rather slow server are:

      pre. create ser deser total size +dfl
      java-built-in 193 24805 110891 135696 889 514
      java-manual 193 3034 3046 6080 255 147
      unsafe-manual 193 1334 1192 2526 1024 188
      kryo 193 1920 2784 4705 214 134
      kryo-opt 193 2107 2884 4991 211 131
      kryo-manual 193 1570 2369 3939 211 131

      Given that kryo is typically the among fastest serializer listed (https://github.com/eishay/jvm-serializers/wiki/Staging-Results) the results are definitely interesting.

      Delete
    2. How do we read these results? Did you test with sufficient warm up so the optimiser can apply intrinsics?

      Delete
    3. Apologies for the chart formatting. My understanding is that the serializer runner in the jvm-serializer test suite does indeed do warmup runs and runs each test 500 times against each serializer. I had the test run against against the serializers I'm most familiar with - java and kryo.

      The two numbers I was interested in were the serialization and deserialization costs which are the second and the third numbers in the chart above (in nano seconds I believe). We currently use an older version of kryo than the one in the test so I was curious just how much faster, if anything, using Unsafe would be. From the test results above it would seem that the serialization times are slightly faster and the deserialization times are 1/2 of what kryo achieves.

      Delete
    4. For any performance test it is important to do multiple runs of over 10 thousand iteration to allow the optimiser to kick in, otherwise you are just testing the interpreter. An option is to vary the CompileThreshold.

      From the documentation for -server JVM:
      -XX:CompileThreshold=10000 Number of method invocations/branches before compiling [-client: 1,500]

      Delete
    5. I assume Bruce used the default of 500 trials in the jvm-serializers project. Each trial is 2000 iterations.

      Delete
  6. If you want a specific byte ordering for the Unsafe, you can use Long/Integer.reverseBytes(). On Intel these are both implemented as intrinsics (BSWAP) and are quite efficient. You can statically evaluate whether your need to reverse the bytes for your platform by writing out a long in the desired byte order and reading it in again. Store this in the final static field and the value can be read via Unsafe and reversed if required.

    long longFromMemory = unsafe.readLong(buffer, offset);
    return SHOULD_REVERSE ? Long.reverseBytes(longFromMemory) : longFromMemory;

    Because SHOULD_REVERSE is a constant, Hotspot will remove the alternative unused branch, if not the CPU will probably correctly branch predict it.

    We're currently using this technique and it adds about 20% overhead and is still faster than ByteBuffer.

    ReplyDelete
  7. Hi Martin,
    I've recently written a similar article in my blog and updated it after reading this article. I've made a full set of ByteBuffer performance tests: heap/direct bufers; little/big endian; working on separate elements of array or using bulk methods; Java 6 b30/Java 7 b2. I've tested byte[] and long[] processing times separately, so I ended up with 64 time measurements for ByteBuffers and 16 - for Unsafe.
    As it tturned out, direct byte buffers with native byte order used to serialize arrays of any primitive type are nearly indistinguishable by performance from Unsafe.
    You can read my article here: http://vorontsovm.blogspot.com.au/2012/07/javaniobytebuffer-javaiodatainputstream.html

    ReplyDelete
    Replies
    1. Thanks Mike this is interesting. I've found direct ByteBuffer's can give sufficient performance for large byte arrays when intrinsics get applied. By the way you should test across all JVMs, I was very surprised at the differences! Hotspot and Azul rock but the others (JRockit, IBM, et al) are somewhat lacking.

      Most JDK libraries now support the use of ByteBuffer and Channel, unfortunately many 3rd party libraries only take a byte[] in their API. This is why Unsafe is more interesting to me. I can use it to manipulate *any* type of buffer I get a reference to.

      Delete
  8. Hi Folks,

    Very nice article on a topic that is frequentky overlooked I think. On second thaught however I wonder whether comparing JSE serialization to a custom protocol is of any value. Are we comparing apples to apples here?
    JSE serialisation only requires you to implement the correct interface. Your object can but does not need custom IO code at all. This is a pretty big contrast to the ByteBuffer and UnsafeMemory implementations.
    Secondly, JSE serialization has additional logic to serialize object references to (re)construct object graphs etc. I see no equivalence of that feature in the other 2 testcases.
    I have added a test case for the Hessian library which have results in the line of those for JSE serialization.

    Curious to here your thaughts on this...

    Regards,
    Benoit

    ReplyDelete
    Replies
    1. Java Serialization is not a fair comparison at all. I added that test case for context to what people would likely do if they were not aware of custom binary protocols. I could have shown how much slower XML is even than Java Serialization to be really silly.

      Yes, serialising a graph of objects does require additional code and this can be implemented with a simple key protocol. However in a high-performance environment if you are passing complex graphs for basic communication then I'd have to council a step back and question the design.

      The point of the article is to show how it is possible to get native like performance in a managed runtime environment like Java. If you are working in a high-performance domain then you need to be thinking like this. If performance is not an issue then ease of coding is probably your best investment.

      Delete
  9. Someone let me know when they've modified Kryo to use Unsafe in the com.esotericsoftware.kryo.io Output and Input classes. I'd love to see the results then. I'd do it myself, but I'm sure someone else has already thought of this minor mod.

    ReplyDelete
  10. Very good article. I'm a fan of Kryo's.
    But I've also ported to Java "Python Construct", an excellent library for parsing and building custom binary protocols.
    Speed hasn't been a requirement so far, but if someone wants to develop the low level further and use "unsafe" rather than ByteBuffer, that'd be a useful contribution:

    https://github.com/ZiglioNZ/construct

    http://techno.emanueleziglioli.it/2012/07/java-construct-112-release.html

    https://github.com/construct/construct

    http://construct.readthedocs.org/en/latest/index.html

    ReplyDelete
  11. It's interesting to see that I'm not the only person featuring the sun.misc.Unsafe idea. I started some Github project (Lightning) a few months ago to implement a serialization framework that uses as much Unsafe constructs as possible but can failover to reflection.
    Another thing it does is generating marshaller / unmarshaller bytecode at runtime to run at native speed the moment HotSpot jumps in.
    The last performance feature is the missing constructor invocation. This means ONLY value objects can be transfered with this serializer but for most cases this is enough.

    Another reason for the framework was a fast-failing principle means it is used for clusters and all clusternodes need to have the same codebase. So the masternode builds a classdefinition container holding all information of registered / configured classes and attributes.

    The current implementation only features JGroups integration so when a new node connects the clustermaster transfers it's own classdefinition container to the node and the new cluster member tests it's own classes against the definitions. If one class fails the new node is automatically disconnected from the cluster.

    The whole project started as a prototyping to find out if my expectations about the speed improvement would be correct but this implementation was good enough to make it a real project.

    Here are some results (for interested people this is the source is linked below):
    Lightning Serializer build time: 2795 ms
    Lightning Serialization Avg: 899,13 ns, runs: 800000, size: 42 bytes
    Lightning Deserialization Avg: 918,84 ns, runs: 800000, size: 40 bytes
    Java Serialization Avg: 3939,51 ns, runs: 800000, size: 375 bytes
    Java Deserialization Avg: 19581,37 ns, runs: 800000, size: 375 bytes

    The building time is a one time operation when generating the classdefinition container and generating the bytecode marshallers. This is normally done at startup time of the clusternode.
    The possible differences in byte size depends on if a field was randomized to null or filled with a value.

    For interested people I would like to see others joining the project or offering ideas on what to add or how to improve the framework. I'm open for questions and optinions as well. You can contact me on m e [at] n o c t a r i u s [DOT] c o m or Google+. I hope to see some reaction of the blog's owner.

    Cheers,
    Noctarius

    https://github.com/noctarius/Lightning
    https://github.com/noctarius/Lightning/raw/master/lightning-core/src/test/java/com/github/lightning/Benchmark.java

    ReplyDelete
    Replies
    1. Interesting - I was thinking about something like this for Apache DirectMemory - but you're already ahead on the way ;) we now use protostuff as the default serializer - do you think you could have better figures?

      Delete
    2. I already thought about adding a serializer for directmemory (https://github.com/noctarius/Lightning/issues/10). it would be nice to have a chat about how to integrate both systems in a nice way :) Feel free to contact me with the above mail address.

      Delete
    3. This thread is going too far off topic and I feel it is more about promoting your projects. Unless adding to the topic directly I will remove comments.

      Please restrict the blatant promotion.

      Delete
  12. Very timely article for software I am developing write now. However, we are trying to implement memory-mapped buffers to allow paging in order to make our application work on devices with limited physical memory. Is there (a reasonable) way I could leverage the use of Unsafe operations as an alternative to using a MappedByteBuffer?

    ReplyDelete
    Replies
    1. You can use reflection to get the address of the memory inside the MappedByteBuffer then use Unsafe to manipulate it. However on some non-mainstream JVMs Unsafe may not have intrinsics applied, therefore you are better off using the ByteBuffer interface. Best to test for your own requirements and platform.

      Delete
  13. Guys, I just came across this article which follows up on this one with some brief benchmarks:
    http://java.dzone.com/articles/fast-java-file-serialization

    One surprising claim made in that article is that standard Java serialization to a File can be sped up by ~4X if you use a FileOutputStream(RandomAccessFile.getFD()) instead of the usual BufferedOutputStream(FileOutputStream).

    What? That claim was news to me! Anyone one else ever seen a performance claim like that before?

    It cannot be because RandomAccessFile natively implements DataOutput, since a FileOutputStream wraps and hides that API...

    I have found that benchmarking file I/O is really nasty due to disk caching. You gotta be super careful between benchmarks to do stuff to ruin the cache before doing a new benchmark. Perhaps these results are just not trustworthy?

    ReplyDelete
  14. I've implemented a serialization that uses unsafe optionally to speed things up. Additionally it is possible to inject hints using annotations into the serialization algorithm.
    In a real project, doing handcrafted serialization of complex structures is costly in terms of mandays and resulting bugs.
    Theoretically it should be possible to achieve similar or (in most cases better) performance by using a mix of efficient generic serialization implementation+Annotation hints.

    see http://code.google.com/p/fast-serialization/

    ReplyDelete
  15. Not really an apples to apples comparison. For example the byte buffer approach does not serialize the class type itself, IOW it is not polymorphic, it also uses a fixed size which often times is not possible, especially if the object is bigger than the allocated size.

    For an apples to apples comparison you should use externalize but only invoke the writeExternal and readExternal methods not ever call #writeObject on the stream with the object to be externalized. Make sense? And of course if you really want to be fair you will either reuse the byte array for the ObjectOutputStream/InputStream or just allocate a new ByteBuffer in the loop as this is how most code would work.

    ReplyDelete
    Replies
    1. It is best to handle the "class" of a serialised message by a type id which is an integer. In the example I have used a long (sourceId).

      The example shows how to do simple fixed sized messages as you point out. This week I'm releasing a new open-source framework that supported much more complex variable sized structures.

      Delete
  16. Hello Martin,

    Thank you for the excellent article. Extending your idea I have implemented a general purpose serialization using the Unsafe class, and extended on it to implement an off-heap cache.

    I have tested my solution using your performance test framework and found it to be as fast as (and even slightly better) than Kryo.

    However, before talking real big, I would really like to have it tested through some other robust test suite as well.

    Can you pls provide any pointer so as to how should I go about in testing it?

    Thanks,
    Sutanu

    ReplyDelete
    Replies
    1. We could do with some good test suites in this area. The few I've seen are very pojo based with significant allocation thus skewing the results.

      I've been working on a new codec for C++, C#, and Java, that uses these techniques.

      https://github.com/real-logic/simple-binary-encoding

      Delete
  17. Hi Martin,

    I've followed @iotsakp's direct buffer suggestion and decided to post the results for reference.
    Platform used: Windows 10 64bit, Haswell i5-4460 3.2 GHz, 16GB RAM, JDK 1.8.0_91 x64, default JVM settings.

    Direct ByteBuffer actually comes quite close to Unsafe.

    0 Serialisation write=1,663ns read=6,784ns total=8,447ns
    1 Serialisation write=1,412ns read=6,425ns total=7,837ns
    2 Serialisation write=1,404ns read=6,388ns total=7,792ns
    3 Serialisation write=1,391ns read=6,427ns total=7,818ns
    4 Serialisation write=1,395ns read=6,394ns total=7,789ns

    0 ByteBuffer write=118ns read=165ns total=283ns
    1 ByteBuffer write=97ns read=141ns total=238ns
    2 ByteBuffer write=98ns read=143ns total=241ns
    3 ByteBuffer write=96ns read=142ns total=238ns
    4 ByteBuffer write=96ns read=142ns total=238ns

    0 DirectByteBuffer write=49ns read=78ns total=127ns
    1 DirectByteBuffer write=33ns read=62ns total=95ns
    2 DirectByteBuffer write=29ns read=62ns total=91ns
    3 DirectByteBuffer write=28ns read=62ns total=90ns
    4 DirectByteBuffer write=28ns read=62ns total=90ns

    0 UnsafeMemory write=27ns read=52ns total=79ns
    1 UnsafeMemory write=22ns read=51ns total=73ns
    2 UnsafeMemory write=24ns read=55ns total=79ns
    3 UnsafeMemory write=21ns read=49ns total=70ns
    4 UnsafeMemory write=20ns read=48ns total=68ns

    ReplyDelete