Nov 6, 2022

How does Python calculate tuple hash?

Well, in Python we can find several builtin types that can be proud of its own optimized implementation of a hash function:

complex, float (both use _Py_HashDouble),
bytes, Memoryview (_Py_HashBytes),
str
tuple

In this post, I’m gonna deep dive into implementation of the latter one. Tuples hashing mechanism has changes over the years. Modern Python versions has been using SipHash (starting from 2013), but finally moved over to the xxHash. Issue 34751 describes its way into release code.

Python3.8 Changelog states:

The hash function for tuples is now based on xxHash which gives better collision results on (formerly) pathological cases. Additionally, on 64-bit systems it improves tuple hashes in general. Patch by Jeroen Demeyer with substantial contributions by Tim Peters.

Issue author notices, how easy it is to induce hash collision, even for a trivial tuples:

hash((1, 0, 0)) == hash((1, -2, -2))

Collision have great impact on a performance, so lowering them was the main goal of the new algorithm. New implementation can be found in a tupleobject.c file:

#if SIZEOF_PY_UHASH_T > 4
#define _PyHASH_XXPRIME_1((Py_uhash_t) 11400714785074694791 ULL)
#define _PyHASH_XXPRIME_2((Py_uhash_t) 14029467366897019727 ULL)
#define _PyHASH_XXPRIME_5((Py_uhash_t) 2870177450012600261 ULL)
#define _PyHASH_XXROTATE(x)((x << 31) | (x >> 33)) /* Rotate left 31 bits */
#else
#define _PyHASH_XXPRIME_1((Py_uhash_t) 2654435761 UL)
#define _PyHASH_XXPRIME_2((Py_uhash_t) 2246822519 UL)
#define _PyHASH_XXPRIME_5((Py_uhash_t) 374761393 UL)
#define _PyHASH_XXROTATE(x)((x << 13) | (x >> 19)) /* Rotate left 13 bits */
#endif

static Py_hash_t
tuplehash(PyTupleObject * v) {
  Py_ssize_t i, len = Py_SIZE(v);
  PyObject ** item = v -> ob_item;

  Py_uhash_t acc = _PyHASH_XXPRIME_5;
  for (i = 0; i < len; i++) {
    Py_uhash_t lane = PyObject_Hash(item[i]);
    if (lane == (Py_uhash_t) - 1) {
      return -1;
    }
    acc += lane * _PyHASH_XXPRIME_2;
    acc = _PyHASH_XXROTATE(acc);
    acc *= _PyHASH_XXPRIME_1;
  }

  /* Add input length, mangled to keep the historical value of hash(()). */
  acc += len ^ (_PyHASH_XXPRIME_5 ^ 3527539 UL);

  if (acc == (Py_uhash_t) - 1) {
    return 1546275796;
  }
  return acc;
}

Prime numbers are used heavily in the calculations here.

Python equivalent:

import ctypes

_PyHASH_XX_PRIME_1: int = 11400714785074694791
_PyHASH_XX_PRIME_2: int = 14029467366897019727
_PyHASH_XX_PRIME_5: int = 2870177450012600261
BITS_LEFT_ROTATE: int = 31
PLATFORM: int = 64


def _py_hash_xx_rotate(val: int, rotate_by: int, bits) -> int:
    mask = (1 << bits) - 1  # define 0xffffffffffffffff mask
    left_rotate_masked = val << rotate_by % bits & mask  # rotate left and apply mask
    right_shift_len = bits - rotate_by % bits  # calculate right shift length
    right_rotate_masked = (val & mask) >> right_shift_len  # mask bits and rotate right
    return left_rotate_masked | right_rotate_masked


def tuple_hash(v: tuple[Any, ...]) -> int:
    acc = _PyHASH_XX_PRIME_5
    for item in v:
        lane = hash(item)
        acc += lane * _PyHASH_XX_PRIME_2
        acc = _py_hash_xx_rotate(acc, BITS_LEFT_ROTATE, PLATFORM)
        acc *= _PyHASH_XX_PRIME_1

        acc += len(v) ^ _PyHASH_XX_PRIME_5 ^ 3527539
    return ctypes.c_longlong(acc).value

tuple_hash function hashes and applies modifiers to all objects contained in the tuple.

Note: Result value, despite being Py_uhash_t (unsigned 64-bit int) is implicitly converted to signed int. So it has to be casted to ctypes.c_longlong.

Finally, we can check how it’s behaving:

test_tuple = 1, 2
hash(test_tuple)  # -3550055125485641917
tuple_hash(test_tuple)  # -3550055125485641917

Interestingly, the most bizzare collision still exists:

hash(-1) == hash(-2)

You can try it yourself.