not a beautiful or unique snowflake (nothings) wrote,
not a beautiful or unique snowflake

Wow, I learned a new linked-list trick. I wonder how long this one has been around.

Suppose you want to make a non-binary tree, where each node has a list of all its children. And you're going to thread the list through the nodes directly.

If you want to make adding and removing nodes from the tree an O(1) operation, you have to use a doubly-linked list.

Now, suppose we want to be able to delete a node from the tree, and we don't already know who its parent is. There are three implementation strategies I know of to deal with all these constraints:

  1. A plain old doubly-linked list, with NULLs at each end, and a parent pointer on each node. Whenever you delete a node, you have to check if you're deleting the first child of the parent (which points to this node) and advance that pointer. The code for this is the most complex.
  2. A circularly-linked list. You still need a parent pointer to update the child pointer. The code is a little simpler; the only special case is deleting from a one-item list.
  3. A doubly-linked list with a sentinel list entry. The parent keeps a pointer to the sentinel. To delete a node, you just unthread it from the list; there are no tests, and you don't need to access the parent.

So, if you're trying to keep storage usage down, what are the overheads here?

  1. a parent pointer
  2. a parent pointer
  3. a sentinel node per list

If your tree doesn't have a special leaf data type, then every node in the tree has a child list, and thus every node has a sentinel node, so the sentinel node code actually uses more storage. (In the typical linked-list case the overhead is tolerable because we don't have every entry on the list having its own child list below.) If the nodes are large, this gets pretty nasty; you can try to be clever and allocate smaller sentinels, but you have to make sure the pointers are in the right place and that's pretty gross and it's still going to be more than a parent pointer.

The thing is, the only time you really need the parent pointer is when you're deleting the first node on the list--and in case #1, that node has no previous pointer anyway, although I'm so used to using #2 I didn't see that right away. So you could do some hack where you use the bottom bit of the prev pointer to encode whether it's really a parent pointer, not a previous pointer. But it turns out there's a better approach that doesn't require bit-hackery.

In fact, I already knew this approach algorithmically, but not as a data structure, so let's look at that case first. Suppose you have this singly-linked-list pointer walk to remove an item:

   if (List->first_item == target) {
      List->first_item = target->next;
   } else {
      Thing *cur, *prev;
      prev = List->first_item;
      cur = prev->next;
      while (cur != target) {
          // [1]
          prev = cur;
          cur = prev->next;
      // remove it
      prev->next = target->next;

Note that at the location marked [1], if you think about the node 'cur', even though it's singly linked, we have all the same information as a doubly-linked list: we've computed the equivalent of 'cur->prev' and stored it into prev.

Now, there's a well known trick in C to remove the special cases from this code. It's only possible in a language that lets you take a pointer into the middle of an object; most garbage-collecting languages don't let you do this, even though they're totally tractable (and GCs actually have to deal with them internally--CPU registers may have intermediate values in the middle, and more of them on the stack).

   Thing **prevn = &List->first_item;
   // prevn points to the 'next' field in 'prev'
   // *prevn = cur
   while (*prevn != target) {
       // [2]
       prevn = &(*prevn)->next;
   // prevnext points to the 'next' field in target->prev
  *prevn = (*prevn)->next;

Sometimes I write these sorts of delete loops one way, sometimes the other way.

Note that at [2], which corresponds to [1] from before, *prevn is the same as 'cur' would be. We have access to 'cur->next' by saying '(*prevn)->next', but we don't have access to 'cur->prev'; we only have access to '&cur->prev->next'. But that's all we need for this algorithm.

Also note that the way this avoids having a special case is by conceptually cheating; the very first time through, prevn doesn't point to the previous items next pointer--there is no previous item. Instead it points to the list head's 'first item' pointer. Since these are the same types, the algorithm works. It might make more sense if you think of it as the list header actually being a sentinel at the head of the list.

So the new trick, the 4th approach, is to make this manifest in the data itself. (I learned about this browsing the source code to Instead of the linked list data structure being defined as
    Node *next;
    Node *prev;

you define the data structure as
    Node *next;
    Node **prevn;

and then you make the first node on the list have its 'prevn' pointer pointing "into" the parent node. Now your doubly-linked node deletion operation is just:
   target->next->prevn = target->prevn;
   *(target->prevn) = target->next;

and that handles all the cases, and you don't need a parent pointer.
  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.