MMIX LOGO

MMIX Bug Report copy_block

Table of Content

Content

MMIXware Version

mmix-20131017

Bug Reported

Initial: 27/11/2014

Author

Laurent Desnogues

Description

The function copy_block copies only dirty parts of a cache block.
static void copy_block(cache *c, cacheblock *p, cache *cc, cacheblock *pp)
{
  int j, jj, i, ii, lim;
  int off = p->tag & (cc->bb - 1);

  if (c->g != cc->g || p->tag - off != pp->tag)
    panic(confusion("copy block"));
  for (j = 0, jj = off >> c->g; j < c->bb >> c->g; j++, jj++)
    if (p->dirty[j]) {
      pp->dirty[jj] = true;
      for (i = j << (c->g - 3), ii = jj << (c->g - 3), lim =
           (j + 1) << (c->g - 3); i < lim; i++, ii++)
        pp->data[ii] = p->data[i];
    }
}
This doesn't make sense and it indeed breaks in the following case:
  • a partially dirty cache line is evicted from Dcache
  • a line in the Scache exists that is clean and which should be replaced by the evicted line.
The result is a line in Scache which is a mix of two lines that don't have the same address, which is wrong.

Note that the tag check in copy_block won't fail due to this:

        case flush_to_S:
...
            case 3:
              if (Scache->filler.next)
                wait(1);
              p = alloc_slot(Scache, c->outbuf.tag);
              if (!p)
                wait(1);
              data->ptr_b = (void *) p;
->            p->tag = c->outbuf.tag & (-Scache->bb);
              if (block_diff) {
                octa *d = p->data;

                p->data = Scache->inbuf.data;
                Scache->inbuf.data = d;
              }
            case 4:
              copy_block(c, &(c->outbuf), Scache, p);
The line prefixed with '->' is making sure the tags will be the same when calling copy_block.

Proposed Patch

The trivial fix to copy all of the 'dirty' and 'data' arrays doesn't work. To make my test pass I have to do the following:
  for (j = 0, jj = off >> c->g; j < c->bb >> c->g; j++, jj++) {
    pp->dirty[jj] |= p->dirty[j];
    for (i = j << (c->g - 3), ii = jj << (c->g - 3),
           lim = (j + 1) << (c->g - 3); i < lim; i++, ii++)
      pp->data[ii] = p->data[i];
  }
That is 'or' pp and p dirty arrays. This is due to the fact that you can have in Scache a line that is dirty, though it's not dirty in the same places as in the Dcache. I'm not sure this is expected: a line in Scache can only be dirty after being evicted from Dcache; then later the line can go back to Dcache but I'd expect the dirty flags to be replicated from Scache to Dcache. In real CPU that'd be mostly mandatory (for instance to ensure coherency at L1 level).

Anyway the code above fixes the issue I have seen. But I think it's slightly overkill and there might be better ways to fix it (for instance keep the old way and only do a full copy if the tags differed).

My larger test still fails and I have to investigate, but it's becoming harder and harder. I'm afraid I can't share that test since it's from SPEC CPU2000.

Discussion

Laurent Desnogues: Here is a new proposed fix: the idea uses the fact that alloc_slot will mark the clean lines in Scache as invalid; that information is saved when copying the tag in flush_to_S case 3; in copy_block this is used to force a replacement of the line. In terms of code this looks like this:
  • keep the invalidate bit:
            case flush_to_S:
    ...
                case 3:
                  if (Scache->filler.next)
                    wait(1);
                  p = alloc_slot(Scache, c->outbuf.tag);
                  if (!p)
                    wait(1);
                  data->ptr_b = (void *) p;
                  /* keep the invalidate bit of the tag for copy_block */
                  p->tag = (p->tag & sign_bit64) | (c->outbuf.tag & (-Scache->bb));
    
    the last two lines replace p->tag = c->outbuf.tag & (-Scache->bb)
  • handle the invalidate bit in copy_block:
    /* Old code was copying only dirty parts of p into pp, but this breaks in the
     * following situation (flush_to_S):
     *  - evicted partially dirty line from Dcache to Scache
     *  - a clean line is replaced in Scache which hasn't the same address
     * We fix that by using the invalid bit of pp which means we can safely replace
     * all of the line (cf case 3 from flush_to_S).
     */
    static void copy_block(cache *c, cacheblock *p, cache *cc, cacheblock *pp)
    {
      int j, jj, i, ii, lim;
      int off = p->tag & (cc->bb - 1);
      int replace;
    
      replace = pp->tag >> 63;
      pp->tag &= ~sign_bit64;
      if (p->tag != pp->tag)
        abort();
      if (c->g != cc->g || p->tag - off != pp->tag)
        panic(confusion("copy block"));
      if (replace) {
        for (j = 0, jj = off >> c->g; j < c->bb >> c->g; j++, jj++) {
          pp->dirty[jj] = p->dirty[j];
          for (i = j << (c->g - 3), ii = jj << (c->g - 3),
                 lim = (j + 1) << (c->g - 3); i < lim; i++, ii++)
            pp->data[ii] = p->data[i];
        }
      } else {
        for (j = 0, jj = off >> c->g; j < c->bb >> c->g; j++, jj++)
          if (p->dirty[j]) {
            pp->dirty[jj] = true;
            for (i = j << (c->g - 3), ii = jj << (c->g - 3), lim =
                   (j + 1) << (c->g - 3); i < lim; i++, ii++)
              pp->data[ii] = p->data[i];
          }
      }
    }
    
    I have run the first 2B cycles of some SPEC 2k test with two different configurations and things seem to be OK.

Further Discussion

The copy_block bug manifests itself in the flush_to_S coroutine.

There are four execution pathes through this coroutine which use the copy_block function and are potentially affected by this bug.

After the coroutine accquires the Scache lock, there are four different execution pathes leading to copy_block:

    • A cache hit in state 1.
    • copy_block in state 4
    • A cache miss in state 1 and
      The Scache is in WRITE_ALLOC mode and
      the blocksize of the Dcache is less than the blocksize in the Scache.
    • fill the Scache inbuf in state 2
    • Allocate a slot in the Scache in state 3
    • Swap the Scache inbuf with the allocated slot in state 3
    • copy_block in state 4
    • A cache miss in state 1 and
      The Scache is in WRITE_ALLOC mode and
      the blocksize of the Dcache is equal to the blocksize in the Scache.
    • Allocate a slot in the Scache in state 3
    • copy_block in state 4
    • A cache miss in state 1 and
      The Scache is in WRITE_AROUND mode
    • Handle write around in state 6
    • copy_block to the Scache outbuf
The present implementation of copy_block copies only the dirty granules from the source block to the destination block. This implementation works fine for path 1 and 4 as follows:

Path 1: There is a target block with valid data and valid dirty bits. and a source block with valid dirty bits, and valid data in the dirty granules; the result is a block with valid dirty bits and valid data.

Note that the implementation of the write_from_wbuf coroutine will write in state 3 (Handle write-around when writing to the Dcache) just one octabyte to the outbuf of Dcache, setting all dirty bits to zero except for the one octabyte written. This occurs only if the granule size is 8 (one octa), there is no hit in the Dcache, and the Dcache is in WRITE_AROUND mode or the instruction is STUNC (store-uncached). As a consequence, the Dcache outbuf will contain invalid data, except for one granule of 8 byte marked dirty. Hence, copy_block must not copy "clean" data; it might be invalid.

Path 4: First all granules in the Scache outbuf are marked "clean", and then the dirty granules in the Dcache outbuf will be copied to the Scache outbuf. Later, flush_to_mem will write only dirty data to memory.

The present implementation of copy_block is not correct for path 2 and 3:

Path 2: After swapping the inbuf data with the data in the allocated slot in state 3, the newly allocated slot contains valid data. Unfortunately, the present code does not reset the dirty information, so we find the old dirty information in the newly allocated block. After copy_block, we will have valid data, but some granules might be marked dirty, where in fact these granules are clean. This will not cause the processor to malfunction, but might cause extra flushing to memory.

Path 3: (This bug was found by Laurent Desnogues; his bug report started my investigation into this.) The alloc_slot function will try to find an available block in the Scache. If all candidates are in use, one of them is flushed to memory and without clearing data or dirty bits, this block is returned; only the sign bit in the tag is set to mark the block as invalid. The flush_to_S will set the new tag and will copy the dirty granules from the Dcache outbuf into the newly allocated block. If the outbuf was only partially dirty, some data from the old block content will remain.

Proposal (Laurent): Keep the invalid bit in the newly allocated block in state 3 and augment copy_block as follows: check for a newly allocated block by testing the invalid bit and in this case copy all the data (including the "clean" data) and all the dirty bits.

Unfortunately, this will not work in all cases, since the "clean" data in the Dcache outbuf might be invalid, for instance after a STUNC instruction (see note above).

The bug in path 3 is difficult to fix. We have to distinguish two cases: either all data in the Dcache outbuf is valid or only the dirty granules contain valid data. In the first case, we do not need to load data from memory, in the second, loading data is necessary.

The present flush_to_S coroutine in state 2 takes a liberal approach to loading data from memory: The complete cache block is always loaded but the delay is computed only for the amount of octabyte realy needed. Further, the memory is locked only for the required time.

In this spirit, we should inform the flush_to_S coroutine if only a single octabyte in the Dcache outbuf is valid, compute block_diff accordingly and take path 2 instead of path 3 in this case. We take path 3 only if the Dcache outbuf is completely valid and in this case copy all data and all dirty bits to the newly allocated slot.

To indicate that not all data in Dcache outbuf is valid, we could

  • use the dirty bytes, e.g. using -1 for an invalid granule.
  • use the sign bit of the tag field in the outbuf.
  • use some field in the Dcache structure.
  • use the rank field in the outbuf.
The rank field in the outbuf is currently unused (it's only used if the the cache block is part of a cacheset). So, I propose to use the rank field, setting it to the number of valid data byte in the outbuf before calling the flusher coroutine.

So we change the following:

In the subroutine flush_cache, we change

@x
    startup(&c->flusher,c->copy_out_time); /* will not be aborted */
@y
    p->rank=c->bb; /* this many valid bytes */
    startup(&c->flusher,c->copy_out_time); /* will not be aborted */
@z

In flush_to_S, we change

@x
  register int block_diff=Scache->bb-c->bb;
@y
  register int block_diff=Scache->bb-c->outbuf.rank;
@z
and in state 3 of flush_to_S, we change
@x
    if (block_diff) @inbuf| to slot |p|@>;         
@y
    if (block_diff) @inbuf| to slot |p|@>
    else@+for (j=0;jbb>>3;j++) p->data[j]=c->outbuf.data[j];
    for (j=0;jbb>>Scache->g;j++) p->dirty[j]=false;
@z
In @<Handle write-around when writing to the D-cache@> we add at the beginning
if (Dcache->filler.next) goto write_restart;
if ((Scache&&Scache->lock) || (!Scache&&mem_lock)) goto write_restart;
and before starting the flusher, we record the number of valid bytes:
@x
set_lock(self,wbuf_lock);
startup(&Dcache->flusher,Dcache->copy_out_time);
@y
Dcache->outbuf.rank=Dcache->gg; /* this many valid bytes */
set_lock(self,wbuf_lock);
startup(&Dcache->flusher,Dcache->copy_out_time);
@z

These changes will replace path 3 by path 2 in all cases where the data in the Dcache outbuf is not sufficent to fill an Scache slot and should also fix the bug in path 2.

Status

Fixed in the repository, Rev. 92, on 2015-05-28.

Please help to keep this site up to date! If you want to point out important material or projects that are not listed here, if you find errors or want to suggest improvements, please send email to email