模拟LRU

CacheSimulator

pass

shc@shc-virtual-machine:~/Shared_ln/csapp/cachelab$ ./test-csim 
                        Your simulator     Reference simulator
Points (s,E,b)    Hits  Misses  Evicts    Hits  Misses  Evicts
    3 (1,1,1)       9       8       6       9       8       6  traces/yi2.trace
    3 (4,2,4)       4       5       2       4       5       2  traces/yi.trace
    3 (2,1,4)       2       3       1       2       3       1  traces/dave.trace
    3 (2,1,3)     167      71      67     167      71      67  traces/trans.trace
    3 (2,2,3)     201      37      29     201      37      29  traces/trans.trace
    3 (2,4,3)     212      26      10     212      26      10  traces/trans.trace
    3 (5,1,5)     231       7       0     231       7       0  traces/trans.trace
    6 (5,1,5)  265189   21775   21743  265189   21775   21743  traces/long.trace
    27

TEST_CSIM_RESULTS=27

注意事项

层次：所模拟的是SRAM作为DRAM的缓存的替换策略。
Placement policy: determines where b goes
Replacement policy: determines which block gets evicted (victim)

只是一个cacheSimluator 并不是真正的cache
- A cache simulator is NOT a cache!
- Memory contents NOT stored
- Block offsets are NOT used – the b bits in your address don’t matter
- Simply count hits, misses, and evic-ons
- 由于不需要处理数据，那么load store和modify就可简化为同一函数updateCache操作。

读题时困惑许久
- cacheSimulator中，如果我Load了一个数据，发现在cache中没有，他该从哪里加载进来呢？是我们假设他从下层加载进cache了吗？但实际我们并没有做这个操作？是这样么？
  - 没错，就是这样。
- 本实验只是专注于实现一个组中的替换（驱逐）策略：LRU
  - 至于读时如果不命中如何从下层加载到cache中，这些都不考虑，假设不命中时会自动缓存到cache中。
  - 至于写命中时直写还是写回、写不命中时写分配还是非写分配，这些都不做。只做LRU替换策略。

Linux C处理命令行参数
- getopt

C格式化读取文件

fscanf

while(fscanf(f," %c %lx,%d",&op,&addr,&sz)>0)
{
    printf("%c, %lx %d\n",op,addr,sz);
    switch(op)
    {
        case 'L':       //  load data
            updateCache(addr);
            break;
        case 'M':       //  modified = load and store    load hit then update cache(write-back) . load not hit and then load into cache and update(write-allocate) 
            updateCache(addr);   //  load
        case 'S':       //  store
            updateCache(addr);
            break;             
    }
}

shc@shc-virtual-machine:~/Shared_ln/csapp/cachelab$ ./csim -s 4 -E 1 -b 4 -t traces/yi.trace
L, 10 1
M, 20 1
L, 22 1
S, 18 1
L, 110 1
L, 210 1
M, 12 1
hits:4 misses:5 evictions:3

yi.trace
L 10,1
M 20,1
L 22,1
S 18,1
L 110,1
L 210,1
M 12,1

用unsigned long作为地址 64bits 防止被截断
如何取address低x位被截断
- 通过mask/只通过移位
- 见datalab如何获取低x位。
思路和lc的LRU缓存思路大致相同。
- 本实验大致步骤
  - 如何从命令中拿到所需的参数 getopt
  - 如何从文件中读入内容 fscanf
  - 如何构造cache结构
    - 本实验是实现了S组LRU。
      - 通过数组将index映射到S组中的一组。
      - 每组通过timelist维护数据和访问时间
      - 每组LRU做到了O(1)时间的驱逐evict
      - 但是，没有做到O(1)的取数load，而是O(n)。通过遍历组中的所有节点(cacheLine)
      - L：load。S：store。M：modify = L + S。由于不处理数据，可以简化为同一函数。
    - lc通过hash实现了O(1)的取数load，通过timelist实现O(1)的evict。

Part B(略)

有些繁琐、感觉没啥意思。
参考

32*32

C = 1024 Bytes
直接映射 E = 1
B = 32Bytes b = 5
S = C / (S*B) = 1024/32 = 32 s = 5
观察A和B的地址可以发现，A和B中相同坐标的元素会映射到cache中的同一个cache line
一行cache line可以存储8个int。A的一行可以映射到4行不冲突的cacheline。A的8行可以映射到所有cacheline而不冲突。cache可以存储8行1*32的int行。
A/B[0,0]会和A/B[8,0]冲突
数组的行和cache的映射关系如下图
为什么划分为8*8？因为要求不超过12个临时变量。4*4没有利用好cacheline.16*16冒出cacheline(cachlien一行只能8个int)
对角线元素相同，没必要交换，且A和B会映射到同一cacheline
不一次读完一行：在读某一行时，初始会有一个读miss；写对角线元素时，会有一个写miss；且会将正在读的A的该行替换出cache。而A的该行还没读完，还需要接着读，因此需要再替换进cache。造成抖动

一次读完一行：在读某一行时，初始会有一个读miss，且一次性全部读完，在写对角线元素时，会有一个写miss，会将读的A行替换出cache。但不会造成抖动，因为A的改行已经读完。

void transpose_submit(int M, int N, int A[N][M], int B[M][N])
{
//  不一次读
    int i, j, m, n;
    for (i = 0; i < N; i += 8)
            for (j = 0; j < M; j += 8)
                    for (m = i; m < i + 8; ++m)
                            for (n = j; n < j + 8; ++n)
                            {
                                    B[n][m] = A[m][n];
                            }
//  一次读
    for(int i=0;i<M;i+=8)
    {
        for(int j=0;j<N;j+=8)
        {
            for(int k=i;k<i+8;++k)
            {
                int t0 = A[k][j];
                int t1 = A[k][j+1];
                int t2 = A[k][j+2];
                int t3 = A[k][j+3];
                int t4 = A[k][j+4];
                int t5 = A[k][j+5];
                int t6 = A[k][j+6];
                int t7 = A[k][j+7];
  
                B[j][k] = t0;
                B[j+1][k] = t1;
                B[j+2][k] = t2;
                B[j+3][k] = t3;
                B[j+4][k] = t4;
                B[j+5][k] = t5;
                B[j+6][k] = t6;
                B[j+7][k] = t7;
            }
        }
    }


}

# 不一次读（a的小矩阵每行多一次读miss）
shc@shc-virtual-machine:~/桌面/cachelab$ ./test-trans -M 32 -N 32

Function 0 (2 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 0 (Transpose submission): hits:1710, misses:343, evictions:311

Function 1 (2 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151

Summary for official submission (func 0): correctness=1 misses=343

TEST_TRANS_RESULTS=1:343


# 一次读

shc@shc-virtual-machine:~/桌面/cachelab$ ./test-trans -M 32 -N 32

Function 0 (2 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 0 (Transpose submission): hits:1766, misses:287, evictions:255

Function 1 (2 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151

Summary for official submission (func 0): correctness=1 misses=287

TEST_TRANS_RESULTS=1:287

不落辰

Cache-Lab

CacheSimulator

注意事项

Part B(略)