Check-in [643eefa7e6]

 1 2 3 4 5 6 7 8 9 10 11 12 .. 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 .. 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 .. 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108  # Optimizing data motion into and out of callframes in Tcl Kevin B Kenny 16 February 2019 ## Introduction As of the current version ([0718166269]), the handling of variables in the callframe reflects some fairly fundamental oversights in the effect of Tcl operations on variables. Essentially, what it does is to introduce moveToCallFrame and moveToCallFrame operations around a partial ................................................................................ This discussion is written at least partly for the benefit of its author, to help with understanding what optimizations are safe and profitable, and how to compute them. It will, however, hopefully aid those who come after in understanding how the the decisions that were made. ## Back to basics - the code generation strategy One general strategy has been overwhelmingly successful in other areas of quadcode manipulation. Rather than attempting to add operations that handle the awkward cases when they are detected, it has been much easier to generate code for the worst case always, and then optimize it away. The code for variable accesses, for instance, is truly horrific when initially generated: ................................................................................ that is, extract the machine-native representation. Nevertheless, all of these tests become redundant if a second access to the variable is dominated by the first. In that case, partial redundancy elimination will eliminate all five of these steps and simply use the pure value in the arithmetic operation. The partial redundancy elimination is based largely on the value-based algorithms developed in [SIMP96][] We therefore assume here that we will take the brutally simple approach of generating code that: + whenever a local variable (whether potentially aliased or not) is loaded, generate a moveFromCallFrame instruction transferring a Tcl value into an SSA value. ................................................................................ amenable to his technique owing to anti-dependencies (a STORE operation may not be hoisted above a LOAD operation that might refer to the same memory). His technique also does not contemplate downward code motion, where a STORE operation that does not affect a loop's operation might be moved to after the loop exit. To address these deficiencies, it falls to us to develop a better theory of Tcl variable accesses. ## References [SIMP96]: . $SIMP96$. Loren Taylor Simpson. 1996. Value-Driven Redundancy Elimination. Ph.D. Dissertation. Rice University, Houston, TX, USA. AAI9631092. []   < | | > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > <  1 2 3 4 5 6 7 8 9 10 11 .. 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 .. 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 .. 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406  # Optimizing data motion into and out of callframes in Tcl Kevin B Kenny ## Introduction As of the current version ([0718166269]), the handling of variables in the callframe reflects some fairly fundamental oversights in the effect of Tcl operations on variables. Essentially, what it does is to introduce moveToCallFrame and moveToCallFrame operations around a partial ................................................................................ This discussion is written at least partly for the benefit of its author, to help with understanding what optimizations are safe and profitable, and how to compute them. It will, however, hopefully aid those who come after in understanding how the the decisions that were made. ### Back to basics - the code generation strategy One general strategy has been overwhelmingly successful in other areas of quadcode manipulation. Rather than attempting to add operations that handle the awkward cases when they are detected, it has been much easier to generate code for the worst case always, and then optimize it away. The code for variable accesses, for instance, is truly horrific when initially generated: ................................................................................ that is, extract the machine-native representation. Nevertheless, all of these tests become redundant if a second access to the variable is dominated by the first. In that case, partial redundancy elimination will eliminate all five of these steps and simply use the pure value in the arithmetic operation. The partial redundancy elimination is based largely on the value-based algorithms developed in $SIMP96$. We therefore assume here that we will take the brutally simple approach of generating code that: + whenever a local variable (whether potentially aliased or not) is loaded, generate a moveFromCallFrame instruction transferring a Tcl value into an SSA value. ................................................................................ amenable to his technique owing to anti-dependencies (a STORE operation may not be hoisted above a LOAD operation that might refer to the same memory). His technique also does not contemplate downward code motion, where a STORE operation that does not affect a loop's operation might be moved to after the loop exit. To address these deficiencies, it falls to us to develop a better theory of Tcl variable accesses. ### Trace semantics As soon as we get into discussion of non-local variables, the question of traces comes up. In this particular writeup, a very limited concept of traces is admitted: * We assume that any traces present are 'well behaved' in the sense that they do not invalidate the compilation (for instance, by redefining or changing the resolution of compiled commands, creating new aliases among local variables or changing their values). * We assume that skipping traces is permissible to a limited extent. In particular, repeated writes to the same variable may fire only a trace for the last write, while repeated reads may fire only a single read trace. We do, however, respect the ordering among reads or writes; a procedure that reads a variable, changes its value, and puts it back will fire a read trace and a write trace in the correct sequence. We also promise that any read trace will fire at a time when all prior writes have been traced. These constraints are weak enough that many loop-invariant reads and writes can be optimized away. For instance, a procedure that repeatedly reads the same global value in a loop can fire a trace only for the first read, and one that accumulates a result through multiple writes of the same value can defer the write until after the loop. They nevertheless are strong enough that the common cases (the read trace on ::env, and the write traces on variables that appear in Tk user interfaces) will work - and, in fact, may avoid needless work relative to interpreted code. ### Types of optimization to be considered The particular optimizations considered in this document all consist of eliminating code motion to and from the callframe. There are a number of these that we will attempt to cover: * Dead LOAD - An instruction moves a value from the callframe, but the result is not used anywhere in the program. This case should already be covered by the dead code analysis in quadcode. * LOAD-LOAD - An instruction moves a value from the callframe, and an earlier instruction has already put the Tcl value into an SSA value. It can be proven that the value in the callframe has not changed, so the second move can be eliminated. * STORE-LOAD - An instruction moves a value from the callframe, and an earlier instruction has moved an SSA value into the callframe. It can be proven that the value in the callframe has not changed. The move from the callframe can be eliminated. * Loop-invariant LOAD - An instruction inside a loop moves a value from the callframe, and it can be proven that nothing inside the loop alters the value. The move from the callframe can be hoisted to before the start of the loop. (Note that in this case, partial redundancy elimination actually works by creating a second move outside the loop, and reducing the optimization to the LOAD-LOAD case.) * Dead STORE - An instruction moves a value to the callframe, but the variable in the callframe is not aliased and is not referred to again before procedure exit. * LOAD-STORE - An instruction moves a value to the callframe, but the value came from an earlier move from the same place in the callframe, and it can be proven that the value has not changed. The move to the callframe can be eliminated. * STORE-STORE - An instruction moves a value to the callframe, and there is a _later_ instruction on each code path that moves a value to the same place in the callframe, and nothing in between the two instructions might read the value. It is safe to eliminate the first store. * Loop invariant STORE - An instruction withing a loop moves a value to the callframe, and nothing inside the loop loads the value. (This condition will commonly be introduced by STORE-LOAD elimination.) In this case, the STORE may be pushed to after the loop exit(s). It is expected that it will be simplest to introduce speculative move instructions at each loop exit, which will then cause STORE-STORE optimization to eliminate the move inside the loop. (It is possible that some of the speculative instructions can then be optimized away by other means.) From among these, it is useful to separate those requiring _anterograde_ analysis (proceeding from a procedure's entry to its exits) from those requiring _retrograde_ analysis (proceeding from the procedure exits to the entry). LOAD-LOAD, STORE-LOAD, loop-invariant LOAD, and LOAD-STORE are all anterograde. They depend on the availability of SSA counterparts to callframe values at a particular point in the program and on whether the use of a value in the callframe can be anticipated at a given point in the program. By contrast, dead-LOAD, dead-STORE, STORE-STORE, and loop-invariant STORE are retrograde. They depend on whether it is true that all paths forward to the program exit will contain a move to the callframe or a use of a callframe value, or whether some path contains a reference to a callframe value. The pass structure of the two types of analysis is different. While the 'iterate to convergence' structure would allow visiting the blocks of a program in any order, performance demands that anterograde analysis visit dominators before the blocks that they dominate, while retrograde analysis visits postdominators before the blocks that they postdominate. With that ordering, a given iteration to convergence will require only a number of passes proportional to the loop nesting depth. ## TASK 1: Availability analysis The first part of anterograde analysis will begin with developing the set of callframe values that are known to be available as SSA values. This involves looking at the instructions of a block in order, and accumulating the mapping of name and callframe instance to SSA value. Initially, the available set on exit from each block is empty. We iterate through a basic block, applying the following rules. Initially, the set of available values is the set from the block's dominator, or Ø if the block is the entry block or unreachable. + entry - All values listed on the instruction acquire Nothing as their values. + moveFromCallFrame - The result of the instruction becomes the SSA value corresponding to the given variable name in the given callframe. + moveToCallFrame - The name-value mapping in the result callframe is set up as follows: + The available values in the source callframe become available values in the destination callframe. + For each name specified in the instruction from left to right, the available value for the name becomes the corresponding value in the instruction, and all other names that potentially alias the given name lose their values. + invoke[Expanded] - The available values in the source callframe become available in the destination callframe. Then, any names that the invoked command might change and any names that might alias them, lose their available values. + directSet etc. - Any name that might be an alias of the changing variable loses its available value. (If the changing variable name cannot be determined, all names might be aliases, because in this case we cannot determine that the name is fully qualified.) + variable, upvar, nsupvar - Any existing value for the given name and any possible alias thereof is no longer available. Note that this is actually a program error, but we need to retain correct behaviour in the case where the error is caught. + φ - If several callframes meet at a φ, the set of names that have available values will be the intersection of the available sets of the input callframes. For each name in that intersection: + If all the incoming values are the same, that value becomes the value of the name in the result callframe. + If values are different, then we find the available values for each name on exit from the predecessor blocks, and construct a φ instruction that combines them. We search for a φ that combines them. + If such a φ is found, its result is the new value for the given name. + If no such φ is found, we insert such a φ speculatively into the header of the block, and give its result as the value for the given name. At the end of the block, after applying these rules, we have the block's new available set. The calculation of this set may in turn change the available values on φ instructions at the start of loops, so the calculation must be applied to convergence over all blocks in the program. Those who are versed in formal data flow analysis are welcome to separate the descriptions of the operations given above into separate sets AVAIL_GEN and AVAIL_KILL so that the problem can be given as the forward analysis: ![\texttt{AVAIL\_OUT}[B] = \texttt{AVAIL\_GEN}[B] ~ \cup ~ \left( \overline{\texttt{AVAIL\_KILL}[B]} ~ \cap \bigcap_{P\in \texttt{pred}[B]} \texttt{AVAIL\_OUT}[P] \right) ](./avail.svg) This separation will be made in the code if it turns out that it enables combining of passes with other callframe optimizations. ### Availability analysis: implications Calculating the available values already exposes opportunities for optimization. In particular, LOAD-LOAD and STORE-LOAD redundancies can be eliminated by checking each moveFromCallFrame operation in turn for whether it is accessing an available value. If it is, it is eliminated and its result is replaced with the available value. ### Availability analysis: equivalence sets LOAD-STORE pairs can also be detected, by examining 'moveToCallFrame' instructions and eliminating them if the variable being written is available, and known to be equal to the Tcl value in the callframe. Therefore, in the availability analysis, as well as accumulating the sets of available Tcl values and the names in the callframe that correspond to them, we must track the sets of names in a given callframe that correspond correspond to each SSA value. That way, when we examine a moveToCallFrame instruction, we can detect that the given value is already in the callframe under the given name. ## TASK 2: Live Tcl value analysis The first part of retrograde analysis needs to compute the sets of Tcl variables whose values are possibly needed at the entry to each block in the program. This analysis involves examining the instructions of the block in _reverse_ order (that is, visiting postdominators before the blocks that they postdominate) and accumulating the sets of needed values. Initially, the required set at each block is empty. When beginning to examine a block, if the block is an exit, then the set of required variables is the set of local variables that may be aliased (by virtue of appearing in [upvar], [namespace upvar]/[global], or [variable]. This assertion mirrors the requirement that a compiled procedure must have all the nonlocal variables in sync when it leaves. The set is marked as being relative to the input callframe of the return instruction. If a block is not an exit block, then the set of required variables is the union of the sets of variables required at the entry to its successors. If a successor contains a φ operation that produces the callframe to which the variable references refer, then the callframe reference of each variable is updated to the appropriate input to the φ. Then, back to front, the sets get updated by examining individual instructions: + moveFromCallFrame - The given variable, and all possible aliases are added to the set of variables whose values are required. + moveToCallFrame - The variables required will be relative to the input callframe. Starting with the set of variables required in the output callframe, we examine the variables given in the instruction from left to right. For each of those, the given variable is removed from the set of variables whose values are required. All possible aliases of the given variable are added to the set. + invoke[expanded] - The variables required will be relative to the input callframe. We start with the set required relative to the output callframe. All variables that may be read or written, and all possible aliases, are added to the set. (Since we do not in general examine whether a procedure must alter a given variable, but only whether it may alter it, we require in general that the previous value of the variable must already be in sync.) + directGet etc. - The variables required will be relative to the input callframe. We start with the set required relative to the output callframe. Any variable that might alias the variable being retrieved or set in the instruction is added to the set. + variable, upvar, nsupvar - The variables required will be relative to the input callframe. We start with the set required relative to the output callframe. The target variable and any possible aliases are added to the set. + φ - Analysis stops at a φ instruction that references a callframe. The translation to the correct input callframe will happen when analyzing the predecessor block, as noted above. When we arrive at a loop's entry, we may produce a greater set of values that must be in sync at the bottom of the loop, so just as with the availability analysis, the liveness analysis must be iterated to convergence. ### Liveness analysis: implications When liveness analysis is complete, it informs removal of certain moveToCallFrame instructions. Any moveToCallFrame instruction that designates a variable that is not live in the given callframe may be safely removed. Either it is a dead STORE (that is, neither it nor any potential alias will ever be accessed again), or it participates in a STORE-STORE pair. ## TASK 3: Anticipability analysis and loop-invariant loads ## TASK 4: Loop-invariant stores ## References $SIMP96$. Loren Taylor Simpson. 1996. Value-Driven Redundancy Elimination. Ph.D. Dissertation. Rice University, Houston, TX, USA. AAI9631092. []