Friday, May 3, 2013

First working program using the Repa plugin

Haskell Source

repa_process :: R.Stream k Int -> Int
repa_process s
 = R.fold (+) 0 s + R.fold (*) 1 s

Actual well formed GHC Core code

.. or at least the parts that get printed with -dsuppress-all.
repa_process
repa_process =
  \ @ k_aq6 arg_s2Rf ->
    let { (# x1_s2R6, x1_acc_s2R5 #) ~ _ <- newByteArray# 8 realWorld# } in
    let { __DEFAULT ~ x2_s2R8            <- writeIntArray# x1_acc_s2R5 0 0 x1_s2R6 } in
    let { (# x3_s2Rd, x3_acc_s2Rc #) ~ _ <- newByteArray# 8 x2_s2R8 } in
    let { __DEFAULT ~ x4_s2RS            <- writeIntArray# x3_acc_s2Rc 0 1 x3_s2Rd } in
    let { Stream ds1_s2Rs ds2_s2Rj ~ _   <- arg_s2Rf } in
    let { Vector rb_s2Rz _ rb2_s2Ry ~ _  <- ds2_s2Rj `cast` ... } in
    letrec {
      go_s2RP
      go_s2RP =
        \ ix_s2Rr world_s2Ru ->
          case >=# ix_s2Rr ds1_s2Rs of _ {
            False ->
              let { (# x7_s2RF, x0_s2RC #) ~ _ <- readIntArray# x1_acc_s2R5 0 world_s2Ru } in
              let { __DEFAULT ~ sat_s2SV       <- +# rb_s2Rz ix_s2Rr } in
              let { __DEFAULT ~ wild3_s2RD     <- indexIntArray# rb2_s2Ry sat_s2SV } in
              let { __DEFAULT ~ sat_s2SU       <- +# x0_s2RC wild3_s2RD } in
              let { __DEFAULT ~ x8_s2RH        <- writeIntArray# x1_acc_s2R5 0 sat_s2SU x7_s2RF } in
              let { (# x9_s2RN, x1_s2RL #) ~ _ <- readIntArray# x3_acc_s2Rc 0 x8_s2RH } in
              let { __DEFAULT ~ sat_s2ST       <- *# x1_s2RL wild3_s2RD } in
              let { __DEFAULT ~ x10_s2RR       <- writeIntArray# x3_acc_s2Rc 0 sat_s2ST x9_s2RN } in
              let { __DEFAULT ~ sat_s2SS <- +# ix_s2Rr 1 } in
              go_s2RP sat_s2SS x10_s2RR;
            True -> world_s2Ru
          }; } in
    let { __DEFAULT ~ x11_s2RU <- go_s2RP 0 x4_s2RS } in
    let { (# x12_s2RY, x1_s2S2 #) ~ _  <- readIntArray# x1_acc_s2R5 0 x11_s2RU } in
    let { (# _, x2_s2S3 #) ~ _         <- readIntArray# x3_acc_s2Rc 0 x12_s2RY } in
    let { __DEFAULT ~ sat_s2SW         <- +# x1_s2S2 x2_s2S3 } in I# sat_s2SW
Im using ByteArrays to fake imperative variables. The next job is to convert the arrays x1_acc_s2R5 and x3_acc_s2Rc into proper accumulation variables, and attach them to the go_s2RP loop.

x86 assembly code

This is just for the inner loop.
LBB11_2:
        movq    (%rcx), %rdi
        addq    %rdi, 16(%rdx)
        imulq   16(%rsi), %rdi
        movq    %rdi, 16(%rsi)
        addq    $8, %rcx
        incq    %r14
        cmpq    %r14, %rax
        jg      LBB11_2
The extra memory operations are there because the LLVM optimiser doesn't know that the two ByteArrays I'm using to fake imperative variables don't alias. This should turn into optimal code once it uses pure accumulators.

Wednesday, May 1, 2013

Array fusion using the lowering transform

I'm getting back into blogging. This is current work in progress.

Repa 4 will include a GHC plugin that performs array fusion using a version of Richard Waters's series expressions system, extended to support the segmented operators we need for Data Parallel Haskell.

The plugin converts GHC core code to Disciple Core, performs the fusion transform, and then converts back to GHC core. We're using Disciple Core for the main transform because it has a simple (and working) external core format, and the core language AST along with most of the core-to-core transforms are parametric in the type used for names. This second feature is important because we use a version of Disciple Core where the main array combinators (map, fold, filter etc) are primitive operators rather than regular function bindings.

The fusion transform is almost (almost) working for some simple examples. Here is the compilation sequence:


Haskell Source

process :: Stream k Int -> Int
process s
 = fold (+) 0 s + fold (*) 1 s

In this example we're performing two separate reductions over the same stream. None of the existing short-cut fusion approaches can handle this properly. Stream fusion in Data.Vector, Repa-style delayed array fusion, and build/foldr fusion will all read the stream elements from memory twice (if they are already manifest) or compute them twice (if they are delayed). We want to compute both reductions in a single pass.


Raw GHC core converted to DDC core

repa_process_r2 : [k_aq0 : Data].Stream_r0 k_aq0 Int_3J -> Int_3J
  = \(k_c : *_34d).\(s_aqe : Stream_r0 k_c Int_3J).
    $fNumInt_$c+_rjF
       (fold_r34 [k_c] [Int_3J] [Int_3J] $fNumInt_$c+_rjF 
                 (I#_6d 0i#) s_aqe)
       (fold_r34 [k_c] [Int_3J] [Int_3J] $fNumInt_$c*_rjE 
                 (I#_6d 1i#) s_aqe)


Detect array combinators from GHC core, and convert to DDC primops

repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#
 = /\(k_c : Rate).
    \(s_aqe : Stream# k_c Int#).
   add# [Int#]
        (fold# [k_c] [Int#] [Int#] (add# [Int#]) 0i# s_aqe)
        (fold# [k_c] [Int#] [Int#] (mul# [Int#]) 1i# s_aqe)


Normalize and shift array combinators to top-level
All array combinators are used in their own binding.

repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#
 = /\(k_c : Rate).
    \(s_aqe : Stream# k_c Int#).
   let x0 = add# [Int#] in
   let x1 = fold# [k_c] [Int#] [Int#] x0 0i# s_aqe in
   let x2 = mul# [Int#] in
   let x3 = fold# [k_c] [Int#] [Int#] x2 1i# s_aqe in
   add# [Int#] x1 x3


Inline and eta-expand worker functions
This puts the program in the correct form for the next phase.

repa_process_r2 : [k_aq0 : Rate].Stream# k_aq0 Int# -> Int#
 = /\(k_c : Rate).
    \(s_aqe : Stream# k_c Int#).
   let x1
        = fold# [k_c] [Int#] [Int#]
                (\(x0 x1 : Int#). add# [Int#] x0 x1) 0i# s_aqe in
   let x3
        = fold# [k_c] [Int#] [Int#]
                (\(x2 x3 : Int#). mul# [Int#] x2 x3) 1i# s_aqe in
    add# [Int#] x1 x3


Do the lowering transform
This is the main pass that performs array fusion. Note that we've introduced a single loop# that computes both of the fold# results.

repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> Int#
 = /\(k_c : Rate).
    \(s_aqe : Stream# k_c Int#).
   let x1_acc : Ref# Int# = new# [Int#] 0i# in
   let x3_acc : Ref# Int# = new# [Int#] 1i# in
   let _ : Unit
        = loop# (lengthOfRate# [k_c])
                (\(x0 : Nat#).
                let x1 : Int# = next# [Int#] [k_c] s_aqe x0 in
                let x0 : Int# = read# [Int#] x1_acc in
                let _  : Void#
                       = write# [Int#] x1_acc (add# [Int#] x0 x1) in
                let x2 : Int# = read# [Int#] x3_acc in
                let _  : Void#
                       = write# [Int#] x3_acc (mul# [Int#] x2 x1) in
                   ()) in
   let x1 : Int# = read# [Int#] x1_acc in
   let x3 : Int# = read# [Int#] x3_acc in
   add# [Int#] x1 x3


Assign imperative variable storage to arrays
We need to convert the code back to GHC core, but we don't want to use IORefs because they can't hold unboxed values (of types like Int#). Instead, we use some new arrays to hold these values instead.

repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> Int#
 = /\(k_c : Rate).
    \(s_aqe : Stream# k_c Int#).
   let x1_acc : Array# Int# = newArray# [Int#] 8# in
   let _      : Void#       = writeArray# [Int#] x1_acc 0# 0i# in
   let x3_acc : Array# Int# = newArray# [Int#] 8# in
   let _      : Void#       = writeArray# [Int#] x3_acc 0# 1i# in
   let _ : Unit
         = loop# (lengthOfRate# [k_c])
                 (\(x0 : Nat#).
                  let x1 : Int# = next# [Int#] [k_c] s_aqe x0 in
                  let x0 : Int# = readArray# [Int#] x1_acc 0# in
                  let _ : Void#
                        = writeArray# [Int#] x1_acc 0# 
                              (add# [Int#] x0 x1) in
                  let x2 : Int# = readArray# [Int#] x3_acc 0# in
                  let _ : Void#
                         = writeArray# [Int#] x3_acc 0# 
                               (mul# [Int#] x2 x1) in
                   ()) in
      let x1 : Int# = readArray# [Int#] x1_acc 0# in
      let x3 : Int# = readArray# [Int#] x3_acc 0# in
      add# [Int#] x1 x3


Thread state token through effectful primops
The lowered code is naturally imperative, and GHC uses state threading to represent this. 

repa_process_r2 : [k_c : Rate].Stream# k_c Int# -> World# -> Tuple2# World# Int#
    = /\(k_c : Rate).
       \(s_aqe : Stream# k_c Int#).\(x0 : World#).
      caselet T2# (x1 : World#) (x1_acc : Array# Int#) 
        = newArray# [Int#] 8# x0 in
      caselet T2# (x2 : World#) (_ : Void#) 
        = writeArray# [Int#] x1_acc 0# 0i# x1 in
      caselet T2# (x3 : World#) (x3_acc : Array# Int#) 
        = newArray# [Int#] 8# x2 in
      caselet T2# (x4 : World#) (_ : Void#) 
        = writeArray# [Int#] x3_acc 0# 1i# x3 in
      caselet T2# (x11 : World#) (_ : Unit) 
        = loop# (lengthOfRate# [k_c])
              (\(x0 : Nat#).\(x5 : World#).
               caselet T2# (x6 : World#) (x1 : Int#) 
                 = next# [Int#] [k_c] s_aqe x0 x5 in
               caselet T2# (x7 : World#) (x0 : Int#) 
                 = readArray# [Int#] x1_acc 0# x6 in
               caselet T2# (x8 : World#) (_ : Void#) 
                 = writeArray# [Int#] x1_acc 0# (add# [Int#] x0 x1) x7 in
               caselet T2# (x9 : World#) (x2 : Int#) 
                 = readArray# [Int#] x3_acc 0# x8 in
               caselet T2# (x10 : World#) (_ : Void#) 
                 = writeArray# [Int#] x3_acc 0# (mul# [Int#] x2 x1) x9 in
               T2# x10 ()) x4 in
      caselet T2# (x12 : World#) (x1 : Int#) 
        = readArray# [Int#] x1_acc 0# x11 in
      caselet T2# (x13 : World#) (x3 : Int#) 
        = readArray# [Int#] x3_acc 0# x12 in
      T2# x13 (add# [Int#] x1 x3)

Here, "caselet" is just sugar for a case expression with a single alternative.


Covert back to GHC core

repa_process_sTX
  :: forall k_c.
     Data.Array.Repa.Series.Stream k_c GHC.Types.Int
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Int# #)
[LclId]
lowered_sTX =
  \ (@ k_c)
    (rate_sTY :: GHC.Prim.Int#)
    (x_sTZ :: Data.Array.Repa.Series.Stream k_c GHC.Types.Int)
    (x_sU0 :: GHC.Prim.State# GHC.Prim.RealWorld) ->
    let { (# x_sU1, x_sU2 #) ~ scrut_sUv
    <- newByteArray#_sU3 @ GHC.Prim.RealWorld 8 x_sU0
    } in
    let { __DEFAULT ~ x_sU8
    <- writeIntArray#_sU4 @ GHC.Prim.RealWorld x_sU2 0 0 x_sU1
    } in
    let { (# x_sU5, x_sU6 #) ~ scrut_sUw
    <- newByteArray#_sU7 @ GHC.Prim.RealWorld 8 x_sU8
    } in
    let { __DEFAULT ~ x_sUp
    <- writeIntArray#_sU9 @ GHC.Prim.RealWorld x_sU6 0 1 x_sU5
    } in
    let { (# x_sUa, x_sUb #) ~ x_sUa
    <- Main.primLoop
         (Main.primLengthOfRate rate_sTY)
         (\ (x_sUc :: GHC.Prim.Int#)
            (x_sUd :: GHC.Prim.State# GHC.Prim.RealWorld) ->
            let { (# x_sUe, x_sU1 #) ~ x_sUe
            <- Main.primNext_Int @ k_c x_sTZ x_sUc x_sUd
            } in
            let { (# x_sUf, x_sUc #) ~ x_sUf
            <- readIntArray#_sUg x_sU2 0 x_sUe
            } in
            let { __DEFAULT ~ x_sUl
            <- writeIntArray#_sUh
                 @ GHC.Prim.RealWorld x_sU2 0 (+#_sUi x_sUc x_sU1) x_sUf
            } in
            let { (# x_sUj, x_sU8 #) ~ x_sUj
            <- readIntArray#_sUk x_sU6 0 x_sUl
            } in
            let { __DEFAULT ~ x_sUo
            <- writeIntArray#_sUm
                 @ GHC.Prim.RealWorld x_sU6 0 (*#_sUn x_sU8 x_sU1) x_sUj
            } in
            (# x_sUo, GHC.Tuple.() #))
         x_sUp
    } in
    let { (# x_sUq, x_sU1 #) ~ x_sUq
    <- readIntArray#_sUr x_sU2 0 x_sUa
    } in
    let { (# x_sUs, x_sU5 #) ~ x_sUs
    <- readIntArray#_sUt x_sU6 0 x_sUq
    } in
    (# x_sUs, +#_sUu x_sU1 x_sU5 #)


This doesn't work yet because I've forgotten to pass the type arguments to the unboxed tuple constructor (#,#), and maybe other problems as well. I'll post again when I have an actual program running.