Generativity & interpretation: a study of generated comics


In Scott McCloud's classic comics theory book Understanding Comics, he introduces six kinds of panel transitions:
Since my last reading of the book, I've been curious whether these transition types can be operationalized toward any of the following goals, approximately ordered from "most human effort needed" to "least human effort needed":
  • A Dadaist/Oulipian collaborative cartooning game where players take turns rolling a six-sided die and drawing a panel on a shared piece of paper according to the transition type determined by the die roll.
  • "Fridge poetry" for comics: come up with a fixed set of panels that link together via the different transition types, then let humans decide how to order them.
  • A board game in which each player has a "hand" of panels, as well as some goals that align with good global comic construction, and use the same die-roll mechanism as in the first idea.
  • A mixed-initiative digital comic creation tool in which the system suggests possible next panels based on transition types, and the human selects and modifies these panels.
  • A comic-generating program that creates abstract "comic specifications" by automating next panel selection, and lets a human render the comic concretely.
  • A fully-automated comic generator that does all of the above plus visual element placement and rendering.
A few days ago, I did an experiment in which I made some panels out of index cards and a combination of two sticker packs, then used die rolls to select the first and each next panels. At first, I tried straightforward application of McCloud's transition types, which meant doing a lot of human work to interpret panel sequences in certain ways, and add modifiers/emotes to make that meaning more visible. Here are the first few generated results:




The nice thing about comics (especially when wordless) is that they can be understood as telling stories through a very simple language of visual elements and their spatial relationships to one another, e.g. their relative size, rotation, horizontal and vertical juxtaposition, and the distance between them. Of course, by using robot stickers with humanoid faces (and further augmenting them with emotes), the human brain of the reader fills in some of the semantic gaps that would otherwise be impossible to resolve. Human brains' ability to fill in gaps is also why comics are simpler than animation in this respect: animations are expected to provide continuous motion between frames, whereas two comic frames need only be plausibly connected by some narrative justification. And that's where transition types come in: when you exclude non sequiturs, they constrain the space of next panels to ones that "make sense."

Figuring out how to tell a computer which concrete meanings to apply to an arrangement of visual elements seems like a deep and difficult problem, so I decided to see if I could sidestep it and solve the simpler problem of telling a computer how to generate abstract arrangements of visual elements according to panel transition types. To do so, I came up with the following terms:

Visual element (VE): unique identifier from an infinite set, mappable to visually distinct image components, such as anthropomorphic "characters," scenery, and geometric shapes.

Frame: a named panel outline dictating a minimum number of visual elements requires to fill it in, e.g. "give" requires three visual elements (a giver, a gift, and a giftee). The frame should contain instructions for visual rendering, e.g. an image with three holes for the spatial positions of each element.

Panel: a frame with its holes filled by visual elements, and optionally some additional VEs (e.g. observers carried over from previous panels).

Modifier: visual details overlaid on frames and VEs to add semantic coherence to the comic, such as floating emotes, facial expressions, motion lines, word balloons, and other text.

McCloud's transitions can (mostly) be made sense of in these terms:

Moment: panel i+1 has the same frame and VEs as panel i but different modifiers.
Action: panel i has the same VEs (give or take) as panel i but different frame and modifiers.
Scene: different frame, VEs, and modifiers.
Subject: panel i+j+1 shows VEs from panel i in the same or a similar frame to panel i+j.
Aspect: panel i+1 shows a subset of panel i's VEs together with new VEs.

But of course, these transition types are designed for human interpretation, not machine generation, and there's still a considerable amount of gap-filling to do: what distinguishes an "aspect" change from an "action" change other than the interpretation of different visual elements being part of the same space vs. part of the next step in time? What distinguishes a scene change from a non-sequitur unless the new scene is eventually revealed to connect with the previous one? And furthermore, there's a lot of nondeterminism in when visual elements are allowed to join or leave the narrative, and when new ones can be generated.

So I came up with a more machine-friendly set of panel transitions:

Moment: keep VEs, change frame and/or modifiers.
Add: introduce a VE that didn't appear in the previous panel (but might have appeared earlier).
Subtract: remove a VE from the previous panel (and potentially choose a new frame).
Meanwhile: choose a new frame and only show VEs that didn't appear in the previous panel, generating new VEs if necessary.
Rendez-vous: choose a new frame and fill it with any combination of previously-appearing VEs. Generate new VEs only when there aren't enough previous VEs to fill the frame.

Finally, I also introduced an End transition to allow the generated comic strip to terminate.

After a couple more paper prototype tests, I wrote an ML program to generate abstract comics in this form, e.g.

- ComicGen.gen [] 2;
val it =
  [({elements=[1,2],name="aid"},Add),
  ({elements=[3,1,2],name="monolog"},Subtract),
  ({elements=[3,2],name="carry"},Moment),
  ({elements=[3,2],name="whisper"},Meanwhile),
  ({elements=[],name="blank"},Meanwhile),
  ({elements=[1],name="fall"},RendezVous),
  ({elements=[4,3],name="touch"},Meanwhile),
  ({elements=[1],name="monolog"},RendezVous),
  ({elements=[4,3],name="dialog"},End)]
  : (ComicGen.panel * ComicGen.transition) list
The bit of the program that interprets transitions is:

case transition of
    Moment =>
           let
             val {name, nholes} = pickFrame currentNVEs
           in
             ({name=name, elements=justPrior}, totalNVEs) 
           end
  | Add =>
           let
             val unused = nonmembers justPrior allPrior
             val howManyNew = 1 (* Random.randRange (1,2) rand *)
             val {name, nholes} = pickFrame (currentNVEs + howManyNew)
             val (new_elts, new_total) = pickRandomVEs unused howManyNew totalNVEs
             val new_elts = new_elts @ justPrior
           in
             ({name=name, elements=new_elts}, new_total) 
           end
  | Subtract =>
             if List.length justPrior > 0 then
              let
                val nVEs = currentNVEs - 1
                val {name, nholes} = pickFrame nVEs
                val elts = removeRandom justPrior
              in
                ({name=name, elements=elts}, totalNVEs)
              end
            else
              let
                val {name, nholes} = pickFrame 0
              in
                ({name=name, elements=[]}, totalNVEs)
              end
  | Meanwhile =>
             let
               val skipVEs = nonmembers justPrior allPrior
               val {name, nholes} = pickRandomFrame ()
               val (elts, newTotal) = pickRandomVEs skipVEs nholes totalNVEs
             in
               ({name=name, elements=elts}, newTotal)
             end
  | RendezVous =>
             let
               val {name, nholes} = pickRandomFrame ()
               val (elts, newTotal) = pickRandomVEs allPrior nholes totalNVEs
             in
               ({name=name, elements=elts}, newTotal)
             end
  | End => ({name="blank", elements=[]}, totalNVEs)
I did a few hand-renderings of these generated strips:



Later, I also wrote a Ceptre version of the generator, mostly just for the comparison exercise. My current conclusions: the Ceptre version is indeed more concise (especially when putting aside the re-implemented arithmetic and basic datatypes), but it was quite a bit more difficult to get bug-free. (If only there were some analog of types for generative logic programs...)

If I continue working on this project, I plan to port my ML code to JavaScript and write a panel renderer so that I can let people play with the generator in a browser. If anyone wants to scoop me for this step, though, please feel free, since this is not my primary research project and I should probably move on from it. :)

Theoretically speaking, there's already a fair amount to reflect on here. I'm used to taking a simulationist approach to narrative generation, i.e. modeling an action possibility space for virtual agents and letting action descriptions constrain the generated artifact. With comic generation, I'm struck by how much the usual nonsensical output of Markov chaining is mitigated by prioritizing referral to previous visual elements, and how "narrative-feeling" these generated panel sequences manage to be.

Mattie Brice has written about the strange lack of interpretive components to games, pointing to the Tarot as an example of a practice that does centralize interpretation within a generative system. (Tarot in particular could be an interesting system to try to operationalize for narrative generation due to two-dimensional "spreads" symbolizing more complex relationships than simple temporal sequentiality.) Divination systems have an established link to generative stories through Nick Montfort's observations of the I Ching and Llull machine being pre-digital examples of text generators. And Mitu Khandaker-Kokoris spoke about two understandings of "immersion," one which is the typical VR fantasy, and the other of which comes from human brains filling in gaps left by sufficiently agentive-seeming abstract rule systems.

I feel like the current mainstream of game design and PCG is so literalist, measuring the effectiveness of play or generated artifacts in terms of how immediately legible they are, while arbitrarily privileging other forms of difficulty (spatial reasoning, twitch reflexes, etc.). In contrast, I find the recurring themes of meaning-making as a mechanic really exciting, conceptualizing human engagement with a system as the effort of interpretative meditation, and embracing that phenomenon as an alternative to other metrics for "flow"/"immersion"/"fun."

Comments

  1. This is a really cool project and I think the ICCC community would love it, by the way (deadline's February and it's in Paris!)

    The last paragraph is interesting to me - I think I agree with what you're saying, and I would love to do more work in that space, but personally I feel constrained already by how evaluation-light my work is, and I feel obliged to do work that is easier to do (lazy, meaningless but reviewer-pleasing) evaluation on. It's a sad realisation (the papers I'm drafting for next year are all of this form, in fact).

    (This is @mtrc btw, it took me forever to sign in to comment >_>)

    ReplyDelete

Post a Comment

Popular posts from this blog

Reading academic papers while having ADHD

Using Twine for Games Research (Part II)

Using Twine for Games Research (Part III)