[OPR] Schmidt & Marx: Co-Constructing Tele-Presence by Embodying Avatars: Evidence from Let’s Play Videos

Update (22.03.2023): The Open Peer Review for this submission has been completed. Based on the Open Peer Review, the article has been approved for publication in the Journal for Media Linguistics and is available at: https://doi.org/10.21248/jfml.2021.35.

On this page you can download the discussion paper that was submitted for publication in the Journal for Media Linguistics. The blogstract summarises the submission in a comprehensible manner. You can comment on the discussion paper and the blogstract below this post. Please use your real name for this purpose. For detailed comments on the discussion paper please refer to the line numbering of the PDF.

This submission is a contribution to the special issue „Co-constructing presence between players and non-players in videogame interactions“.

Discussion Paper (PDF)

Blogstract of

Co-Constructing Tele-Presence by Embodying Avatars: Evidence from Let’s Play Videos

by Axel Schmidt & Konstanze Marx

Our data comes from so-called Let’s Plays which are supposed to present and comment computer gaming on the internet and which are one of the most successful YouTube-genres. Let’s Plays can be done in a single player mode (one person is playing and commenting) or in a multiplayer mode (several people are playing and commenting together).

Video games are attractive because they are highly immersive and interactive (Freyermuth 2015). Exactly these characteristics get lost as soon as Let’s Plays are produced as videos. Recipients do not have the chance to intervene into the game anymore. They can only watch others while playing a game. Thus, the reception situation is comparable to watching a show on TV (Ackermann 2016). We assume that the accompanying moderation of Let’s Plays is crucial to make a computer game ‘watchable’ (Schmidt/Marx 2020). That is, Let’s Players are constantly engaged in embodying their avatars by formulating and explaining their actions in the game and by producing so called response cries (Goffman 1981) in reaction to game events. By that, they make their experiences during the game more transparent for spectators. Thereby they construct a specific kind of (tele-)presence.

Following an ethnomethodological conversation analytical approach, our paper will focus on practices of making computer games ‘watchable’. One possibility to do that is to exploit the computer game specific participation framework composed of at least players and avatars which are connected with one another in several ways (Baldauf- Quilliatre/Colón de Carvajal 2015; Mondada 2012; Keating/Sunakawa 2010). The presentation mode of Let’s Plays usually consists of game activities on a large screen and the simultaneous mimic activities of the players on a small screen (transmitted by a facecam). Obviously, the embodied activities of the players are used to enhance the pleasure of merely watching the game. We are interested in how players use their voices and the facecam to either interact with avatars resp. non-play characters (mainly in the single player mode) or to animate avatars (frequently in the multiplayer mode). Both practices are readable as attempts to ‘embody’ avatars in order to ‘bring them to life’ and to make (watching) the game more lively.


Ackermann, Judith (2016) (Ed.): Phänomen Let’s play-Video: Entstehung, Ästhetik, Aneignung und Faszination aufgezeichneten Computerhandelns. Wiesbaden: Springer VS.

Baldauf-Quilliatre, Heike/Colón de Carvajal, Isabel (2015). Is the avatar considered as a participant by the players? A conversational analysis of multi-player videogames interactions. In: PsychNology Journal, 13, 2-3, 127-147.

Freyermuth, Gundolf S. (2015): Games, game design, game studies: eine Einführung. Bielefeld: transcript.

Goffman, Erving (1981): Response Cries. In Goffman, Erving (Ed.): Forms of Talk. Philadelphia: University of Pennsylvania Press, 78-122.

Mondada, Lorenza (2012): Coordinating action and talk-in-interaction in and out of video games. In Ayaß, Ruth/Gerhardt, Cornelia (Ed.): The appropriation of media in everyday life. Philadelphia: Benjamins, 231-270.

Keating, Elizabeth/Sunakawa, Chiho (2010): Participation cues: Coordinating activity and collaboration in complex online gaming worlds. In: Language in Society, 39, S. 331-356.

Schmidt, Axel/Marx, Konstanze (2020): Making Let’s Plays watchable: An interactional approach to multimodality. In Crispin Thurlow/Christa Dürscheid/Diémoz, Federica (Eds.): Visualizing (in) the New Media. London: John Benjamins, 131-150.

2 Replies to “[OPR] Schmidt & Marx: Co-Constructing Tele-Presence by Embodying Avatars: Evidence from Let’s Play Videos”

  1. Martin LuginbühlJuli 15, 2020 at 08:23Reply

    This is an excellent article on crucial characteristics of Let’s Play videos, a genre with a complex media setting, as gamers control with their devices an avatar that virtually represents these activities and with whom they partly conflate if you look at the language use; at the same time the Let’s Players are constantly oriented towards the viewers in order to make their game “watchable” as the authors put it. The analysis within the field of multimodal, ethnographic CA (but, as I might add, media linguistics as well) is concerned with questions of media affordances, participation framework, and accounts of immediacy and presence that are performed with specific practices. Analyzing examples of a German speaking player, the authors focus on two important practices, formulating own actions and animating avatars via response cries. While formulating one own’s actions makes the Let’s Play more transparent for viewers, the response cries are part of a (multimodal, as is shown in the analysis very nicely) embodiment of avatars. The main hypothesis – whose correctness is shown very convincingly – is that these practices interact on the split screen with game actions as well as with the facecam of the gamers in order to embody the avatars, resulting in tele-presence, that enable the viewers to experience the gamer’s presence that has shifted from the real world to the virtual world.
    Overall very informative introduction that really made me want to read on. I was confused by the use of “co-construction” and asked myself if the analysis will be on games that are played together. It is only in the conclusion that you explain your understanding of co-construction. I asked myself if this term is adequate in this context, as in the Let’s Plays you are looking at asynchronous one-way communication. If you describe the practices of the Let’s Players and speak of co-construction, then probably every one-way communication could be labelled co-constructed as it aims at a preferred reading by the viewers. And in the following section lines 95 ff. you point out that viewers cannot interact – so how can they co-construct (in a narrower sense of the word)?
    Very well description of the participation framework, that explains crucial terms for the following analysis. As you show in your analysis, not only the verbs, but also the body behavior is oriented to the viewers. So the “extra level” you are talking about and the staged “intimate interaction” is also (and probably to an important part) result of showing body movements and mimic in close up videos by the facecam (similar to TV hosts, as shown in different studies on TV news). This becomes very clear in your analysis (e.g. lines 418f.), but in this section you only mention “embodied conduct” (line 88) in passing. Perhaps you could emphasize this aspect more clearly here.
    I would have liked to read some theoretical considerations when it comes to the media ‘infrastructure’ of Let’s Plays; you mention the media involved and you mention “mediation” and “re-mediation” in the Conclusion (lines 783 and 790). Also, you speak of afforances.
    I wonder how you would describe the affordances of such a combination on a theoretical level and how notions of mediation and re-mediation fit in here (we have two media in a third here). In a way you do that implicitly in your article, but I think you could point out here an additional theoretical point. Can you grasp on a theoretical level what happens in terms of media affordances, mediation and re-mediation when gamers not just play for themselves but play for a Let’s Play? 
    Again, very readable and stimulating chapter. I was wondering if you would restrict the term of tele-presence to the gamers or also extend it to the viewers. You speak of the gamers (in a quote in line 236 the “media users”) and it is clear that gamers are tele-present in the games. But what about the viewers? Are they – due the practices of the gamers you describe – tele-present as well? I think you would say so, but I am not sure. And if yes: Do we need a sub-classification of tele-presence (gamers vs. viewers). 
    Chapter describing data and method, no comments on that.
    Very convincing, plausible and detailed (but never boring) analysis, with very good examples showing different subtypes of the practices mentioned above (by the way: video 8 is just hilarious). While in the previous chapters you sometimes speak of “inner state” of the gamers, you write “she displays a stance” here (line 337). Watching the video 2 you truly get the impression that the player got scared – but of course the Let’s Players perform these displays probably quite consciously, so I do prefer referring to these phenomena with ‘display of a stance’. (You surely are aware of all of this, but I still think it is important to highlight this difference, especially because it remains unclear what is authentic and what is staged, see below.)
    I think the argument that formulating own actions makes the action more transparent and therefore more attractive is very convincing – as well as the conflation of gamer and avatar e.g. by using “I” (line 377).
    As mentioned above, I would be careful to talk about “inner states” as well as “spontaneous reactions” (line 566); some of them might be spontaneous, other just well staged for entertainment of a bigger audience. The same is true for emotions (“the player’s emotions”, line 594, similar line 758). There is some work on reality TV that also addresses this issue, as you never really know what is authentic and what is staged. This goes in line with your observation of “liveliness” (line 664), a term created by Tolson (2006, perhaps you add this reference) that points out that “live” is not just a temporal phenomenon, but can be staged with different means (therefor liveliness). I think that here the notion of “immediacy” is also important – in addition to presence.
    Good summary with a very interesting outlook on future research. Again, I am skeptical about your conceptualization of “co-construction” (lines 745ff.): Does taking viewers into account equal with co-construction? 
    Consider minor revisions mentioned above.
    Minor remarks:
    Line 209: It probably should read: .. is only possible within a functional cycle of sensing”, not “within in”
    Line 213: I struggled first to connect the first sentence to the preceding paragraph. Perhaps you could elaborate: “What occurs WHEN PLAYERS DO XYZ…”
    Line 307: In this transcript “GS” and “Fig” are (at least in my pdf) not aligned correctly

  2. RedaktionNovember 23, 2020 at 10:47Reply

    Reviewer: Jannis Androutsopoulos, Hamburg

    Recommendation: Accept with minor revisions

    Review: This is a really worth-while, in-depth analysis of verbal and embodied actions by which computer-gamers animate their avatars when gaming for an audience in so-called Let’s Play videos. The relevance of this research is grounded in the immense popularity of gaming and Let’s Play videos in particular, which are, however, not sufficiently studied from a language and social interaction perspective. The paper is well-grounded in relevant interdisciplinary research from relevant fields, notably multimodal interaction analysis and game studies, at least their segments that consider interactional processes and not just the virtual narrative. Key to the analysis the authors here offer is the distinction between action in the real world (HCI) vs. the fictional world (video game fictional interaction), the two combined in a “cybernetic control loop”, which in turn creates the backdrop for gamers’ audience-directed animation practices.

    Against this backdrop, the authors draw on several examples to analyse and illustrate the two practices they focus on: formulating actions and embodying avatars by means of response cries. While the analysis and interpretation are of high scholarly quality and certainly publishable, I would recommend a few amendments in the paper’s overall organisation, which is at times counter intuitive and difficult to follow. The following issues, in order of their appearance in the paper, need to be addressed and amended before acceptance.


    1) The paper lacks advance organisers at key transitional points, by which to direct the reader to the course and unfolding of the argument, esp. between sections 1 and 2, between 4 and 5, and at the start of the very lengthy section 5. As it now stands the argument unfolds rather slowly, and even though the paper is overall well written, there is a risk of coherence loss because the reader is not notified in advance about what’s coming ahead.

    2) It would be good to focus more on the discussion of participation framework in section 2 e.g. by taking it up in the title, adding a bit more detail and background, and by pulling all relevant discussion here. You might want to consider here the work by Marta Dynel on participation framework in YouTube interaction, 2014, it seems highly relevant when it comes to complex participation frameworks in digital media interactions.

    3) Section 3 discusses in length (lines 201 ff) how players feel when immersed in the game. I was struck by the lack of reference, so what is the basis of your claims on how gamers feel etc?

    4) Section 4 on data and methods is really thin and fails to give enough background on data, e.g. the exact number of videos, duration of data, and so on. In my view the first para of section 4 could go to the introduction in order to make transparent where the paper is situated epistemologically.

    5) In the start of the analysis, there is a slightly redundant repetition of the two analytical legs (formulating actions and voicing avatars). These are defined and briefly illustrated several times across the paper, so redundancy comes up as a result. The authors label the first subsection “Hypothesis” but it is to me unclear what exactly their hypothesis is. The analysis of these two practices does not constitute a hypothesis. Either clarify this straight on at the start of section 5 or perhaps consider renaming subsection 5.1 as e.g. “Definitions” of the two practices in focus.

    6) During the analysis there are a few references to research literature, which seem to come in pretty late, e.g. lines 688-690. This seems like relevant research findings to the core issues of your analysis, so why is it disclosed so late? There are also references to data size (lines 571-575) that are not pre-empted in the presentation of data.

Leave a Comment