Initial Capabilities

 

The user actions the very first incarnation of a Time Browser must support:

 

 

Recording Audio

 

Limitations:

 

  • The user will need to use a headset with microphone to ensure that only the users voice is captured.
  • Recording will be in high quality (44kHz at 16 bits).
  • Speech to Text Transcription will take place on the users device, for a first-run attempt.

 

Recording Set-Up: Room ID & Event

 

A 'host' will share a Room ID. This ID will be the name of the Room plus the ID of the host.

 

  • Everyone taking part in the same Event (an event is an audio, video or other recording with a specific start and end time) will enter this Room ID into their application.
  • An Event does not need a Room ID assigned to it, but then it's only a single user, single voice recording so for version 1 a Room ID will therefore be required.
  • An Event is defined as  any contiguous recording by one or more users.
  • When no-one is recording with a certain Room ID, the event stops.
  • For version 1 an Event is maximum 6 hours, to prevent anyone recording at all times. This is an artificial imitation which will be re-evaluated over time.
  • The client application will send a ping to the server to when recording
  • Recording will be compressed (audio levels, not quality) to reduce volume changes during head movement.

 

Server

 

  • On clicking 'Done' will compress the audio and put it into a .zip wrapper with:
    • The full text transcription, which has time-stamped every word
    • A JSON metadata document which includes:
      • Instructions for assigning real-world time to the internal audio timeline
      • The number of expected participants in the Room at the time of recording
  • The .zip will be named in this format: User ID, Room ID, start and stop times.
  • This is then sent to the server, likely Google or Amazon provided.
  • When the server receives a .zip it will extract it, and check who else it expects recordings from.
  • When the server is confident it has all the recordings from the event, it will put the recordings into a multitrack audio file and bundle all the transcriptions and zip this and send to all participants, keeping a 'master' on the server.
  • If the documents are tagged to require human-transcription the audio files are sent for this and when results are received a new .zip is created with the addition of 'ht' for 'Human Transcription' appended and sent to all participants.

 

 

 

Interaction

 

The resulting event record will be presented as a text document with rich controls:

 

 

Query the record for an utterance or a response

how to choose what is presented on the screen – a search

 

In this scenario we expect the user to be able to query the system via voice. The results should be shown as a text document with deep interactions:

 

  • Show me everything Ed has said this month. This is the most basic query, show phrases based on a keyword. The time limitation is optional, it should be possible to also ask to see everything Ed has ever said into the system.
  • What did Lisa say about OHS this week? This is also a pretty basic query, show phrases based on a participant and a keyword.
  • What did Sam say in response to Stan about DKR? Results should be shown in the document as text, showing all text uttered by Sam after Ed was speaking and where one of them mentioned DKR.

 

 

Select text to interact with it and change the view

how to interact with what is on the screen

 

The primary view is that of a text document. The user should be able to select text and:

 

  • Press Space Bar (or something similar) to play back the recorded audio for that selection of text.
  • Ctrl-Click (or something similar) and on a participants name and choose:
    • Show me only text from ‘name’
    • Do not show me text from ‘name’
    • Show me who most speaks before ‘name’
    • Show me who most speaks after ‘name’
    • & further queries along the same style…
  • Ctrl-Click (or something similar) and on any text and choose:
    • Show only utterances with ‘keyword(s)’
    • Show only utterances without ‘keyword(s)’
    • Expand this section to show more of what was said before or after (either as a command or as an action somehow)
    • Copy as Citation (copies this text with citation information to enter into any document, where the citation includes a link to the audio snippet of this text)
  • Liquid’ style interactions where the user can search based on any text, look up references, translate and so on.

 

Note: As the user continues this interaction, the amount of data on the screen changes, these are only view changes, but the views will be tracked so the views can be saved and shared:

 

 

Create ‘Saved Views’

views are as important as documents

 

The user will be able to apply control over what it shown and store this view and share it with others.

 

 

Share Real-Time Coded Documents

emergence of a new way to handle real-world time in documents

 

Another key feature is the ability to share a stream document and have the real-world time encoded so that anyone can open it and go to not just a specified amount of time in the record but to a specific real-world time, such as 2:37pm. This would entail matching real-world time to the record internal timecode and leaving it in the document itself, maybe in the EXIF data, though this should not stop the record to be opened in a regular media browser.

 

We could use the same .zip wrapper as the client application uses to send the initial recording to the server, but this would not allow for other applications to use the audio without being made aware/updated to allow for this format.

 

•  User emails a document, which looks like a normal media document and someone who is not using any Time Browser software can still open it and play it as normal.

•  User emails a document, which looks like a normal media document and someone who is using Time Browser software can point to specific real-world time in the player timeline.