Initial Capabilities
The user actions the very first incarnation of a Time Browser must support:
Vocabulary
Room is a dimension/tag used to collect participants recordings,
Event is defined by a start time and an end time
Session is any recording made in a Room during an Event
Conference
The way someone joins a Room is through
Recording Audio
Limitations
- The initial platform will be the Google Chrome Browser.
- The user will need to use a headset & microphone to ensure that only the users voice is captured.
- Recording will be in high quality (44.1kHz at 16 bits).
- The audio recording will be audio-compressed (not file compressed), to minimize audio changes when suers head moves, if this is cheap to implement.
- Speech to Text Transcription will take place on the users device, for a first-run attempt. Tests will need to be done to establish best library for this.
- We will need to test NTP with this system to see how accurate we can get the timing. Timing is CRUCIAL for this project.
Recording Set-Up: Room ID & Event
A participant will share a Room ID - which will be used again and again if preferred. This ID will be the name of the Room plus the ID of the host.
- Everyone taking part in the same Event (an event is an audio, video or other recording with a specific start and end time) will enter this Room ID into their application.
- An Event does not need a Room ID assigned to it, but then it's only a single user, single voice recording so for version 1 a Room ID will therefore be required.
- An Event is defined as any contiguous recording by one or more users.
- When no-one is recording with a certain Room ID, the event stops.
- For version 1 an Event is maximum 6 hours, to prevent anyone recording at all times. This is an artificial imitation which will be re-evaluated over time.
- The client application will send a ping to the server to when recording
- Recording will be compressed (audio levels, not quality) to reduce volume changes during head movement.
- For this initial release we expect perfect recording, with no dropouts so there will be no fault-tolerance work.
- The naming convention for the Event will be the Room name plus the date and time, with 'meeting 2' etc. appended if more than one meeting that date.
Server
- On clicking 'Done' will data-compress the audio.
-
It will then create a .zip wrapper with:
- The data-compressed audio.
- The full text transcription, which has time-stamped every word
-
A JSON metadata document which includes:
- Instructions for assigning real-world time to the internal audio timeline
- The number of expected participants in the Room at the time of recording
- The .zip will be named in this format: User ID, Room ID, start and stop times.
- When a new contribution (end user zip file with Room ID with overlapping Event time) is downloaded
- This is then sent to the server, likely Google or Amazon provided.
- When the server receives a .zip it will extract it, and check who else it expects recordings from.
- When the server is confident it has all the recordings from the event, it will put the recordings into a multitrack audio file and bundle all the transcriptions and zip this and send to all participants, keeping a 'master' on the server.
- If the documents are tagged to require human-transcription the audio files are sent for this and when results are received a new .zip is created with the addition of 'ht' for 'Human Transcription' appended and sent to all participants.
Interaction
The resulting event record will be presented as a text document with rich controls:
Query the record for an utterance or a response
how to choose what is presented on the screen – a search
In this scenario we expect the user to be able to query the system via voice. The results should be shown as a text document with deep interactions:
- Show me everything Ed has said this month. This is the most basic query, show phrases based on a keyword. The time limitation is optional, it should be possible to also ask to see everything Ed has ever said into the system.
- What did Lisa say about OHS this week? This is also a pretty basic query, show phrases based on a participant and a keyword.
- What did Sam say in response to Stan about DKR? Results should be shown in the document as text, showing all text uttered by Sam after Ed was speaking and where one of them mentioned DKR.
Select text to interact with it and change the view
how to interact with what is on the screen
The primary view is that of a text document. The user should be able to select text and:
- Press Space Bar (or something similar) to play back the recorded audio for that selection of text.
-
Ctrl-Click (or something similar) and on a participants name and choose:
- Show me only text from ‘name’
- Do not show me text from ‘name’
- Show me who most speaks before ‘name’
- Show me who most speaks after ‘name’
- & further queries along the same style…
-
Ctrl-Click (or something similar) and on any text and choose:
- Show only utterances with ‘keyword(s)’
- Show only utterances without ‘keyword(s)’
- Expand this section to show more of what was said before or after (either as a command or as an action somehow)
- Copy as Citation (copies this text with citation information to enter into any document, where the citation includes a link to the audio snippet of this text)
- ‘Liquid’ style interactions where the user can search based on any text, look up references, translate and so on.
Note: As the user continues this interaction, the amount of data on the screen changes, these are only view changes, but the views will be tracked so the views can be saved and shared:
Create ‘Saved Views’
views are as important as documents
The user will be able to apply control over what it shown and store this view and share it with others.
Share Real-Time Coded Documents
emergence of a new way to handle real-world time in documents
Another key feature is the ability to share a stream document and have the real-world time encoded so that anyone can open it and go to not just a specified amount of time in the record but to a specific real-world time, such as 2:37pm. This would entail matching real-world time to the record internal timecode and leaving it in the document itself, maybe in the EXIF data, though this should not stop the record to be opened in a regular media browser.
We could use the same .zip wrapper as the client application uses to send the initial recording to the server, but this would not allow for other applications to use the audio without being made aware/updated to allow for this format.
• User emails a document, which looks like a normal media document and someone who is not using any Time Browser software can still open it and play it as normal.
• User emails a document, which looks like a normal media document and someone who is using Time Browser software can point to specific real-world time in the player timeline.