Parse WebVTT file and iterate through results

I’m trying to learn swift and haven’t done much programming in a very long time and could use some help.

I’m trying to load a local WebVtt file and then parse through the time stamps and the associated text.

Ultimately I’m hoping to make an app for personal use on Mac that can load a transcript WebVtt file and sync it to audio and do like an interactive transcript where the text will sync to the audio file that’s associated or conversely tap on text and have the audio start playing from that location.

Below is a sample WebVtt file:

let testString = """
WEBVTT
 
00:00:00.000 --> 00:00:04.560
Did you know that up to 80 to 90 percent


00:00:04.560 --> 00:00:09.080
of those making a decision to Code will hit a wall and give up?


00:00:09.080 --> 00:00:14.920
What? Yes, it's true. Did you know that most people give up within the first year?


00:00:14.920 --> 00:00:19.400
Some extra dialog would go here


00:00:19.400 --> 00:00:23.360
also additional dialog would also go here

In case the formatting doesn’t show well there’s a single return from WebVtt and the first time stamp and then a double return in between all the rest including at the end of the file.

My Swift Code:

let regexPattern = /(?m)^(\d{2}:\d{2}:\d{2}\.\d+) +--> +(\d{2}:\d{2}:\d{2}\.\d+).*[\r\n]+\s*(?s)((?:(?!\r?\n\r?\n).)*)/

if let match40 = testString.firstMatch(of: regexPattern) {
   let entireResult = match40.output.0
   let firstResult = match40.output.1
   let secondResult = match40.output.2
   let thirdResult = match40.output.3

This works regarding getting the first result of each match and group which was my proof of concept that I seem to be on the right track.

My issue however is when I try switching out firstMatch with wholeMatch nothing I’ve tried let me go beyond the first match (or work at all).

I’ve had Xcode concert the RegEx into RegEx Builder but then the code above doesn’t work and nothing I try works with the auto generated builder code.

Secondly and the larger next issue is how to iterate through the matches as needed? This will be for a file that contains approximately an hour of transcription.

I’m at a loss and over the last month have posted this in other places and really kind of stuck.

That's a very scary-looking regular expression. :slight_smile:

Why don't you tackle the problem by breaking it into smaller pieces and working on one piece at a time? That is, break the whole text into lines, and parse each line individually and collate the results.

That’s initially what I was trying to do when I had Xcode try to convert the RegEx into the newer RegEx builder code.

I figured with the capture’s I could do named captures but nothing worked for me and I’m not at the point where I could figure out why it wasn’t working as there was no explicit errors.

Maybe it was a mistake on my part but many tutorials that are out there for learning swift focus on games and I figured I’d try to learn by building something I was interested in and would maintain myself.

Breaking it down is my next step but wanted to do a call for help as well as I realize I’m over my head at the moment.

Lexical analysis and parsing of text input is an excellent way to learn programming in Swift.

Start by writing a lexical analyser, something that reads text input character by character and turns it into a sequence of tokens.

Then write a simple recursive-descent parser, something which takes the sequence of tokens and creates an abstract syntax tree.

Once you have the abstract syntax tree, you can do lots of fun stuff with.

Have fun learning new things. :slight_smile:

1 Like

Appreciate the tips and definitely have it on my short list of things to dive deeper into.

For anyone else coming across this I wanted to post what I've worked out so far:

Note: My sample WebVTT file was stored as a block quote constant as shown earlier in this thread as a constant named "testString".

 let patternTest = Regex {
/^/
Capture {
Regex {
  Repeat(count: 2) {
    One(.digit)
  }
  ":"
  Repeat(count: 2) {
    One(.digit)
  }
  ":"
  Repeat(count: 2) {
    One(.digit)
  }
  One(.anyOf(".,"))
  Repeat(count: 3) {
    One(.digit)
  }
}
}
One(.whitespace)
"-->"
One(.whitespace)
Capture {
Regex {
Repeat(count: 2) {
    One(.digit)
  }
  ":"
  Repeat(count: 2) {
    One(.digit)
  }
  ":"
  Repeat(count: 2) {
    One(.digit)
  }
  One(.anyOf(".,"))
  Repeat(count: 3) {
    One(.digit)
  }
}
}
"\u{A}"
Capture {
Regex {
  ZeroOrMore {
    /./
  }
  ZeroOrMore {
    Regex {
      Optionally {
        "\u{D}"
      }
      "\u{A}"
      NegativeLookahead {
        Regex {
          Optionally {
            "\u{D}"
          }
          "\u{A}"
        }
      }
      ZeroOrMore {
        /./
      }
    }
  }
}
}
}
.anchorsMatchLineEndings()

The above is my RegexBuilder code that I got working.

Below is my code to cycle to show all the RegEx Matches:

let matches = testString.matches(of: patternTest)

print("Array of Text elements are:")
for match14 in matches {
let (notSureWhatGoesHere) = match14.output

print(match14.output.3)
}

The following code will print only the text from the WebVTT file without the timestamps. If you need to get all the first timestamps change the "print(match14.output.3" to "print(match14.output.1" and if you need all of the ending timestamps you can use "print(match14.output.2".

Embarrassingly enough I wasn't sure what goes in the second let hence the "notSureWhatGoesHere" but for my purposes of just accessing all the RegEx Matches and Groups in Swift this is working for me in a Swift Playground file.

I hope this might help someone else later on as well.