Since my misadventures in creating an enum
-based trie resulted in discovering a compiler bug, I've gone back to manually creating a CR/LF/CRLF/CR-CRLF parser. I tried years ago, and now updating it with my better experience with Swift.
I start with a protocol to indicate the CR and LF values:
/// A type that supports values for the ASCII line feed and carriage return.
protocol InternetLineBreakerValues {
/// The value of the ASCII+ carriage return code point.
static var crValue: Self { get }
/// The value of the ASCII+ line feed code point.
static var lfValue: Self { get }
}
I could have extended just UInt8
, but I made this protocol then extended it with default implementations for ExpressibleByIntegerLiteral
and ExpressibleByUnicodeScalarLiteral
types so I can then add extensions for UnicodeScalar
, Int8
, etc. (Just in case I go back to the Swift-language token iterator.)
Now add an iterator to search for line breaks:
/// An iterator over locations within a given collection of where its line-
/// breaking sequences are.
struct LineTerminatorLocationIterator<Base: Collection> where Base.Element: Equatable & InternetLineBreakerValues {
/// The remaining sub-collection to search.
var collection: Base.SubSequence
/// Which line-breaking sequences to search for.
let targets: LineTerminatorSearchTargets
}
Since I need to look at most 3 code-points back, I originally manually arranged to get the next three values, but had to add special cases when there were less than three elements left. Get a function to simply return nil
once going out of bounds was a lot easier then having to stop the flow to create if
-let
-else
branches in my initializations.
extension Collection {
/// Returns the position immediately after the given index, if they're
/// dereferencable.
func elementIndex(after i: Index) -> Index? {
precondition(i < endIndex)
let next = index(after: i)
return next < endIndex ? next : nil
}
}
extension LineTerminatorLocationIterator: IteratorProtocol {
mutating func next() -> Range<Base.Index>? {
var result: Range<Base.Index>?
var first = collection.isEmpty ? nil : collection.startIndex
var second = first.flatMap { collection.elementIndex(after: $0) }
var third = second.flatMap { collection.elementIndex(after: $0) }
while let firstIndex = first, result == nil {
defer {
first = second
second = third
third = third.flatMap { collection.elementIndex(after: $0) }
}
let secondValue = second.map { collection[$0] }
let thirdValue = third.map { collection[$0] }
switch (collection[firstIndex], secondValue, thirdValue) {
case (Base.Element.crValue, Base.Element.crValue?, Base.Element.lfValue?) where targets.contains(.crcrlf):
result = firstIndex..<collection.index(after: third!)
case (Base.Element.crValue, Base.Element.lfValue?, _) where targets.contains(.crlf):
result = firstIndex..<collection.index(after: second!)
case (Base.Element.crValue, _, _) where targets.contains(.cr),
(Base.Element.lfValue, _, _) where targets.contains(.lf):
result = firstIndex..<collection.index(after: firstIndex)
default:
break
}
}
collection = collection[(result?.upperBound ?? collection.endIndex)...]
return result
}
}
I made a Sequence
for this iterator, then upgraded it with Collection
extensions, then optionally added BidirectionalCollection
extensions. Doing index(before:)
required the elementIndex(before:)
method to help in the same way.