Checking if an array contains duplicate elements

AlexanderM · June 3, 2019, 1:54pm

Isn't the birthday paradox pretty fundamental, like that's no way to "remove" it, besides just making the hashes way wider and decreasing the chances?

From my understanding (plus some intuition) cryptographic hashes:

are intentionally slow (to slow down brute force attacks)
very extra uniform and have lower rates of collisions
have a very strong "avalanche effect", which improves point 2
are particularly hard to reverse

It seems to me like ideally (for the hashed collection use case) you'd like a fast hash function (counter to point 1) that excels at point 2, 3, and optionally also point 4 (so long as it doesn't come at the cost of the behaviour of points 1-3).

Are these the differences you had in mind, or did you have others?

Avi · June 3, 2019, 2:18pm

All hashes are irreversible. Hashes lose information, on purpose.

AlexanderM · June 3, 2019, 2:53pm

Losing information isn't sufficient to not be reversible.

Take the simple % function. If I tell you that x % 3 is 0, then you obviously know that x is one of {..., -3, 0, 3, 6, ...}. Let's call that the "possible input set", the set of values that you know all hash to x, which thereby satisfy that equality.

With the % function, you can figure out the "possible input set", but you can't figure out precisely which member of the set was the original x value that was hashed (because of the information loss you mention).

However, you can figure out what at least some (if not all) elements of the "possible input set" are.

In a cryptographic context, this means that you may not know what message produced a given MAC, but you can forge a message that has the same MAC. This is a pre-image attack, and cryptographic functions are specifically designed to make pre-image attacks hard. Given a hash value produced by a strong cryptographic function, it should be hard for you to come up with any value that's in the "possible input set".

Avi · June 3, 2019, 3:15pm

Yes. I only meant that you can't directly reverse a hash. I didn't mean that you couldn't exploit the algorithm to find collisions.

AlexanderM · June 3, 2019, 4:04pm

No, even that's not true. Read my comment carefully

Avi · June 3, 2019, 4:07pm

Fixed the typo. (What's it called when the typo is an absence of typing?)

AlexanderM · June 3, 2019, 5:01pm

Ah okay, I'll respond to the edit:

It depends on exactly what you mean by "directly reverse". If you mean "using some analytic method to find the exact input", then no, hashes can't be reversed. This is because of the pigeon hole principle, causing that "information loss" that you alluded to. However, that's not necessarily what you need.

One could take "reverse" to have a looser definition: to produce an input (any one input), that hashes to the same value as the true input. This is the more useful definition, and for this definition of "reverse", it is not true to say that

If the input space is larger than the output space, information loss occurs and the hash is not reversible.

Take a basic message authentication code, for example. You have a message you want to send and a shared secret key that authenticates you to an awaiting party. You want to authenticate the message, so you send your friend the message, plus hash(msg, yourSharedSecretKey). For me to forge the message, I have to intercept and change the message, but also update the hash so that your friend's verification works out. For me to succeed at this, I don't need to reverse what the exactly value yourSharedSecretKey is. I only need to figure out a key value, any key value, which hashes the same way as yourSharedSecretKey; whether or not the key I guess is exactly yourSharedSecretKey is irrelevant. For this attack to be successful (for for the hash function to be determined to be shit), I only need a way of easily figuring out values that collide with yourSharedSecretKey. This property is not guaranteed simply because "hashes lose information", and it's one of the defining characteristics of cryptographic hashes, that this process is made exceptionally difficult.

Here's a cute little example. I use Ints for messages, keys and signatures to simplify things, but I'm sure you can imagine how this could be generalized to any bit stream.

// A really shitty hash function
func hash(input: Int, sharedSecret: Int) -> Int { return (sharedSecret + input) % 3 }

enum SignitureError: Error {
	case invalidSigniture(expectedSigniture: Int, attachedSigniture: Int)
}

struct SignedMessage {
	var message: Int
	var signiture: Int
	
	init(message: Int, sharedSecret: Int) {
		self.message = message
		self.signiture = hash(input: message, sharedSecret: sharedSecret)
	}
	
	func verify(expectedSharedSecret: Int) throws {
		let expectedSigniture = hash(input: self.message, sharedSecret: expectedSharedSecret)
		
		if self.signiture != expectedSigniture {
			throw SignitureError.invalidSigniture(expectedSigniture: expectedSigniture, attachedSigniture: self.signiture)
		}
	}
}

let avisSharedSecret = 6562346

func sendMessageToAvisFriend(message: SignedMessage) {
	// Avi'd friend receives the message and verifies it:
	do {
		try message.verify(expectedSharedSecret: avisSharedSecret)
		print("Avi's friend receives a valid message: \(message.message) with signiture \(message.signiture)")
	} catch SignitureError.invalidSigniture(let expectedSigniture, let attachedSigniture) {
		print("""
		Avi's friend received an invalid message!
			message: \(message.message)
			attached signiture: \(attachedSigniture)
			expected signiture: \(expectedSigniture)
		""")
	} catch { fatalError("Unexpected error") }
}


func happyCase() {
	// Good guy Avi sends a real message:
	let realMessage = SignedMessage(message: 123, sharedSecret: avisSharedSecret)
	sendMessageToAvisFriend(message: realMessage)
}

func happyCaseForgedMessageDetected() {
	// Good guy Avi tries to send a real message:
	let realMessage = SignedMessage(message: 123, sharedSecret: avisSharedSecret)
	
	// Evil guy Alex intercepts the message, changes it, but doesn't change the signiture:
	var forgedMessage = realMessage
	forgedMessage.message = 100
	
	// Avi's friend receives the message, verifies it, and discovers that the signiture isn't
	// what he expected. The message has been tampered with!
	sendMessageToAvisFriend(message: forgedMessage)
}

func sadCaseMessageForged() {
	// Good guy Avi tries to send a real message:
	let realMessage = SignedMessage(message: 123, sharedSecret: avisSharedSecret)

	// Evil guy Alex intercepts the message, changes it
	// Alex is smart this time. He inspected the hash function, and was able to analytically
	// deduce that if the key is one of {2, 5, 8, 11, ...}, he produces a compatible signiture as the real shared secret.
	//
	// In general, a key that works with one message to produce a valid signiture doesn't mean it will necessarily work
	// to produce a valid signiture for every message, but that's okay, it just needs to work for this message.
	// For another message, a different key would be figured out, just the same.
	let guessedSharedSecret = 2
	let fakeMessageBody = 100
	var forgedMessage = SignedMessage(message: fakeMessageBody, sharedSecret: guessedSharedSecret)
	
	// Avi's friend thinks the message is legit! :(
	// The hash has been "reversed" successfully, even though the hash function caused a loss of information.
	sendMessageToAvisFriend(message: forgedMessage)
}

happyCase()
happyCaseForgedMessageDetected()
sadCaseMessageForged()

Even though I couldn't reverse engineer the hash function to know that Avi's sharedSecret was precisely 6562346, there are a set of values I could use that would work instead, because they all hash to produce the same signature 0. (2, 5, 8, 11 ...)

If hash(input:sharedSecret:) was a cryptographic hash function, it would have been designed so that it's extremely difficult to reverse the hash to find any value that hashes equivalently to 6562346. Only then would forging the message be impossible. That's the useful definition of "irreversible".

Avi · June 3, 2019, 6:29pm

You are using a very particular definition for reverse which is not one I'd ever use, even within this domain. We don't mean the same thing when we use the term, and there's no actual disagreement about how hashes, cryptographic or otherwise, work. When it comes to hashes used for data structures, collision avoidance is a desirable property, but the slowness of a cryptographic hash is not. This thread is about efficiently finding duplicates in an array, and it's not a discussion about the merits of particular hashing properties.

AlexanderM · June 3, 2019, 11:15pm

Yep, that's the root of the misunderstanding. But I think there's far more utility to think of reversal as the oposite of collision resistance, rather than a definition as strong as "finding out the exact input that resulted in a given hash value", because the former is a far bigger concern in a wider range of applications.

This all stems from your initial comment:

Which (unless I understood incorrectly) is the very common misconception that it's the non-injectiveness of functions that makes them hash functions. This (perceived, on my part) misconception was what I was trying alleviate.

The one-way-ness of hash functions does not relate to the mathematical property of being a not injective function.

paiv · June 4, 2019, 2:39pm

The performance goal for cryptographic hashes is actually opposite: the quicker the better. Because it doesn't define how strong the hash is, but its usability. To counter brute force, one need to wrap hash in a key stretching algorithm.

AlexanderM · June 4, 2019, 4:06pm

Didn't know that, cool. So then what differences are there between what cryptographic hashes offer, and what Swift's hashed collections require?

It seems to me that all the properties of cryptographic hashes (fast, uniform, avalanche) are also beneficial to a hashed collection. What's the difference?