A cryptographer's eye on antivirus analysis
Not so long ago, I arrived, all fresh and pumping, from a world full of cryptography -- you know, RSA, AES, SHA256 etc. -- very excited to discover a new face to computer security. It's always in such situations you notice the importance of vocabulary, context and shortcuts. All of a sudden, you understand nuts to conversations in your mother tongue. I'll share a couple of surprises I had.
We have our AV (antivirus) engine scan a "signature" database. In cryptography, a signature consists in processing some input through an asymmetric algorithm with a private key or a symmetric algorithm with a secret key (actually, the latter is rather called a MAC - Message Authentication Code). For the AV industry, a signature database groups various detection patterns for malware. Indeed, like any criminal likes (or is compelled) to sign his/her acts (e.g Jack The Ripper), each cyber malware also has its own style: specific techniques used, payload, targets, messages displayed etc. This is the malware's signature. It does not mean any "cryptographic signature" is used.
We also use checksums in several cases, mostly CRC32 but also CRC16... and, more surprisingly, in some situations, MD5. As a cryptographer, I barely ever used CRCs and, if I could, would have banned MD5. Technically speaking, MD5 is a hash function, not a checksum. Hash functions are designed for security considerations, to detect malicious corruptions, whereas checksums are built to detect accidental corruptions. Consequently, for instance, hash functions must be difficult to invert (this is called pre-image resistance), whereas checksums need not fulfill this requirement. Hence, cryptographers typically use hash functions, not checksums, while surety processing rather involve checksums (easier and faster to implement).
So, apart from performance issues, using the hash function MD5 as a checksum should be okay: detecting malicious corruptions will also detect accidental ones... except that, lately, several flaws have been identified on MD5 and no cryptographer will recommend its use anymore. This is why I said I had virtually banned it. The problem lies in the fact MD5 has collisions, i.e different inputs can end up with the same digest. The first theoretical attacks were disclosed in 2004. Since then, the attacks have been improved -- such that collisions can be found for inputs with the same chosen prefix - and practical demonstrations have been released.
Let's go back to our particular case. In some situations, we use MD5 to identify a piece of code as a given malware A. Then, an attacker can probably craft a similar malware B, with the same prefix as A and whose hash matches A's. This means we'll wrongly identify malware B as A. As long as we do stop both malware, this doesn't sound too critical to me (although it's probably quicker to use CRC32 for not so less security).
The point I really want to make in this article is that statements which are true for a given domain may not be in another: a signature in cryptography has a different meaning than for the AV industry, but both definitions make sense. Similarly, for a cryptographer, using MD5 is nonsense... but it actually depends what you are using it for and which attacks you do consider as valid.
-- The Crypto Girl