In a paper published on ArXiV on 2021-10-30, “Trojan Source: Invisible Vulnerabilities”, University of Cambridge researchers Nicholas Boucher and Ross Anderson report a new attack vector which affects essentially any programming language which accepts Unicode characters in its source code, including the UTF-8 encoding which is becoming ubiquitous.
The trick is simple: use Unicode bidirectional text formatting control codes to write text which the compiler interprets in a way which differs from how it appears to a human reading the code. The abstract describes the method as follows.
Other forms of deceptive code attacks are also discussed, including invisible characters and homoglyph attacks (replacing a character with another which is visually similar, for example the Latin alphabet “H” and the Cyrillic “Н”).
It will be essential that compilers and code auditing tools check for the presence of these Unicode tricks in source code and flag them for programmers and software testers. It will probably not be long before public code repositories such as GitHub start screening code for deceptive Unicode constructions.