They both fetch the _same_ counter value, they both increment it by one, they both store the value incremented by 1, and the final result is incremented by one, _not_ by two.
That's the problem with non-atomic updates.
On some platforms it can even be worse than that. Writes issued in a certain order by one processor (e.g. write address A, then address B, then address C) might be observed by another processor in a different order, even if the second processor is not trying to write to any of that memory (i.e. 2nd processor reads B and gets new value, then reads A and gets old value, even though 1st processor thinks it wrote to A before writing to B). One platform I know of where this can happen is the Xbox360 (which has 3 dual-threaded PPC cores share the same no-allocate L2 cache).
The bottom line with synchronization is that you have to be very, very careful. The best approach is to use other people's primitive operations which are fully debugged and known to behave in predictable ways (such as the Interlocked* primitives on Windows). To go any deeper than that you need to have detailed info about the cache coherency and atomicity guarantees provided by your platform -- and more importantly, the ones NOT provided! =)