I really wonder how that rating system works.

Sugoi (Levi) being considerably better than DeepL seems like nonsense to me.
DeepL fails in noise sections when there's just moaning, thumping, or things like that, without context, and crashes fairly often. So Sugoi clearly wins there, but for normal dialogue (which I would consider much more important) DeepL still seems considerably better than Sugoi. Mostly because Sugoi sometimes has a person refer to themselves while talking about someone else, or claim ownership of something they don't actually own. That makes it difficult to follow some situations. Also the sentences in general tend to feel more "natural" when using DeepL, less "fancy".
(Talking just about the Levi model, I didn't have time to make extensive tests with V4, yet.)