Actually, with the OpenCV library, the input bitmap has to be of size 600x600 ideally but with scaling I can support any video resolution. For example if the original video frame resolution is 720x480, I will just resize the bitmap to 600x400. Because from what I saw, in the game Mad Dog McCree, hitboxes were defined in a different resolution than the actual video frame (I think the hitbox resolution was in 360x240 as you said). So with a bit of scaling it's not a problem.
But when I originally had the idea to use object recognition for this problem, I was sure it wouldn't work. I talked about it with @mazinger4life a bit with no hope. But when I discovered the Mask R-CNN technique I was pretty excited with the possibilities. My last fear is the speed of the computation, that can be a major deal breaker since now it's done on the CPU and not the GPU. So I haven't tested when on the input bitmap there are a lot of object to detect.
So I hope @Karis and @sduensin can give us their inputs on this also...