A system for detection and classification of findings in an image, comprising at least one hardware processor configured to: receive the image; process the image by a plurality of convolutional and pooling layers of a neural network to produce a plurality of feature maps; process one of the feature maps by some of the layers and another plurality of layers to produce a plurality of region proposals; produce a plurality of region of interest (ROI) pools by using a plurality of pooling layers to downsample the plurality of region proposals with each one of the plurality of feature maps; process the plurality of ROI pools by at least one concatenation layer to produce a combined ROI pool; process the combined ROI pool by a classification network comprising some other of the convolutional and pooling layers to produce one or more classifications; and output the one or more classifications.