Getting a pipeline to behave in a reliable way across a survey is the hardest part in creating a big stellar catalog, in my opinion!
For the catalogs I've created, I ended up producing a lot of quality statistics and diagnostic graphs per field using standard Python tools. I would then explore these statistics using topcat, and flick through the graphs by eye. This invariably revealed a bunch of "failures" which ended up being used to fine-tune the pipeline. The process of identifying useful quality diagnostics and fine-tuning the pipeline is of course a painful, iterative, and never-ending process.
Examples of quality diagnostics which I found to be very useful are:
- consistency of the photometry in field overlaps (i.e. median offset and standard deviation of repeat photometry);
- scatter plots of PSF fit residuals against source magnitudes (should be flat);
- scatters plots of local sky estimates against source magnitudes (should be flat);
- maps showing the density of catalogued stars in the sky (this reveals locations with lots of spurious sources).
Mind you, I don't think a pipeline can ever measure all stars in a reliable way. There are plenty of cases where you wouldn't be able to extract a decent magnitude for all the love in the world (e.g. a cosmic ray blended with a faint star). Catalogs can never be 100% complete. I think it is fine to acknowledge this and just make sure that all the poor estimates are either left blank or flagged in a user-friendly way. SDSS does the latter quite well using the "clean" flag.
Your idea of masking out pixels with residual emission seems a bit dangerous, because it sounds like a recipe for getting photometry with weird systematic errors. Also: what if the source actually is slightly extended?
I hope these rambling thoughts are helpful. Let us know how you get on!