Error: Barcodes/ISBN

After having scanned my very large collection of comics (over 53K) into the software I have noticed that barcode identification of books (easily one of the best features of the software) is populated with some pretty bad data. Too often I scan a book and it shows me multiple titles and or issues associated with the barcode I scanned. Sometimes the book I scan is not even one of them. Many other times the barcode just does not work at all. Now, my research does show that some publishers re-used barcodes and that explains away some of the errors and at other times in the distant past variants commonly had the same barcodes as well but there are more actual errors in the data than there was errors due to these reasons.

For example, here is a summary by Barcode length

First, it is important to note that barcode or ISBN lengths like 11,12,15,16,18,19 are not valid and should immediately raise suspicion.

Next, there are many that are duplicated. Of the 53K books in my collection I found 2834 distinct barcodes with duplicates across 6976 of my books. Take this list for example:

The first barcode, 0714860246108, produces a list of 14 different issues in my collection which when I examine I see that it is actually legit - somehow Marvel used the same barcode across issues of X-Men and Power Man/Iron Fist and it is easily confirmed looking at the covers…

…but if I look at the second barcode, 759606056859, I see 13 issues of Marvel Adventures: Spider-Man where someone did in fact put the wrong barcode for every book and that is going to make bar code scanning impossible for any issue in this series.

As I dig further into this, I see that there was a time when variant issues all had the same barcodes and understand that duplicates within a single issue are to be expected. But if you factor out per-issue duplication and move to a period where barcodes were no longer re-used by publishers (say 1995 and above) you still see that there are still a lot of duplicate barcodes due to bad entry. Initially, I had assumed that at least one of each item set was correct but looking at them closely it is evident that most of the issues identified in the table below are flat out wrong and all you have to do is click the cover image to see it.

All of this information aside there are easy ways for you to detect bad barcodes. For example, the first 6 digits indicate the publisher. So take a look at the list below:

It shows that there are nearly 30K books associated with Marvel Comics by the first 6 digits. So given that look at the others near it, like say 75940609502500611 which says it is Ghost-Spider, Vol. 1 #6A. Wrong. You can clearly see a type-o from the image below:

So now I have given you a technique (not 100% foolproof but pretty close) which you can use to at least identify when the publisher portion is wrong (on books from the last couple decades for certain). Note: some publishers legitimately have multiple 6 digit publisher codes (Marvel is one of them)

Also, the last five digits are significant as well for most books (especially modern) in which they represent the issue number and the the variant. You can easily write queries to detect when a 17 digit number is entered that the first 3 of the last 5 digits correspond (usually) to an issue number and the last two are the variant (11 for an A variant, 12 for a B variant, etc.)

But one of the simplest things you might be able to do with the data you have today and is simply run a snippet of code against each cover scan you have to see if the image contains a barcode and if it does extract it and compare it to the barcode you have for the issue. This is easy to do. You can then flag discrepancies and make some manual checks to confirm before applying them and fixing the database.

Glad to assist if you would like more information. In my particular case I have already corrected hundreds of entries in my local database.

Regardless of the actions you ultimately take (hopefully ignoring is not one of them) you do have the tools to improve the database greatly and make scanning and input a lot better for users.

If you would like copies of any of the queries I use against my local database to adapt to your environment let me know…

Hi Gary,

I need to look into the details of your post, but here’s some initial comments:

  • looking at the data you show, it is clear that MOST barcodes are correct. It is a database, with data coming from multiple sources and even user contributed data. So I am finding it quite impressive that there are only a small number of bad barcodes listed.
  • The erroneous barcodes are probably from the more “obscure” comics, otherwise the error would have already been fixed. Which makes the errors less important and not encountered by many users.
  • Please note that you are looking at your OWN database, not at the Core data. This means that
    • the Core data may have already been fixed
    • the Core may have the CORRECT barcode too (Core allows multiple barcodes per comic)
    • the incomplete barcode you have in your database MAY have resulted from a barcode scanning problems, that is, the scanner only picking up a PARTIAL barcode. In that case, the app shows you multiple results and you picked the correct one. So Core probably has the correct and full barcode.

I don’t think there is a really big problem here. Our stats show the our users are getting an extremely high success rate on their barcode searches.

…but if I look at the second barcode, 759606056859, I see 13 issues of Marvel Adventures: Spider-Man where someone did in fact put the wrong barcode for every book and that is going to make bar code scanning impossible for any issue in this series.

No that is not correct. We DO have the correct barcodes listed for all those issues. Apparently we also have that partial barcode listed, but that is not a problem in anyway. When users scan the full barcode, the app will give them the correct comic.
Screenshot from our Core admin:

Anyway, thanks for your suggestions on how to improve the barcodes. I must say this is not on our radar for high-priority problems. Also, because less and less users use the Add By Barcode feature nowadays. Almost everyone uses the new cover scanning method.

We have to pick our battles :slight_smile:

Thanks for the reply @CLZ_Alwin

You are correct that most of the barcodes in the database are right and I have no doubt that many of the missing barcodes are simply because the issues in question just do not have one. But based on what you have just said I think there is a very easy fix that you can apply to make what you have more useful in the case where you have multiple barcodes.

You could 1.) update the codebase to prioritize either the longest or a 17-digit barcode when one is present or 2.) write a query to update your Barcode field to move the 17-digit/longest to the beginning of your string as that would appear to be the currently prioritized value. The first would be a coding change but the second is just an UPDATE query.

Then we do not see bad data in the desktop app or the android app like below:

In any given modern 17-digit UPC-A barcode (formatted as EAN-13) you have the following:

image

Black is the number system (7 for most comics)

Red indicates the publisher

Blue indicates the series

Yellow is a check digit - use it validate that you entered it correctly (at least to this point)

Purple is the issue number

Green is the variant.

So basically the fragments you have for Marvel Adventures: Spider-Man are only enough to identify the publisher and the title - not the issue.

So if you use a query like the one I have written for you below you can identify all the invalid barcodes in the database and go fix them. There are 100s in just my own collection. If you want you can easily change the logic to evaluate 13-digit EAN-13 barcodes using the 13th digit as a check digit.

WITH
	ISBNTemp AS (
		SELECT 
				X.Publisher, 
				SUBSTRING(X.ISBN, 2, 5) AS [ISBN Publisher], 
				X.ISBN, 
				X.Title, 
				X.IssueName,
				CAST(SUBSTRING(X.ISBN,1,1) AS int) AS D1,
				CAST(SUBSTRING(X.ISBN,2,1) AS int) AS D2,
				CAST(SUBSTRING(X.ISBN,3,1) AS int) AS D3,
				CAST(SUBSTRING(X.ISBN,4,1) AS int) AS D4,
				CAST(SUBSTRING(X.ISBN,5,1) AS int) AS D5,
				CAST(SUBSTRING(X.ISBN,6,1) AS int) AS D6,
				CAST(SUBSTRING(X.ISBN,7,1) AS int) AS D7,
				CAST(SUBSTRING(X.ISBN,8,1) AS int) AS D8,
				CAST(SUBSTRING(X.ISBN,9,1) AS int) AS D9,
				CAST(SUBSTRING(X.ISBN,10,1) AS int) AS D10,
				CAST(SUBSTRING(X.ISBN,11,1) AS int) AS D11,
				CAST(SUBSTRING(X.ISBN,12,1) AS int) AS D12
			FROM dbo.Comics X
			WHERE 1=1
				AND X.ISBN IS NOT NULL
				AND LEN(X.ISBN) >= 12
				AND YEAR(X.ReleaseDate) > '1990' --EAN-13 formatted UPC-A is fully adopted throughout the industry by this point
				AND X.ISBN NOT LIKE '97[789]%' --EAN-13
				AND X.ISBN NOT LIKE '4[0-9][0-9]%' --German EAN-13
		),
	ISBNComputed AS (
		SELECT 
				Y.*,
				(D1 + D3 + D5 + D7 + D9 + D11) AS Odd,
				(D2 + D4 + D6 + D8 + D10) AS Even,
				D12 AS CheckDigit
			FROM ISBNTemp Y
		),
	ISBNValidated AS (
		SELECT *,
				(10 - (((X.Odd * 3) + X.Even) % 10)) % 10 AS CheckSum
			FROM ISBNComputed X
		)
	SELECT *
		FROM ISBNValidated X
		WHERE X.CheckDigit <> X.CheckSum
		ORDER BY 
			X.Publisher,
			X.Title,
			X.IssueName,
			X.ISBN

Here is a subset of the results showing bad bar codes and highlighting the book that set this whole thing off for me: Astonishing X-Men 56A :slight_smile:

You could very easily add validation logic to the app for this stuff - there are not that many barcode formats - and with the predictable sequential patterning you could even auto-identify a lot of book details on adding to the database from just the barcode itself.

Also: I understand that you say a lot of people are using cameras to scan their books but in my case I add 100 new books to my collection every 2 weeks and it literally takes 2 minutes with a barcode scanner - I doubt anyone can touch that with a camera - and as the compute capacity used to do that is significantly higher than barcode lookup you would probably lower your energy costs significantly using the barcode method - even if you use a picture to extract the barcode you can reduce that.

Of course, that should be easy.

As I said, it’s not really a priority in the grand scheme of things that we still want/need to do.

I am not saying the cover scanner is faster. I am just saying more people use that nowadays.

Again, thanks for your suggestions, but fixing this is not a high priority problem right now.