Odd Characters in some places

DevNull wrote 9 years ago: 1

From http://api.tvmaze.com/shows?page=56

"name":"TV OD","type":"Talk Show"

There is some odd character before The "T" in "TV OD"
In notepad, it is not visible but the arrow key has to be hit twice to get past the "T"
In UltraEdit I see "name":"?TV OD","type":"Talk Show"

There are several other listings with this same thing:

Benelli's The Fowl Life
The Big Interview with Dan Rather
Chris Brackett's Fear No Evil
The First Eden: The Mediterranean World and Man
The Trials of Life: A Natural History of Behaviour
The Voice Versus
King of Cars
Skies Above Britain
Celebrity Haunted Hotel Live
The Adventures of Pow Wow

It makes those shows impossible to search for in the app I am working on, if that matters.

gazza911 wrote 9 years ago: 1

That's called a zero width character, unicode \u200B

The reason that in your database it's appearing as '?' is because you're using a different character set; the default (at least for MySQL, not sure about others) is latin1, whereas that specific unicode requires something like utf8.

Unfortunately there's not too much that can be done about existing ones as they've already been converted as it was an unknown character (at least in your character set). Replacing the question mark (as it will most likely appear in your database) with an empty string would also remove question marks that are supposed to be there in show names.

For the future, there's 2 options:

1. Change your collation to utf8;

ALTER TABLE tableName CHARACTER SET utf8 COLLATE utf8_general_ci;

(You may also have to go into your database and change the collation for the specific column(s), I'm not sure)

However the people who use your site may have the same problems if they copy and paste it elsewhere / you offer your own API; you could use the second option in conjunction with this to get around that.

2. Replace the unicode value in PHP with an empty string before you insert it into the database

Now, what TVMaze should do - in addition to option #2 in the future - is replace it in their database. This will only work because TVMaze is already using the correct character set, meaning that it didn't have to convert the unknown value.

The following MySQL statement will remove the zero width character in all rows that have them:

UPDATE tableName SET columnName = REPLACE(columnName, 0xE2808B, '') WHERE columnName LIKE CONCAT('%', 0xE2808B, '%');

david wrote 9 years ago: 1

Right, a zero-width space.. I guess it sneaked along with a copy & paste a couple of times. I'll add something to filter them out in the future, thanks for reporting :)

(Moved to Bugs, it's not API specific. Reference to self: http://kb.mozillazine.org/Network.IDN.blacklist_ch...)

DevNull wrote 9 years ago: 1

gazza911 wrote:
That's called a zero width character, unicode \u200B
The reason that in your database it's appearing as '?' is because you're using a different character set; the default (at least for MySQL, not sure about others) is latin1, whereas that specific unicode requires something like utf8.
Unfortunately there's not too much that can be done about existing ones as they've already been converted as it was an unknown character (at least in your character set). Replacing the question mark (as it will most likely appear in your database) with an empty string would also remove question marks that are supposed to be there in show names.
For the future, there's 2 options:
1. Change your collation to utf8;
ALTER TABLE tableName CHARACTER SET utf8 COLLATE utf8_general_ci;
(You may also have to go into your database and change the collation for the specific column(s), I'm not sure)
However the people who use your site may have the same problems if they copy and paste it elsewhere / you offer your own API; you could use the second option in conjunction with this to get around that.
2. Replace the unicode value in PHP with an empty string before you insert it into the database
Now, what TVMaze should do - in addition to option #2 in the future - is replace it in their database. This will only work because TVMaze is already using the correct character set, meaning that it didn't have to convert the unknown value.
The following MySQL statement will remove the zero width character in all rows that have them:
UPDATE tableName SET columnName = REPLACE(columnName, 0xE2808B, '') WHERE columnName LIKE CONCAT('%', 0xE2808B, '%');

I'm not using at database for now, it's a standalone .NET app - I let the user select the countries of interest, then go through each page from this API - "http://api.tvmaze.com/shows?page=X" X starts at 1 and increments until I get a HTTP 404. If the country is in the user selected list, I add the name, The year it premiered, Network, Country, a delimiter (currently "||") and the ID. This gets saved to a UTF8 text file which is essentially the data source for everything. For some strange reason, using String.Replace to remove characters 0xE2, 0x80, 0x8B is not working. This I am working on. Thanks a bunch for the information.

DevNull wrote 9 years ago: 1

david wrote:
Right, a zero-width space.. I guess it sneaked along with a copy & paste a couple of times. I'll add something to filter them out in the future, thanks for reporting :)
(Moved to Bugs, it's not API specific. Reference to self: http://kb.mozillazine.org/Network.IDN.blacklist_ch...)

I appreciate the reply, does that mean that eventually the show data will no longer have those chars or just the shows that get updated ? As I posted above, I use this API to get the show Data - "http://api.tvmaze.com/shows?page=X" X starts at 1 and increments until I get a HTTP 404, so many of the shows retreived are older and finished and I assume will not be updated.

gazza911 wrote 9 years ago: 1

DevNull wrote:
I'm not using at database for now, it's a standalone .NET app - I let the user select the countries of interest, then go through each page from this API - "http://api.tvmaze.com/shows?page=X" X starts at 1 and increments until I get a HTTP 404. If the country is in the user selected list, I add the name, The year it premiered, Network, Country, a delimiter (currently "||") and the ID. This gets saved to a UTF8 text file which is essentially the data source for everything. For some strange reason, using String.Replace to remove characters 0xE2, 0x80, 0x8B is not working. This I am working on. Thanks a bunch for the information.

I really suggest you use a database so that each person doesn't have to download ~24MB just so that they use it. I assume it's the hosting that's the issue for you.

Anyway I don't know .net, you might have to replace using the Unicode rather than the byte values.

david wrote 9 years ago: 1

DevNull wrote:
I appreciate the reply, does that mean that eventually the show data will no longer have those chars or just the shows that get updated ? As I posted above, I use this API to get the show Data - "http://api.tvmaze.com/shows?page=X" X starts at 1 and increments until I get a HTTP 404, so many of the shows retreived are older and finished and I assume will not be updated.

I'll try to find and fix the existing entries as well.

gazza911 wrote 9 years ago: 1

david wrote:
I'll try to find and fix the existing entries as well.

David, did you see the SQL that I posted?

UPDATE tableName SET columnName = REPLACE(columnName, 0xE2808B, '') WHERE columnName LIKE CONCAT('%', 0xE2808B, '%');

At least in MySQL (after changing the names) that will fix it.

If it's not that, message me if you like.

DevNull wrote 9 years ago: 1

gazza911 wrote:
I really suggest you use a database so that each person doesn't have to download ~24MB just so that they use it. I assume it's the hosting that's the issue for you.
Anyway I don't know .net, you might have to replace using the Unicode rather than the byte values.

The proper replacement appears to be like this:

Dim TroubleChars As String = ChrW(&H200B)

CorrectedString = RawString.Replace(TroubleChars,"")

It might be useful if I include an "update data" function that gets the last ID number and looks for higher ID numbers only to add. Is it correct to assume that ID numbers are sequential (new shows always have higher ID numbers) ? That would not be difficult and the only overlap could be part of a page of 250 shows. If someone adds/removes a country, that won't work so well so I need to think about that for sure.

I also noticed U+2026 e2 80 a6 in "I'm Pregnant And…" - It's not at the beginning so it would not affect my lookup method like the other one.

gazza911 wrote 9 years ago: 1

DevNull wrote:
The proper replacement appears to be like this:
Dim TroubleChars As String = ChrW(&H200B)
CorrectedString = RawString.Replace(TroubleChars,"")
It might be useful if I include an "update data" function that gets the last ID number and looks for higher ID numbers only to add. Is it correct to assume that ID numbers are sequential (new shows always have higher ID numbers) ? That would not be difficult and the only overlap could be part of a page of 250 shows. If someone adds/removes a country, that won't work so well so I need to think about that for sure.
I also noticed U+2026 e2 80 a6 in "I'm Pregnant And…" - It's not at the beginning so it would not affect my lookup method like the other one.

Yes, they are always sequential; even if a show gets deleted, that ID has already been reserved and it won't be given to another show.

DevNull wrote 9 years ago: 1

gazza911 wrote:
Yes, they are always sequential; even if a show gets deleted, that ID has already been reserved and it won't be given to another show.

Thanks again for that - The update data file module is done and the typical download is just 1 page (~300 KB) rather then 90+ pages. I also tell the user if the list of countries in the data file to update differs from the user's previously selected countries.