RISC OS Open
A fast and easily customised operating system for ARM devices
ROOL
Home | News | Downloads | Bugs | Bounties | Forum | Documents | Photos | Contact us
Account
Forums → Code review →

Chars

Subscribe to Chars 135 posts, 23 voices

Posts per page:

Pages: 1 2 3 4 5 6

 
Jul 11, 2016 8:55pm
Avatar Steffen Huber (91) 1645 posts

It is easy to scan through the file and determine if it is or is not UTF-8 by looking at the character sequences.

You might be able to do that if there are illegal sequences (i.e. determine that it is not UTF-8), but in the generic case you cannot determine if a file is UTF-8-encoded or a single byte encoding.

 
Jul 11, 2016 9:48pm
Avatar Rick Murray (539) 10579 posts

You can also determine that a file is UTF-8 if you come across legal sequences.

However, as I said – if the file has no high bit set characters (such as plain English with normal punctuation), you cannot tell any difference between UTF-8 and Latin1. That said, for such a file there is no difference…

Having markers in the file adds complication. What should Edit do with such a thing? Show it? Hide it? Allow it to be edited? Insert it upon saving? The thing I like about Edit is that you see what is there – even if you’re looking at binary files.

Speaking of binary files, the reason I installed Notepad++ on my PC was because I got fed up of various bits of Windows (such as the RTF handler) “deciding” that a binary file was some sort of Unicode and thus displaying the file in bits of random Chinese. At least Notepad++ can be told what sort of file it is, so I see what is there and not what some algorithm thinks ought to be there…

Which is circular. We’re back to Edit. Showing what’s there. It can be useful, looking in a data file, you know. Days when magazine cover discs would unhelpfully provide files in Impression format. I’m an Ovation guy. So I used to dump them into Edit or Zap and just read the content straight out of the file. ;-)

 
Jul 11, 2016 10:33pm
Avatar Frederick Bambrough (1372) 708 posts

Chris,

Once you’ve got Chars running, does it perform as expected (ie bring up the UCS character names

Yes.

To many nested structures at line 915

Haven’t seen that. I do get File ‘<Chars$Dir>.!Help’ not found at line 1800 on selecting Help from the icon bar menu. Chars exits.

Desktop is using the standard Homerton font, though with an altered theme for the sprites.

 
Jul 11, 2016 11:03pm
Avatar Steffen Huber (91) 1645 posts

You can also determine that a file is UTF-8 if you come across legal sequences.

Every legal UTF-8 sequence is also a legal single byte encoding sequence.

Just witness encoding auto detection in browsers – they often get it wrong, because it is an unsolvable problem.

 
Jul 12, 2016 8:10am
Avatar Chris (121) 437 posts

Thanks Frederick.

I do get File ‘<Chars$Dir>.!Help’ not found at line 1800 on selecting Help from the icon bar menu.

OK, I’ve spoken to ROOL who also can’t reproduce the problems with running out of memory, etc. Could you report the results of these commands:

*Show Chars*
*Ex Resources:$.Apps.!Chars
*Show Wimp*

Are you using a standard ROM download from the site, rather than building your own?

 
Jul 12, 2016 9:37am
Avatar Rick Murray (539) 10579 posts

Every legal UTF-8 sequence is also a legal single byte encoding sequence.

While this is correct, you need to keep looking and not just judge based upon the first sequence found. I think if you encounter, say ten UTF sequences and no invalid high bit stuff, you may be able to have confidence in the file being UTF-8. It would surely be a very rare file that wasn’t UTF-8…while only containing valid UTF-8 sequences.

 
Jul 12, 2016 11:32am
Avatar Steffen Huber (91) 1645 posts

While this is correct, you need to keep looking and not just judge based upon the first sequence found. I think if you encounter, say ten UTF sequences and no invalid high bit stuff, you may be able to have confidence in the file being UTF-8. It would surely be a very rare file that wasn’t UTF-8…while only containing valid UTF-8 sequences.

Experience says: no, not rare. Especially if your decision is not only “UTF-8 or ISO-8859-1(5)”, but also includes other single byte encodings.

You can try to make an educated guess. It can be “judged”. But it cannot be determined.

 
Jul 12, 2016 12:03pm
Avatar Rick Murray (539) 10579 posts

But it cannot be determined.

That’s why I said confidence rather than absolute. It’s like science – it only takes one experiment to disprove something, but any number of “proofs” only increase confidence by virtue of the theory not having been disproven. ;-)

 
Jul 12, 2016 2:22pm
Avatar Steffen Huber (91) 1645 posts

That’s why I said confidence rather than absolute.

You said “determine”. That’s why I responded at all. “Determine” is – according to my dictionary – not the same as “guess with some confidence”.

 
Jul 12, 2016 3:09pm
Avatar Paul Sprangers (346) 192 posts

You can try to make an educated guess. It can be “judged”. But it cannot be determined.

But, cough… how does Windows do it then? Firefox, Thunderbird, Word – even the humblest notepad displays Unicode and I never noticed any failure.

 
Jul 12, 2016 3:42pm
Avatar Steffen Huber (91) 1645 posts

But, cough… how does Windows do it then? Firefox, Thunderbird, Word – even the humblest notepad displays Unicode and I never noticed any failure.

You are trying the wrong things :-)

Firefox has no problem if proper HTML is used – after all, specifying the correct encoding is part of “proper HTML”. Now place a plain text file on your server, with a single byte encoding of your choice using high-bit characters. There are very good chances that Firefox “guesses” UTF-8 content.

Thunderbird usually has no problem because modern emails usually carry the correctly specified encoding (or something like “quoted printable”). Give it an email with unspecified encoding, again single byte encoding with high-bit, and watch it fail miserably.

Word always knows which encoding to use because it is either a default (old binary format) or explicitly specified (XML formats).

Bottom line: guessing the encoding is difficult.

 
Jul 12, 2016 4:39pm
Avatar Paul Sprangers (346) 192 posts

Bottom line: guessing the encoding is difficult.

Then only one conclusion seems to be left over: RISC OS should be rewritten so that it expects specified encodings in text files.
(No idea if this makes sense, I finished some glasses of delightful grappa already.)

This also seems to contradict Rick’s statement, which actually was mine too. But again, grappa and all that…

 
Jul 12, 2016 4:40pm
Avatar Frederick Bambrough (1372) 708 posts

Chris

*show Chars*

*Ex Resources:$.Apps.!Chars
Dir. Resources:$.Apps.!Chars Option 02 (Run) 
CSD  Resources:"Unset"
Lib. Resources:"Unset"
URD  Resources:"Unset"
!Help        WR/     Text      10:27:26 09-Jul-2016    5 kbytes
!Run         WR/     Obey      10:27:23 09-Jul-2016  235  bytes

*Show Wimp*
Wimp$IconTheme : Bluberry.
Wimp$Scrap : SDFS::HardDisc0.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir.ScrapFile
Wimp$ScrapDir : SDFS::HardDisc0.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir
Wimp$State : desktop
*

After running Chars I get;

*Show Chars*
Chars$Dir : SDFS::HardDisc0.$.Public
Chars$Path : SDFS::HardDisc0.$.Public.,Resources:$.Resources.Chars.

Public being the dir I’m using for the altered !Run.

Yup, I’m using the standard ROM. I wouldn’t know how to build one!

 
Jul 12, 2016 5:15pm
Avatar Frederick Bambrough (1372) 708 posts

Doh! It eventually occurred to me you want the results after a clean boot and without the changed !Run. Here it is.

*show Chars*
Chars$Dir : Resources:$.Apps.!Chars
Chars$Path : Resources:$.Apps.!Chars.,Resources:$.Resources.Chars.

*Ex Resources:$.Apps.!Chars
Dir. Resources:$.Apps.!Chars Option 02 (Run) 
CSD  Resources:"Unset"
Lib. Resources:"Unset"
URD  Resources:"Unset"
!Help        WR/     Text      10:27:26 09-Jul-2016    5 kbytes
!Run         WR/     Obey      10:27:23 09-Jul-2016  235  bytes

*Show Wimp*
Wimp$IconTheme : Bluberry.
Wimp$Scrap : SDFS::HardDisc0.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir.ScrapFile
Wimp$ScrapDir : SDFS::HardDisc0.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir
Wimp$State : desktop
*

I thought cycling was supposed to improve one’s wits.

 
Jul 12, 2016 5:16pm
Avatar Rick Murray (539) 10579 posts

ow place a plain text file on your server, with a single byte encoding of your choice using high-bit characters. There are very good chances that Firefox “guesses” UTF-8 content.

Yes. It does. And there are often very good reasons why – sniffing the index page of this site (which I note requests two cookies to be set, but doesn’t pop up the obligatory annoying notice ;-) ), the first line is:

Content-Type:	text/html; charset=utf-8

If you serve a text file and your server is set to include that within the HTTP header, then Firefox is only doing what it was told…

I ran into this myself, which is why my site doesn’t specify any encoding in the http header. I used http://web-sniffer.net to look at the headers.

Give it an email with unspecified encoding, again single byte encoding with high-bit, and watch it fail miserably.

I don’t know about never versions of Thunderbird. Older ones never seemed to suffer too badly for receiving Latin1 emails from a RISC OS application. It would be the usual stuff (fancy quotes in a different place in CP-1252) but nothing extraordinary.

Given that I sometimes received mangled address labels, with my “é” turned into some gibberish, I’m wondering if this whole problem isn’t being made harder than it ought to be.

Bottom line: guessing the encoding is difficult.

Guessing the encoding with any level of confidence is harder, but then anybody who attempts to determine UTF-8 by looking only at the first sequence found needs a kick in the goolies. There may well be some obscure Polish word in Latin5 that actually contains a valid UTF-8 sequence, so you really need to scan through to find a few sequences to make any sort of judgement.

That said, we are really getting off the topic of how the Wimp can be expected to cater for older applications (by older, I mean “every one thus written”) and Unicode applications? Being in the UTF-8 alphabet is a non-starter as it breaks everything else for non-English users…

 
Jul 12, 2016 5:38pm
Avatar Steve Pampling (1551) 6545 posts

It’s like science – it only takes one experiment to disprove something, but any number of “proofs” only increase confidence by virtue of the theory not having been disproven. ;-)

Ah, the joys of misunderstanding the language, even born and bred English speakers get that one wrong.

In the context given “proof” is the result of the test and “prove” is “test” so multiple tests giving the same result do imply the theory is correct but they not categorically rule any other option out.
BTW. Since many people wrongly believe “prove” to mean demonstrate to be true there is often a good debate. They are however wrong, no matter how many of them believe otherwise (ref. the old adage about flies)

 
Jul 12, 2016 6:34pm
Avatar Doug Webb (190) 858 posts

Chris

Here are my results after deleting EasyFonts from the start up menu.

*show Chars*

*ex Resources:$.Apps.!Chars
Dir. Resources:$.Apps.!Chars Option 02 (Run) 
CSD  Resources:"Unset"
Lib. Resources:"Unset"
URD  Resources:"Unset"
!Help        WR/     Text      10:27:26 09-Jul-2016    5 kbytes                                                               !Run         WR/     Obey      10:27:23 09-Jul-2016  235  bytes

*show Wimp*
Wimp$Font : Homerton.Medium
Wimp$IconTheme : PandaLand2.
Wimp$Scrap : SDFS::ARMiniX.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir.ScrapFile
Wimp$ScrapDir : SDFS::ARMiniX.$.!BOOT.Resources.!Scrap.ScrapDirs.ScrapDir
Wimp$State : desktop


Then after attempting to run !Chars

*show Chars*
Chars$Dir : Resources:$.Apps.!Chars
Chars$Path : Resources:$.Apps.!Chars.,Resources:$.Resources.Chars.
*

I would do it in a nice textual way if the help file was any use whats so ever :-)

 
Jul 12, 2016 7:33pm
Avatar Chris (121) 437 posts

Frederick: I’m the one whose wits are slow :) The reason you’re getting the error when selecting Help from the menu is that you’ve moved the !Run file, thus setting Chars$Dir to that directory. I’d forgotten you were doing that in order to get it to run with a larger wimpslot.

So that’s one thing solved. But I’m no closer to understanding why Chars on your/Doug’s system runs out of memory. I suppose it would be useful to know if it’s running as it should on OMAP3/4 ROMs generally, or whether this is something that affects all Beagle/Pandaboards.

 
Jul 12, 2016 8:12pm
Avatar Rick Murray (539) 10579 posts

so multiple tests giving the same result do imply the theory is correct but they not categorically rule any other option out.

Which is why it was put in quotes. A “proof” (layman’s definition) doesn’t really prove anything other than “here’s one more test that doesn’t disprove the theory”.

 
Jul 12, 2016 9:30pm
Avatar Steffen Huber (91) 1645 posts

Then only one conclusion seems to be left over: RISC OS should be rewritten so that it expects specified encodings in text files.
(No idea if this makes sense, I finished some glasses of delightful grappa already.)

It is the job of whatever application is showing the text file to support different encodings and, if it cannot be determined, let the user choose the correct encoding.

It would be a good idea if the OS would support conversion between different common encodings. Apart from that, the OS should be encoding agnostic. All IMHO of course.

 
Jul 12, 2016 10:42pm
Avatar Doug Webb (190) 858 posts

Chris,

I think I know what is the issue and it seems to be related to the number of Fonts in the !Fonts directory in Resources.

I installed a clean !Boot and then rebooted so all the choices were set up as new and run !Chars and it worked.

I then reintroduced all of the added Fonts I had in !Fonts and rebooted and tried !Chars and got the failure.

I deleted them gradually, testing each time after a reboot, until I had 23 different font folders in !Fonts at which point !Chars worked.

To ensure it didn’t not relate to a particular Font I altered the fonts that made up the 24th entry, though I only tried another 10 different fonts not all of them, and on each occasion !Chars either gave the error.

So it does seem to be related to the number of fonts at least on this set up.

Hope helps

 
Jul 13, 2016 12:12am
Avatar Frederick Bambrough (1372) 708 posts

I think I know what is the issue and it seems to be related to the number of Fonts in the !Fonts directory in Resources.

This was easy for me to confirm. I keep two font directories, one for the default fonts (5) and another for fonts I’ve added (68). This made it easy for me to move the second dir out of Resources temporarily and reboot. Result same as Doug’s – Chars works.

 
Jul 13, 2016 8:06am
Avatar Chris (121) 437 posts

I think I know what is the issue and it seems to be related to the number of Fonts in the !Fonts directory in Resources.

In the source in CVS it looks like the code that creates the fontlist, which should grow the wimpslot to accommodate long lists, doesn’t. Not sure why – it used to. I think when I did some tidying of the source for submission I must have had an idiot moment and mangled the code. I’ll take a look at it tonight and should be able to send a fix in.

Apologies for the inconvenience, many thanks for your detective work!

 
Jul 13, 2016 9:46am
Avatar Rick Murray (539) 10579 posts

Apologies for the inconvenience,

That’s okay. That’s why this is not the “stable” release. Think of it as crowd sourced bug bashing. ;-)

While I’m here – is there somebody with a large font collection willing to zip up and mail me a copy? I ought to test Ovation with lots of fonts.

 
Jul 13, 2016 10:34am
Avatar Andrew Conroy (370) 626 posts

While I’m here – is there somebody with a large font collection willing to zip up and mail me a copy? I ought to test Ovation with lots of fonts.

Drop me an email to a.m.conroy (at) owlart.co.uk and I can send you tons of them!

Next page

Pages: 1 2 3 4 5 6

Reply

To post replies, please first log in.

Forums → Code review →

Search forums

Social

Follow us on and

ROOL Store

Buy RISC OS Open merchandise here, including SD cards for Raspberry Pi and more.

Donate! Why?

Help ROOL make things happen – please consider donating!

RISC OS IPR

RISC OS is an Open Source operating system owned by RISC OS Developments Ltd and licensed primarily under the Apache 2.0 license.

Description

Developer peer review of proposed code alterations.

Voices

  • Steffen Huber (91)
  • Rick Murray (539)
  • Frederick Bambrough (1372)
  • Chris (121)
  • Paul Sprangers (346)
  • Steve Pampling (1551)
  • Doug Webb (190)
  • Andrew Conroy (370)

Options

  • Forums
  • Login
Site design © RISC OS Open Limited 2018 except where indicated
The RISC OS Open Beast theme is based on Beast's default layout

Valid XHTML 1.0  |  Valid CSS

Powered by Beast © 2006 Josh Goebel and Rick Olson
This site runs on Rails

Hosted by Arachsys