Automated edit proposal - convert non-standard dashes to standard dashes

Discouraging folks from fixing things seems a pretty bad look IMO. No edit ever fixes everything in the db in one swoop and for all time. Holding an edit to that standard is preposterous. We all get to contribute in our own way. This kind of edit it fine and should be encouraged. It’s not making anything worse and will make things better in some cases.

7 Likes

This is not your run of the mill unstructured tag. you need to invest a substantial amount of effort one way or the other to parse and evaluate the values mechanically.

Normalizing the characters is a 0.0001% thing, literally totally irrelevant from an effort pov and anything robust will do that in any cae. For human consumption the lookalike chars are exactly that and the differences are completely irrelevant.

And I’m not advocating doing nothing, I’m suggesting to fix the subpar implementation.

This is true, though opening_hours=* is the antithesis of the sort of freeform key that mappers generally take a laissez-faire approach toward. I’m not surprised to see both mappers and developers expecting more consistent usage of this key. Thanks to its presentation and complexity, the opening hours specification seems a lot more airtight and unambiguous than it really is.

A similarly superficially rigorous key is phone=*. It turns out that, in some languages and regions, popular style guides or publishing standards mandate the use of en dashes, perhaps one of the reasons why 786 phone numbers contain en dashes. (There’s a particularly high concentration in the D–A–CH region, possibly due to the wiki’s stipulation of DIN 5008, though I don’t have access to the standard to be sure.)

Per Postel’s law, we could strive to make the database more consistent and developers could make their software more forgiving of syntax errors that slip through.

6 Likes

I looked at the “not a reference implementation” OpeningHoursParser/src/main/java/ch/poole/openinghoursparser/OpeningHoursParser.jj at 42d1821b0db3ada0fd9b3cd0da609b99c10d6354 · simonpoole/OpeningHoursParser · GitHub and indeed, adding more look-a-likes seems trivial. Perhaps this topic can longer the list?

Some of these comments make it sound like we’re browser makers…

Downstream consumers are (mostly) out of our control, but the data isn’t. According to the spec (as @Duja has already linked), it’s illegal to use a non-ASCII minus. So values using a non-ASCII minus are illegal.

Yes, maybe downstream consumers can/could/will/would/shall/should handle this themselves, but only because the upstream data is wrong. Stop yak shaving y’all.

Thanks @matheusgomesms for the initiative! Hope it’s clear that I support this (after the usual bureaucratic stuff).

5 Likes

it is not matching specification, but calling it illegal is taking it a bit too far :slight_smile:

note that in English “illegal” means “breaking law”, and misformating opening hours in OSM is not something that breaks any legal rules anywhere.

“not allowed” would be better term. (I would send it as a PM but it seems you blocked them)

2 Likes

For what it’s worth, the term “illegal” in the sense “not according to specification” has a history in IT circles:
(https://www.pcmag.com/encyclopedia/term/illegal-operation)

An operation that is not authorized or understood. An “illegal operation” error message typically means that the computer has been directed to execute an invalid instruction and has stopped or has terminated the offending application (see abend).

…although it has fallen out of fashion lately.

6 Likes

As @Duja said. It’s also commonly used in (formal) language theory: an illegal string is one that does not conform to the (formal) language in question.

7 Likes

Hahahahaha.

Calling that page a “specification” (and yes, it is the name of the page) is a bit of a misnomer. A specification implies that people adding opening hours values have been told that they need to add values that conform to that specification, when in practice almost no-one editing data in OSM has seen it. With a bit of luck, they’ll be using an editor that can extract what they know about opening hours and format that into something that other people can understand, but there will always be edge cases (e.g. “I know that this place stays open late on a Thursday but don’t know any other opening hours”).

To be clear, “open late on a Thursday” is in no way a machine-readable opening hours string but it is more useful than writing nothing at all.

2 Likes

Yes, let us shave a few yaks more.

Next you’re gonna tell me that ISO standards, IETF RFCs, HTML, ECMA, and others are not specs because nobody reads them. But first tell me something serious, I can’t stop laughing…

Now, apparently this has a name – “derailing”, someone in this forum called it. So can we get back on the rails? What do you have against the proposed edit?

It is machine-readable if you tell the machine how to read it. No magic. And that’s within the spec, it’s called a comment, as long as it’s surrounded by double quotes.

1 Like

Now it’s me that’s laughing :smiley:

(my emphasis, obviously)

Anyone who’s worked with the opening hours syntax would readily acknowledge that it has all the fuzziness and holes of moldy Swiss cheese, but it’s better than nothing, unless you’re proposing that we abandon it in favor of treating the key as freeform text.

It’s a somewhat formal grammar, that’s all. If we had called the page “guidelines”, it would’ve become a lot less stable and a lot fuzzier. On the bright side, it would probably have an answer for how to represent non-Christian holidays by now. On the other hand, it would probably lay out alternative “approaches” six ways to Sunday, like any other tagging page.

If you think that’s bad, at least everyone has access to the grammar. In OpenHistoricalMap, we have frequent debates about how to represent some common scenarios in EDTF because all we have is a choice between an underspecified, easy-reading summary by the Library of Congress, a formal ISO 8601 specification that no one has seen because it would cost them their firstborn, and style guides by any number of universities trying to bring some order to the chaos. Yet it still makes plenty of sense to try to align on something.

ISO 8601 also has another component about recurrence rules, trying to address the same thing as our opening hours syntax. Lots of software uses it, and its specification is apparently airtight, but no human would ever be expected to author it by hand.

This is a valid consideration for editor developers: any editor implementing a validation rule should treat a malformed value as merely a warning, at least until the same editor supports the entirety of the opening hours syntax in an intuitive point-and-click interface (or maybe link out to ChatGPT).

However, the proposal before us is to clean up some values in the database. As long as @matheusgomesms can be reasonably certain that the edit won’t introduce any additional syntax errors or ambiguity or change the meaning of a tag away from the intended value, then there should be little downside, even if the initial upside may be limited. Maybe someone cares that it would churn lots of POIs or require big bboxes? Maybe a developer won’t appreciate that the special cases they put in will become less essential?

I’d venture a guess that someone has already been carrying out this very cleanup task in one region for years, but no one raised a fuss, and probably no one will care to go back and undo it.

4 Likes

well, in practice only subset of opening hours actually following it (almost always because mapper was using some editor handling formatting for them) has decent chance of being usable

ATYL also applies here and opening_hours=otwarte od szóstej do osiemnastej w dni robocze is still better than nothing, but it is not changing that it is specification and it is really worth following it.

I don’t mean to take the role of the moderators here, but the questions about the machine-readability of comments in opening_hours=* and whether or not the specification is an actual formal specification are outside of the scope of the original post: replacing different types of dashes with U+002D. So, continuing the discussion about this proposed edit, I would like raise the following questions:

  • Are there any (solid) arguments on why this automated edit could be problematic, or why it might be better to review all data manually instead of performing an automated edit?
  • What characters exactly would be replaced with U+002D? Anything with the Unicode property Dash=yes maybe, or specific characters?
  • Would comments within opening_hours=* values also be affected by this edit?
  • Would keys such as collection_times=*, service_times=* and conditional restrictions be affected?
11 Likes

I was thinking that if people are concerned about comments, features with those could be excluded from a first run.

7 Likes

To be honest, nothing - except that without finding and fixing the source of these “odd” values, you’ll have to do the same thing again, and again, and again. As long as there are some “non-conforming” items in the data, software that wants to try and understand it will need to cope with these values.

2 Likes

Would it be useful then to look at 20-30 example changesets that added opening hours in nonstandard syntax? Maybe they all used the same editor or there is something else they have in common.

3 Likes

I think one of the name-suggestion-index maintainers did an analysis like this a few years ago and came to a soft conclusion that a lot of smart dashes were probably being added by Safari on Apple platforms, which enables smart quotes and dashes by default in text fields. I don’t know that there’s ever been a decisive consensus against them in every freeform key, in every language, but keys that accept opening hours syntax are a no-brainer compared to name and inscription.

Regardless, the nonbreaking spaces I turned up suggest that at least some of these mappers are copy-pasting from another site. If so, there’s certainly not enough intentionality to justify all the time we’re spending scrutinizing this proposal.

4 Likes

I’d say improve the data, move with your life, and if it’s still interesting to you in six months, do it again.

In the mean time, 2000 opening_hours tags will be easier to parse.

Other problems with the specification or the editing tools or the parsers should be fixed too but no one has volunteered. For now, @matheusgomesms is willing to improve the data right now, so I say go for it.

7 Likes

From what I saw, usually copy-paste from websites etc (I did this lots of times), so that’s why I’m willing not only to fix my mistakes, which I did already in my city as the example I mentioned, but also to fix the rest of the world.

Looks like a no-brainer to me, but I’m in awe about the bunch of messages for such a simple thing! OSM never disappoints heh! (Obviously all messages here are well intentioned, and it’s nice to see the deep knowledge our users have about many stuff).

I’ll wait a couple of days until we reach a consensus here, so that’s why I’m not engaging that much anymore. Like I said, I’m willing to put some work here, but I won’t spend many hours on that.

3 Likes