Subscribe to this thread
Home - General / All posts - LIKE operator in M9.0.180.0 SQL
David S
136 post(s)
#08-May-24 10:51

I have a table with over 6.2 million records x 25+ fields. One of the fields ('Geol_symb') type nvarchar contains values of the following form: N_sad, N1_soc, H2_or, Pg2_vas_ap, Pg3_pim_a, C13-J1_soh_J1, ...

There are hundreds of different codes but those examples will suffice for this post.

I have another field 'Geol_Class' typenvarchar that I am using to consolidate the above codes into five geological classifications: Reclaimed, Sedimentary, Igneous-volcanic, Igneous -plutonic, Metamorphic.

I am doing the consolidation using SQL code in a Transform Expression. The SQL code I am attempting to use is as follows:

CASE

WHEN [Geol_symb] LIKE '%\_o%' THEN 'Reclaimed'

WHEN [Geol_symb] LIKE '%\_s%' THEN 'Sedimentary'

WHEN [Geol_symb] LIKE '%\_v%' THEN 'Igneous - volcanic'

WHEN [Geol_symb] LIKE '%\_p%' THEN 'Igneous -plutonic'

WHEN [Geol_symb] LIKE '%\_*%' THEN 'Metamorphic'

END

The intention is that codes containing

  • '_o' will be interpreted as 'Reclaimed'
  • '_s' will be interpreted as'Sedimentary'
  • '_v' will be interpreted as'Igneous - volcanic'
  • '_p' will be interpreted as'Igneous -plutonic'

  • '_*' will be interpreted as'Metamorphic'

    I am using the '\' escape character so the underscore characters in the geological codes are recognised as literal characters, not a wildcards.

    However, the code is not working as expected.

    Yes:

    • H2_or is correctly recognised as 'Reclaimed'

    • N_sad is correctly recognised as 'Sedimentary'
    • Pg3_pim_a is correctly recognised as 'Igneous -plutonic', etc

    BUT:

    • N1_soc appears to be incorrectly recognised as 'Reclaimed'. It should be 'Sedimentary' (contains '_s', not '_o')
    • Pg2_vas_ap appears to be incorrectly recognised as 'Sedimentary'. It should be 'Igneous - volcanic' (contains '_v', not '_s')
    • etc

    It seems the '\' escape characteris not working correctly since '_so' is being matched as '_o'.

    Similarly '_vas' is being matched as '_s'.

    Either '_' is still being treated as a wildcard character despite the escape character (most likely), or in the latter case for example it matches '_s' to '_' plus any following 's', even though it is separated from the '_'.

    The other type of problem I am encountering is with the '*' character used in the Metamorphic codes. I cannot get the SQL to recognise '*' as a literal character, however I try; eg. as '\_*', '*'. or '\*' (in any case, the Manifold User Manual page on the LIKE operator does not show * as a special character)

    ==

    One other oddity is that if I reverse the order of the first two WHEN statements such that I have

    WHEN [Geol_symb] LIKE '%\_s%' THEN 'Sedimentary'

    WHEN [Geol_symb] LIKE '%\_o%' THEN 'Reclaimed'

    then N1_soc is correctly recognised as 'Sedimentary'. And similarly for other cases where there is an 'o' at some point after the '_s'. These were previously being incorrectly equated to 'Reclaimed'

    H2_or is still correctly recognised as 'Reclaimed'

    But it does not fix the other failures as discussed above.

    ==

    As far as I can make out my code is consistent with the Manifold User Manual websites

    https://manifold.net/doc/mfd9/sql_operators.htm

    https://manifold.net/doc/mfd9/like_operator.htm

    This seems to be an error in the SQL implementation in M9.

    Or am I doing something wrong?

    ==

    Please do not respond "get rid of the '_' and '*' in the codes", or "get the data provider to use different codes", or similar. These are high resolution geological data with hundreds of geologicalcodesprovided free online by a Japanese government entity, and the codes are effectively 'official'. They put in an enormous amount of work (years) to unify and rationalise the geological codes used across the whole of Japan. There is no way they are going to totally revise the geologicalcodes at my behest.

  • David S
    136 post(s)
    #08-May-24 11:06

    Briefly adding a couple of points

    • As far as I can make out, my SQL code seems consistent with the User Manual pages I linked, and hence should work as intended?
    • Well perhaps apart from the '*' issue .
    • I have tried a range of alternatives such as using [_] instead of \_ to identify '_' as a literal character.

    [] seems to be standard (/MS?) SQL? However, it does not appear to work in M9. I got zero matches in 6.2 million records when I used that option.

    mlinth
    450 post(s)
    #08-May-24 11:55

    Two thoughts.

    First, I'd be tempted to build a lookup table with the distinct input values and fill a column with the values you want them mapped to. That way you move the information out of your query and into a table, where it is easier to see, and you can replace a complicated case statement with a join.

    It seems like you're looking at the data after the underscore, so I would try to write a regular expression to extract the token following an underscore. There's an example in the user manual Example: Get Tokens From a String (manifold.net)

    M

    David S
    136 post(s)
    #08-May-24 13:36

    Many thanks for the suggestions. I may yet have to look into those, though for a lookup table it might not be easy to collate all the different geological codes in the dataset! There are hundreds buried in amongst 6.2 million records... I'm still discovering even more convoluted ones.

    Following your link did drawn my attention to another alternative. That is to use the StringContains() function rather than the LIKE operator; ie. the code becomes as below.

    CASE

    WHEN StringContains([Geol_symb], '\_o') THEN 'Reclaimed'

    WHEN StringContains([Geol_symb], '\_s') THEN 'Sedimentary'

    WHEN StringContains([Geol_symb], '\_v') THEN 'Igneous - volcanic'

    WHEN StringContains([Geol_symb], '\_p') THEN 'Igneous -plutonic'

    WHEN StringContains([Geol_symb], '\*') THEN 'Metamorphic'

    END

    ==

    With that, the escape character '\' does work correctly so that now Pg2_vas_ap is correctly recognised as 'Igneous - volcanic'. Ditto for N1_soc (correctly recognised as 'Sedimentary').

    This seems to reinforce the likelihood that the escape character '\' may not be correctly implemented for the LIKE operator?????

    ==

    Unfortunately:

    1. StringContains() doesn't fix the '*' issue . Not certain how I am going to get around that issue as '*' appears to be the only unique character used in all Metamorphic rock codes. Hmm .... This might ultimately force me to take the task outside of Manifold and use a one off home baked program to do the consolidation.
    2. I have now discovered that among the hundreds of different codes in the dataset, there are some of the form P12-Tr32_vbs_o_J3. That is, the code includes a '_v' and a '_o'. This is incorrectly being interpreted as 'Reclaimed' whereas it should be 'Igneous - volcanic'. I will need somehow to restrict the matching to the first '_' occurrence. May need to go down the regular expression route (has an Occurrence operator), but it seems like that may become messy for the dataset I have.

    Starting to look like I may need to use a home baked program :(.

    David S
    136 post(s)
    #08-May-24 13:56

    Quick follow up. Putting the 'Reclaimed' and 'Sedimentary' code lines at the bottom fixed the P12-Tr32_vbs_o_J3 and similar cases (now correctly identified as 'Igneous - volcanic'). But too much data to yet clarify if other problems are still there (apart from '*') or were introduced.

    David S
    136 post(s)
    #08-May-24 14:32

    Argh ... I have now discovered that the '*' is infact a wildcard in the geological codes and can represent any one of 48 different 3 - 5 character strings listed on a separate sheet of the Legend spreadsheet (Legend is spread over five sheets).

    Fortunately those all start with 'm' (for metamorphic of course) so at least the '*' problem disappears (use '\_m').

    Seems like may now be able to sort it using StringContains(), provided there are no ' ... _s ... _v ...' type situations where I need the revised order of statements does not cause 'Sedimentary' to be mis-assigned as 'Igneous - volcanic'.

    It is the letter following the first instance of '_' that significant the general rock classification.

    Would be good if there was a way of explicitly restricting the matches to the he letter following the first instance of '_'.

    ==

    Thanks to both of you. I think I may now have cracked it.

    Still leaves an apparent issue with '\' in the LIKE implementation.

    Sloots

    694 post(s)
    #08-May-24 12:40

    Try this as an alternative:

    --SQL 

    CASE

     WHEN StringRegexpMatches([Geol_symb]'.+\_o.+''gi'THEN 'Reclaimed'

     WHEN StringRegexpMatches([Geol_symb]'.+\_s.+''gi'THEN 'Sedimentary'

     WHEN StringRegexpMatches([Geol_symb]'.+\_v.+''gi'THEN 'Igneous - volcanic'

     WHEN StringRegexpMatches([Geol_symb]'.+\_p.+''gi'THEN 'Igneous -plutonic'

     WHEN StringRegexpMatches([Geol_symb]'.+\_.+''gi'THEN 'Metamorphic'

    END

    .+ means one of more characters

    \_ means a single underscore

    Chris


    http://www.mppng.nl/manifold/pointlabeler

    David S
    136 post(s)
    #08-May-24 14:01

    Thanks. Will look at tomorrow.

    What is the 'gi'?

    I get the 'i' part but couldn't find a description relating to 'gi'.

    Manifold appears to use .* rather than .+ ? See StringRegexpMatches() function examples in https://manifold.net/doc/mfd9/string_sql_functions.htm

    Sloots

    694 post(s)
    #08-May-24 15:18

    The 'g' is meaningless in Manifold. 'i' ignores the case.

    Regular expressions are very powerful. .* has a different meaning than .+.

    I'm 100% sure that the StringRegexpMatches() function can handle all cases in your table without the need of postprocessing.


    http://www.mppng.nl/manifold/pointlabeler

    David S
    136 post(s)
    #09-May-24 00:44

    Thanks @Sloots. Yes as I have now discovered, regular expressions are very powerful.

    Both the StringRegexpMatches() and the StringContains() options handle the great bulk of codes in the dataset.

    But as per one of my posts yesterday, I later discovered there are codes that contain more than one instance of a token that matches the target values. For example, the code P12-Tr32_vbs_o_J3 will match to both '_v' (correct), and '_o' (incorrect). The result therefore becomes dependent on the order of the WHEN statements in the CASE structure. Since there are possibly other codes (so many of them!) where the '_v' snd '_o' may be in the reverse order (or similarly for other letter combinations), it is not possible to ensure that all results from either StringRegexpMatches() or StringContains() are correct. Or at least not possible easily. Only the first encountered tokens are relevant to the current task. That is it is the letter immediately following the first '_' that is relevant.

    Not all lost though. On digging around following your suggestions, I discovered there is a regular expression that does do the job, as per the code below. Ironically it is even simpler coding than the other two options!

    The code below extracts only to the first instance of '_' and the immediately following character. Have now confirmed that this does work as intended.

    ==

    CASE StringRegexpSubstring([Geol_symb], '\_.', 'i') 

      WHEN '\_o' THEN 'Reclaimed'

      WHEN '\_s' THEN 'Sedimentary'

      WHEN '\_v' THEN 'Igneous-volcanic'

      WHEN '\_p' THEN 'Igneous-plutonic'

      WHEN '\_m' THEN 'Metamorphic'

    END

    Sloots

    694 post(s)
    #09-May-24 07:15

    Nice solution!


    http://www.mppng.nl/manifold/pointlabeler

    lionel

    1,001 post(s)
    #08-May-24 23:59

    it seem using this SQL SELECT * FROM [Table4] WHERE [Geol_symb] LIKE '%_s%';

    replace %_s% by %s% or %\_s% return same number of lines in result table 3 !!

    Attachments:
    Like_test.mxb


    Book about Science , cosmological model , Interstellar travels

    Boyle surface fr ,en

    David S
    136 post(s)
    #09-May-24 00:57

    @lionel. Again thanks for the suggestions. As per my post above, I have now solved the problem and now have simple code using a regular expression that robustly does the task.

    ==

    Still an apparent error in the LIKE implementation in M9 though, or in the documentation for it.

    Would be good if someone can confirm that (see my OP).

    (Fortunately no longer a problem for my current task)

    Dimitri


    7,479 post(s)
    #10-May-24 06:50

    Looks like a bug in that the documentation for LIKE fails to repeat the essential advice for using escape characters within strings in SQL that is given in the regular expression topic.

    The way LIKE is implemented inside 9 is that LIKE is treated as a simplified regular expression: the pattern for LIKE is converted into a regular expression, which is then processed by the regular expression machinery.

    So far, so good. However, the documentation for LIKE does not cover the parsing of escape characters within the pattern string that have meaning within SQL itself for any strings that occur in a query. That a string is used with an operator such as LIKE does not mean those parsing rules do not apply.

    To use escape characters within pattern strings for LIKE you have to follow the advice in the Escaping Special Characters and Avoiding the Use of Escape Characters sections of the Regular Expressions topic, which boil down to using dual \\ escapes in the string.

    For example, if your table is called [geology] the following works:

    -- $manifold$

    SELECT [mfd_id][Geol_symb],

      CASE

      WHEN [Geol_symb] LIKE '%\\_o%' THEN 'Reclaimed'

      WHEN [Geol_symb] LIKE '%\\_s%' THEN 'Sedimentary'

      WHEN [Geol_symb] LIKE '%\\_v%' THEN 'Igneous - volcanic'

      WHEN [Geol_symb] LIKE '%\\_p%' THEN 'Igneous -plutonic'

      WHEN [Geol_symb] LIKE '%\\_*%' THEN 'Metamorphic'

      END AS Description

    FROM [geology];

    as does

    -- $manifold$

    SELECT [mfd_id][Geol_symb],

      CASE

      WHEN [Geol_symb] LIKE @'%\_o%' THEN 'Reclaimed'

      WHEN [Geol_symb] LIKE @'%\_s%' THEN 'Sedimentary'

      WHEN [Geol_symb] LIKE @'%\_v%' THEN 'Igneous - volcanic'

      WHEN [Geol_symb] LIKE @'%\_p%' THEN 'Igneous -plutonic'

      WHEN [Geol_symb] LIKE @'%\_*%' THEN 'Metamorphic'

      END AS Description

    FROM [geology];

    But not

    -- $manifold$

    SELECT [mfd_id][Geol_symb],

      CASE

      WHEN [Geol_symb] LIKE '%\_o%' THEN 'Reclaimed'

      WHEN [Geol_symb] LIKE '%\_s%' THEN 'Sedimentary'

      WHEN [Geol_symb] LIKE '%\_v%' THEN 'Igneous - volcanic'

      WHEN [Geol_symb] LIKE '%\_p%' THEN 'Igneous -plutonic'

      WHEN [Geol_symb] LIKE '%\_*%' THEN 'Metamorphic'

      END AS Description

    FROM [geology];

    Here are screenshots of the query in action:

    The above uses dual \\ as described in the Escaping Special Characters section.

    The above uses Manifold's @ innovation, (useful, but not portable) as described in the Avoiding the Use of Escape Characters section.

    The above uses single \ escapes, which does not give the desired result.

    I've suggested some edits, including to the LIKE topic, to cover the above. I also think that some of the other topics in the SQL documentation should be reviewed to possibly add commentary on escapes within SQL strings that might be used in other situations, such as with other functions or operators.

    Thanks for the report!

    Attachments:
    like_01.png
    like_02.png
    like_03.png

    Dimitri


    7,479 post(s)
    #14-May-24 14:23

    To follow up, some edits prompted by your post have been added to the LIKE operator topic and other topics in the documentation.

    Unfortunately, given how SQL rules work (not just in Manifold, but in other SQLs as well) for backslash escaping within string literals the result is some dense reading to make sense of how all that plays out with string literal patterns used with LIKE or regular expressions versus patterns taken from field values.

    Thanks for posting!

    David S
    136 post(s)
    #16-May-24 12:50

    Dimitri, many thanks for implementing the documentation changes, and acting on it quickly (albeit dense reading indeed!)

    To be honest I am not certain I fully follow all of it, though that may be my limited experience with SQL.

    I think I follow the [source] LIKE 'a\_bc' example given near the end OK. That is, that 'a\_bc' gets preprocessed to 'a_bc' for use in the LIKE operator? Which explains why my original attempt at using LIKE didn't work. All good ... so far!

    But the documentation also says

    Because strings are processed to handle use of a backspace character to designate the following character as a literal, we generally have to use two backslashes in string literals in three settings:

    • String literals used as arguments to SQL functions, for example regular expression SQL functions.
    • Patterns used with LIKE.
    • Strings given as patterns forSelect pane templates

    Yet, '\_'did work when I used it in the regular expressions I tried. For example:

    StringRegexpSubstring below extracted the first instance of the literal '_' and its following character, and the resulting substring was successfully matched to the literals '_o', '_s', etc as appropriate, despite using only a single backslash in the <regexp> string.

    CASE StringRegexpSubstring([Geol_symb], '\_.', 'i') 

      WHEN '\_o' THEN 'Reclaimed'

      WHEN '\_s' THEN 'Sedimentary'

      WHEN '\_v' THEN 'Igneous-volcanic'

      WHEN '\_p' THEN 'Igneous-plutonic'

      WHEN '\_m' THEN 'Metamorphic'

    END

    For example, [Geol_symb] = Pg2_vas_ap was correctly matched to the pattern '_v',

    That means the substring extracted from [Geol_symb] was the two character literal string '_v', despite the pattern (<regexp>) specified in StringRegexpSubstring being a string ('\_.') with only a single backslash.

    The fact that this worked correctly, means the pattern specified ('\_.') was interpreted as the literal underscore character followed by any other single character (thankfully!).

    If the '_' had been interpreted as a wildcard for any single character, then the substring extracted would presumablyhave been thefirst two characters in [Geol_symb]; ie. 'Pg' in the example above?

    The discussion in the Dutch\sChocolate example in the new documentation seems to imply that '\\_.' would need to be used for <regexp> in my StringRegexpSubstring statement above?

    Or am I misunderstanding something?

    ==

    Another difference between use of LIKE and regular expressions I noticed while experimenting, trying to get my code to work was that:

    • with LIKE, the wildcard for 'any single character' was '_' (as stated in the documentation), whereas
    • in the regular expressions, the wildcard for 'any single character' was '.' (again as stated in the documentation)
    • '_' did not seem to work as a wildcard in the regular expressions I tested.

    The discussion in the new 'Escaping Special Characters in String Literals' documentation seems to apply as much to string literals used as arguments in regular expression SQL functions as to patterns used with LIKE. It kind of (to me) reads like everything there applies in both cases.

    Possibly misleading?

    For example, as above, the wildcard for 'any single character'seems to differ between LIKE (_) and regular expressions (.)?

    I know the specific wildcards for each case are specified elsewhere in the manual but perhaps it would be helpful to include a brief comment in the new documentation noting that there are differences. Would at least be helpful to SQL novices.

    ==

    Hmm. On reflection, after writing all this, I suspect one reason '\_.' works with regular expressions because '_' is not a special character in the case of regular expressions? Have not tested this yet though. Would '_.' worked just as well as '\_.' in my substring expression???

    If so, I still think the difference would still be worth noting. Perhaps more so, given the many '\_'examples in the new documentation.

    Dimitri


    7,479 post(s)
    #16-May-24 14:18

    To explain the specific case you raise...

    For example, [Geol_symb] = Pg2_vas_ap was correctly matched to the pattern '_v',

    No, in the fragment

    WHEN '\_v' THEN 'Igneous-volcanic'

    the string '\_v' is not a pattern. It's just a two character string value, consisting of an underscore character followed by a v, which you've listed as one of the possible substring results you're picking out from the substrings that may be produced by your use of the StringRegexpSubstring function. In your use of '\_v' the backslash character is redundant. It just means "take the next character literally" which would happen anyway since the underscore character has no special meaning in a string value that is not being used as a pattern by a LIKE operator. It would be like writing the string 'abc' as '\a\b\c' where the backslashes are also redundant.

    The only pattern in the CASE expression you cited is the argument to the StringRegexpSubstring function, the pattern string '\_.'

    That string is processed by SQL to feed the two character pattern _. (an underscore followed by a dot) as a regular expression pattern argument to the StringRegexpSubstring function. That regular expression matches any two character sequence of an underscore character followed by any single character (which is what the dot . character wildcard means in regular expressions). As you've correctly noted, the underscore character is not a wildcard in regular expressions. It's just an underscore character.

    When the StringRegexpSubstring function uses that pattern to match against the tech character text value Pg2_vas_apit picks out the first two character substring that matches the regexp pattern, in this case consisting of _v, that is, an underscore followed by a v.

    In the CASE statement the only WHEN that matches that substring is the one that calls out a substring _v, that is, an underscore followed by a v.

    Note that it may be risky to use regular expression pattern strings like '\_.' with your data set and CASE statement because that pattern matches the first occurrence of any two character sequence of an underscore followed by any other character, and so what is picked out by the WHEN cascade depends on the positioning of possibilities within the string being matched if the underscore character appears more than once followed by some other character.

    For example, suppose the string you're matching against is the text value Pg2_vas_p and the SQL is

    CASE StringRegexpSubstring([Geol_symb], '\_.', 'i') 

      WHEN '\_o' THEN 'Reclaimed'

      WHEN '\_s' THEN 'Sedimentary'

      WHEN '\_v' THEN 'Igneous-volcanic'

      WHEN '\_p' THEN 'Igneous-plutonic'

      WHEN '\_m' THEN 'Metamorphic'

    END

    The result of the CASE test will depend on whether that two character sequence _p comes before or after the sequence _v in the text value to be matched (and not when it occurs in the WHEN cascade). The StringRegexpSubstring function reports the first match to the regular expression, so in the Pg2_vas_ptext value the substring _v is reported and that WHEN is matched. But if the text value was Pg2_pas_vthe substring _p would be reported, because it occurs first.

    If the '_' had been interpreted as a wildcard for any single character, then the substring extracted would presumably have been the first two characters in [Geol_symb]; ie. 'Pg' in the example above?

    That doesn't happen because the underscore character is not a wildcard in either the regular expression pattern used with the StringRegexpSubstring function or in the explicit strings used to match WHENs.

    A few more comments...

    Possibly misleading?

    No, because it's talking about SQL dealing with backslash escapes in all strings. That's a different matter than wildcards which are different in regular expressions or the LIKE operator.

    because '_' is not a special character in the case of regular expressions?

    Yes, the underscore character is not a wildcard when used within a regular expression pattern. See the Regular Expressions topic for all wildcards used in regular expressions.

    Would '_.' worked just as well as '\_.' in my substring expression???

    Yes, because the string '_.' is the same as the string '\_.' when used as a pattern string for a regular expression. There's nothing special about an underscore character in regular expression pattern strings. (Note that referring to the pattern string as a "substring expression" is incorrect terminology in the context of a pattern argument to a regexp function.)

    ---

    A few words on the evolution of all this. It's understandable if anybody is confused by this stuff at a first meeting because the evolution of SQL, the LIKE operator, and the use of regular expressions are a blending of separate technologies. At the risk of simplifying a bit...

    SQL string literals, sequences of characters delimited by single ' or double " quote characters, had to have a way of being able to specify a quote character in them so you could test for words like don't and such, and that's where backslash escaping came from. So it made sense to have a simple rule that when a backslash is encountered in a string then take that backslash not as a literal character but as a command to take whatever follows literally, for example, taking literally a single ' quote character and not interpreting that single ' quote character as meaning the end of the string.

    Regular expressions are a UNIX thing coming out of tools like GREP. They weren't as far as I know a consideration in SQL syntax way back when, and tools like GREP had a very elaborate set of wildcards with the need to have character escaping to be able to specify patterns that also included what otherwise might be command characters.

    But if you have functions in SQL that use regular expressions, you get the complication that SQL still wants to preprocess any string for the special meaning of a backslash character used as an escape.

    And then there's the third complication that comes in from the LIKE operator, which seems like an attempt to get the general benefit of regular expressions but using a simpler wildcard system, using the underscore and percent characters, but which are different wildcard characters than the dot and asterisk wildcard characters used to do similar things in regular expression patterns.

    Add all three complications together and you can have an SQL statement that uses a regexp function with patterns meaning one thing, and a LIKE operator with patterns meaning other things, and all that using literal strings as patterns within which SQL is going to insist on preprocessing backslashes. Big fun.

    That's why Manifold introduced the @ innovation for strings, to make things a lot simpler. Alas, as useful as that is it's not at all portable.

    cassini106 post(s)
    #26-May-24 06:23

    Bonjour Lionel,

    je suis utilisateur de Manifold et j'ai besoin d'un service pour avancer sur un projet.

    Peut-être pourriez vous m'aider ?

    Bien entendu, ça peut-être un service payant.

    Merci

    Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.