How can i approach this problem.

Pages: 123

Ah, a Universal Translator.

I am not sure that can be done from text alone... but thinking that hard is something I try to avoid these days...

closed account (SECMoG1T)

Ooh, you known it's like am lost and trying to find a way to go somewhere with no map but am required to get there.

There are easier ways to do stuff but they don't lay around waiting to be picked, they are earned by experience, hard work and lots of head-scratching you know, that am sure of and for folks who are yet to earn them we walk through dark alleys and are bitten by a hideous raccoon-like(bug and misconceptions) till we get the point, the scratches are the badges we earn lol, I still have miles to go.

Do you know of a simpler or a better approach I could try, that would be massively helpful.

Grateful for any ideas.

Last edited on

jonnin (11497)

Do you know of a simpler or a better approach I could try, that would be massively helpful.

we gave you both. A simple elimination list that works maybe 50-75% of the time, maybe less, depending on how tricksy the input data is.
And ideas for a better approach, which, of course, is less simple. The better approaches include
- parsing the sentence
- AI driven pattern recognition
- non AI driven pattern recognition (but more complex than just a parse)

No, there is not a simple way to get a near 100% correct answer here. You just have to decide how accurate you want it, and how much time and money and all you are willing to sink into the issue.

where are nouns? Sentences are made of phrases. Phrases tend to have a subject(noun) verb (target) structure with other words stuck in there (adj, etc). So if you can locate the subject and verb targets, you have 2 candidates. Unfortunately in a complex sentence that has phrases as these targets, you have to recursively break those down as well looking for the same. I marked the nouns there, observe the italics one (its a noun phrase that does not contain a noun).

This is hard. First, english is just awful, it is full of weirdness and exceptions to the rules. Second, it is a nontrivial task even in a less random/confusing language.

Last edited on

helios (17607)

OP, maybe I'm missing something, but are you trying to write an algorithm that can decide if an arbitrary string of characters could be a noun in a given language? I.e. start from the fact that both "dog" and "house" are nouns but ~~"run"~~ "execute" isn't, and conclude that "sognort" must be a noun.

Like Duthomhas said, that's going to be tough starting from text alone. A writing system doesn't evolve as an isolated system. Arguably, words evolve primarily orally and their spelling is more or less a matter of convenience and convention (e.g. there's nothing that makes "house" an intrinsically better spelling than "haus").
Also, languages don't evolve in isolation. Even if we've concluded that there's some algorithm that English has for inventing new nouns, this won't help you with loan words, especially in the case of distantly-related languages. E.g. "samurai", "yoghurt", "yoga", "kung fu".

Last edited on

jonnin (11497)

the fact that both "dog" and "house" are nouns but "run" isn't,
I went for a run on my treadmill...

@#$% english lol...
all your base are belong to us

I am going to bail on this one unless you have a specific coding / design / algorithm question.
I will leave you with this:
run just about anything nontrivial through google translate into any language you know. Let me know if it gets it right. It won't. So you have a giant company with a large investment in translation software and at best its like a kid with 2 semesters in the language + the internet did the translation. And often its not even that good. I can't stress it enough: this is a big, challenging problem.

Last edited on

Duthomhas (13290)

Why have we moved to discouraging OP? She is looking for patterns in language. Specifically, she is looking for patterns that transcend individual languages. What is wrong with that? These are the kinds of questions scientists ask about meta-cognition all the time. You can’t learn anything by not exploring.

@Yolanda
Have you considered paying people to identify nouns in text? Having three people work independently to identify all the nouns in a given text will get you a good result, and it will be faster and cheaper than spending a lot of time making a perfect algorithm...

That said, a lot of what you are looking to do may overlap. Finding patterns means being able to tag words with syntactic value, which, if I understand you correctly, is your focus question:

Can I identify nouns in an unknown language?

Your question might be better worded as:

Can I identify nouns in a known or unknown language in a known language family with a high reliability (>80-90% success) ?

This problem has no simple approach, alas.

Let me know if I missed anything.

deleted account xyzzy (5768)

Why have we moved to discouraging OP?

I wasn't trying to discourage her. Just trying to point out her question was very nebulous and broad stroke encompassing.

Like trying to paint the universe with a toothbrush. It is theoretically possible, just not practical.

The idea she raised is intriguing, but definitely In MY Opinion not something for a beginner. This would be a very large team effort.

I'll shut up now.

jonnin (11497)

Also not trying to discourage. I just like people to be aware of what they asked for. If your professor asks you to factor a 2000 digit number, for example, it sounds easy to a first year student.

The language thing is actually something we need. Our translator programs are terrible and anything important still requires a human to convert. Someone has to do it, and I am all for people taking a crack at making a better one. But whoever does it needs to know that its not going to be a half page of code or something. What is more discouraging... knowing what you are getting into up front, or finding out after spending weeks on it...

closed account (SECMoG1T)

Ooh sorry, guys, I have been away for sometimes but now am back.
@Jonnin am grateful for all you guys have provided it's the basis I'll run with, I didn't in anyway say that what I was provided with is less than what I need.

a simpler or a better approach

by this I was referring to maybe another approach that doesn't involve language {beacuse I know linguistics are damn difficult to analyze and especially if I have to start on text-only model}
People am extremely thankful, I'll always be and you have done beyond what I expected.

@Helios this is a tough way to go I am fully aware of it, every time I start a challenge of this kind it's usually difficult at the beginning but with time I earn the confidence and I have always made it out, I have been through this road more than once, I have done similar project, guys, I have mathematical models{stochastic and definitive} at my disposal that I use for deep analysis so don't be worried,
At this step am just gathering data that I'll use to collect rules and invariants that allow simple contexts to exist, This is difficult I know but it's doable.

@Jonnin

run just about anything nontrivial through google translate into any language you know. Let me know if it gets it right.
 It won't. So you have a giant company with a large investment in translation software and at best it's like a kid with 2 semesters in the language + the internet did the translation. 
And often it's not even that good. I can't stress it enough: this is a big, challenging problem.

I don't know what to say, my main aim is not making a translation software but it can do from this research, I know this is damn difficult, I have been through this road before guys and I understand you.
Am sorry for asking an extremely difficult question but this is the only programming forum I call home I come here every time and I always been helped and I always do likewise whenever I can.

@Dothumhas I have to take this road if I need progress.

Finding patterns means being able to tag words with syntactic value, which, if I understand you correctly, is your focus question:

Sure this is all am going for and not necessary in languages, am just using language as a starting basis, This q: "what rules define a context", For systems without chaos there are rules that govern them, when you break those rules you end up with chaos.

Consider an enzyme system{neurotransimitters in the CNS e.g Acetylcholine}  There are rules that implicitly dictate 
how it must work within the system {it must be released, bind receptors, be recycled by Acetycholinestrase,
 repeat the cycle on activation} if any of that fails you end up a dead men.

A simple process maybe you're not even aware of, This rules how do they exist? If you analyze this system you definitely can clearly derive the rule from the basic interaction exhibited by the system.

- rule 1: There must be activation, without that you die.
- rule 2: The must be a functioning system to produce Acetylcholine and Acetylcholinesterase in the correct form, without them chaos.
- rule 3: There must be a proper non-malfunctioning receptor and a proper binding port on Acetylcholine, without them then chaos.
- rule 4: There must be a proper functioning ACH-strase to recycle the activating enzyme.
- Rule 5: dont break any of the previous rule and every folk goes home happy.

These is the kind of analysis am trying to carry out, I know this is simple but am working on a more complex system now.

@Furry I do understand this is difficult even for me at times but I have been here before this is the road that I always travel. Please don't always stick to this

It is theoretically possible, just not practical.

when the game seems difficult. Did you see this


problem which required indexing all permutation of a lengthy string without 
(producing/memoization/caching/storing) a single permutation whatsoever, 
it took time but I finally solved it by looking at the traits of permutations

it was impossible at start, by end after lots of analysis i had just 6 lines of code that i solved this with.

Folks am not broken at all and I will go through this road, let me give you a non-related challenge and a difficult one.

Write an algorithm that takes random data{any size}, processes it, destroy it completely retain just characteristic of the data exactly 1kb, and then reproduce the data completely without losing anything.

Last edited on

helios (17607)

Consider an enzyme system{neurotransimitters in the CNS e.g Acetylcholine} There are rules that implicitly dictate
how it must work within the system {it must be released, bind receptors, be recycled by Acetycholinestrase,
repeat the cycle on activation} if any of that fails you end up a dead men.

But a language is not a chemical system. There are not physical laws that let you deduce which strings of characters are nouns in language X.

let me give you a non-related challenge and a difficult one.

Write an algorithm that takes random data{any size}, processes it, destroy it completely retain just characteristic of the data exactly 1kb, and then reproduce the data completely without losing anything.

Uh-huh. I have to ask: are you trolling? Are you asking questions that you know have no answer to rile people up and/or purposely waste their time?

An algorithm capable of taking an arbitrary bit string, outputting another bit string of bounded length, and reconstructing the original bit string from the reduced bit string would be capable of infinite compression. Infinite compression has been mathematically proven to be impossible.
https://en.wikipedia.org/wiki/Pigeonhole_principle

jonnin (11497)

what you need for that is a black box that can determine an equation (or many equations) that generates the file back, such that
foo(1) = byte 1
foo(2) = byte 2
foo(3) = byte 3
and somehow store the equation only in a tiny amount of space. Something like a neural net can learn to reproduce data this way, to an extent. You get into trouble as the files get bigger though.

you can also try recursive compression. that works by compressing the data, then altering the data (some form of encryption, though less random, and more patternized so you get lots of redundant stuff to compress next time) and then compress it again (and again, and again...) until size is < desired target, and then pad the file out to target size.

Neither of these techniques has been successfully applied to give any practical benefits to my knowledge.

you run into a general purpose problem. You can compress one specific file a lot and regenerate it; there is a one or two line program that generates a LOT of the mary had a lamb poem out there in the convoluted C contest stuff. Generating anything else with that code is about impossible though: it does one thing only. Not exactly the same issue.

In theory a random byte generator that could produce every possible combination of bytes could do this, if you could find the input that makes the one file you want. This is just theory though, no way to DO that.

Last edited on

helios (17607)

You can compress one specific file a lot and regenerate it

That's kind of why the Kolmogorov complexity is a useful measurement. There's a few seemingly highly complex data sequences that in actuality can be generated by very small programs. The usual example is a raster of the Mandelbrot set.

In theory a random byte generator that could produce every possible combination of bytes could do this, if you could find the input that makes the one file you want. This is just theory though, no way to DO that.

In theory that would simply not work even if you had an infinite amount of time, because you would not be able to know for sure whether you stumbled upon the original message unless you already had the original message stored somewhere.

Last edited on

closed account (SECMoG1T)

@Helios Then roll back to the DNA you will see more than a language, Even try it with yourself as a system, why do you wear shoes of a specific size, what happens when you wear those of a smaller size, This is how we all look for embedded traits, phenotypical characteristics, interaction, and the rules that define them, look at the Genes in your body you will be surprised.... am not trolling lol, you want to try this: think of "DARPA hard problems , they are difficult at first but people solve them with very simple approaches, look at some when you can".

@Jonnin


and somehow store the equation only in a tiny amount of space. Something like a neural net can learn
 to reproduce data this way, to an extent. You get into trouble as the files get bigger though.

I don't agree with that There exist simple algo that does just that and retain just the traits in form of data, believe me, you can do it even on a small system without much resources, Maybe you just need to try some problem, it feels great when you have a solution on your hand that you couldn't have thought of previously... Just give it a try, it's a tough road but you can do it (analytics might be your friend ).

Last edited on

jonnin (11497)

not like that.
you have file x, say its the scrabble text file of english words, whatever.
you find some seed, say 1234, that generates this file exactly via your generator.
you save 1234. that is the compressed file.
then to uncompress it, you don't need anything but 1234 and the filesize (when to stop). It regenerates the file.
So you DO have the original file (binary or text matters not) when you compress it.

Where it breaks down is like the alien with the measuring stick. You need an infinitely large key to generate infinite files with this idea, so your key becomes incredibly big, possibly... as big as the original data... not sure what happens there. Doesn't matter: its not tractable anyway.

Last edited on

closed account (SECMoG1T)

If you only knew what things you can do then everyone would agree with me, Guys try join an organization that does just that or even DARPA or any other of the same kind.

Am scared now what thing have I brought to myself.

Last edited on

helios (17607)

Then roll back to the DNA you will see more than a language, Even try it with yourself as a system, why do you wear shoes of a specific size, what happens when you wear those of a smaller size, This is how we all look for embedded traits, phenotypical characteristics, interaction....

I'm not sure what you're getting at.

closed account (SECMoG1T)

I'm not sure what you're getting at.

Solutions to complex systems exist even within simple natural things that you couldn't believe when you get your solutions, When you have a complex problem venture beyond the boundaries of the problem definition and explore, it's easier than being stuck on the original problem, you generate new ideas, clues that you couldn't have seen before.

Last edited on

jonnin (11497)

There exist simple algo that does just that and retain just the traits in form of data

If this existed in a simple format we would have compression tools that topped 90% reduction in common use for all file types and would not still be basing lossless compression on LZW type approaches. Or still using MP3 audio & H264 video type family approaches. It would be a game changer.

Sometimes we make things that take too long to be practical -- jpeg2k (and generic wavelets which are even better) lost out because it was ahead of its CPU era for example. But we all knew about it, and there were libraries that used it, it just missed the web page market and died on the vine.

If you know something... share, please.

Last edited on

helios (17607)

Oh, OK. Well, you never stated what the limits of your computing resources were.
Fine, if you have the resources to simulate:
* A few hundred thousand human heads (with ears), larynxes, and lungs.
* A few hundred thousand human brains, especially their language centers.
* Ten or so complex human societies.
* Some world to put around them (it doesn't have to resemble our own).
then yes, I guess this project becomes doable. You should have said so from the beginning, though; it would have saved a fair bit of time to know that you have at your disposal nearly all of the world's computing power.

Last edited on

closed account (SECMoG1T)

If this existed in a simple format we would have compression tools that topped 90% 
reduction in common use for all file types and would not still be basing lossless compression on 
LZW type approaches. Or still using MP3 audio & H264 video type family approaches. It would be a game-changer.

Unless you believe me, I don't know how I could prove it to you, Am really getting anxious now,
Am I offending you guys?

Last edited on

Pages: 123

C++

Forum

How can i approach this problem.