user agent parsing.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

user agent parsing.

Max Lapshin-2
Hi.

I have a list of 250 000 of different useragents: not only browsers, but also devices and so on.

I want to get operation system, kind of device, browser and version from them, so I need user agent parser.


Right now I see two of them:


I have checked them and they cannot parse lot of my useragent strings.

What is the proper way to go here:

1) try to edit these libraries?
2) take some service with API?
3) just try to put ad-hoc hardcoded strings that will cover 90% of my records?


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

Leonard Boyce-2
Hi Max,


On Fri, May 26, 2017 at 7:39 AM, Max Lapshin <[hidden email]> wrote:

> Hi.
>
> I have a list of 250 000 of different useragents: not only browsers, but
> also devices and so on.
>
> I want to get operation system, kind of device, browser and version from
> them, so I need user agent parser.
>
>
> Right now I see two of them:
>
> https://github.com/ferd/useragent
> https://github.com/chitika/uaparser
>
> I have checked them and they cannot parse lot of my useragent strings.
>
> What is the proper way to go here:
>
> 1) try to edit these libraries?
> 2) take some service with API?
> 3) just try to put ad-hoc hardcoded strings that will cover 90% of my
> records?
>

We spent quite a bit of time evaluating various options and eventually
just bit the bullet and brought Elixir into the mix for UA parsing.
We've been using https://github.com/elixytics/ua_inspector for a few
months now and find it quite solid. It's rare it does not parse things
out correctly.
It uses the databases from https://github.com/piwik/device-detector

> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

Alexander Shorin-2
In reply to this post by Max Lapshin-2
Hi Max,

I think you could try to make another one based on https://github.com/ua-parser
It provides language agnostic set of rules to parse user agent, which
updates quite frequently.
We use python and scala parsers based on those and it works pretty
good for us. Good thing that we can have consistent parser results
across different projects.

--
,,,^..^,,,


On Fri, May 26, 2017 at 2:39 PM, Max Lapshin <[hidden email]> wrote:

> Hi.
>
> I have a list of 250 000 of different useragents: not only browsers, but
> also devices and so on.
>
> I want to get operation system, kind of device, browser and version from
> them, so I need user agent parser.
>
>
> Right now I see two of them:
>
> https://github.com/ferd/useragent
> https://github.com/chitika/uaparser
>
> I have checked them and they cannot parse lot of my useragent strings.
>
> What is the proper way to go here:
>
> 1) try to edit these libraries?
> 2) take some service with API?
> 3) just try to put ad-hoc hardcoded strings that will cover 90% of my
> records?
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

Max Lapshin-2
So all people are parsing it with a list of regexes?

Seems that it is a problem for me, because I have such entries like  HLS Client/2.0 (compatible; LG NetCast.TV-2012) or Flash MAC 10.0.32.18



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

Alexander Shorin-2
I fear there is no way to do that without regular expressions since
User-Agent header has no well formed structure and there could be
anything. And actually people puts there anything they like.  So yes,
all you need is maintain list of regexps and extract with them all the
bits you need.
--
,,,^..^,,,


On Fri, May 26, 2017 at 2:59 PM, Max Lapshin <[hidden email]> wrote:
> So all people are parsing it with a list of regexes?
>
> Seems that it is a problem for me, because I have such entries like  HLS
> Client/2.0 (compatible; LG NetCast.TV-2012) or Flash MAC 10.0.32.18
>
>
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

John Doe
In reply to this post by Max Lapshin-2
Max,  it depends on what for you do need the info from useragent. If you don't need detailed info such as model of the phone for example, you can use a fork of the useragent app, such as https://github.com/brigadier/useragent , and modify it yourself for new useragents.
If you need to know detailed info, such as model of a phone or resolution, you'd need to integrate database from https://browscap.org . But as you can see, the base es very large, the full one weights about 130Mb, and naive parsing with regexps would take a lot of CPU.

Often detailed version is not needed. If you sell ads buyers would be interested maybe in android or ios, or maybe even in exact version of android or ios, but not in the exact model of some obscure dumbphone. For this dumbphone the ads would cost close to zero regardless of if you show the name of the device or show 'unknown device' instead.

2017-05-26 14:39 GMT+03:00 Max Lapshin <[hidden email]>:
Hi.

I have a list of 250 000 of different useragents: not only browsers, but also devices and so on.

I want to get operation system, kind of device, browser and version from them, so I need user agent parser.


Right now I see two of them:


I have checked them and they cannot parse lot of my useragent strings.

What is the proper way to go here:

1) try to edit these libraries?
2) take some service with API?
3) just try to put ad-hoc hardcoded strings that will cover 90% of my records?


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

Max Lapshin-2
I think that we need to know device, its vendor, os, major versions.

Problem here is that it seems that I will have to fill database myself: we have stb, flash players, smart tv and so on.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

Antonio SJ Musumeci
In reply to this post by John Doe
If the libraries don't already offer one it should be simple enough to create a cache of the literal user agent input and parsed response. While there are many permutations I'd think it's pretty fixed in practice relative to the number of possible lookups needed.

On Fri, May 26, 2017 at 10:56 AM, John Doe <[hidden email]> wrote:
Max,  it depends on what for you do need the info from useragent. If you don't need detailed info such as model of the phone for example, you can use a fork of the useragent app, such as https://github.com/brigadier/useragent , and modify it yourself for new useragents.
If you need to know detailed info, such as model of a phone or resolution, you'd need to integrate database from https://browscap.org . But as you can see, the base es very large, the full one weights about 130Mb, and naive parsing with regexps would take a lot of CPU.

Often detailed version is not needed. If you sell ads buyers would be interested maybe in android or ios, or maybe even in exact version of android or ios, but not in the exact model of some obscure dumbphone. For this dumbphone the ads would cost close to zero regardless of if you show the name of the device or show 'unknown device' instead.

2017-05-26 14:39 GMT+03:00 Max Lapshin <[hidden email]>:
Hi.

I have a list of 250 000 of different useragents: not only browsers, but also devices and so on.

I want to get operation system, kind of device, browser and version from them, so I need user agent parser.


Right now I see two of them:


I have checked them and they cannot parse lot of my useragent strings.

What is the proper way to go here:

1) try to edit these libraries?
2) take some service with API?
3) just try to put ad-hoc hardcoded strings that will cover 90% of my records?


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

John Doe
In reply to this post by Max Lapshin-2
Check the  https://browscap.org - the csv database, it's the most complete one of the lot. I doubt you'll be able to keep the database up to date yourself. Even if you manage to gather one right now. It will become obsolete very fast.

Also you can't just compare useragents "as is" with the database of useragents, as often useragents have some additional parts added by plugins, minor browser versions or even proxies, so you would need to split  useragent and look up known features of known useragents in the current useragent, just to be able to detect if not all but at least some features. The browscap has such trees of features.

2017-05-26 18:14 GMT+03:00 Max Lapshin <[hidden email]>:
I think that we need to know device, its vendor, os, major versions.

Problem here is that it seems that I will have to fill database myself: we have stb, flash players, smart tv and so on.




_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

Anthony Molinaro-4
This library should do what you want


We've been using it in production for some time (maybe 5-6 years).  Every so often you need to update the database (which is now Apache DeviceMap IIRC).

Hope that helps,

-Anthony

On May 26, 2017, at 8:27 AM, John Doe <[hidden email]> wrote:

Check the  https://browscap.org - the csv database, it's the most complete one of the lot. I doubt you'll be able to keep the database up to date yourself. Even if you manage to gather one right now. It will become obsolete very fast.

Also you can't just compare useragents "as is" with the database of useragents, as often useragents have some additional parts added by plugins, minor browser versions or even proxies, so you would need to split  useragent and look up known features of known useragents in the current useragent, just to be able to detect if not all but at least some features. The browscap has such trees of features.

2017-05-26 18:14 GMT+03:00 Max Lapshin <[hidden email]>:
I think that we need to know device, its vendor, os, major versions.

Problem here is that it seems that I will have to fill database myself: we have stb, flash players, smart tv and so on.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

Marc Worrell
Nice to hear that you are using our library!
If you happen to have database updates, then you are very welcome to make merge requests.

- Marc


On 26 May 2017, at 19:45, Anthony Molinaro <[hidden email]> wrote:

This library should do what you want


We've been using it in production for some time (maybe 5-6 years).  Every so often you need to update the database (which is now Apache DeviceMap IIRC).

Hope that helps,

-Anthony


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

John Doe
Unfortunately Apache Devicemap is retired, in other words it is no more. There will be no updates.

2017-05-26 21:56 GMT+03:00 Marc Worrell <[hidden email]>:
Nice to hear that you are using our library!
If you happen to have database updates, then you are very welcome to make merge requests.

- Marc


On 26 May 2017, at 19:45, Anthony Molinaro <[hidden email]> wrote:

This library should do what you want


We've been using it in production for some time (maybe 5-6 years).  Every so often you need to update the database (which is now Apache DeviceMap IIRC).

Hope that helps,

-Anthony



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

Richard A. O'Keefe-2
In reply to this post by Max Lapshin-2

> On 27/05/2017, at 3:14 AM, Max Lapshin <[hidden email]> wrote:
>
> I think that we need to know device, its vendor, os, major versions.

I have no idea how much of a problem it is these days,
but user agents used to lie a lot.  Indeed, one of the browsers
I use still has an easy-to-use option to say what it should
pretend to be.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: user agent parsing.

Max Lapshin-2
We do not need to make a 100% or even 90% detection of user identity, we need to show: who was watching.

Right now I have stopped with taking uaparser library and I've added extra list with specific streaming clients.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Loading...