Discussion:
Configurable safety options for high performance Django systems
Rick van Hattem
2014-11-18 12:58:00 UTC
Permalink
Hi guys,

As it is right now Django has the tendency to kill either your browser (if
you're lucky) or the entire application server when confronted with a large
database. For example, the admin always does counts for pagination and a
count over a table with many rows (say, in the order of 100M) can really
stall your app and database servers.

In the past I've simply patched Django to include a few "security" features
for cases like these so someone can't accidently kill the site by doing
something silly, but I was wondering if it would be a good idea to include
some of these configurable options in the core.

Examples:
- If COUNT_LIMIT is set than wrap counts in a subquery and limit the
results: SELECT COUNT(*) FROM (SELECT pk FROM table LIMIT <COUNT_LIMIT>)
- When someone does "if queryset" automatically limit it to a default
slice size so someone doesn't accidently fetch all rows in the queryset.

~wolph

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/9ec3e5e6-0681-4e5b-bb8e-c450ce38c96a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Michael Manfre
2014-11-18 14:08:39 UTC
Permalink
Django is a generic framework. Those who use it to implement functionality
are the ones who kill either the browser or their entire application.
Applications must be written to meet functional and performance
requirements of the project. When dealing with large tables, more effort is
certainly needed when using the Admin. E.g. write custom ModelAdmins or
admin filters that restrict the resultset size.

I don't think Django should automatically limit these queries.

Regards,
Michael Manfre

On Tue, Nov 18, 2014 at 7:58 AM, Rick van Hattem <***@gmail.com> wrote:

> Hi guys,
>
> As it is right now Django has the tendency to kill either your browser (if
> you're lucky) or the entire application server when confronted with a large
> database. For example, the admin always does counts for pagination and a
> count over a table with many rows (say, in the order of 100M) can really
> stall your app and database servers.
>
> In the past I've simply patched Django to include a few "security"
> features for cases like these so someone can't accidently kill the site by
> doing something silly, but I was wondering if it would be a good idea to
> include some of these configurable options in the core.
>
> Examples:
> - If COUNT_LIMIT is set than wrap counts in a subquery and limit the
> results: SELECT COUNT(*) FROM (SELECT pk FROM table LIMIT <COUNT_LIMIT>)
> - When someone does "if queryset" automatically limit it to a default
> slice size so someone doesn't accidently fetch all rows in the queryset.
>
> ~wolph
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/9ec3e5e6-0681-4e5b-bb8e-c450ce38c96a%40googlegroups.com
> <https://groups.google.com/d/msgid/django-developers/9ec3e5e6-0681-4e5b-bb8e-c450ce38c96a%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAGdCwBsxws8_MbzWOgAweC1XTShqU0Ckidbu_PUh1e0B7UZ04Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-18 14:17:14 UTC
Permalink
The framework is indeed generic which is why it should also take scenarios
into account where there is a lot more data. There is obviously no
one-size-fits-all solution which is why I propose configurable options. The
needed patches within the Django core are small and have very limited
impact but the debugging it can save when working with larger querysets is
enormous.

Believe me when I tell you that debugging why your entire Django instance
is hanging at random times due to some erroneous query with an unfortunate
filter is painful at best. Especially since Django won't ever generate any
useful debugging output in those cases, it will simply kill your machine
and die.

On Tuesday, November 18, 2014 3:09:27 PM UTC+1, Michael Manfre wrote:
>
> Django is a generic framework. Those who use it to implement functionality
> are the ones who kill either the browser or their entire application.
> Applications must be written to meet functional and performance
> requirements of the project. When dealing with large tables, more effort is
> certainly needed when using the Admin. E.g. write custom ModelAdmins or
> admin filters that restrict the resultset size.
>
> I don't think Django should automatically limit these queries.
>
> Regards,
> Michael Manfre
>
> On Tue, Nov 18, 2014 at 7:58 AM, Rick van Hattem <***@gmail.com
> <javascript:>> wrote:
>
>> Hi guys,
>>
>> As it is right now Django has the tendency to kill either your browser
>> (if you're lucky) or the entire application server when confronted with a
>> large database. For example, the admin always does counts for pagination
>> and a count over a table with many rows (say, in the order of 100M) can
>> really stall your app and database servers.
>>
>> In the past I've simply patched Django to include a few "security"
>> features for cases like these so someone can't accidently kill the site by
>> doing something silly, but I was wondering if it would be a good idea to
>> include some of these configurable options in the core.
>>
>> Examples:
>> - If COUNT_LIMIT is set than wrap counts in a subquery and limit the
>> results: SELECT COUNT(*) FROM (SELECT pk FROM table LIMIT <COUNT_LIMIT>)
>> - When someone does "if queryset" automatically limit it to a default
>> slice size so someone doesn't accidently fetch all rows in the queryset.
>>
>> ~wolph
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Django developers (Contributions to Django itself)" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to django-***@googlegroups.com <javascript:>.
>> To post to this group, send email to django-***@googlegroups.com
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/django-developers.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/django-developers/9ec3e5e6-0681-4e5b-bb8e-c450ce38c96a%40googlegroups.com
>> <https://groups.google.com/d/msgid/django-developers/9ec3e5e6-0681-4e5b-bb8e-c450ce38c96a%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/97dce1de-29a6-41d9-969a-a7df0f4f217e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Aymeric Augustin
2014-11-18 14:31:45 UTC
Permalink
I've had to implement workarounds for such problems in many Django projects
I created. For example I wrote code to determine which models have too many
instances for dumping them all into a HTML dropdown and automatically add
this model to raw_id_fields in related models.

The main difficulty I'm foreseeing is to provide a generic and useful set
of restrictions. Even within specific projects, it was hard to come up with
one-size-fits-all-models solutions. If we want to implement something in
Django itself, the problem is much worse.

Having workarounds for 70% of the places where Django handles large tables
ungracefully isn't much better than 0%. It creates a false feeling of
security.

--
Aymeric.

2014-11-18 15:17 GMT+01:00 Rick van Hattem <***@gmail.com>:

> The framework is indeed generic which is why it should also take scenarios
> into account where there is a lot more data. There is obviously no
> one-size-fits-all solution which is why I propose configurable options. The
> needed patches within the Django core are small and have very limited
> impact but the debugging it can save when working with larger querysets is
> enormous.
>
> Believe me when I tell you that debugging why your entire Django instance
> is hanging at random times due to some erroneous query with an unfortunate
> filter is painful at best. Especially since Django won't ever generate any
> useful debugging output in those cases, it will simply kill your machine
> and die.
>
> On Tuesday, November 18, 2014 3:09:27 PM UTC+1, Michael Manfre wrote:
>>
>> Django is a generic framework. Those who use it to implement
>> functionality are the ones who kill either the browser or their entire
>> application. Applications must be written to meet functional and
>> performance requirements of the project. When dealing with large tables,
>> more effort is certainly needed when using the Admin. E.g. write custom
>> ModelAdmins or admin filters that restrict the resultset size.
>>
>> I don't think Django should automatically limit these queries.
>>
>> Regards,
>> Michael Manfre
>>
>> On Tue, Nov 18, 2014 at 7:58 AM, Rick van Hattem <***@gmail.com>
>> wrote:
>>
>>> Hi guys,
>>>
>>> As it is right now Django has the tendency to kill either your browser
>>> (if you're lucky) or the entire application server when confronted with a
>>> large database. For example, the admin always does counts for pagination
>>> and a count over a table with many rows (say, in the order of 100M) can
>>> really stall your app and database servers.
>>>
>>> In the past I've simply patched Django to include a few "security"
>>> features for cases like these so someone can't accidently kill the site by
>>> doing something silly, but I was wondering if it would be a good idea to
>>> include some of these configurable options in the core.
>>>
>>> Examples:
>>> - If COUNT_LIMIT is set than wrap counts in a subquery and limit the
>>> results: SELECT COUNT(*) FROM (SELECT pk FROM table LIMIT <COUNT_LIMIT>)
>>> - When someone does "if queryset" automatically limit it to a default
>>> slice size so someone doesn't accidently fetch all rows in the queryset.
>>>
>>> ~wolph
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Django developers (Contributions to Django itself)" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to django-***@googlegroups.com.
>>> To post to this group, send email to django-***@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/django-developers.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/django-developers/9ec3e5e6-0681-4e5b-bb8e-
>>> c450ce38c96a%40googlegroups.com
>>> <https://groups.google.com/d/msgid/django-developers/9ec3e5e6-0681-4e5b-bb8e-c450ce38c96a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/97dce1de-29a6-41d9-969a-a7df0f4f217e%40googlegroups.com
> <https://groups.google.com/d/msgid/django-developers/97dce1de-29a6-41d9-969a-a7df0f4f217e%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



--
Aymeric.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CANE-7mV-2_i3U8mGsaFcwhpDK_aM551TLU_LVpUnDg0O21kmdg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Claude Paroz
2014-11-18 18:18:18 UTC
Permalink
On Tuesday, November 18, 2014 1:58:00 PM UTC+1, Rick van Hattem wrote:
>
> Hi guys,
>
> As it is right now Django has the tendency to kill either your browser (if
> you're lucky) or the entire application server when confronted with a large
> database. For example, the admin always does counts for pagination and a
> count over a table with many rows (say, in the order of 100M) can really
> stall your app and database servers.
>


https://docs.djangoproject.com/en/dev/ref/contrib/admin/#django.contrib.admin.ModelAdmin.show_full_result_count
seems to go in your direction, at least. count(*) performance is
implementation-specific, so once again, no one-size-fits-all.
Maybe the documentation could be completed, here and there.

Claude

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/5c3ffd8f-7ac6-4ac9-863d-875581c3c127%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-18 18:27:33 UTC
Permalink
That certainly solves one part of the problem. After that I would still opt
for an optional configurable default for slicing. Personally I prefer to
raise an error when unsliced querysets are used since it's almost always
harmful or at least dangerous behaviour.

On 18 November 2014 19:18, Claude Paroz <***@2xlibre.net> wrote:

> On Tuesday, November 18, 2014 1:58:00 PM UTC+1, Rick van Hattem wrote:
>>
>> Hi guys,
>>
>> As it is right now Django has the tendency to kill either your browser
>> (if you're lucky) or the entire application server when confronted with a
>> large database. For example, the admin always does counts for pagination
>> and a count over a table with many rows (say, in the order of 100M) can
>> really stall your app and database servers.
>>
>
>
>
> https://docs.djangoproject.com/en/dev/ref/contrib/admin/#django.contrib.admin.ModelAdmin.show_full_result_count
> seems to go in your direction, at least. count(*) performance is
> implementation-specific, so once again, no one-size-fits-all.
> Maybe the documentation could be completed, here and there.
>
> Claude
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to
> django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/5c3ffd8f-7ac6-4ac9-863d-875581c3c127%40googlegroups.com
> <https://groups.google.com/d/msgid/django-developers/5c3ffd8f-7ac6-4ac9-863d-875581c3c127%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAJAwA%3DzwbCdXFun_%3DR17EfdDEcS35GOpPi%2BsdcNPyZZFwRLs5g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Schmitt, Christian
2014-11-18 18:37:31 UTC
Permalink
Sorry, but I don't think this is suitable.
If somebody has 100M rows per Table then, he should prolly think about
sharding/replication anyway. So the ORM would still suffer anyway.
Currently my company has a few tables with a high count as well but since
we never used the django-admin and managed to shard some of the data, we
never ran into such problems.

Django is a headstart for many projects / products, but once you have
certain amount of data you still need to program a few things by yourself.

If your patches are enough than live with it, but I don't see a reason for
optimizing django against big tables.



2014-11-18 19:27 GMT+01:00 Rick van Hattem <***@wol.ph>:

> That certainly solves one part of the problem. After that I would still
> opt for an optional configurable default for slicing. Personally I prefer
> to raise an error when unsliced querysets are used since it's almost always
> harmful or at least dangerous behaviour.
>
> On 18 November 2014 19:18, Claude Paroz <***@2xlibre.net> wrote:
>
>> On Tuesday, November 18, 2014 1:58:00 PM UTC+1, Rick van Hattem wrote:
>>>
>>> Hi guys,
>>>
>>> As it is right now Django has the tendency to kill either your browser
>>> (if you're lucky) or the entire application server when confronted with a
>>> large database. For example, the admin always does counts for pagination
>>> and a count over a table with many rows (say, in the order of 100M) can
>>> really stall your app and database servers.
>>>
>>
>>
>>
>> https://docs.djangoproject.com/en/dev/ref/contrib/admin/#django.contrib.admin.ModelAdmin.show_full_result_count
>> seems to go in your direction, at least. count(*) performance is
>> implementation-specific, so once again, no one-size-fits-all.
>> Maybe the documentation could be completed, here and there.
>>
>> Claude
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "Django developers (Contributions to Django itself)" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to
>> django-developers+***@googlegroups.com.
>> To post to this group, send email to django-***@googlegroups.com.
>> Visit this group at http://groups.google.com/group/django-developers.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/django-developers/5c3ffd8f-7ac6-4ac9-863d-875581c3c127%40googlegroups.com
>> <https://groups.google.com/d/msgid/django-developers/5c3ffd8f-7ac6-4ac9-863d-875581c3c127%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/CAJAwA%3DzwbCdXFun_%3DR17EfdDEcS35GOpPi%2BsdcNPyZZFwRLs5g%40mail.gmail.com
> <https://groups.google.com/d/msgid/django-developers/CAJAwA%3DzwbCdXFun_%3DR17EfdDEcS35GOpPi%2BsdcNPyZZFwRLs5g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAPDLAU4C4u0x-B2toa3gAr32sARTu8zMrPvdTPAnsubZaMmqPg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-18 18:59:12 UTC
Permalink
The 100M is an example of a really big problem, but the problem also exists
on a much smaller scale.

Even doing "if some_queryset" with 10,000 items can get pretty slow, just
because it doesn't kill the server in all cases doesn't make it a good
thing to ignore.

In all but the most basic Django projects I've seen problems like these.
Sane defaults won't hurt anyone and solves issues for people with larger
systems. And running forks of Django seems counter productive as well.

The "raw_id_field" is a nice solution for making the admin workable again
but I've seen many real-life projects have huge performance problems
because at some point the amount of linked items grew to the point of being
problematic. Having a dropdown with 100 items is a lot but workable, with
1000 items it's a problem. Which is not a large amount of items by any
measure.

On 18 November 2014 19:37, Schmitt, Christian <***@briefdomain.de>
wrote:

> Sorry, but I don't think this is suitable.
> If somebody has 100M rows per Table then, he should prolly think about
> sharding/replication anyway. So the ORM would still suffer anyway.
> Currently my company has a few tables with a high count as well but since
> we never used the django-admin and managed to shard some of the data, we
> never ran into such problems.
>
> Django is a headstart for many projects / products, but once you have
> certain amount of data you still need to program a few things by yourself.
>
> If your patches are enough than live with it, but I don't see a reason for
> optimizing django against big tables.
>
>
>
> 2014-11-18 19:27 GMT+01:00 Rick van Hattem <***@wol.ph>:
>
>> That certainly solves one part of the problem. After that I would still
>> opt for an optional configurable default for slicing. Personally I prefer
>> to raise an error when unsliced querysets are used since it's almost always
>> harmful or at least dangerous behaviour.
>>
>> On 18 November 2014 19:18, Claude Paroz <***@2xlibre.net> wrote:
>>
>>> On Tuesday, November 18, 2014 1:58:00 PM UTC+1, Rick van Hattem wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> As it is right now Django has the tendency to kill either your browser
>>>> (if you're lucky) or the entire application server when confronted with a
>>>> large database. For example, the admin always does counts for pagination
>>>> and a count over a table with many rows (say, in the order of 100M) can
>>>> really stall your app and database servers.
>>>>
>>>
>>>
>>>
>>> https://docs.djangoproject.com/en/dev/ref/contrib/admin/#django.contrib.admin.ModelAdmin.show_full_result_count
>>> seems to go in your direction, at least. count(*) performance is
>>> implementation-specific, so once again, no one-size-fits-all.
>>> Maybe the documentation could be completed, here and there.
>>>
>>> Claude
>>>
>>> --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "Django developers (Contributions to Django itself)" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe
>>> .
>>> To unsubscribe from this group and all its topics, send an email to
>>> django-developers+***@googlegroups.com.
>>> To post to this group, send email to django-***@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/django-developers.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/django-developers/5c3ffd8f-7ac6-4ac9-863d-875581c3c127%40googlegroups.com
>>> <https://groups.google.com/d/msgid/django-developers/5c3ffd8f-7ac6-4ac9-863d-875581c3c127%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Django developers (Contributions to Django itself)" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to django-developers+***@googlegroups.com.
>> To post to this group, send email to django-***@googlegroups.com.
>> Visit this group at http://groups.google.com/group/django-developers.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/django-developers/CAJAwA%3DzwbCdXFun_%3DR17EfdDEcS35GOpPi%2BsdcNPyZZFwRLs5g%40mail.gmail.com
>> <https://groups.google.com/d/msgid/django-developers/CAJAwA%3DzwbCdXFun_%3DR17EfdDEcS35GOpPi%2BsdcNPyZZFwRLs5g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to
> django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/CAPDLAU4C4u0x-B2toa3gAr32sARTu8zMrPvdTPAnsubZaMmqPg%40mail.gmail.com
> <https://groups.google.com/d/msgid/django-developers/CAPDLAU4C4u0x-B2toa3gAr32sARTu8zMrPvdTPAnsubZaMmqPg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAJAwA%3Dzq%2BffMJmheiiOfywwf5yhWqQa2AtU0SRGRc_wTitV9mA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Josh Smeaton
2014-11-19 00:42:59 UTC
Permalink
To me, "sane default" means django should not silently alter the query to
provide a LIMIT when it is not asked for.

I have also run into situations where doing a .count() or iterating a full
table has broken the application, or put too much pressure on the database.
Specifically with django bindings to javascript datatables. But I still
wouldn't want artificial limiting on such queries.

What *may* be useful, is to be able to apply queryset methods onto an
already sliced queryset. That would allow users to implement
queryset/manager methods that provide pre-sliced querysets to the rest of
the code. The problem would be, what should happen in this case?

Model.objects.all()[0:10].filter(field=True)

Should the filter be logically/internally moved to before the limit? Or
should the filter be applied to the result of the limit in an outer query?
Traditionally, django applies mutations in succession, but this wouldn't be
very useful the for the majority of operations that would occur "after" the
slicing. We *could* say all slicing is "saved" and applied at the end, but
we'd definitely run into issues with users reporting that filtering isn't
working as they expect - after the slice.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/bef66ad3-c886-4106-8336-a2db13c340cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Michael Manfre
2014-11-19 02:11:49 UTC
Permalink
On Tue, Nov 18, 2014 at 7:42 PM, Josh Smeaton <***@gmail.com>
wrote:

> To me, "sane default" means django should not silently alter the query to
> provide a LIMIT when it is not asked for.
>
> I have also run into situations where doing a .count() or iterating a full
> table has broken the application, or put too much pressure on the database.
> Specifically with django bindings to javascript datatables. But I still
> wouldn't want artificial limiting on such queries.
>
> What *may* be useful, is to be able to apply queryset methods onto an
> already sliced queryset. That would allow users to implement
> queryset/manager methods that provide pre-sliced querysets to the rest of
> the code. The problem would be, what should happen in this case?
>
> Model.objects.all()[0:10].filter(field=True)
>
>
I think we'd need a non-slicing syntax to allow chaining of limit/offset.
E.g. Model.objects.all().set_slice(limit=10, offset=0).filter(field=True).
Changing the existing slicing output would be a major backwards
incompatible headache.


> Should the filter be logically/internally moved to before the limit? Or
> should the filter be applied to the result of the limit in an outer query?
> Traditionally, django applies mutations in succession, but this wouldn't be
> very useful the for the majority of operations that would occur "after" the
> slicing. We *could* say all slicing is "saved" and applied at the end, but
> we'd definitely run into issues with users reporting that filtering isn't
> working as they expect - after the slice.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/bef66ad3-c886-4106-8336-a2db13c340cc%40googlegroups.com
> <https://groups.google.com/d/msgid/django-developers/bef66ad3-c886-4106-8336-a2db13c340cc%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAGdCwBu0FWdL0PezoJW4DXb1OXmAHOFOeYqVHiFtJF6PkZO5pQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-19 12:50:20 UTC
Permalink
Definitely agree on this, silently altering a query's limit is probably not
the way to go. Raising an exception in case of no limit and lots of results
could be useful.

For the sake of keeping the discussion useful:
- Let's say you have a table with 50,000 items, not an insanely large
amount imho.
- Now someone does a non-restricted loop through "Model.objects.all()"
which would return 50,000 items.
- In what case would this be desirable as opposed to limiting to 10,000
items and raising an error when the database actually returns 10,000 items.

Naturally this case would only apply if no slice is given, but if you're
really processing over 10,000 items in a single loop you probably know how
to slice the queryset when needed.
Perhaps something like this: raise QuerysetTooLarge('Your query returned
over {LIMIT} results, if this is intentional please slice the queryset')

Not saying this should be a default straight away but having a default of
10,000 or even 100,000 should not hurt anyone and protects against killing
a server which is always a positive result in my book.

On Wednesday, November 19, 2014 1:42:59 AM UTC+1, Josh Smeaton wrote:
>
> To me, "sane default" means django should not silently alter the query to
> provide a LIMIT when it is not asked for.
>
> I have also run into situations where doing a .count() or iterating a full
> table has broken the application, or put too much pressure on the database.
> Specifically with django bindings to javascript datatables. But I still
> wouldn't want artificial limiting on such queries.
>
> What *may* be useful, is to be able to apply queryset methods onto an
> already sliced queryset. That would allow users to implement
> queryset/manager methods that provide pre-sliced querysets to the rest of
> the code. The problem would be, what should happen in this case?
>
> Model.objects.all()[0:10].filter(field=True)
>
> Should the filter be logically/internally moved to before the limit? Or
> should the filter be applied to the result of the limit in an outer query?
> Traditionally, django applies mutations in succession, but this wouldn't be
> very useful the for the majority of operations that would occur "after" the
> slicing. We *could* say all slicing is "saved" and applied at the end, but
> we'd definitely run into issues with users reporting that filtering isn't
> working as they expect - after the slice.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/96868573-1c37-44a3-9a2a-61df62994308%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Marco Paolini
2014-11-19 12:56:47 UTC
Permalink
2014-11-19 13:50 GMT+01:00 Rick van Hattem <***@gmail.com>:

> Definitely agree on this, silently altering a query's limit is probably
> not the way to go. Raising an exception in case of no limit and lots of
> results could be useful.
>
> For the sake of keeping the discussion useful:
> - Let's say you have a table with 50,000 items, not an insanely large
> amount imho.
> - Now someone does a non-restricted loop through "Model.objects.all()"
> which would return 50,000 items.
>
even a restricted loop causes troubles (at least in postgres).

if you do:

for i, m in enumerate(Model.objects.all()):
if i > 4:
break

then psycopg will fetch (and cache locally) the entire cursor resultset. Of
course django orm will only fetch the first batch (100 items I think) from
the psycppg cursor, bu that does not solve the underlying issue.

There are options to have server-side cursors in psycopg but django does
not use them AFAIK

see this thread
https://groups.google.com/forum/#!searchin/django-developers/psycopg$20results/django-developers/8ZBA4uMinXo/M1osYTGmvbUJ




- In what case would this be desirable as opposed to limiting to 10,000
> items and raising an error when the database actually returns 10,000 items.
>
> Naturally this case would only apply if no slice is given, but if you're
> really processing over 10,000 items in a single loop you probably know how
> to slice the queryset when needed.
> Perhaps something like this: raise QuerysetTooLarge('Your query returned
> over {LIMIT} results, if this is intentional please slice the queryset')
>
> Not saying this should be a default straight away but having a default of
> 10,000 or even 100,000 should not hurt anyone and protects against killing
> a server which is always a positive result in my book.
>
> On Wednesday, November 19, 2014 1:42:59 AM UTC+1, Josh Smeaton wrote:
>>
>> To me, "sane default" means django should not silently alter the query to
>> provide a LIMIT when it is not asked for.
>>
>> I have also run into situations where doing a .count() or iterating a
>> full table has broken the application, or put too much pressure on the
>> database. Specifically with django bindings to javascript datatables. But I
>> still wouldn't want artificial limiting on such queries.
>>
>> What *may* be useful, is to be able to apply queryset methods onto an
>> already sliced queryset. That would allow users to implement
>> queryset/manager methods that provide pre-sliced querysets to the rest of
>> the code. The problem would be, what should happen in this case?
>>
>> Model.objects.all()[0:10].filter(field=True)
>>
>> Should the filter be logically/internally moved to before the limit? Or
>> should the filter be applied to the result of the limit in an outer query?
>> Traditionally, django applies mutations in succession, but this wouldn't be
>> very useful the for the majority of operations that would occur "after" the
>> slicing. We *could* say all slicing is "saved" and applied at the end, but
>> we'd definitely run into issues with users reporting that filtering isn't
>> working as they expect - after the slice.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/96868573-1c37-44a3-9a2a-61df62994308%40googlegroups.com
> <https://groups.google.com/d/msgid/django-developers/96868573-1c37-44a3-9a2a-61df62994308%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAEbA68%3Dn4vVJoSZcwwZx3wKsNrX8o5HTQw-R1Nn3Vrf%3D%2B7%2BG-A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Marco Paolini
2014-11-19 12:59:08 UTC
Permalink
also, the offset + limit pagination strategy of django paginator is
sub-optimal as it has N complexity: doing SELECT * FROM auth_user LIMIT 100
offset 1000000 causes a 100000-long table scan

2014-11-19 13:56 GMT+01:00 Marco Paolini <***@gmail.com>:

>
>
> 2014-11-19 13:50 GMT+01:00 Rick van Hattem <***@gmail.com>:
>
>> Definitely agree on this, silently altering a query's limit is probably
>> not the way to go. Raising an exception in case of no limit and lots of
>> results could be useful.
>>
>> For the sake of keeping the discussion useful:
>> - Let's say you have a table with 50,000 items, not an insanely large
>> amount imho.
>> - Now someone does a non-restricted loop through "Model.objects.all()"
>> which would return 50,000 items.
>>
> even a restricted loop causes troubles (at least in postgres).
>
> if you do:
>
> for i, m in enumerate(Model.objects.all()):
> if i > 4:
> break
>
> then psycopg will fetch (and cache locally) the entire cursor resultset.
> Of course django orm will only fetch the first batch (100 items I think)
> from the psycppg cursor, bu that does not solve the underlying issue.
>
> There are options to have server-side cursors in psycopg but django does
> not use them AFAIK
>
> see this thread
> https://groups.google.com/forum/#!searchin/django-developers/psycopg$20results/django-developers/8ZBA4uMinXo/M1osYTGmvbUJ
>
>
>
>
> - In what case would this be desirable as opposed to limiting to 10,000
>> items and raising an error when the database actually returns 10,000 items.
>>
>> Naturally this case would only apply if no slice is given, but if you're
>> really processing over 10,000 items in a single loop you probably know how
>> to slice the queryset when needed.
>> Perhaps something like this: raise QuerysetTooLarge('Your query returned
>> over {LIMIT} results, if this is intentional please slice the queryset')
>>
>> Not saying this should be a default straight away but having a default of
>> 10,000 or even 100,000 should not hurt anyone and protects against killing
>> a server which is always a positive result in my book.
>>
>> On Wednesday, November 19, 2014 1:42:59 AM UTC+1, Josh Smeaton wrote:
>>>
>>> To me, "sane default" means django should not silently alter the query
>>> to provide a LIMIT when it is not asked for.
>>>
>>> I have also run into situations where doing a .count() or iterating a
>>> full table has broken the application, or put too much pressure on the
>>> database. Specifically with django bindings to javascript datatables. But I
>>> still wouldn't want artificial limiting on such queries.
>>>
>>> What *may* be useful, is to be able to apply queryset methods onto an
>>> already sliced queryset. That would allow users to implement
>>> queryset/manager methods that provide pre-sliced querysets to the rest of
>>> the code. The problem would be, what should happen in this case?
>>>
>>> Model.objects.all()[0:10].filter(field=True)
>>>
>>> Should the filter be logically/internally moved to before the limit? Or
>>> should the filter be applied to the result of the limit in an outer query?
>>> Traditionally, django applies mutations in succession, but this wouldn't be
>>> very useful the for the majority of operations that would occur "after" the
>>> slicing. We *could* say all slicing is "saved" and applied at the end, but
>>> we'd definitely run into issues with users reporting that filtering isn't
>>> working as they expect - after the slice.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Django developers (Contributions to Django itself)" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to django-developers+***@googlegroups.com.
>> To post to this group, send email to django-***@googlegroups.com.
>> Visit this group at http://groups.google.com/group/django-developers.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/django-developers/96868573-1c37-44a3-9a2a-61df62994308%40googlegroups.com
>> <https://groups.google.com/d/msgid/django-developers/96868573-1c37-44a3-9a2a-61df62994308%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAEbA68nhESxaAf44x1R%2BhffNd5YnVLMJ-HQK88Lv_bPr%3DxZoYA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Schmitt, Christian
2014-11-19 13:49:25 UTC
Permalink
A sequence scan will only be made, if you query non indexed values.
So if you add a simple ORDER BY you will make a index scan, which is very
fast.
The problem relies more on the database than on the ORM.
As already said. If you need to deal with that much queries you need to log
your SQL statements and need to optimize them. Nothing to do with django at
all.
We ran into several performance issues of our applications as well even at
something like 10.000 or 100.000 entries per table and it was quite obvious
that the problem relied on the database/queries.

Also we learned that using the cache gives you a big performance win.
"The best query you could have, is the query you never need to make"
I don't think its good to make django more safely when dealing with high
performance systems, since it will make the system more complex and django
is already really complex inside the core orm.



2014-11-19 13:59 GMT+01:00 Marco Paolini <***@gmail.com>:

> also, the offset + limit pagination strategy of django paginator is
> sub-optimal as it has N complexity: doing SELECT * FROM auth_user LIMIT 100
> offset 1000000 causes a 100000-long table scan
>
> 2014-11-19 13:56 GMT+01:00 Marco Paolini <***@gmail.com>:
>
>>
>>
>> 2014-11-19 13:50 GMT+01:00 Rick van Hattem <***@gmail.com>:
>>
>>> Definitely agree on this, silently altering a query's limit is probably
>>> not the way to go. Raising an exception in case of no limit and lots of
>>> results could be useful.
>>>
>>> For the sake of keeping the discussion useful:
>>> - Let's say you have a table with 50,000 items, not an insanely large
>>> amount imho.
>>> - Now someone does a non-restricted loop through "Model.objects.all()"
>>> which would return 50,000 items.
>>>
>> even a restricted loop causes troubles (at least in postgres).
>>
>> if you do:
>>
>> for i, m in enumerate(Model.objects.all()):
>> if i > 4:
>> break
>>
>> then psycopg will fetch (and cache locally) the entire cursor resultset.
>> Of course django orm will only fetch the first batch (100 items I think)
>> from the psycppg cursor, bu that does not solve the underlying issue.
>>
>> There are options to have server-side cursors in psycopg but django does
>> not use them AFAIK
>>
>> see this thread
>> https://groups.google.com/forum/#!searchin/django-developers/psycopg$20results/django-developers/8ZBA4uMinXo/M1osYTGmvbUJ
>>
>>
>>
>>
>> - In what case would this be desirable as opposed to limiting to 10,000
>>> items and raising an error when the database actually returns 10,000 items.
>>>
>>> Naturally this case would only apply if no slice is given, but if you're
>>> really processing over 10,000 items in a single loop you probably know how
>>> to slice the queryset when needed.
>>> Perhaps something like this: raise QuerysetTooLarge('Your query returned
>>> over {LIMIT} results, if this is intentional please slice the queryset')
>>>
>>> Not saying this should be a default straight away but having a default
>>> of 10,000 or even 100,000 should not hurt anyone and protects against
>>> killing a server which is always a positive result in my book.
>>>
>>> On Wednesday, November 19, 2014 1:42:59 AM UTC+1, Josh Smeaton wrote:
>>>>
>>>> To me, "sane default" means django should not silently alter the query
>>>> to provide a LIMIT when it is not asked for.
>>>>
>>>> I have also run into situations where doing a .count() or iterating a
>>>> full table has broken the application, or put too much pressure on the
>>>> database. Specifically with django bindings to javascript datatables. But I
>>>> still wouldn't want artificial limiting on such queries.
>>>>
>>>> What *may* be useful, is to be able to apply queryset methods onto an
>>>> already sliced queryset. That would allow users to implement
>>>> queryset/manager methods that provide pre-sliced querysets to the rest of
>>>> the code. The problem would be, what should happen in this case?
>>>>
>>>> Model.objects.all()[0:10].filter(field=True)
>>>>
>>>> Should the filter be logically/internally moved to before the limit? Or
>>>> should the filter be applied to the result of the limit in an outer query?
>>>> Traditionally, django applies mutations in succession, but this wouldn't be
>>>> very useful the for the majority of operations that would occur "after" the
>>>> slicing. We *could* say all slicing is "saved" and applied at the end, but
>>>> we'd definitely run into issues with users reporting that filtering isn't
>>>> working as they expect - after the slice.
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Django developers (Contributions to Django itself)" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to django-developers+***@googlegroups.com.
>>> To post to this group, send email to django-***@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/django-developers.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/django-developers/96868573-1c37-44a3-9a2a-61df62994308%40googlegroups.com
>>> <https://groups.google.com/d/msgid/django-developers/96868573-1c37-44a3-9a2a-61df62994308%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/CAEbA68nhESxaAf44x1R%2BhffNd5YnVLMJ-HQK88Lv_bPr%3DxZoYA%40mail.gmail.com
> <https://groups.google.com/d/msgid/django-developers/CAEbA68nhESxaAf44x1R%2BhffNd5YnVLMJ-HQK88Lv_bPr%3DxZoYA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAPDLAU6p1ByCHvFfpD%2BqQHE7b%3Driu6qXwcTe4SThCzBQVqRpng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Marco Paolini
2014-11-19 14:11:12 UTC
Permalink
2014-11-19 14:49 GMT+01:00 Schmitt, Christian <***@briefdomain.de>:

> A sequence scan will only be made, if you query non indexed values.
>
So if you add a simple ORDER BY you will make a index scan, which is very
> fast.
>

Nope. a large OFFSET of N will read through N rows, regardless index
coverage. see http://www.postgresql.org/docs/9.1/static/queries-limit.html


> The problem relies more on the database than on the ORM.
>

The problem relies on the django paginator, that builds OFFSET queries that
are unefficient by design. There are alternatives such as ordering by
primary key and doing SELECT * from auth_user WHERE id > latest_id_returned
. that "latest_id_returned" has to be passed back and forth to the client
to keep the pagination hapening. Too bad you can't do "jump to page 15"
like the django admin does.



> As already said. If you need to deal with that much queries you need to
> log your SQL statements and need to optimize them. Nothing to do with
> django at all.
> We ran into several performance issues of our applications as well even at
> something like 10.000 or 100.000 entries per table and it was quite obvious
> that the problem relied on the database/queries.
>

read the relevant theread I posted before. There are huge memory allocation
issues even if you just iterate the first two items of a query that
potentially returns a million rows: all the rows will be fetched and stored
in memory by psycopg and django ORM


> Also we learned that using the cache gives you a big performance win.
> "The best query you could have, is the query you never need to make"
>

...except when the cache is bigger than your entire VM.

In this case if your dataset is , say, 10 TB, you will need roughly 10 TB
of memory and 10 TB of data transferred between your django app and your
postgres server. It is not a matter of optimization: is mere survival.



> I don't think its good to make django more safely when dealing with high
> performance systems, since it will make the system more complex and django
> is already really complex inside the core orm.
>

Let's first understand what's needed, than we can decide if it has a story
inside the django core



>
>
> 2014-11-19 13:59 GMT+01:00 Marco Paolini <***@gmail.com>:
>
>> also, the offset + limit pagination strategy of django paginator is
>> sub-optimal as it has N complexity: doing SELECT * FROM auth_user LIMIT 100
>> offset 1000000 causes a 100000-long table scan
>>
>> 2014-11-19 13:56 GMT+01:00 Marco Paolini <***@gmail.com>:
>>
>>>
>>>
>>> 2014-11-19 13:50 GMT+01:00 Rick van Hattem <***@gmail.com>:
>>>
>>>> Definitely agree on this, silently altering a query's limit is probably
>>>> not the way to go. Raising an exception in case of no limit and lots of
>>>> results could be useful.
>>>>
>>>> For the sake of keeping the discussion useful:
>>>> - Let's say you have a table with 50,000 items, not an insanely large
>>>> amount imho.
>>>> - Now someone does a non-restricted loop through "Model.objects.all()"
>>>> which would return 50,000 items.
>>>>
>>> even a restricted loop causes troubles (at least in postgres).
>>>
>>> if you do:
>>>
>>> for i, m in enumerate(Model.objects.all()):
>>> if i > 4:
>>> break
>>>
>>> then psycopg will fetch (and cache locally) the entire cursor resultset.
>>> Of course django orm will only fetch the first batch (100 items I think)
>>> from the psycppg cursor, bu that does not solve the underlying issue.
>>>
>>> There are options to have server-side cursors in psycopg but django does
>>> not use them AFAIK
>>>
>>> see this thread
>>> https://groups.google.com/forum/#!searchin/django-developers/psycopg$20results/django-developers/8ZBA4uMinXo/M1osYTGmvbUJ
>>>
>>>
>>>
>>>
>>> - In what case would this be desirable as opposed to limiting to 10,000
>>>> items and raising an error when the database actually returns 10,000 items.
>>>>
>>>> Naturally this case would only apply if no slice is given, but if
>>>> you're really processing over 10,000 items in a single loop you probably
>>>> know how to slice the queryset when needed.
>>>> Perhaps something like this: raise QuerysetTooLarge('Your query
>>>> returned over {LIMIT} results, if this is intentional please slice the
>>>> queryset')
>>>>
>>>> Not saying this should be a default straight away but having a default
>>>> of 10,000 or even 100,000 should not hurt anyone and protects against
>>>> killing a server which is always a positive result in my book.
>>>>
>>>> On Wednesday, November 19, 2014 1:42:59 AM UTC+1, Josh Smeaton wrote:
>>>>>
>>>>> To me, "sane default" means django should not silently alter the query
>>>>> to provide a LIMIT when it is not asked for.
>>>>>
>>>>> I have also run into situations where doing a .count() or iterating a
>>>>> full table has broken the application, or put too much pressure on the
>>>>> database. Specifically with django bindings to javascript datatables. But I
>>>>> still wouldn't want artificial limiting on such queries.
>>>>>
>>>>> What *may* be useful, is to be able to apply queryset methods onto an
>>>>> already sliced queryset. That would allow users to implement
>>>>> queryset/manager methods that provide pre-sliced querysets to the rest of
>>>>> the code. The problem would be, what should happen in this case?
>>>>>
>>>>> Model.objects.all()[0:10].filter(field=True)
>>>>>
>>>>> Should the filter be logically/internally moved to before the limit?
>>>>> Or should the filter be applied to the result of the limit in an outer
>>>>> query? Traditionally, django applies mutations in succession, but this
>>>>> wouldn't be very useful the for the majority of operations that would occur
>>>>> "after" the slicing. We *could* say all slicing is "saved" and applied at
>>>>> the end, but we'd definitely run into issues with users reporting that
>>>>> filtering isn't working as they expect - after the slice.
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Django developers (Contributions to Django itself)" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to django-developers+***@googlegroups.com.
>>>> To post to this group, send email to django-***@googlegroups.com
>>>> .
>>>> Visit this group at http://groups.google.com/group/django-developers.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/django-developers/96868573-1c37-44a3-9a2a-61df62994308%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/django-developers/96868573-1c37-44a3-9a2a-61df62994308%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Django developers (Contributions to Django itself)" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to django-developers+***@googlegroups.com.
>> To post to this group, send email to django-***@googlegroups.com.
>> Visit this group at http://groups.google.com/group/django-developers.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/django-developers/CAEbA68nhESxaAf44x1R%2BhffNd5YnVLMJ-HQK88Lv_bPr%3DxZoYA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/django-developers/CAEbA68nhESxaAf44x1R%2BhffNd5YnVLMJ-HQK88Lv_bPr%3DxZoYA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/CAPDLAU6p1ByCHvFfpD%2BqQHE7b%3Driu6qXwcTe4SThCzBQVqRpng%40mail.gmail.com
> <https://groups.google.com/d/msgid/django-developers/CAPDLAU6p1ByCHvFfpD%2BqQHE7b%3Driu6qXwcTe4SThCzBQVqRpng%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAEbA68kbimQG3Q_cqJLkKPx3x1AZHrBk6rcQXKFC2kF74c%3DLJw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Schmitt, Christian
2014-11-20 07:30:35 UTC
Permalink
>
> Nope. a large OFFSET of N will read through N rows, regardless index
> coverage. see http://www.postgresql.org/docs/9.1/static/queries-limit.html
>

That's simple not true.
If you define a Order By with a well indexes query, the database will only
do a bitmap scan.
This wiki isn't well explained. Take this:

>
https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf

The problem relies on the django paginator, that builds OFFSET queries that
> are unefficient by design. There are alternatives such as ordering by
> primary key and doing SELECT * from auth_user WHERE id > latest_id_returned
> . that "latest_id_returned" has to be passed back and forth to the client
> to keep the pagination hapening. Too bad you can't do "jump to page 15"
> like the django admin does.
>

Paginator needs to use OFFSET, which is slower than your seek method, but
SEEK has certain limitations. As you already said. Index and Order By to
the Rescue, which makes OFFSET as fast as possible. WITH INDEX, ORDER BY
and maybe some READ Slaves or even if you Shard your data the result is as
good as the SEEK method.


read the relevant theread I posted before. There are huge memory allocation
> issues even if you just iterate the first two items of a query that
> potentially returns a million rows: all the rows will be fetched and stored
> in memory by psycopg and django ORM
>

Thats only happening since django will built something like "SELECT "bla",
"bla", "bla" FROM table LIMIT 10 OFFSET 100;
There is no ORDER BY clause, which is of course really slow, even with only
1000 entries inside a table.


...except when the cache is bigger than your entire VM.
>
> In this case if your dataset is , say, 10 TB, you will need roughly 10 TB
> of memory and 10 TB of data transferred between your django app and your
> postgres server. It is not a matter of optimization: is mere survival.
>

Okai just as I said, If you deal with that amount of data u certainly will
use READ Slaves AND / OR partition your data.


> Let's first understand what's needed, than we can decide if it has a story
> inside the django core
>


The only thing that could be improved is that we should force a ORDER BY id
clause here and there. The index on the automatic id is already there.

I don't know why we should enforce something just because somebody has a
huge amount of data and can't deal with it. That wouldn't follow the clean
design pattern. Most people rely on the "Page to" part of the Paginator, if
we patch django to use a seek method as a default it wouldn't be that good
for a lot of developers.
The most people that have a performance problem could easily "fix" that by
themself or throwing more hardware at it.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAPDLAU5HNtE%2BDfSxR7FcgdtAN6U3bSG7AVtGYhT7LLZUdY0Fjw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Marco Paolini
2014-11-20 11:05:00 UTC
Permalink
2014-11-20 8:30 GMT+01:00 Schmitt, Christian <***@briefdomain.de>:

> Nope. a large OFFSET of N will read through N rows, regardless index
>> coverage. see
>> http://www.postgresql.org/docs/9.1/static/queries-limit.html
>>
>
> That's simple not true.
> If you define a Order By with a well indexes query, the database will only
> do a bitmap scan.
> This wiki isn't well explained. Take this:
>
> >
> https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
>

You are right: it is not a table scan but the point is OFFSET has *linear
complexity* even with the ORDER BY and the proper index.

The "seek" (with a PK > seek_point, as I was suggesting) is much better.


> The problem relies on the django paginator, that builds OFFSET queries
>> that are unefficient by design. There are alternatives such as ordering by
>> primary key and doing SELECT * from auth_user WHERE id > latest_id_returned
>> . that "latest_id_returned" has to be passed back and forth to the client
>> to keep the pagination hapening. Too bad you can't do "jump to page 15"
>> like the django admin does.
>>
>
> Paginator needs to use OFFSET, which is slower than your seek method, but
> SEEK has certain limitations. As you already said. Index and Order By to
> the Rescue, which makes OFFSET as fast as possible. WITH INDEX, ORDER BY
> and maybe some READ Slaves or even if you Shard your data the result is as
> good as the SEEK method.
>

Putting sharding into the mix further complicates this discussion, but even
in a sharded environment, having linar complexity in a single shard is to
be avoided.


>
>
> read the relevant theread I posted before. There are huge memory
>> allocation issues even if you just iterate the first two items of a query
>> that potentially returns a million rows: all the rows will be fetched and
>> stored in memory by psycopg and django ORM
>>
>
> Thats only happening since django will built something like "SELECT "bla",
> "bla", "bla" FROM table LIMIT 10 OFFSET 100;
> There is no ORDER BY clause, which is of course really slow, even with
> only 1000 entries inside a table.
>

No, it has nothing to do with the ORDER BY. It's just how psycopg works:
the first time you fetch from a cursor, say you fetch the first row, it
automatically fetches all the results from the server. The only solution is
using server-side cursor, which apparently oracle uses by default.



>
>
> ...except when the cache is bigger than your entire VM.
>>
>> In this case if your dataset is , say, 10 TB, you will need roughly 10 TB
>> of memory and 10 TB of data transferred between your django app and your
>> postgres server. It is not a matter of optimization: is mere survival.
>>
>
> Okai just as I said, If you deal with that amount of data u certainly
> will use READ Slaves AND / OR partition your data.
>

Not always true: a single machine can happily handle a few terabytes.

The point here is client-side cursors will eat up a lot your memory in the
*application tier* if you happen to do a simple:

for user in User.objects.all().iterator():
break

even if you only have a few thousand users, you will see a big memory and
network spike


>
>> Let's first understand what's needed, than we can decide if it has a
>> story inside the django core
>>
>
>
> The only thing that could be improved is that we should force a ORDER BY
> id clause here and there. The index on the automatic id is already there.
>

That only solves the issue in the best-case scenario: when you fetch the
first pages. The last page scenario even with order by and an index still
has N complexity. That's written clearly in the article you posted.



>
> I don't know why we should enforce something just because somebody has a
> huge amount of data and can't deal with it. That wouldn't follow the clean
> design pattern. Most people rely on the "Page to" part of the Paginator, if
> we patch django to use a seek method as a default it wouldn't be that good
> for a lot of developers.
> The most people that have a performance problem could easily "fix" that by
> themself or throwing more hardware at it.
>

I am all against enforcing something: I only thin it is nice to provider
the options for not killing your application layer . These things are used
normally by all django developers that deal with medium-size applications.

Throwing iron is sub-optimal when you can just use 10 lines of code as I
normally do



> --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/CAPDLAU5HNtE%2BDfSxR7FcgdtAN6U3bSG7AVtGYhT7LLZUdY0Fjw%40mail.gmail.com
> <https://groups.google.com/d/msgid/django-developers/CAPDLAU5HNtE%2BDfSxR7FcgdtAN6U3bSG7AVtGYhT7LLZUdY0Fjw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAEbA68mDESqNGQYheshYymE_oKyqbh1hheL47s5u9Okxcqvifg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-20 11:46:01 UTC
Permalink
On Thursday, November 20, 2014 8:31:06 AM UTC+1, Christian Schmitt wrote:
>
> Nope. a large OFFSET of N will read through N rows, regardless index
>> coverage. see
>> http://www.postgresql.org/docs/9.1/static/queries-limit.html
>>
>
> That's simple not true.
> If you define a Order By with a well indexes query, the database will only
> do a bitmap scan.
> This wiki isn't well explained. Take this:
>
> >
> https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
>

Sounds awesome in theory, here's a real world example:

http://explain.depesz.com/s/WpQU
# explain analyze select * from entity_entity order by id limit 10 offset
1000000;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=44832.84..44833.28 rows=10 width=423) (actual
time=8280.248..8280.382 rows=10 loops=1)
-> Index Scan using entity_entity_pkey_two on entity_entity
(cost=0.44..823410.92 rows=18366416 width=423) (actual
time=0.232..6861.507 rows=1000010 loops=1)
Total runtime: 8280.442 ms
(3 rows)

http://explain.depesz.com/s/qexc
# explain analyze select * from entity_entity where id > 1000000 order by
id limit 10;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.44..0.91 rows=10 width=423) (actual time=0.044..0.102
rows=10 loops=1)
-> Index Scan using entity_entity_pkey_two on entity_entity
(cost=0.44..823830.94 rows=17405140 width=423) (actual time=0.040..0.071
rows=10 loops=1)
Index Cond: (id > 1000000)
Total runtime: 0.152 ms
(4 rows)


And that's not even all the way in the table. That's just the beginning.


> read the relevant theread I posted before. There are huge memory
>> allocation issues even if you just iterate the first two items of a query
>> that potentially returns a million rows: all the rows will be fetched and
>> stored in memory by psycopg and django ORM
>>
>
> Thats only happening since django will built something like "SELECT "bla",
> "bla", "bla" FROM table LIMIT 10 OFFSET 100;
> There is no ORDER BY clause, which is of course really slow, even with
> only 1000 entries inside a table.
>

Once again, great in theory. Doesn't always mean much in real world
scenarios. Compare the following numbers to the ones above:


http://explain.depesz.com/s/50Wo
# explain analyze select * from entity_entity where id > 1000000 limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.67 rows=10 width=423) (actual time=0.012..0.070
rows=10 loops=1)
-> Seq Scan on entity_entity (cost=0.00..1168998.20 rows=17405140
width=423) (actual time=0.008..0.038 rows=10 loops=1)
Filter: (id > 1000000)
Total runtime: 0.122 ms
(4 rows)

http://explain.depesz.com/s/jDJS
# explain analyze select * from entity_entity limit 10 offset 1000000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=61148.68..61149.30 rows=10 width=423) (actual
time=3820.305..3820.347 rows=10 loops=1)
-> Seq Scan on entity_entity (cost=0.00..1123082.16 rows=18366416
width=423) (actual time=0.007..2410.809 rows=1000010 loops=1)
Total runtime: 3820.403 ms
(3 rows)


I don't know why we should enforce something just because somebody has a
> huge amount of data and can't deal with it. That wouldn't follow the clean
> design pattern. Most people rely on the "Page to" part of the Paginator, if
> we patch django to use a seek method as a default it wouldn't be that good
> for a lot of developers.
> The most people that have a performance problem could easily "fix" that by
> themself or throwing more hardware at it.
>

The point of this discussion is to make Django easier and safer to use in
all scenarios. Even in scenarios with relatively small amounts of data (sub
100k rows) Django can already kill a small VPS easily. Having some
(optional) safeties in place doesn't hurt and can help a lot of people.

It's not that large amounts of data are not something you can deal with,
but I've seen many real-life projects and Django apps where people didn't
take into consideration that some people have more than 100 rows. With 1M+
rows you can kill a lot, but many things slow down or die with even 10k
rows.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/2e56ee96-2fae-4671-8319-b5ac784aca26%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Mattias Linnap
2014-11-23 10:49:14 UTC
Permalink
I thought I'd chime in since I run a Django app with fairly large tables
(1M, 30M, 120M rows) on a fairly poor VPS (4GB ram, 2 cores, database and
all services on one machine).

I just don't see a situation where "safety limits on ORM queries" would
improve things. Changing behaviour based on the size of the queryset seems
like a really bad idea. Now instead of gradual slowdown as the dataset
grows, you start getting surprise exceptions, invalid data in results, or
missing functionality once you hit a magic threshold.

There is no good way to set that threshold automatically, since people have
very different expectations of what's "slow". For example, loading the
second-to-last list page in the Django admin for my 1M row table takes a
perfectly reasonable 2-3 seconds. That's too slow for the front page of a
popular site, but even 30 seconds would be fine for an admin page that's
visited maybe once a month.

I'm not saying that performance improvements or optimisation suggestions
are a bad idea, but if they are possible, they should stay consistent. If
using "if queryset:" is a bug, then it should show a warning and suggestion
to use "if queryset.exists():" for both 100 and 100M rows. If there's an
option to turn off full pagination in admin, then it should be a ModelAdmin
option, not dependent on data size - the developer can enable it for big
tables manually.

Behaviour changes based on the queryset size make it more difficult, not
easier for developers - right now you can delay optimisation work until it
really becomes too painful for users, whereas having strict limits would
require doing that work as soon as a limit is reached.

Finally, I agree that it would be nice if Postgresql server-side cursors
were better integrated with the ORM. It's possible to use them for raw
queries already:

import connection, transaction
connection.ensure_connection()
with transaction.atomic():
cursor = connection.connection.cursor(name='hello')
cursor.execute('BIG QUERYSET SQL')
for row in cursor: # fetches rows in cursor.itersize chunks
pass

I use this in a few scripts that iterate over 10GB of table data. But a way
to map the rows to Django objects would be nice.


On Thursday, 20 November 2014 13:46:01 UTC+2, Rick van Hattem wrote:
>
> On Thursday, November 20, 2014 8:31:06 AM UTC+1, Christian Schmitt wrote:
>>
>> Nope. a large OFFSET of N will read through N rows, regardless index
>>> coverage. see
>>> http://www.postgresql.org/docs/9.1/static/queries-limit.html
>>>
>>
>> That's simple not true.
>> If you define a Order By with a well indexes query, the database will
>> only do a bitmap scan.
>> This wiki isn't well explained. Take this:
>>
>> >
>> https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
>>
>
> Sounds awesome in theory, here's a real world example:
>
> http://explain.depesz.com/s/WpQU
> # explain analyze select * from entity_entity order by id limit 10 offset
> 1000000;
>
> QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=44832.84..44833.28 rows=10 width=423) (actual
> time=8280.248..8280.382 rows=10 loops=1)
> -> Index Scan using entity_entity_pkey_two on entity_entity
> (cost=0.44..823410.92 rows=18366416 width=423) (actual
> time=0.232..6861.507 rows=1000010 loops=1)
> Total runtime: 8280.442 ms
> (3 rows)
>
> http://explain.depesz.com/s/qexc
> # explain analyze select * from entity_entity where id > 1000000 order by
> id limit 10;
>
> QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.44..0.91 rows=10 width=423) (actual time=0.044..0.102
> rows=10 loops=1)
> -> Index Scan using entity_entity_pkey_two on entity_entity
> (cost=0.44..823830.94 rows=17405140 width=423) (actual time=0.040..0.071
> rows=10 loops=1)
> Index Cond: (id > 1000000)
> Total runtime: 0.152 ms
> (4 rows)
>
>
> And that's not even all the way in the table. That's just the beginning.
>
>
>> read the relevant theread I posted before. There are huge memory
>>> allocation issues even if you just iterate the first two items of a query
>>> that potentially returns a million rows: all the rows will be fetched and
>>> stored in memory by psycopg and django ORM
>>>
>>
>> Thats only happening since django will built something like "SELECT
>> "bla", "bla", "bla" FROM table LIMIT 10 OFFSET 100;
>> There is no ORDER BY clause, which is of course really slow, even with
>> only 1000 entries inside a table.
>>
>
> Once again, great in theory. Doesn't always mean much in real world
> scenarios. Compare the following numbers to the ones above:
>
>
> http://explain.depesz.com/s/50Wo
> # explain analyze select * from entity_entity where id > 1000000 limit 10;
> QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.00..0.67 rows=10 width=423) (actual time=0.012..0.070
> rows=10 loops=1)
> -> Seq Scan on entity_entity (cost=0.00..1168998.20 rows=17405140
> width=423) (actual time=0.008..0.038 rows=10 loops=1)
> Filter: (id > 1000000)
> Total runtime: 0.122 ms
> (4 rows)
>
> http://explain.depesz.com/s/jDJS
> # explain analyze select * from entity_entity limit 10 offset 1000000;
> QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=61148.68..61149.30 rows=10 width=423) (actual
> time=3820.305..3820.347 rows=10 loops=1)
> -> Seq Scan on entity_entity (cost=0.00..1123082.16 rows=18366416
> width=423) (actual time=0.007..2410.809 rows=1000010 loops=1)
> Total runtime: 3820.403 ms
> (3 rows)
>
>
> I don't know why we should enforce something just because somebody has a
>> huge amount of data and can't deal with it. That wouldn't follow the clean
>> design pattern. Most people rely on the "Page to" part of the Paginator, if
>> we patch django to use a seek method as a default it wouldn't be that good
>> for a lot of developers.
>> The most people that have a performance problem could easily "fix" that
>> by themself or throwing more hardware at it.
>>
>
> The point of this discussion is to make Django easier and safer to use in
> all scenarios. Even in scenarios with relatively small amounts of data (sub
> 100k rows) Django can already kill a small VPS easily. Having some
> (optional) safeties in place doesn't hurt and can help a lot of people.
>
> It's not that large amounts of data are not something you can deal with,
> but I've seen many real-life projects and Django apps where people didn't
> take into consideration that some people have more than 100 rows. With 1M+
> rows you can kill a lot, but many things slow down or die with even 10k
> rows.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/8815aae8-1581-4447-8126-c1a0687894bf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-23 12:11:13 UTC
Permalink
Hi Mattias,

Can you comment on the example I've given where this should not cause any
problems and should help pretty much everyone (but just in case, the
setting could be *optional*)?

If/when an *unsliced* queryset were to reach a certain limit (say, 10,000,
but configurable) the system would raise an error.

My arguments for why this would be a good idea and not a problem for anyone:

- Protects servers from going down due to memory constraints.
- Why would anyone need to run through a list of *10,000* items without any
slicing/pagination/etc?
- If someone _does_ need to walk through more than 10,000 items, they'll
know how to slice a queryset.
- You can write your own apps perfectly, but installing 3rd party apps can
still kill the system without fixes like these.


So please, can anyone give a good argument as to why any sane person would
have a problem with a huge default limit which will kill the performance of
your site anyhow but isn't enough to kill the entire system?


On Sunday, November 23, 2014 11:49:14 AM UTC+1, Mattias Linnap wrote:
>
> I thought I'd chime in since I run a Django app with fairly large tables
> (1M, 30M, 120M rows) on a fairly poor VPS (4GB ram, 2 cores, database and
> all services on one machine).
>
> I just don't see a situation where "safety limits on ORM queries" would
> improve things. Changing behaviour based on the size of the queryset seems
> like a really bad idea. Now instead of gradual slowdown as the dataset
> grows, you start getting surprise exceptions, invalid data in results, or
> missing functionality once you hit a magic threshold.
>
> There is no good way to set that threshold automatically, since people
> have very different expectations of what's "slow". For example, loading the
> second-to-last list page in the Django admin for my 1M row table takes a
> perfectly reasonable 2-3 seconds. That's too slow for the front page of a
> popular site, but even 30 seconds would be fine for an admin page that's
> visited maybe once a month.
>
> I'm not saying that performance improvements or optimisation suggestions
> are a bad idea, but if they are possible, they should stay consistent. If
> using "if queryset:" is a bug, then it should show a warning and suggestion
> to use "if queryset.exists():" for both 100 and 100M rows. If there's an
> option to turn off full pagination in admin, then it should be a ModelAdmin
> option, not dependent on data size - the developer can enable it for big
> tables manually.
>
> Behaviour changes based on the queryset size make it more difficult, not
> easier for developers - right now you can delay optimisation work until it
> really becomes too painful for users, whereas having strict limits would
> require doing that work as soon as a limit is reached.
>
> Finally, I agree that it would be nice if Postgresql server-side cursors
> were better integrated with the ORM. It's possible to use them for raw
> queries already:
>
> import connection, transaction
> connection.ensure_connection()
> with transaction.atomic():
> cursor = connection.connection.cursor(name='hello')
> cursor.execute('BIG QUERYSET SQL')
> for row in cursor: # fetches rows in cursor.itersize chunks
> pass
>
> I use this in a few scripts that iterate over 10GB of table data. But a
> way to map the rows to Django objects would be nice.
>
>
> On Thursday, 20 November 2014 13:46:01 UTC+2, Rick van Hattem wrote:
>>
>> On Thursday, November 20, 2014 8:31:06 AM UTC+1, Christian Schmitt wrote:
>>>
>>> Nope. a large OFFSET of N will read through N rows, regardless index
>>>> coverage. see
>>>> http://www.postgresql.org/docs/9.1/static/queries-limit.html
>>>>
>>>
>>> That's simple not true.
>>> If you define a Order By with a well indexes query, the database will
>>> only do a bitmap scan.
>>> This wiki isn't well explained. Take this:
>>>
>>> >
>>> https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
>>>
>>
>> Sounds awesome in theory, here's a real world example:
>>
>> http://explain.depesz.com/s/WpQU
>> # explain analyze select * from entity_entity order by id limit 10 offset
>> 1000000;
>>
>> QUERY PLAN
>>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> Limit (cost=44832.84..44833.28 rows=10 width=423) (actual
>> time=8280.248..8280.382 rows=10 loops=1)
>> -> Index Scan using entity_entity_pkey_two on entity_entity
>> (cost=0.44..823410.92 rows=18366416 width=423) (actual
>> time=0.232..6861.507 rows=1000010 loops=1)
>> Total runtime: 8280.442 ms
>> (3 rows)
>>
>> http://explain.depesz.com/s/qexc
>> # explain analyze select * from entity_entity where id > 1000000 order by
>> id limit 10;
>>
>> QUERY PLAN
>>
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------
>> Limit (cost=0.44..0.91 rows=10 width=423) (actual time=0.044..0.102
>> rows=10 loops=1)
>> -> Index Scan using entity_entity_pkey_two on entity_entity
>> (cost=0.44..823830.94 rows=17405140 width=423) (actual time=0.040..0.071
>> rows=10 loops=1)
>> Index Cond: (id > 1000000)
>> Total runtime: 0.152 ms
>> (4 rows)
>>
>>
>> And that's not even all the way in the table. That's just the beginning.
>>
>>
>>> read the relevant theread I posted before. There are huge memory
>>>> allocation issues even if you just iterate the first two items of a query
>>>> that potentially returns a million rows: all the rows will be fetched and
>>>> stored in memory by psycopg and django ORM
>>>>
>>>
>>> Thats only happening since django will built something like "SELECT
>>> "bla", "bla", "bla" FROM table LIMIT 10 OFFSET 100;
>>> There is no ORDER BY clause, which is of course really slow, even with
>>> only 1000 entries inside a table.
>>>
>>
>> Once again, great in theory. Doesn't always mean much in real world
>> scenarios. Compare the following numbers to the ones above:
>>
>>
>> http://explain.depesz.com/s/50Wo
>> # explain analyze select * from entity_entity where id > 1000000 limit 10;
>> QUERY PLAN
>>
>> -----------------------------------------------------------------------------------------------------------------------------
>> Limit (cost=0.00..0.67 rows=10 width=423) (actual time=0.012..0.070
>> rows=10 loops=1)
>> -> Seq Scan on entity_entity (cost=0.00..1168998.20 rows=17405140
>> width=423) (actual time=0.008..0.038 rows=10 loops=1)
>> Filter: (id > 1000000)
>> Total runtime: 0.122 ms
>> (4 rows)
>>
>> http://explain.depesz.com/s/jDJS
>> # explain analyze select * from entity_entity limit 10 offset 1000000;
>> QUERY PLAN
>>
>> -------------------------------------------------------------------------------------------------------------------------------------
>> Limit (cost=61148.68..61149.30 rows=10 width=423) (actual
>> time=3820.305..3820.347 rows=10 loops=1)
>> -> Seq Scan on entity_entity (cost=0.00..1123082.16 rows=18366416
>> width=423) (actual time=0.007..2410.809 rows=1000010 loops=1)
>> Total runtime: 3820.403 ms
>> (3 rows)
>>
>>
>> I don't know why we should enforce something just because somebody has a
>>> huge amount of data and can't deal with it. That wouldn't follow the clean
>>> design pattern. Most people rely on the "Page to" part of the Paginator, if
>>> we patch django to use a seek method as a default it wouldn't be that good
>>> for a lot of developers.
>>> The most people that have a performance problem could easily "fix" that
>>> by themself or throwing more hardware at it.
>>>
>>
>> The point of this discussion is to make Django easier and safer to use in
>> all scenarios. Even in scenarios with relatively small amounts of data (sub
>> 100k rows) Django can already kill a small VPS easily. Having some
>> (optional) safeties in place doesn't hurt and can help a lot of people.
>>
>> It's not that large amounts of data are not something you can deal with,
>> but I've seen many real-life projects and Django apps where people didn't
>> take into consideration that some people have more than 100 rows. With 1M+
>> rows you can kill a lot, but many things slow down or die with even 10k
>> rows.
>>
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/313c740e-64e8-4c6b-80eb-7d6303809635%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Shai Berger
2014-11-23 12:27:44 UTC
Permalink
Hi Rick,

On Sunday 23 November 2014 14:11:13 Rick van Hattem wrote:
>
> So please, can anyone give a good argument as to why any sane person would
> have a problem with a huge default limit which will kill the performance of
> your site anyhow but isn't enough to kill the entire system?
>
...because Django is also used to implement batch processes, and you'd be very
surprised if your system, which works well enough for a year or so, suddenly
has its night-jobs starting to fail one day because some size-limit has been
reached.

The Django project has traditionally held the position, with regard to such
proposals, that we should avoid making such decisions for our users, but we
should allow them to make them by themselves; this can also lead to more
general solutions.

In this case, I think you could get support for allowing yourself query-
validation hooks -- say, allow installing a function that will be called by
the ORM when query results are received, before turning them into objects. I
think this lets you solve your problem outside of core (and then, if your
solution proves useful and popular, it may go into core as well). This could
also allow other interesting functionalities, I suspect, if it is allowed to
modify (and not just validate) the results.

I think (speaking only for myself, of course) that a proposal along these
lines has a better chance of gaining traction.

HTH,
Shai.
Marco Paolini
2014-11-23 19:38:03 UTC
Permalink
2014-11-23 13:27 GMT+01:00 Shai Berger <***@platonix.com>:

> Hi Rick,
>
> On Sunday 23 November 2014 14:11:13 Rick van Hattem wrote:
> >
> > So please, can anyone give a good argument as to why any sane person
> would
> > have a problem with a huge default limit which will kill the performance
> of
> > your site anyhow but isn't enough to kill the entire system?
> >
> ...because Django is also used to implement batch processes, and you'd be
> very
> surprised if your system, which works well enough for a year or so,
> suddenly
> has its night-jobs starting to fail one day because some size-limit has
> been
> reached.
>
> The Django project has traditionally held the position, with regard to such
> proposals, that we should avoid making such decisions for our users, but we
> should allow them to make them by themselves; this can also lead to more
> general solutions.
>
> In this case, I think you could get support for allowing yourself query-
> validation hooks -- say, allow installing a function that will be called by
> the ORM when query results are received, before turning them into objects.
> I
> think this lets you solve your problem outside of core (and then, if your
> solution proves useful and popular, it may go into core as well). This
> could
> also allow other interesting functionalities, I suspect, if it is allowed
> to
> modify (and not just validate) the results.
>
that function wil likeky be called much too late.

The client-side postgresql cursor fetches *all query results* in one go when
reading from the cursor the first time.

So again, the issue with "big" resultsets is not the ORM generating django
objects
but the very way django uses psycopg.

Why don't we just write a Model.manager.chunked() like proposed here?

https://groups.google.com/forum/#!searchin/django-developers/psycopg$20results/django-developers/8ZBA4uMinXo/M1osYTGmvbUJ



> I think (speaking only for myself, of course) that a proposal along these
> lines has a better chance of gaining traction.
>
> HTH,
> Shai.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAEbA68kT-Yhq2J3ksuH7-6b13BxZQbnab71pcRfyc662KDUY1A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Florian Apolloner
2014-11-23 15:22:55 UTC
Permalink
Hi Rick,

On Sunday, November 23, 2014 1:11:13 PM UTC+1, Rick van Hattem wrote:

> If/when an *unsliced* queryset were to reach a certain limit (say,
> 10,000, but configurable) the system would raise an error.
>

Django can't know if that would be the case without issuing an extra query
-- and even then another transaction might commit a batch operation adding
10k items



> - Protects servers from going down due to memory constraints.
>

Not really, cause psycopg already fetched everything.

So please, can anyone give a good argument as to why any sane person would
> have a problem with a huge default limit which will kill the performance of
> your site anyhow but isn't enough to kill the entire system?
>

There are way easier ways to kill the performance of your site than
fetching a few rows, an arbitrary limit on queries is not going to fix
anything. Code applications defensively and don't use 3rd party stuff
without introspecting it for problems first. While you are right that there
are issues in theory, I think in practice they can be worked around easily
and don't cause any real problem. Killing a VPS due to out of memory
conditions can happen all the time -- be it the database or a service gone
rogue or whatever -- but in the end I've rarely seen Django to be the cause
of that.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/e8b81792-21a2-493e-8b33-47f12ea367d7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-23 21:07:29 UTC
Permalink
Hi Florian,

On 23 Nov 2014 16:22, "Florian Apolloner" <***@gmail.com> wrote:
>
> Hi Rick,
>
>
> On Sunday, November 23, 2014 1:11:13 PM UTC+1, Rick van Hattem wrote:
>>
>> If/when an unsliced queryset were to reach a certain limit (say, 10,000,
but configurable) the system would raise an error.
>
>
> Django can't know if that would be the case without issuing an extra
query -- and even then another transaction might commit a batch operation
adding 10k items

>

Actually, it can. If Django limits unsliced queries to 10001 items it can
simply raise an error if 10001 items are returned.

>> - Protects servers from going down due to memory constraints.
>
>
> Not really, cause psycopg already fetched everything.

Not if Django limits it by default :)

>> So please, can anyone give a good argument as to why any sane person
would have a problem with a huge default limit which will kill the
performance of your site anyhow but isn't enough to kill the entire system?
>
>
> There are way easier ways to kill the performance of your site than
fetching a few rows, an arbitrary limit on queries is not going to fix
anything. Code applications defensively and don't use 3rd party stuff
without introspecting it for problems first. While you are right that there
are issues in theory, I think in practice they can be worked around easily
and don't cause any real problem. Killing a VPS due to out of memory
conditions can happen all the time -- be it the database or a service gone
rogue or whatever -- but in the end I've rarely seen Django to be the cause
of that.

In that case you are lucky. I've seen in over half a dozen different
projects (which I didn't write myself so couldn't set up properly) problems
like these.

Even the batch job argument has flaws imho, in the case of batch jobs it's
still not a good idea to fetch everything in one big go. But if that would
be the case, within Django a view is easily distinguishable from batch jobs
so that doesn't have to be a problem.

But the hostility against the ideas alone make me lose hope that anything
will ever be merged in. It's too bad that people always assume full control
over projects. If you're hired to fix problems you can't fix design flaws
that easily.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAJAwA%3Dw%2Bo9VwsZ85ycxmqjKSzjm57VJOrqkoJxxGORJXnFVe7w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Christophe Pettus
2014-11-23 21:13:43 UTC
Permalink
On Nov 23, 2014, at 1:07 PM, Rick van Hattem <***@wol.ph> wrote:

> > Not really, cause psycopg already fetched everything.
>
> Not if Django limits it by default :)

Unfortunately, that's not how it works. There are three things that take up memory as the result of a query result:

1. The Django objects. These are created in units of 100 at a time.
2. The psycopg2 Python objects from the result. These are already limited to a certain number (I believe to 100) at a time.
3. The results from libpq. These are not limited, and there is no way of limiting them without creating a named cursor, which is a significant change to how Django interacts with the database.

In short, without substantial, application-breaking changes, you can't limit the amount of memory a query returns unless you add a LIMIT clause to it. However, adding a LIMIT clause can often cause performance issues all by itself:

http://thebuild.com/blog/2014/11/18/when-limit-attacks/

There's no clean fix that wouldn't have significant effects on unsuspecting applications.
--
-- Christophe Pettus
***@thebuild.com

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CD10DACF-4D0A-458D-BB85-0F7BB8BFF4C0%40thebuild.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-23 21:53:16 UTC
Permalink
On 23 Nov 2014 22:13, "Christophe Pettus" <***@thebuild.com> wrote:
>
>
> On Nov 23, 2014, at 1:07 PM, Rick van Hattem <***@wol.ph> wrote:
>
> > > Not really, cause psycopg already fetched everything.
> >
> > Not if Django limits it by default :)
>
> Unfortunately, that's not how it works. There are three things that take
up memory as the result of a query result:
>
> 1. The Django objects. These are created in units of 100 at a time.
> 2. The psycopg2 Python objects from the result. These are already
limited to a certain number (I believe to 100) at a time.
> 3. The results from libpq. These are not limited, and there is no way of
limiting them without creating a named cursor, which is a significant
change to how Django interacts with the database.
>
> In short, without substantial, application-breaking changes, you can't
limit the amount of memory a query returns unless you add a LIMIT clause to
it. However, adding a LIMIT clause can often cause performance issues all
by itself:
>
> http://thebuild.com/blog/2014/11/18/when-limit-attacks/
>
> There's no clean fix that wouldn't have significant effects on
unsuspecting applications.

Very true, that's a fair point. That's why I'm opting for a configurable
option. Patching this within Django has saved me in quite a few cases but
it can have drawbacks.

> --
> -- Christophe Pettus
> ***@thebuild.com
>
> --
> You received this message because you are subscribed to a topic in the
Google Groups "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this topic, visit
https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/django-developers/CD10DACF-4D0A-458D-BB85-0F7BB8BFF4C0%40thebuild.com
.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAJAwA%3Dx-BC1dy1eXU9Ch2mHYrjk_PSXaj6HewVq%3Dc%2BypXoj37g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Christophe Pettus
2014-11-23 21:57:39 UTC
Permalink
On Nov 23, 2014, at 1:53 PM, Rick van Hattem <***@wol.ph> wrote:

> Very true, that's a fair point. That's why I'm opting for a configurable option. Patching this within Django has saved me in quite a few cases but it can have drawbacks.

As a DB guy, I have to say that if an application is sending a query that expects to get 100 results back but gets 1,000,000 back, you have a bug that needs to be tracked down and fixed. Patching it by limiting the results is kind of a database version of "except: pass" to get rid of an inconvenient but mysterious exception.

--
-- Christophe Pettus
***@thebuild.com

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/09B28BB5-13DE-43B3-830F-F10E8AA7B5EC%40thebuild.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-24 09:08:48 UTC
Permalink
On 23 November 2014 at 22:57, Christophe Pettus <***@thebuild.com> wrote:

>
> On Nov 23, 2014, at 1:53 PM, Rick van Hattem <***@wol.ph> wrote:
>
> > Very true, that's a fair point. That's why I'm opting for a configurable
> option. Patching this within Django has saved me in quite a few cases but
> it can have drawbacks.
>
> As a DB guy, I have to say that if an application is sending a query that
> expects to get 100 results back but gets 1,000,000 back, you have a bug
> that needs to be tracked down and fixed. Patching it by limiting the
> results is kind of a database version of "except: pass" to get rid of an
> inconvenient but mysterious exception.
>

Indeed, except it's not an "except: pass" but an "except: raise" which I'm
proposing. Which makes a world of difference.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAJAwA%3DydvY12tDVOo1UjkNU7XQESX0PkPVoeauiO%2B21WWG-3RQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Christophe Pettus
2014-11-24 11:16:55 UTC
Permalink
On Nov 24, 2014, at 1:08 AM, Rick van Hattem <***@wol.ph> wrote:

> Indeed, except it's not an "except: pass" but an "except: raise" which I'm proposing. Which makes a world of difference.

Well, as previously noted, this option would introduce another round-trip into every database if it's actually going to check (and, in most databases, would have to run inside a transaction at, at least, REPEATABLE READ isolation mode in order to provide a strong guarantee, so you can add three more statements to a lot of interactions right there). That seems to reduce the "high performance" part of the system.

What I'm getting at is: This is a bug in the application. It's not a misfeature in Django. If you have a bug in an application whose source code cannot be changed, that's a shame, but I don't think that Django can be expected to introduce configuration options to cover every possible scenario in which a previously-written Django application interacts badly with the database in a way in which, in theory, it could do extra work (slowing down non-buggy applications and introducing more code to QA) to patch the bad application. To me, this is about as logical as putting in DIVIDE_BY_ZERO_RETURNS_NONE with a default of False because someone wrote some legacy code that would work if we did that.

--
-- Christophe Pettus
***@thebuild.com

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/C28E966C-8DB6-4E9D-BDBA-F30C7708EBBE%40thebuild.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-24 11:36:36 UTC
Permalink
Hi Christophe,

As I previously explained, there's no need for a roundtrip or different
transactions levels.

If you fetch N+1 items you know if there are over N items in your list. So
with that in mind you can simply do a query with a limit of N+1 to know if
your query returns over N items.

No need for different isolation modes or extra statements, the only cost is
that instead of a maximum of 10,000 items you will be fetching 10,001 items
iff. there is no limit currently defined for the query.


As for this being a bug in the application instead of Django, I fully
agree. But it can be very difficult to debug issues like these if at some
random moment the entire server crashes.

- Limiting the memory usage for the application doesn't help in debugging
since it will just kill the Django app without a proper stacktrace.
- In some cases you are unable to limit the resources on a server level
and you'll just have to watch the entire machine die

The proposed solution is simply a safety net to prevent a complete server
from dying completely, it doesn't even have to be a default option but it's
one that I would like to have to make debugging easier and to protect
servers from dying without any traceable cause as the resource starvation
can kill the machine without ever showing anything useful in the logs.

On 24 November 2014 at 12:16, Christophe Pettus <***@thebuild.com> wrote:

>
> On Nov 24, 2014, at 1:08 AM, Rick van Hattem <***@wol.ph> wrote:
>
> > Indeed, except it's not an "except: pass" but an "except: raise" which
> I'm proposing. Which makes a world of difference.
>
> Well, as previously noted, this option would introduce another round-trip
> into every database if it's actually going to check (and, in most
> databases, would have to run inside a transaction at, at least, REPEATABLE
> READ isolation mode in order to provide a strong guarantee, so you can add
> three more statements to a lot of interactions right there). That seems to
> reduce the "high performance" part of the system.
>
> What I'm getting at is: This is a bug in the application. It's not a
> misfeature in Django. If you have a bug in an application whose source
> code cannot be changed, that's a shame, but I don't think that Django can
> be expected to introduce configuration options to cover every possible
> scenario in which a previously-written Django application interacts badly
> with the database in a way in which, in theory, it could do extra work
> (slowing down non-buggy applications and introducing more code to QA) to
> patch the bad application. To me, this is about as logical as putting in
> DIVIDE_BY_ZERO_RETURNS_NONE with a default of False because someone wrote
> some legacy code that would work if we did that.
>
> --
> -- Christophe Pettus
> ***@thebuild.com
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to
> django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/C28E966C-8DB6-4E9D-BDBA-F30C7708EBBE%40thebuild.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAJAwA%3Dw58khSiRrWPXM3fjxXLvs6ON8An3UFf_UvZf%3DBUv8rGg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Christophe Pettus
2014-11-24 18:39:50 UTC
Permalink
On Nov 24, 2014, at 3:36 AM, Rick van Hattem <***@wol.ph> wrote:
> If you fetch N+1 items you know if there are over N items in your list.

Let's stop there. Unfortunately, because of the way libpq works, just sending the query and checking the result set size won't solve your problem, except for an even smaller edge case.

Using the standard (non-named-curosr) protocol, when you get the first result back from libpq, *every result* is sent over the wire, and takes up space on the client. Thus, no limitation on the client side (number of Django objects created, number of Python objects created by psycopg2) will prevent an out-of-memory error. In my experience (and I've seen this problem a lot), the OOM occurs on the libpq results, not on the other parts.

Thus, the proposal only solves the relatively narrow case where the libpq result does *not* create an OOM, but creating the Django and psycopg2 objects does.

I'll note also that it's not that server that dies in this case; the particular thread doing the request gets an exception.

--
-- Christophe Pettus
***@thebuild.com

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/81A38D7E-E355-46D1-9C21-1C7D43BCC648%40thebuild.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-24 19:16:31 UTC
Permalink
It seems you are misunderstanding what I am trying to do here. The 10,000
(or whatever, that should be configurable) is a number large enough not to
bother anyone but small enough not to trigger the OOM system. In that case
it works perfectly.

Due note that the proposed solution is not a hypothetical one, it's a tried
and proven one that has saved me a lot of headache in the past. It's
definitely helped a lot when I was trying 3rd party apps at
Fashiolista.com, one of the larger Django based websites.

On 24 November 2014 at 19:39, Christophe Pettus <***@thebuild.com> wrote:

>
> On Nov 24, 2014, at 3:36 AM, Rick van Hattem <***@wol.ph> wrote:
> > If you fetch N+1 items you know if there are over N items in your list.
>
> Let's stop there. Unfortunately, because of the way libpq works, just
> sending the query and checking the result set size won't solve your
> problem, except for an even smaller edge case.
>
> Using the standard (non-named-curosr) protocol, when you get the first
> result back from libpq, *every result* is sent over the wire, and takes up
> space on the client. Thus, no limitation on the client side (number of
> Django objects created, number of Python objects created by psycopg2) will
> prevent an out-of-memory error. In my experience (and I've seen this
> problem a lot), the OOM occurs on the libpq results, not on the other parts.
>
> Thus, the proposal only solves the relatively narrow case where the libpq
> result does *not* create an OOM, but creating the Django and psycopg2
> objects does.
>
> I'll note also that it's not that server that dies in this case; the
> particular thread doing the request gets an exception.
>
> --
> -- Christophe Pettus
> ***@thebuild.com
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to
> django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/81A38D7E-E355-46D1-9C21-1C7D43BCC648%40thebuild.com
> .
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAJAwA%3Dx4X3CJ%3DEXfFz9xTvNzfRYYmq2r3VL4OPdSRteJ6P5pXQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Christophe Pettus
2014-11-24 20:32:41 UTC
Permalink
On Nov 24, 2014, at 11:16 AM, Rick van Hattem <***@wol.ph> wrote:

> It seems you are misunderstanding what I am trying to do here. The 10,000 (or whatever, that should be configurable) is a number large enough not to bother anyone but small enough not to trigger the OOM system.

There are really only four options that could be implemented at the framework level:

1. You add a LIMIT to every query. As noted, this can change the performance of a query, sometimes quite radically, often in a bad way. An OOM error is clear and explicit; a query suddenly running 100x slower is a lot stranger to track down. (Also, unless you decide that returning the entire LIMIT amount is an error, you can't really know if the query would have returned *more* than that.)

2. You don't add the LIMIT. At that point, you have no guard against excessive memory consumption.

3. You do two queries, one a count(*), one to get the result. This is a huge protocol change, and quite inefficient (and without wrapping it a transaction at an elevated isolation mode, not reliable).

4. You use a named cursor. This is also a huge protocol change.

Those are your choices.

It's not that the *problem* is unusual; it's that being unable to fix the problem in the applcication is unusual. I've encountered this problem, sure, and what I did was fix the code; it took like two minutes. Indeed, I would *strongly* advise that if you are issuing queries that you expect to get 100 results back and are getting a memory-crushing result back, you fundamentally don't understand something about your data, and need to address that promptly. Running properly on large datasets is a big job; just a patch like this isn't going to solve all the issues.

In your particular case, where you have the relatively unusual situation that:

1. You have this problem, and,
2. You can't fix the code to solve this problem.

... you probably have the right answer is having a local patch for Django.

--
-- Christophe Pettus
***@thebuild.com

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/5224637B-A37E-4B9F-A6F9-B95B405A82BF%40thebuild.com.
For more options, visit https://groups.google.com/d/optout.
Michael Manfre
2014-11-24 20:53:05 UTC
Permalink
On Mon, Nov 24, 2014 at 3:32 PM, Christophe Pettus <***@thebuild.com> wrote:

> In your particular case, where you have the relatively unusual situation
> that:
>
> 1. You have this problem, and,
> 2. You can't fix the code to solve this problem.
>
> ... you probably have the right answer is having a local patch for Django.
>

Alternatively, the same could be accomplished with a custom database
backend that enforces whatever limit value is desired. If I needed this
behavior, I'd probably tweak the query.low_mark and query.high_mark values
in SQLCompiler.as_sql before passing it along to the parent.

I think it's safe to say that this functionality is unlikely to find its
way in to Django, but if you have questions about the custom database
backend approach, feel free to contact me.

Regards,
Michael Manfre

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAGdCwBu1FY_f34vno33WD6npFoRNokmgYwO8G2dy1U%3DUsfd2hw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-24 21:05:01 UTC
Permalink
Thanks for the help but writing the custom database backend won't be a
problem, I've written one before :)

My goal was simply to move the Django project forward but it seems the
problems I've encountered in the field are too uncommon for most other
developers to care or understand.

Thank you all for the discussion.

On 24 November 2014 at 21:53, Michael Manfre <***@gmail.com> wrote:

>
>
> On Mon, Nov 24, 2014 at 3:32 PM, Christophe Pettus <***@thebuild.com>
> wrote:
>
>> In your particular case, where you have the relatively unusual situation
>> that:
>>
>> 1. You have this problem, and,
>> 2. You can't fix the code to solve this problem.
>>
>> ... you probably have the right answer is having a local patch for Django.
>>
>
> Alternatively, the same could be accomplished with a custom database
> backend that enforces whatever limit value is desired. If I needed this
> behavior, I'd probably tweak the query.low_mark and query.high_mark values
> in SQLCompiler.as_sql before passing it along to the parent.
>
> I think it's safe to say that this functionality is unlikely to find its
> way in to Django, but if you have questions about the custom database
> backend approach, feel free to contact me.
>
> Regards,
> Michael Manfre
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to
> django-developers+***@googlegroups.com.
> To post to this group, send email to django-***@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/CAGdCwBu1FY_f34vno33WD6npFoRNokmgYwO8G2dy1U%3DUsfd2hw%40mail.gmail.com
> <https://groups.google.com/d/msgid/django-developers/CAGdCwBu1FY_f34vno33WD6npFoRNokmgYwO8G2dy1U%3DUsfd2hw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAJAwA%3DzGWWCi_xBH4Z1o_vM%3DbuANpDyBvsiPrYq9XEtvLowhgQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Florian Apolloner
2014-11-25 08:06:53 UTC
Permalink
On Monday, November 24, 2014 10:05:47 PM UTC+1, Rick van Hattem wrote:
>
> My goal was simply to move the Django project forward but it seems the
> problems I've encountered in the field are too uncommon for most other
> developers to care or understand.
>

Oh, I can assure you that we care and understand (and you are certainly not
the only one with a somewhat big site), but in the end I think the
arguments against it as shown by Christophe outweigh what you are
proposing. That said, I still fail to see the issue, set a memory limit for
your apps (man ulimit) and you get a nice MemoryError when libpq tries to
overallocate -- including a nice traceback
 Adding security guards for any
ill written application is not going to fly (especially if it's not as easy
as adding a LIMIT which has it's own problems as pointed out), sorry.

Cheers,
Florian

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/6fa16884-b69c-494b-b795-f6fa4aa24c14%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Mattias Linnap
2014-11-24 08:36:32 UTC
Permalink
Since your use case seems to be "avoid blocking the server when one
pageview uses unexpectedly many resources", and you don't want to fix or
optimise the actual application code, why not set these limits at a level
where it makes sense?

For example:
* http://uwsgi-docs.readthedocs.org/en/latest/Options.html#limit-as sets a
memory allocation limit on the UWSGI worker processes, and terminates &
restarts the worker if it uses more memory. This will protect against not
just big querysets, but also big plain python lists, loading huge image or
xml files, memory leaks in C extensions and any other memory-consuming
issues.
*
http://uwsgi-docs.readthedocs.org/en/latest/FAQ.html#what-is-harakiri-mode
sets a timeout on request handling, and terminates & restarts the worker if
it takes more than X seconds to serve the request. This protects against
not just big querysets, but also any other slow code that ties up the
worker, for example forgetting to set a timeout on calls to external APIs,
or a plain old infinite loop.
* http://stackoverflow.com/questions/5421776/how-can-i-limit-database-query-time-during-web-requests
You can set a Postgresql query timeout in seconds for the queries run by
the web app user. This makes far more sense than a limit on the result
size, as complex JOINs can be very resource-intensive but only return a few
rows in the result.
There should be similar configuration options for other app servers and
databases as well.

Django's ORM is just the wrong level in the software stack for these
limits, since there are hundreds of other ways to kill the performance of a
server, and the number of results in a queryset is a poor indicator of
performance issues.

On Sunday, 23 November 2014 23:53:26 UTC+2, Rick van Hattem wrote:
>
>
> On 23 Nov 2014 22:13, "Christophe Pettus" <***@thebuild.com <javascript:>>
> wrote:
> >
> >
> > On Nov 23, 2014, at 1:07 PM, Rick van Hattem <***@wol.ph <javascript:>>
> wrote:
> >
> > > > Not really, cause psycopg already fetched everything.
> > >
> > > Not if Django limits it by default :)
> >
> > Unfortunately, that's not how it works. There are three things that
> take up memory as the result of a query result:
> >
> > 1. The Django objects. These are created in units of 100 at a time.
> > 2. The psycopg2 Python objects from the result. These are already
> limited to a certain number (I believe to 100) at a time.
> > 3. The results from libpq. These are not limited, and there is no way
> of limiting them without creating a named cursor, which is a significant
> change to how Django interacts with the database.
> >
> > In short, without substantial, application-breaking changes, you can't
> limit the amount of memory a query returns unless you add a LIMIT clause to
> it. However, adding a LIMIT clause can often cause performance issues all
> by itself:
> >
> > http://thebuild.com/blog/2014/11/18/when-limit-attacks/
> >
> > There's no clean fix that wouldn't have significant effects on
> unsuspecting applications.
>
> Very true, that's a fair point. That's why I'm opting for a configurable
> option. Patching this within Django has saved me in quite a few cases but
> it can have drawbacks.
>
> > --
> > -- Christophe Pettus
> > ***@thebuild.com <javascript:>
> >
> > --
> > You received this message because you are subscribed to a topic in the
> Google Groups "Django developers (Contributions to Django itself)" group.
> > To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/django-developers/aYwPykvLaMU/unsubscribe
> .
> > To unsubscribe from this group and all its topics, send an email to
> django-***@googlegroups.com <javascript:>.
> > To post to this group, send email to django-***@googlegroups.com
> <javascript:>.
> > Visit this group at http://groups.google.com/group/django-developers.
> > To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/CD10DACF-4D0A-458D-BB85-0F7BB8BFF4C0%40thebuild.com
> .
> > For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/e17fb633-b3d6-422c-9f6d-dc7fddae1aab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Rick van Hattem
2014-11-24 09:00:38 UTC
Permalink
If that is an option than it's definitely a better location to set limits
to prevent the server from going down.

It helps nothing when it comes to debugging though. Which is the primary
reason for patching the orm.

And in addition to that, quite a few customers won't let you change the
hosting setup just like that. Especially since uwsgi is one of the least
used deployment options in my experience.
On 24 Nov 2014 09:36, "Mattias Linnap" <***@gmail.com> wrote:

> Since your use case seems to be "avoid blocking the server when one
> pageview uses unexpectedly many resources", and you don't want to fix or
> optimise the actual application code, why not set these limits at a level
> where it makes sense?
>
> For example:
> * http://uwsgi-docs.readthedocs.org/en/latest/Options.html#limit-as sets
> a memory allocation limit on the UWSGI worker processes, and terminates &
> restarts the worker if it uses more memory. This will protect against not
> just big querysets, but also big plain python lists, loading huge image or
> xml files, memory leaks in C extensions and any other memory-consuming
> issues.
> *
> http://uwsgi-docs.readthedocs.org/en/latest/FAQ.html#what-is-harakiri-mode
> sets a timeout on request handling, and terminates & restarts the worker if
> it takes more than X seconds to serve the request. This protects against
> not just big querysets, but also any other slow code that ties up the
> worker, for example forgetting to set a timeout on calls to external APIs,
> or a plain old infinite loop.
> *
> http://stackoverflow.com/questions/5421776/how-can-i-limit-database-query-time-during-web-requests
> You can set a Postgresql query timeout in seconds for the queries run by
> the web app user. This makes far more sense than a limit on the result
> size, as complex JOINs can be very resource-intensive but only return a few
> rows in the result.
> There should be similar configuration options for other app servers and
> databases as well.
>
> Django's ORM is just the wrong level in the software stack for these
> limits, since there are hundreds of other ways to kill the performance of a
> server, and the number of results in a queryset is a poor indicator of
> performance issues.
>
> On Sunday, 23 November 2014 23:53:26 UTC+2, Rick van Hattem wrote:
>>
>>
>> On 23 Nov 2014 22:13, "Christophe Pettus" <***@thebuild.com> wrote:
>> >
>> >
>> > On Nov 23, 2014, at 1:07 PM, Rick van Hattem <***@wol.ph> wrote:
>> >
>> > > > Not really, cause psycopg already fetched everything.
>> > >
>> > > Not if Django limits it by default :)
>> >
>> > Unfortunately, that's not how it works. There are three things that
>> take up memory as the result of a query result:
>> >
>> > 1. The Django objects. These are created in units of 100 at a time.
>> > 2. The psycopg2 Python objects from the result. These are already
>> limited to a certain number (I believe to 100) at a time.
>> > 3. The results from libpq. These are not limited, and there is no way
>> of limiting them without creating a named cursor, which is a significant
>> change to how Django interacts with the database.
>> >
>> > In short, without substantial, application-breaking changes, you can't
>> limit the amount of memory a query returns unless you add a LIMIT clause to
>> it. However, adding a LIMIT clause can often cause performance issues all
>> by itself:
>> >
>> > http://thebuild.com/blog/2014/11/18/when-limit-attacks/
>> >
>> > There's no clean fix that wouldn't have significant effects on
>> unsuspecting applications.
>>
>> Very true, that's a fair point. That's why I'm opting for a configurable
>> option. Patching this within Django has saved me in quite a few cases but
>> it can have drawbacks.
>>
>> > --
>> > -- Christophe Pettus
>> > ***@thebuild.com
>> >
>> > --
>> > You received this message because you are subscribed to a topic in the
>> Google Groups "Django developers (Contributions to Django itself)" group.
>> > To unsubscribe from this topic, visit https://groups.google.com/d/
>> topic/django-developers/aYwPykvLaMU/unsubscribe.
>> > To unsubscribe from this group and all its topics, send an email to
>> django-***@googlegroups.com.
>> > To post to this group, send email to django-***@googlegroups.com.
>> > Visit this group at http://groups.google.com/group/django-developers.
>> > To view this discussion on the web visit https://groups.google.com/d/
>> msgid/django-developers/CD10DACF-4D0A-458D-BB85-
>> 0F7BB8BFF4C0%40thebuild.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAJAwA%3Dw35wVCcmRF2%2BHFGTE1V9Ph0sNuJn_-JHVNP%3DQcGYcq7w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Carl Meyer
2014-11-19 02:52:35 UTC
Permalink
Hi Rick,

On 11/18/2014 11:59 AM, Rick van Hattem wrote:
[snip]
> In all but the most basic Django projects I've seen problems like these.
> Sane defaults won't hurt anyone and solves issues for people with larger
> systems. And running forks of Django seems counter productive as well.

As a side note -- I'm not sure why running a fork of Django should be
necessary. All the suggestions you've made so far in this thread could
be implemented in a custom QuerySet subclass, and custom QuerySet
classes can be hooked into your models via a custom Manager (which can
even be created for you by Manager.from_queryset). Create an abstract
base Model using this Manager, inherit all your models from it, and you
can get the behaviors you want throughout your system without forking
Django.

Carl
Rick van Hattem
2014-11-19 12:40:58 UTC
Permalink
Hi Carl,

Thruthfully some part of my reason for forking was that I was running an
older version of Django which didn't have custom user models. In that case
it's a bit more difficult to override the manager and I've seen quite a few
external projects (accidently) do something like "User.objects.all()" and
doing an if or loop around it. Quite problematic with a few million users.

Regardless, it still requires forking either Django or all external
projects (in case of different models) so the problem is not solved by
custom managers.


On Wednesday, November 19, 2014 3:52:52 AM UTC+1, Carl Meyer wrote:
>
> Hi Rick,
>
> On 11/18/2014 11:59 AM, Rick van Hattem wrote:
> [snip]
> > In all but the most basic Django projects I've seen problems like these.
> > Sane defaults won't hurt anyone and solves issues for people with larger
> > systems. And running forks of Django seems counter productive as well.
>
> As a side note -- I'm not sure why running a fork of Django should be
> necessary. All the suggestions you've made so far in this thread could
> be implemented in a custom QuerySet subclass, and custom QuerySet
> classes can be hooked into your models via a custom Manager (which can
> even be created for you by Manager.from_queryset). Create an abstract
> base Model using this Manager, inherit all your models from it, and you
> can get the behaviors you want throughout your system without forking
> Django.
>
> Carl
>
>

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+***@googlegroups.com.
To post to this group, send email to django-***@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/40a05cec-7914-40c6-a4f2-2068804d2275%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...